HOTEL RECOMMENDATION ON EXPEDIA DATA




Abstract :
          Data Science is a very popular approach in the business industry to deliver customer needs right at their finger-tips. Data science has numerous amount of use cases. We are tackling one of those use cases of recommending a hotel to a user based on his behavior around the Expedia website. Expedia is a popular Fortune 500 travel company with hotels all around the world. It is the most popular travel company in America and second worldwide to Booking.com.
            We will be using several supervised algorithms and recommendation algorithms to recommend a hotel to the customer. Recommendation systems are systems that suggest items to the customers using customer preferences, item attributes, etc. We have all heard the saying 'the world is a small place' but when it comes to storing data about all different countries, regions and continents, it gets very hard to recommend the perfect hotel to a customer. We will be tackling this problem using Expedia's public data.
Objective :
The objective of Directed Research project is to build a hotel recommendation system using Expedia's public data.
Data :
The data provided by Expedia is a random part of the overall statistics but not a representative of this overall statistics. The data includes customer interactions with the system, customer search results and their respones, static data such as distance between request origin and destination, etc.
Expedia has an internal algorithm which separates all its hotels in 100 clusters. Our goal is to predict which hotel cluster, the user is going to book into.

We have two files :
train.csv : This file contains 37million rows of training data.
test.csv : This file contains 2.5million rows of testing data.

Data fields are as follows :
Column                                   Description
date_time                                Timestamp
site_name                                ID of the Expedia point of sale(Expedia.com, Expedia.co.uk,..)
posa_continent                        ID of the continent associated with the site name
user_location_country             The ID of the country the customer is located in
user_location_region               The ID of the region the customer is located
user_location_city                   The ID of the city the customer is located    
orig_destination_distance       Physical distance between a hotel and a customer at the time of                                           search. A null means the distance could not be calculated                                                     double
user_id                                                ID of user       
is_mobile                                 1 when a user connected from a mobile device, 0 otherwise
is_package                               1 if the click/booking was generated as a part of a package (i.e.                                            combined with a flight), 0 otherwise
channel                                    ID of a marketing channel
srch_ci                                     Checkin date
srch_co                                    Checkout date
srch_adults_cnt                       The number of adults specified in the hotel room     
srch_children_cnt                    The number of (extra occupancy) children specified in the hotel                                           room   
srch_rm_cnt                            The number of hotel rooms specified in the search   
srch_destination_id                 ID of the destination where the hotel search was performed
srch_destination_type_id        Type of destination    
hotel_continent                       Hotel continent          
hotel_country                          Hotel country 
hotel_market                           Hotel market  
is_booking                               1 if a booking, 0 if a click      
cnt                                           Numer of similar events in the context of the same user session
hotel_cluster                            ID of a hotel cluster

Procedure Undertaken :
In this section, we will be looking at the different machine learning models I created to predict the hotel cluster. We will be starting from the simplest model and move towards the more complex models.

1. k-Nearest Neighbour model :
In k-Nearest Neighbour, we look at k nearest neighbours of a data point and decide which hotel cluster the data point belongs to. There are two hyperpatrameters here : k and the distance measure between points. I decided on starting with knn as it is a relatively simple model and the field 'cnt', which gives the number of similar actions made in same user session, might be a good distance measure.
My results for 100000 training rows on knn were as :
k          Accuracy
250      0.007
300      0.007
500      0.011
700      0.025
900      0.025
Clearly, knn performed poorly and the field 'cnt' does not provide any relevant information.
Other problem with knn is that it has a very high running time of  O(ndk), where n is cardinality of training data, d is dimensionality of data and k is the number of neighbours we are looking at. Due to high running time it is impossible to test knn on our very large training dataset.
2. One vs Rest multi-class Logistic Regression Classification :
 Logistic Regression is another classification approach. Here we began with a 'One vs Rest' approach which creates 100 Classifiers as we have 100 clusters and we pick the cluster with the maximum probability to assign final prediction.
At first, I tried One vs Rest logistic regression without removing any data fields. It led to following accuracy : 0.036
As these accuracy is not good enough, I removed several fields : date_time, posa_continent, user_location_region, user_location_city, srch_ci, srch_co, hotel_continent.
Then I added my own features : Number of days from check in : days_from_booking
and Number of days booked for : num_days
 The accuracies we got now are :
Accuracy : 0.048
3. Multinomial Logistic Regression :
Multinomial Logistic Regression uses softmax instead of sigmoid function and thus renders us a probability indicating what are the chances of the data point falling in that cluster. Then, we select the cluster with highest probability.
I removed several fields : date_time, posa_continent, user_location_region, user_location_city, srch_ci, srch_co, hotel_continent
Here too I added my own features : Number of days from check in : days_from_booking
and Number of days booked for : num_days

The accuracies we got with Multinomial Logistic Regression is as follows :
Accuracy : 0.053

4. XGBoost :
XGBoost is short for “Extreme Gradient Boosting”, where the term “Gradient Boosting” is proposed in the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. It is a tree boosting algorithm and has been very popular since its inception in 2014. It is another supervised machine learning approach. For this, I used the XGBoost library.
I applied the same feature engineering here as above.
The specifications I set and accuracies we got with Xgboost are as follows :
a)
Number of rounds = 5
Maximum depth = 300
Eta = 0.1
Accuracy : 0.0827
b)
Number of rounds = 100
Maximum depth = 500
Accuracy : 0.0897

We can see that even XGBoost isn't performing well.

5. Collaborative filtering :
As we have seen that none of the machine learning approaches are working well, we will take a recommendation system approach : Collaborative filtering.
The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person. By collaborative filtering, we will try to find the popular hotels in the search destination region using srch_destination_id and hotel_cluster. Here, everytime we see a search destination, we will use it's most popular hotels as our prediction.
The accuracy we got using this approach was :
Accuracy : 0.300
Surprisingly, this model gave us our best accuracy until now. This clearly tells us that there is a lot of information hidden in our popular relations between the hotel cluster and other relevant fields.
6. Collaborative filtering 2.0 :
We will be again using Collaborative filtering but we will use four different relations to refine our predictions :
a) user_location_city, orig_destination_distance, hotel_cluster
b) srch_destination_id, hotel_country, hotel_market, hotel_cluster
c) srch_destination_id, hotel_cluster
d) hotel_country, hotel_cluster

And we will also be exploiting the is_booking field and will give higher priority to hotel clusters where is_booking = 1 in training data.
We will pick the hotel cluster suggested in majority by these relations with highest priority in order a) to d) decreasing.
The accuracy we get here is  :
Accuracy : 0.495
This is a much better result from any of the machine learning models we discussed before and is a considerable improvement on the first Collaborative filtering model.

Future Developments :
In future to improve on this, collaborative approaches may be combined with supervised learning models. Approaches such as Neural Networks and Deep Learning may give better results as they are found to tune parameter weights much better than any other ML models.

Conclusion :
In conclusion, Collaborative filtering approaches are sometimes stronger than machine learning approaches depending on the data. One has to try several different approaches to see which model works best. Also, feature engineering plays a vital role as a lot of features present in our data were not useful and were just identification values where the absolute value is not numeric but only an identifier. Even in collaborative filtering, some common relations such as most popular hotels in a region turned out to be giving us great accuracy which also points to the fact that common knowledge/sense persists in real world data.

Acknowledgements :
            I wish to express my sincere gratitude to Prof. Saty Raghavachary for this opportunity to work on this research project. I am very thankful to him for providing me guidance and entrusting me with this opportunity. I am very grateful to my pals at Expedia who helped me out by providing me this valuable data and guidance.
References :
1. A Comparative Study of Collaborative Filtering Algorithms by Joonseok Lee, Mingxuan Sun, Guy Lebanon
2.  The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome H. Friedman.

5. https://github.com/dmlc/xgboost

Comments

  1. Thank you for sharing such a useful article. I had a great time. This article was fantastic to read. Continue to publish more articles on

    AI Services

    Data Engineering Services 

    Data Analytics Solutions

    Data Modernization Solutions

    ReplyDelete

Post a Comment