HOTEL RECOMMENDATION ON EXPEDIA DATA

Abstract :

Data Science is a very popular approach in the business industry to deliver customer needs right at their finger-tips. Data science has numerous amount of use cases. We are tackling one of those use cases of recommending a hotel to a user based on his behavior around the Expedia website. Expedia is a popular Fortune 500 travel company with hotels all around the world. It is the most popular travel company in America and second worldwide to Booking.com.

We will be using several supervised algorithms and recommendation algorithms to recommend a hotel to the customer. Recommendation systems are systems that suggest items to the customers using customer preferences, item attributes, etc. We have all heard the saying 'the world is a small place' but when it comes to storing data about all different countries, regions and continents, it gets very hard to recommend the perfect hotel to a customer. We will be tackling this problem using Expedia's public data.

Objective :

The objective of Directed Research project is to build a hotel recommendation system using Expedia's public data.

Data :

The data provided by Expedia is a random part of the overall statistics but not a representative of this overall statistics. The data includes customer interactions with the system, customer search results and their respones, static data such as distance between request origin and destination, etc.

Expedia has an internal algorithm which separates all its hotels in 100 clusters. Our goal is to predict which hotel cluster, the user is going to book into.

We have two files :

train.csv : This file contains 37million rows of training data.

test.csv : This file contains 2.5million rows of testing data.

Data fields are as follows :

Column Description

date_time Timestamp

site_name ID of the Expedia point of sale(Expedia.com, Expedia.co.uk,..)

posa_continent ID of the continent associated with the site name

user_location_country The ID of the country the customer is located in

user_location_region The ID of the region the customer is located

user_location_city The ID of the city the customer is located

orig_destination_distance Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated double

user_id ID of user

is_mobile 1 when a user connected from a mobile device, 0 otherwise

is_package 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise

channel ID of a marketing channel

srch_ci Checkin date

srch_co Checkout date

srch_adults_cnt The number of adults specified in the hotel room

srch_children_cnt The number of (extra occupancy) children specified in the hotel room

srch_rm_cnt The number of hotel rooms specified in the search

srch_destination_id ID of the destination where the hotel search was performed

srch_destination_type_id Type of destination

hotel_continent Hotel continent

hotel_country Hotel country

hotel_market Hotel market

is_booking 1 if a booking, 0 if a click

cnt Numer of similar events in the context of the same user session

hotel_cluster ID of a hotel cluster

Procedure Undertaken :

In this section, we will be looking at the different machine learning models I created to predict the hotel cluster. We will be starting from the simplest model and move towards the more complex models.

1. k-Nearest Neighbour model :

In k-Nearest Neighbour, we look at k nearest neighbours of a data point and decide which hotel cluster the data point belongs to. There are two hyperpatrameters here : k and the distance measure between points. I decided on starting with knn as it is a relatively simple model and the field 'cnt', which gives the number of similar actions made in same user session, might be a good distance measure.

My results for 100000 training rows on knn were as :

k Accuracy

250 0.007

300 0.007

500 0.011

700 0.025

900 0.025

Clearly, knn performed poorly and the field 'cnt' does not provide any relevant information.

Other problem with knn is that it has a very high running time of O(ndk), where n is cardinality of training data, d is dimensionality of data and k is the number of neighbours we are looking at. Due to high running time it is impossible to test knn on our very large training dataset.

2. One vs Rest multi-class Logistic Regression Classification :

Logistic Regression is another classification approach. Here we began with a 'One vs Rest' approach which creates 100 Classifiers as we have 100 clusters and we pick the cluster with the maximum probability to assign final prediction.

At first, I tried One vs Rest logistic regression without removing any data fields. It led to following accuracy : 0.036

As these accuracy is not good enough, I removed several fields : date_time, posa_continent, user_location_region, user_location_city, srch_ci, srch_co, hotel_continent.

Then I added my own features : Number of days from check in : days_from_booking

and Number of days booked for : num_days

The accuracies we got now are :

Accuracy : 0.048

3. Multinomial Logistic Regression :

Multinomial Logistic Regression uses softmax instead of sigmoid function and thus renders us a probability indicating what are the chances of the data point falling in that cluster. Then, we select the cluster with highest probability.

I removed several fields : date_time, posa_continent, user_location_region, user_location_city, srch_ci, srch_co, hotel_continent

Here too I added my own features : Number of days from check in : days_from_booking

and Number of days booked for : num_days

The accuracies we got with Multinomial Logistic Regression is as follows :

Accuracy : 0.053

4. XGBoost :

XGBoost is short for “Extreme Gradient Boosting”, where the term “Gradient Boosting” is proposed in the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. It is a tree boosting algorithm and has been very popular since its inception in 2014. It is another supervised machine learning approach. For this, I used the XGBoost library.

I applied the same feature engineering here as above.

The specifications I set and accuracies we got with Xgboost are as follows :

Number of rounds = 5

Maximum depth = 300

Eta = 0.1

Accuracy : 0.0827

Number of rounds = 100

Maximum depth = 500

Accuracy : 0.0897

We can see that even XGBoost isn't performing well.

5. Collaborative filtering :

As we have seen that none of the machine learning approaches are working well, we will take a recommendation system approach : Collaborative filtering.

The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person. By collaborative filtering, we will try to find the popular hotels in the search destination region using srch_destination_id and hotel_cluster. Here, everytime we see a search destination, we will use it's most popular hotels as our prediction.

The accuracy we got using this approach was :

Accuracy : 0.300

Surprisingly, this model gave us our best accuracy until now. This clearly tells us that there is a lot of information hidden in our popular relations between the hotel cluster and other relevant fields.

6. Collaborative filtering 2.0 :

We will be again using Collaborative filtering but we will use four different relations to refine our predictions :

a) user_location_city, orig_destination_distance, hotel_cluster

b) srch_destination_id, hotel_country, hotel_market, hotel_cluster

c) srch_destination_id, hotel_cluster

d) hotel_country, hotel_cluster

And we will also be exploiting the is_booking field and will give higher priority to hotel clusters where is_booking = 1 in training data.

We will pick the hotel cluster suggested in majority by these relations with highest priority in order a) to d) decreasing.

The accuracy we get here is :

Accuracy : 0.495

This is a much better result from any of the machine learning models we discussed before and is a considerable improvement on the first Collaborative filtering model.

Future Developments :

In future to improve on this, collaborative approaches may be combined with supervised learning models. Approaches such as Neural Networks and Deep Learning may give better results as they are found to tune parameter weights much better than any other ML models.

Conclusion :

In conclusion, Collaborative filtering approaches are sometimes stronger than machine learning approaches depending on the data. One has to try several different approaches to see which model works best. Also, feature engineering plays a vital role as a lot of features present in our data were not useful and were just identification values where the absolute value is not numeric but only an identifier. Even in collaborative filtering, some common relations such as most popular hotels in a region turned out to be giving us great accuracy which also points to the fact that common knowledge/sense persists in real world data.

Acknowledgements :

I wish to express my sincere gratitude to Prof. Saty Raghavachary for this opportunity to work on this research project. I am very thankful to him for providing me guidance and entrusting me with this opportunity. I am very grateful to my pals at Expedia who helped me out by providing me this valuable data and guidance.

References :

1. A Comparative Study of Collaborative Filtering Algorithms by Joonseok Lee, Mingxuan Sun, Guy Lebanon

2. The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome H. Friedman.

3.http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

4. https://xgboost.readthedocs.io/en/latest/

5. https://github.com/dmlc/xgboost

Search This Blog

Vyanktesh Kanungo

HOTEL RECOMMENDATION ON EXPEDIA DATA

Comments

Post a Comment