#!/usr/bin/env python # coding: utf-8 # # DoWhy-The Causal Story Behind Hotel Booking Cancellations # ![Screenshot%20from%202020-09-29%2019-08-50.png](attachment:Screenshot%20from%202020-09-29%2019-08-50.png) # # We consider what factors cause a hotel booking to be cancelled. This analysis is based on a hotel bookings dataset from [Antonio, Almeida and Nunes (2019)](https://www.sciencedirect.com/science/article/pii/S2352340918315191). On GitHub, the dataset is available at [rfordatascience/tidytuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md). # # There can be different reasons for why a booking is cancelled. A customer may have requested something that was not available (e.g., car parking), a customer may have found later that the hotel did not meet their requirements, or a customer may have simply cancelled their entire trip. Some of these like car parking are actionable by the hotel whereas others like trip cancellation are outside the hotel's control. In any case, we would like to better understand which of these factors cause booking cancellations. # # The gold standard of finding this out would be to use experiments such as *Randomized Controlled Trials* wherein each customer is randomly assigned to one of the two categories i.e. each customer is either assigned a car parking or not. However, such an experiment can be too costly and also unethical in some cases (for example, a hotel would start losing its reputation if people learn that its randomly assigning people to different level of service). # # Can we somehow answer our query using only observational data or data that has been collected in the past? # # # In[1]: get_ipython().run_line_magic('reload_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') # In[2]: # Config dict to set the logging level import logging.config DEFAULT_LOGGING = { 'version': 1, 'disable_existing_loggers': False, 'loggers': { '': { 'level': 'INFO', }, } } logging.config.dictConfig(DEFAULT_LOGGING) # Disabling warnings output import warnings from sklearn.exceptions import DataConversionWarning, ConvergenceWarning warnings.filterwarnings(action='ignore', category=DataConversionWarning) warnings.filterwarnings(action='ignore', category=ConvergenceWarning) warnings.filterwarnings(action='ignore', category=UserWarning) #!pip install dowhy import dowhy import pandas as pd import numpy as np import matplotlib.pyplot as plt # In[3]: dataset = pd.read_csv('https://raw.githubusercontent.com/Sid-darthvader/DoWhy-The-Causal-Story-Behind-Hotel-Booking-Cancellations/master/hotel_bookings.csv') dataset.head() # In[4]: dataset.columns # ## Data Description # For a quick glance of the features and their descriptions the reader is referred here. # https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md # ## Feature Engineering # # Lets create some new and meaningful features so as to reduce the dimensionality of the dataset. # - **Total Stay** = stays_in_weekend_nights + stays_in_week_nights # - **Guests** = adults + children + babies # - **Different_room_assigned** = 1 if reserved_room_type & assigned_room_type are different, 0 otherwise. # In[5]: # Total stay in nights dataset['total_stay'] = dataset['stays_in_week_nights']+dataset['stays_in_weekend_nights'] # Total number of guests dataset['guests'] = dataset['adults']+dataset['children'] +dataset['babies'] # Creating the different_room_assigned feature dataset['different_room_assigned']=0 slice_indices =dataset['reserved_room_type']!=dataset['assigned_room_type'] dataset.loc[slice_indices,'different_room_assigned']=1 # Deleting older features dataset = dataset.drop(['stays_in_week_nights','stays_in_weekend_nights','adults','children','babies' ,'reserved_room_type','assigned_room_type'],axis=1) dataset.columns # We also remove other columns that either contain NULL values or have too many unique values (e.g., agent ID). We also impute missing values of the `country` column with the most frequent country. We remove `distribution_channel` since it has a high overlap with `market_segment`. # In[6]: dataset.isnull().sum() # Country,Agent,Company contain 488,16340,112593 missing entries dataset = dataset.drop(['agent','company'],axis=1) # Replacing missing countries with most freqently occuring countries dataset['country']= dataset['country'].fillna(dataset['country'].mode()[0]) # In[7]: dataset = dataset.drop(['reservation_status','reservation_status_date','arrival_date_day_of_month'],axis=1) dataset = dataset.drop(['arrival_date_year'],axis=1) dataset = dataset.drop(['distribution_channel'], axis=1) # In[8]: # Replacing 1 by True and 0 by False for the experiment and outcome variables dataset['different_room_assigned']= dataset['different_room_assigned'].replace(1,True) dataset['different_room_assigned']= dataset['different_room_assigned'].replace(0,False) dataset['is_canceled']= dataset['is_canceled'].replace(1,True) dataset['is_canceled']= dataset['is_canceled'].replace(0,False) dataset.dropna(inplace=True) print(dataset.columns) dataset.iloc[:, 5:20].head(100) # In[9]: dataset = dataset[dataset.deposit_type=="No Deposit"] dataset.groupby(['deposit_type','is_canceled']).count() # In[10]: dataset_copy = dataset.copy(deep=True) # ## Calculating Expected Counts # Since the number of number of cancellations and the number of times a different room was assigned is heavily imbalanced, we first choose 1000 observations at random to see that in how many cases do the variables; *'is_cancelled'* & *'different_room_assigned'* attain the same values. This whole process is then repeated 10000 times and the expected count turns out to be near 50% (i.e. the probability of these two variables attaining the same value at random). # So statistically speaking, we have no definite conclusion at this stage. Thus assigning rooms different to what a customer had reserved during his booking earlier, may or may not lead to him/her cancelling that booking. # In[11]: counts_sum=0 for i in range(1,10000): counts_i = 0 rdf = dataset.sample(1000) counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0] counts_sum+= counts_i counts_sum/10000 # We now consider the scenario when there were no booking changes and recalculate the expected count. # In[12]: # Expected Count when there are no booking changes counts_sum=0 for i in range(1,10000): counts_i = 0 rdf = dataset[dataset["booking_changes"]==0].sample(1000) counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0] counts_sum+= counts_i counts_sum/10000 # In the 2nd case, we take the scenario when there were booking changes(>0) and recalculate the expected count. # In[13]: # Expected Count when there are booking changes = 66.4% counts_sum=0 for i in range(1,10000): counts_i = 0 rdf = dataset[dataset["booking_changes"]>0].sample(1000) counts_i = rdf[rdf["is_canceled"]== rdf["different_room_assigned"]].shape[0] counts_sum+= counts_i counts_sum/10000 # There is definitely some change happening when the number of booking changes are non-zero. So it gives us a hint that *Booking Changes* must be a confounding variable. # # But is *Booking Changes* the only confounding variable? What if there were some unobserved confounders, regarding which we have no information(feature) present in our dataset. Would we still be able to make the same claims as before? # Enter *DoWhy* # ## Step-1. Create a Causal Graph # Represent your prior knowledge about the predictive modelling problem as a CI graph using assumptions. Don't worry, you need not specify the full graph at this stage. Even a partial graph would be enough and the rest can be figured out by *DoWhy* ;-) # # Here are a list of assumptions that have then been translated into a Causal Diagram:- # # - *Market Segment* has 2 levels, “TA” refers to the “Travel Agents” and “TO” means “Tour Operators” so it should affect the Lead Time (which is simply the number of days between booking and arrival). # - *Country* would also play a role in deciding whether a person books early or not (hence more *Lead Time*) and what type of *Meal* a person would prefer. # - *Lead Time* would definitely affected the number of *Days in Waitlist* (There are lesser chances of finding a reservation if you’re booking late). Additionally, higher *Lead Times* can also lead to *Cancellations*. # - The number of *Days in Waitlist*, the *Total Stay* in nights and the number of *Guests* might affect whether the booking is cancelled or retained. # - *Previous Booking Retentions* would affect whether a customer is a *Repeated Guest* or not. Additionally, both of these variables would affect whether the booking get *cancelled* or not (Ex- A customer who has retained his past 5 bookings in the past has a higher chance of retaining this one also. Similarly a person who has been cancelling this booking has a higher chance of repeating the same). # - *Booking Changes* would affect whether the customer is assigned a *different room* or not which might also lead to *cancellation*. # - Finally, the number of *Booking Changes* being the only confounder affecting *Treatment* and *Outcome* is highly unlikely and its possible that there might be some *Unobsevered Confounders*, regarding which we have no information being captured in our data. # In[14]: import pygraphviz causal_graph = """digraph { different_room_assigned[label="Different Room Assigned"]; is_canceled[label="Booking Cancelled"]; booking_changes[label="Booking Changes"]; previous_bookings_not_canceled[label="Previous Booking Retentions"]; days_in_waiting_list[label="Days in Waitlist"]; lead_time[label="Lead Time"]; market_segment[label="Market Segment"]; country[label="Country"]; U[label="Unobserved Confounders"]; is_repeated_guest; total_stay; guests; meal; hotel; U->different_room_assigned; U->is_canceled;U->required_car_parking_spaces; market_segment -> lead_time; lead_time->is_canceled; country -> lead_time; different_room_assigned -> is_canceled; country->meal; lead_time -> days_in_waiting_list; days_in_waiting_list ->is_canceled; previous_bookings_not_canceled -> is_canceled; previous_bookings_not_canceled -> is_repeated_guest; is_repeated_guest -> is_canceled; total_stay -> is_canceled; guests -> is_canceled; booking_changes -> different_room_assigned; booking_changes -> is_canceled; hotel -> is_canceled; required_car_parking_spaces -> is_canceled; total_of_special_requests -> is_canceled; country->{hotel, required_car_parking_spaces,total_of_special_requests,is_canceled}; market_segment->{hotel, required_car_parking_spaces,total_of_special_requests,is_canceled}; }""" # Here the *Treatment* is assigning the same type of room reserved by the customer during Booking. *Outcome* would be whether the booking was cancelled or not. # *Common Causes* represent the variables that according to us have a causal affect on both *Outcome* and *Treatment*. # As per our causal assumptions, the 2 variables satisfying this criteria are *Booking Changes* and the *Unobserved Confounders*. # So if we are not specifying the graph explicitly (Not Recommended!), one can also provide these as parameters in the function mentioned below. # In[15]: model= dowhy.CausalModel( data = dataset, graph=causal_graph.replace("\n", " "), treatment='different_room_assigned', outcome='is_canceled') model.view_model() from IPython.display import Image, display display(Image(filename="causal_model.png")) # ## Step-2. Identify the Causal Effect # We say that Treatment causes Outcome if changing Treatment leads to a change in Outcome keeping everything else constant. # Thus in this step, by using properties of the causal graph, we identify the causal effect to be estimated # In[16]: import statsmodels model= dowhy.CausalModel( data = dataset, graph=causal_graph.replace("\n", " "), treatment="different_room_assigned", outcome='is_canceled') #Identify the causal effect identified_estimand = model.identify_effect(proceed_when_unidentifiable=True) print(identified_estimand) # ## Step-3. Estimate the identified estimand # In[17]: estimate = model.estimate_effect(identified_estimand, method_name="backdoor.propensity_score_stratification",target_units="ate") # ATE = Average Treatment Effect # ATT = Average Treatment Effect on Treated (i.e. those who were assigned a different room) # ATC = Average Treatment Effect on Control (i.e. those who were not assigned a different room) print(estimate) # The result is surprising. It means that having a different room assigned _decreases_ the chances of a cancellation. There's more to unpack here: is this the correct causal effect? Could it be that different rooms are assigned only when the booked room is unavailable, and therefore assigning a different room has a positive effect on the customer (as opposed to not assigning a room)? # # There could also be other mechanisms at play. Perhaps assigning a different room only happens at check-in, and the chances of a cancellation once the customer is already at the hotel are low? In that case, the graph is missing a critical variable on _when_ these events happen. Does `different_room_assigned` happen mostly on the day of the booking? Knowing that variable can help improve the graph and our analysis. # # While the associational analysis earlier indicated a positive correlation between `is_canceled` and `different_room_assigned`, estimating the causal effect using DoWhy presents a different picture. It implies that a decision/policy to reduce the number of `different_room_assigned` at hotels may be counter-productive. # # # ## Step-4. Refute results # # Note that the causal part does not come from data. It comes from your *assumptions* that lead to *identification*. Data is simply used for statistical *estimation*. Thus it becomes critical to verify whether our assumptions were even correct in the first step or not! # # What happens when another common cause exists? # What happens when the treatment itself was placebo? # ### Method-1 # **Radom Common Cause:-** *Adds randomly drawn covariates to data and re-runs the analysis to see if the causal estimate changes or not. If our assumption was originally correct then the causal estimate shouldn't change by much.* # In[18]: refute1_results=model.refute_estimate(identified_estimand, estimate, method_name="random_common_cause") print(refute1_results) # ### Method-2 # **Placebo Treatment Refuter:-** *Randomly assigns any covariate as a treatment and re-runs the analysis. If our assumptions were correct then this newly found out estimate should go to 0.* # In[19]: refute2_results=model.refute_estimate(identified_estimand, estimate, method_name="placebo_treatment_refuter") print(refute2_results) # ### Method-3 # **Data Subset Refuter:-** *Creates subsets of the data(similar to cross-validation) and checks whether the causal estimates vary across subsets. If our assumptions were correct there shouldn't be much variation.* # In[20]: refute3_results=model.refute_estimate(identified_estimand, estimate, method_name="data_subset_refuter") print(refute3_results) # We can see that our estimate passes all three refutation tests. This does not prove its correctness, but it increases confidence in the estimate.