Airbnb NYC the big apple analysis
More than 75K houses to rent, how to choose the best house to enjoy the view.
In this analysis, were used a Kaggle’s dataset about NYC ( more information here: Airbnb Ratings Dataset | Kaggle). Although the dataset has reviews related with LA too, were only used the file “NY_Listings.csv” with NYC’ s houses information.
The scope of this analysis is to complete Udacity Data Science Nanodegree. I’m proposing to analyze the dataset and identify if there is a correlation between the review of client and the others dataset attribute of the houses or.
If you want to check my Git project, regarding this analysis please check it here: rdomingues/AirbnbNYC (github.com)
In this post I would try to find if:
- How to create a new metric to correlate the overall score and the number of reviews?
- This metric will be better to predict the overall score?
- Is the features of the house enough to determinate the overall score?
The dataset has 35 columns and 75.749 rows, each row is related to an Airbnb accommodation. About columns, this dataset has columns related to the host, the location, the property, the reservation and at last some reviews.
First analysis of the dataset, we can check the mean and the respective variance for the discrete variables:
As we can see the regular house in NYC can accommodate 2 persons, has a 1 bathroom, 1 bedroom the price is about 105 dollars, the Minimum night is about ~6 nights. Regarding, the maximum nights and the availability seems to be numbers with insignificant meaning.
Each row, of the dataset, is related to a house, has the respective number of reviews and the mean of several scores. In this analysis I will focus on the number of reviews and the overall score (“Review Scores Rating”), the objective is to understand if the house’s attributes are correlated with the overall rating, so I can not use others scores, otherwise the result will be biased.
The Review Scores Rating
About the scoring of the houses, if you would like to reserve a house, do you prefer a house with 1k reviews and a mean of 8, or do you prefer 1 review with 10 starts?
Analyzing the metric by self with a histogram it’s possible to understand its right skewed, so mainly the persons gave a positive rating.
This score disregard the number of reviews, for example if a person gave 90% in a house, but it was the unique review, this score has the same weight with other house were reviewed by 100 persons with the same rating?
To fix this were created other column related with “Review Scores Rating” but this weighted by number of reviews. The formula used were:
Were p is the rating, q the the number of reviews and Q the weight used to the score, in this case 144.
With this formula, the values are now normalized, as we can see in the following histogram:
Regarding the first question, yes it’s possible to create a new metric with a relationship between the review score and the number of reviews.
The supervised model
With the dataset treated and a normalized metric were possible to test the model with a Linear Regression.
In fact we check 4 possibilities:
- The new metric score with the Number of Reviews on the dataset. As talked before the new score were created with a new formula where used the number of reviews, in this way its normally biased. The result were a r2 equal than 0.806478 and the mean square error were 20.8821823.
- The new metric score without the Number of Reviews. Here the result of r2 decreased to 0.312825 and the mean square error were 71.982027.
- The metric The “Review Scores Rating” with the Number of Reviews, the result were a r2 equal than 0.145195 and the mean square error were 59.795813.
- At last, we used the “Review Scores Rating” without the Number of Reviews, the result were a r2 equal than 0.112494 and the mean square error were 61.477236.
Analyzing the results
Regarding the second and third question,
As we can observe, the best result of r2 where the first test, but this can not be used because the metric used it’s originated by a formula where the number of reviews is part, so the result were biased.
In second place, without the using of “Number of reviews” we got a low correlation value of 0.31. But were the best result in this tests, this means there is a low correlation between the characteristics of the house and the overall review of the guest. Others facts will assume a bigger importance to the guest to evaluate the house, but they are not in this dataset.
As we can see the normalization of the review were an important step, because the model behaved worse with the dataset variable “Review Scores Rating”.