DAT561 Final Project (Fall 2024)

DAT561 Final Project (Fall 2024)

Note: Please be creative in defining the new variables as part of the data manipulation and write your description at the end of each code as the comment. We will read your logic and description for the assessment.

Part 1: 75 points (85 points with the extra credits in the Bonus Question)

In [ ]:
importnumpyasnpimportpandasaspd
In [ ]:
# Read the dataset hereProperty_details=pd.read_csv(?)# Please use your path and dataset for this parts!Order_details=pd.read_csv(?)

Question 1

Part (a): How many unique cities are there in Bulgaria?

Part (b): What is the mean, standard deviation, median, min, and max of “latitude” for all properties in Bulgaria?

In [ ]:
# Part (a):
In [ ]:
# Part (b):

Question 2

Part (a): Create a new column called "Recommendation", which is how well the property is recommended:

For ‘starrating’ of 5: Highly Recommended

For ‘starrating’ of 4 and above: Great Value

For ‘starrating’ of less than 4: Meh

Part (b): Which country receives the largest amount of ‘Highly Recommended’ and ‘Great Value’?

In [ ]:
# Part (a):
In [ ]:
# Part (b):

Question 3

Part (a): In “ratedescription”, what is the mean of the largest 10 percent of room size? What about the mode of smallest 10 percent room size?

Part (b): In ‘rate type’ What is the probability of not having to pay at the hotel given free cancellation?

In [ ]:
# You can deceide whether to display you output for 3(a) and 3(b) separately or together
In [ ]:
# If you displayed your output of 3(b) together with 3(a) please delete this cell

Question 4

Part (a): For each property, there are some abnormal values of 0 in the “onsiteprice”. To better organize the data, you would like to create a new column “replaced onsiteprice” in the dataset by retaining the original non-zero “onsiteprice” of one specific property and replacing the zero value with its median of non-zero “onsiteprice”.

Part (b): For each property, calculate the mean and variance value of “replaced onsiteprice”, and store these two into corresponding two columns named “Mean” and “Variance”. Then create a column named “Standardized Mean” to store the standardized form of the “Mean” column.

In [ ]:
# Part (a):
In [ ]:
# Part (b):

Question 5

Part (a): A party of four is planning a trip. How many available hotels do offer a room with the “maxoccupancy” of 4 or 2? Available hotel are those whose “propertype” are “Hotels”, “close” are “N”, and “hotelblock” are not “sold out” .

Part (b): If this party does not want to pay a room for an average “replaced onsiteprice” higher than 230 per night, how many hotels are still available? Use the mean of “replaced onsiteprice” to compare with 230 due to price fluctuation.

In [ ]:
# Part (a):
In [ ]:
# Part (b):

Bonus Question:

Merge data, filter, groupby, merge three times

Part (a): For each zip code, find the most expensive property by using “replaced onsiteprice”. Provide id, name, rating, city, country, zip code, address, and average “replaced onsiteprice” of these properties.

Part (b): For each zip code, find the cheapest property by using “replaced onsiteprice”. Provide id, name, rating, city, country, zip code, address, and average “replaced onsiteprice” of these properties.

Hint: Each country has numbers of hotels, and each hotel has numbers of prices due to price fluctuation. You need to find the average “replaced onsiteprice” for each hotel first, and sort out the cheapest and the most expensive hotels then.

In [ ]:
# Part (a):
In [ ]:
# Part (b):

Part 2 (25 Points)

For this part, we look at the logic and how you solve the problems.

Part (a):

1- You need to find "5" interesting business questions based on the datasets. Please make sure that these quastions are not similar with other groups... 2- Write Python code to answer the questions. 3- Visualize your results for each question.

Part (b):

Write a 300-word summary of your answers and business insights you get from answering these 5 questions based on your code. Ensure that you have clearly explained why we should care about your questions and your results. Clearly explain your findings.

This part will be evaluated based on the following criteria:

1. You need to ask five business-relevant questions. (5 points) 2. You need to answer these five questions using Python and the two datasets. (5 points) 3. You need to have at least "5" graphs to visualize your insights. (6 points) 4. Your executive summary should be well-written. (6 points) 5. Your results and business insights should be interesting and meaningful. (3 points)

Note: You may use this cell to write your 5 questions

Question 1:

Question 2:

Question 3:

Question 4:

Question 5:

In [ ]:
# Your code to answer Question 1
In [ ]:
# Your code to answer Question 2
In [ ]:
# Your code to answer Question 3
In [ ]:
# Your code to answer Question 4
In [ ]:
# Your code to answer Question 5

Executive Summary & Business insights:

Note: You need to use the cell below to write your executive summary & business insights. If you need more space use enter to go to the next line.

write here

Grading:

PART 1 - 75 points (85 points with the extra credits in the Bonus Question)

  • Question 1: 9 points (6 points for part (a) and 3 points for part (b))
  • Question 2: 15 points (9 points for part (a) and 6 points for part (b))
  • Question 3: 12 points (9 points for part (a) and 3 points for part (b))
  • Question 4: 21 points (9 points for part (a) and 12 points for part (b))
  • Question 5: 18 points (9 points for part (a) and 9 points for part (b))
  • Bonus Question: 10 points (extra credit): (8 points for part (a) and 2 points for part (b))

PART 2 - 25 points

  • You need to ask five business-related questions (5 points).
  • You need to answer these five questions using Python and the two datasets (5 points).
  • You need to have at least "5" graphs to visualize your insights (6 points).
  • Your executive summary should be well-written (6 points).
  • Your results and business insights should be interesting and meaningful (3 points).

Good Luck!

In [ ]:

发表评论

电子邮件地址不会被公开。 必填项已用*标注