Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
BT4212 Homework 4
Search Engine Optimization and Analytics
Term: Fall 2024
Individual Assignment, due Nov 3,23:59
Submission Instruction
This homework contains several coding tasks and short-answer questions to explore several predictive models for page-rank. For coding part, please write codes in the corresponding cells. I may provide some comment lines as guideline. For short-answers, type your answer in the cells with ANSWER: HERE. Please double click those cells and directly input your answer.
I recommend you use Python 3 for this homework. Python 2 may not be supported. You can use either your own PC or Google Colab to do this homework. GPU support is NOT required.
Save your notebook .ipynb file as StudentID_YourName_HW4.ipynb. Generate an .html file from .ipynb file and save as StudentID_YourName_HW4.html. Zip your notebook file and html file into a single .zip file.
Upload the zip file as StudentID_YourName_HW4.zip. Please DO NOT include data file in your zip file (too large to upload and download).
Please make sure your code is executable.
If you are using Google Colab,
- Please make sure you have expanded all hidden cells. You can refer to https://stackoverflow.com/questions/62457417/unhide-all-cells-in-google-colab for more information.
- How to generate an HTML file from your notebook file in Google Colab? Please refer to https://stackoverflow.com/questions/53460051/convert-ipynb-notebook-to-html-in-google-colab
HW4 is worth 80 points in total.
Background Information: About the Data
The data set is acquired in the following steps.
- Identify a set of 20 keywords.
- For each keyword, search in google, and return the first 98 websites. It has 20×98=196020×98=1960 observations in total.
- Split the dataset into training and test (or validation) data. The training data includes 70%70% of observations (14×98=137214×98=1372 rows) while the test one has 30%30% of observations (6×98=5886×98=588 rows).
There are two data files. Train_dta.csv for training data and Test_dta.csv for testing data. Open the data file with Excel may encounter some unexpected errors. Simply download another copy from Canvas, if it occurs.
There are many columns in the data, e.g., title, url and meta desciptions. The detailed information about each column is in the appendix. We will only use: “TitleFlag”, “TitleDensity”, “URLFlag”, “URLDensity”, “MetaFlag”, “MetaDensity”, “PageAuthority”, “DomainAuthority”, “LinkingDomain”, “InboundLink” and “RankingKeyword” as features, "ReverseRank" as label.
To avoid potential issues of shallow copy and deep copy. Try to load data separately for each problem, although they may be the same.
Q1. Linear Regression (10 points)
Reference: https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html
Pick the most statistically significant variable and interpret its estimated beta coefficient. (5 points)
ANSWER: HERE
Q2. Logistic Regression (15 points)
For the baseline logistic regression, it does not require an ordinal relationship among the levels of the outcome variable, e.g., level 1 does not necessarily imply superiority or inferiority compared with level 2. However, for ordinal variable, its levels can be ranked implying a higher value than other level, e.g., in the school grade, A is better than B. We will use ordinal logistic regression for this problem.
Please use the training dataset to fit an ordinal logistic regression model with all variables aforementioned. Use the ReverseRank as the outcome variable, that is apparently a ranked variable**. (5 points)
Note that you may not be able to run ordinal logistic regression because the distributions of several variables are too skewed. You can use new_variable = np.log(the_problematic_variable+1) to transform those variables in both training and testing data. (5 points)
Please use the trained ordinal logistic regression model to predict the rank in the test dataset. Print the predicted rank on the test dataset and report the RMSE of the prediction. (5 points)
Reference: https://www.statsmodels.org/dev/examples/notebooks/generated/ordinal_regression.html. https://stats.oarc.ucla.edu/r/dae/ordinal-logistic-regression/.
Remarks: You will find tons of useful materials about statistical modelling in UCLA website, even though mostly implemented with R or Stata.
Pick the most statistically significant variable and interpret its estimated beta coefficient. (5 points)
ANSWER: HERE
Problem 2. Pairwise Rank with xgboost (35 points)
Please use the training dataset to fit an XGboost model with all variables aforementioned. Use the ReverseRank as the target variable.
Show the feature importance plot. Use "Gain" for measure in the importance plot.
For the XGboost prediction, please interpret the importance value of the most important variable. Please use the trained XGboost model to predict the rank in the test dataset and report the RMSE of the prediction.
Reference: https://xgboost.readthedocs.io/en/latest/python/index.html
Please interpret the importance value of the most important variable. (5 points)
ANSWER: HERE
Try different combinations of the hyper-parameters to get a lower RMSE. (5 points)
Please note that you can use any hyper-parameter tuning techniques in the lecture. The points are not awarded based on the exact number of RMSE. As long as it is lower than the original values and you can justify the technique you use, you will be awarded the points.
Q2. Redo Q1 by using rank:pairwise for objective and rmse for eval_metric. (15 points)
Observe that we did not utilize the query information yet. However, in the dataset, we know that each query corresponds to 98 data points. Now we will leverage this information by setting group information for both training and testing data. (5 points)
Please interpret the importance value of the most important variable. (5 points)
ANSWER: HERE
Note that here y_pred is not a direct rank on itself. We need to figure out how to get the rank from y_pred. Test data can be divided into 6 groups, we need to perform the following steps for each group.
- For each group, get y_pred from your model.
- The values in y_pred indicate a relative score for the rank. The higher the score, the better the page rank. For all 98 pages in one group, the page with the highest score should rank as 100, and the page with the lowest score should rank as 3.
- Repeat this for all groups. You are required to implement this by some simple codes. y_pred_rank is the predicted ranks you transformed from y_pred.
Try different combinations of the hyper-parameters to get a lower RMSE. (5 points)
Please note that you can use any hyper-parameter tuning techniques in the lecture. The points are not awarded based on the exact number of RMSE. As long as it is lower than the original values and you can justify the technique you use, you will be awarded the points.
Please interpret the importance value of the most important variable. (5 points)
ANSWER: HERE
Please paste the RMSE values of all questions above in the table below. Please round your RMSE up to 4 decimals.
To access the table written in markdown below, double click the placeholder table, edit the corresponding value and run the cell.
Before Tuning:
Question | P1-Q1 | P1-Q2 | P1-Q3 | P1-Q4 | P2-Q5 | P2-Q6 | P2-Q7 |
---|---|---|---|---|---|---|---|
RMSE | 1.0000 | 2.0000 | 3.0000 | 4.0000 | 5.0000 | 6.0000 | 7.0000 |
After Tuning:
Question | P2-Q5 | P2-Q6 | P2-Q7 |
---|---|---|---|
RMSE | 5.0000 | 6.0000 | 7.0000 |
Question: Please describe what you observe from the RMSE value for all methods of pointwise ranking. What is the best method? Why? (5 points)
Answer: HERE
Question: Please compare the results from pointwise ranking and pairwise ranking and describe your observatins.(5 points)
Answer: HERE
Appendix
-
ID: identification number (i.e., row number in the dataset)
-
Position: the actual google ranking of the webpage to the query
-
ReverseRank: equals 101101 minus Position. It is equivalent to the position. Sometimes using ReverseRank as the dependent variable can have a better prediction.
-
Title: the title of the webpage
-
URL: the URL of the webpage
-
Meta: the meta description of the webpage
-
TitleFlag: indicates that whether the whole keyword is included in the page title. TitleFlag equals 1 if yes, otherwise, 0.
-
UrlFlag: indicates that whether the whole keyword is included in the page url. UrlFlag equals 1 if yes, otherwise, 0.
-
MetaFlag: indicates that whether the whole keyword is included in the page meta description. MetaFlag equals 1 if yes, otherwise, 0.
-
TitleDensity: is the percentage of times a keyword appears in the title of a web page compared to the total number of words in the title of a web page.
-
UrlDensity: is the percentage of times a keyword appears in the URL of a web page compared to the total number of words in the URL of a web page.
-
MetaDensity: is the percentage of times a keyword appears in the meta description of a web page compared to the total number of words in the meta description of a web page.
-
PageAuthority: is a score developed by Moz that predicts how well a specific page will rank on search engine result pages (SERP). https://moz.com/learn/seo/page-authority
-
DomainAuthority: is a search engine ranking score developed by Moz that predicts how well a website will rank on search engine result pages (SERPs). https://moz.com/learn/seo/domain-authority
-
LinkingDomain: is the number of unique external domains linking to this page. Two or more links from the same websites are considered as one linking domain. Provided by Moz.
-
InboundLink: is the number of unique external pages linking to this page. Two or more links from the same page on a website are considered as one inbound link. Provided by Moz.
-
RankingKeyword: is the number of keywords for which this site ranks within the top 50 positions on Google US. Provided by Moz.