BT4212 Homework 4

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

BT4212 Homework 4

Search Engine Optimization and Analytics

Term: Fall 2024

Individual Assignment, due Nov 3,23:59

Submission Instruction

This homework contains several coding tasks and short-answer questions to explore several predictive models for page-rank. For coding part, please write codes in the corresponding cells. I may provide some comment lines as guideline. For short-answers, type your answer in the cells with ANSWER: HERE. Please double click those cells and directly input your answer.

I recommend you use Python 3 for this homework. Python 2 may not be supported. You can use either your own PC or Google Colab to do this homework. GPU support is NOT required.

Save your notebook .ipynb file as StudentID_YourName_HW4.ipynb. Generate an .html file from .ipynb file and save as StudentID_YourName_HW4.html. Zip your notebook file and html file into a single .zip file.

Upload the zip file as StudentID_YourName_HW4.zip. Please DO NOT include data file in your zip file (too large to upload and download).

Please make sure your code is executable.

If you are using Google Colab,

HW4 is worth 80 points in total.


In [ ]:
# Input your name and studentIDname="your name"stuID="A0123456Z"
In [ ]:
# Import packages# the recommended version is listed but you could try using the most updated one.importnumpyasnpimportpandasaspdimportstatsmodels.apiassm# recommended version: 0.13.0fromstatsmodels.miscmodels.ordinal_modelimportOrderedModelimportxgboostasxgb# recommended version: 1.5.0fromxgboostimportplot_importancefromsklearnimportmetricsfromsklearn.treeimportDecisionTreeRegressorfromsklearn.ensembleimportRandomForestRegressor
In [ ]:
# To fix random seeds (Note that you may still get slightly different results.)np.random.seed(12345)# To ignore some warningsimportwarningsfromstatsmodels.tools.sm_exceptionsimportConvergenceWarningwarnings.simplefilter('ignore',ConvergenceWarning)
In [ ]:
## If you are using Google Colab and get an error regarding statsmodels(OrderedModel),## you can use the following line to re-install statsmodels package.## This will takes some time, please RESTART the RUNTIME after installation.# !pip install statsmodels==0.13.0## You may also want to mount your Google Drive to allow easy file loading.# from google.colab import drive# drive.mount('/content/drive')

Background Information: About the Data

The data set is acquired in the following steps.

  1. Identify a set of 20 keywords.
  2. For each keyword, search in google, and return the first 98 websites. It has 20×98=196020×98=1960 observations in total.
  3. Split the dataset into training and test (or validation) data. The training data includes 70%70% of observations (14×98=137214×98=1372 rows) while the test one has 30%30% of observations (6×98=5886×98=588 rows).

There are two data files. Train_dta.csv for training data and Test_dta.csv for testing data. Open the data file with Excel may encounter some unexpected errors. Simply download another copy from Canvas, if it occurs.

Load Data

In [ ]:
# Load the training and test datasets and print the first 5 rows of the training dataset.# Please use pd.read_csv("data.csv") to load your data# Run this cellfeature_col=["TitleFlag","TitleDensity","URLFlag","URLDensity","MetaFlag","MetaDensity","PageAuthority","DomainAuthority","LinkingDomain","InboundLink","RankingKeyword"]train_data=pd.read_csv("Train_dta.csv")test_data=pd.read_csv("Test_dta.csv")

There are many columns in the data, e.g., title, url and meta desciptions. The detailed information about each column is in the appendix. We will only use: “TitleFlag”, “TitleDensity”, “URLFlag”, “URLDensity”, “MetaFlag”, “MetaDensity”, “PageAuthority”, “DomainAuthority”, “LinkingDomain”, “InboundLink” and “RankingKeyword” as features, "ReverseRank" as label.

In [ ]:
# Split feature and label# Run this celltrain_feature=pd.read_csv("Train_dta.csv",usecols=feature_col)train_label=pd.read_csv("Train_dta.csv",usecols=["ReverseRank"])test_feature=pd.read_csv("Test_dta.csv",usecols=feature_col)test_label=pd.read_csv("Test_dta.csv",usecols=["ReverseRank"])
In [ ]:
# Run this cellprint(train_feature.shape,train_label.shape,test_feature.shape,test_label.shape)
In [ ]:
# Run this cell# This serves as your data inputX_train=train_feature# trainig feature, as a dataframe in pandasX_test=test_feature# test feature, as a dfy_train=train_label.values# training label, as a numpy arrayy_test=test_label.values# test label, as a np array

To avoid potential issues of shallow copy and deep copy. Try to load data separately for each problem, although they may be the same.

Problem 1 Pointwise Rank. (35 points)

Q1. Linear Regression (10 points)

Reference: https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html


# Data Input
# Model: USE sm.OLS(y_train,X_train)# Fit: USE model.fit()# y_train is a numpy array from train_label, X_train is a pandas dataframe from train_feature.# Remember to add an intercept by sm.add_constant() to both train and test data!
In [ ]:
# Print model summary by model.summary()
In [ ]:
# Make prediction
In [ ]:
# Print RMSE, y_test is a numpy array from test_label, y_pred is a numpy array from your model prediction

Pick the most statistically significant variable and interpret its estimated beta coefficient. (5 points)

ANSWER: HERE

Q2. Logistic Regression (15 points)

For the baseline logistic regression, it does not require an ordinal relationship among the levels of the outcome variable, e.g., level 1 does not necessarily imply superiority or inferiority compared with level 2. However, for ordinal variable, its levels can be ranked implying a higher value than other level, e.g., in the school grade, A is better than B. We will use ordinal logistic regression for this problem.

Please use the training dataset to fit an ordinal logistic regression model with all variables aforementioned. Use the ReverseRank as the outcome variable, that is apparently a ranked variable**. (5 points)

Note that you may not be able to run ordinal logistic regression because the distributions of several variables are too skewed. You can use new_variable = np.log(the_problematic_variable+1) to transform those variables in both training and testing data. (5 points)

Please use the trained ordinal logistic regression model to predict the rank in the test dataset. Print the predicted rank on the test dataset and report the RMSE of the prediction. (5 points)

Reference: https://www.statsmodels.org/dev/examples/notebooks/generated/ordinal_regression.htmlhttps://stats.oarc.ucla.edu/r/dae/ordinal-logistic-regression/.

Remarks: You will find tons of useful materials about statistical modelling in UCLA website, even though mostly implemented with R or Stata.

In [ ]:
# Data Input
In [ ]:
# Adjust your input# Please specify which parts of data you do a log transformation on.# There are many ways to determine the skewness,# e.g., plotting the distribution, calling pandas.DataFrame.skew, etc.# As long as you provide reasons for the transformation, you will be awarded the points.
In [ ]:
# Model: use OrderedModel(y_train, X_train, distr='logit')# Fit: use model.fit(method='bfgs', disp=False)# y_train is a numpy array from train_label, X_train is a pandas dataframe from train_feature.
In [ ]:
# Print model summary by model.summary()
In [ ]:
# Make prediction: assume the predicted output of this model is y_pred,# for this question, you need to do y_pred.argmax(1)+3 to get the real output.# Please refer to the reference above for how to make prediction properly.
In [ ]:
# Print RMSE, y_test is a numpy array from test_label, y_pred is a numpy array from your model prediction

Pick the most statistically significant variable and interpret its estimated beta coefficient. (5 points)

ANSWER: HERE

Q3. Decision Tree (5 points)

Please use the training dataset to fit a decision tree model with all variables aforementioned. Use the ReverseRank as the target variable.

In [ ]:
# Data Input
In [ ]:
# Model & Fit: USE DecisionTreeRegressor(random_state=0).fit(...)# y_train is a numpy array from train_label, X_train is a pandas dataframe from train_feature.# Make prediction
In [ ]:
# Print RMSE, y_test is a numpy array from test_label, y_pred is a numpy array from your model prediction

Q4. Random Forest (5 points)

Please use the training dataset to fit a random forest model with all variables aforementioned. Use the ReverseRank as the target variable.

In [ ]:
# Data Input
In [ ]:
# Model & Fit: USE RandomForestRegressor(random_state=0).fit(...)# y_train is a numpy array from train_label, X_train is a pandas dataframe from train_feature.# Make prediction
In [ ]:
# Print RMSE, y_test is a numpy array from test_label, y_pred is a numpy array from your model prediction

Problem 2. Pairwise Rank with xgboost (35 points)

Please use the training dataset to fit an XGboost model with all variables aforementioned. Use the ReverseRank as the target variable.

Show the feature importance plot. Use "Gain" for measure in the importance plot.

For the XGboost prediction, please interpret the importance value of the most important variable. Please use the trained XGboost model to predict the rank in the test dataset and report the RMSE of the prediction.

Reference: https://xgboost.readthedocs.io/en/latest/python/index.html

In [ ]:
# Data Input

Q1. Use reg:linear for objective and rmse for eval_metric. (10 points)

In [ ]:
# Model and Parameter Setting# Use xgb.XGBRegressor# Use parameters below first and you can adjust it later.reg=xgb.XGBRegressor(max_depth=6,eta=0.1,gamma=0.1,subsample=0.8,colsample_bytree=0.8,alpha=0.5,objective="reg:linear",eval_metric="rmse",seed=1)
In [ ]:
# Fit the model by your_model.fit(X_train, y_train)
In [ ]:
# Show the feature importance plot. Use plot_importance(your_model, importance_type = 'gain')# If you don't see your plot, please run this cell again.

Please interpret the importance value of the most important variable. (5 points)

ANSWER: HERE

In [ ]:
# Make prediction: y_pred should be the output of your prediction, it cotains value# from 3 to 100 (may not be integers)
In [ ]:
## Print RMSE, y_test is a numpy array from test_label, y_pred is a numpy array from your model prediction

Try different combinations of the hyper-parameters to get a lower RMSE. (5 points)

Please note that you can use any hyper-parameter tuning techniques in the lecture. The points are not awarded based on the exact number of RMSE. As long as it is lower than the original values and you can justify the technique you use, you will be awarded the points.

In [ ]:
# your code here...

Q2. Redo Q1 by using rank:pairwise for objective and rmse for eval_metric. (15 points)

Observe that we did not utilize the query information yet. However, in the dataset, we know that each query corresponds to 98 data points. Now we will leverage this information by setting group information for both training and testing data. (5 points)

In [ ]:
# Model and Parameter Setting# Use xgb.XGBRanker# Use parameters below first and you can adjust it later.reg=xgb.XGBRanker(max_depth=6,eta=0.1,gamma=0.1,subsample=0.8,colsample_bytree=0.8,alpha=0.5,objective="rank:pairwise",eval_metric="rmse")
In [ ]:
# Data Input
In [ ]:
# Fit the model by your_model.fit(X_train, y_train, group = np.full(14,98))
In [ ]:
# Show the feature importance plot. Use plot_importance(your_model, importance_type = 'gain')

Please interpret the importance value of the most important variable. (5 points)

ANSWER: HERE

Note that here y_pred is not a direct rank on itself. We need to figure out how to get the rank from y_pred. Test data can be divided into 6 groups, we need to perform the following steps for each group.

  1. For each group, get y_pred from your model.
  2. The values in y_pred indicate a relative score for the rank. The higher the score, the better the page rank. For all 98 pages in one group, the page with the highest score should rank as 100, and the page with the lowest score should rank as 3.
  3. Repeat this for all groups. You are required to implement this by some simple codes. y_pred_rank is the predicted ranks you transformed from y_pred.
In [ ]:
# Make prediction: y_pred is the predicted output from your model, y_pred_rank is the actual rank you get
In [ ]:
# Print RMSE, y_test is a numpy array from test_label, y_pred_rank is a numpy array from your model prediction

Try different combinations of the hyper-parameters to get a lower RMSE. (5 points)

Please note that you can use any hyper-parameter tuning techniques in the lecture. The points are not awarded based on the exact number of RMSE. As long as it is lower than the original values and you can justify the technique you use, you will be awarded the points.

In [ ]:
# your code here...

Q3. Use rank:ndcg for objective and rmse for eval_metric. Redo Question 6. (10 points)

In [ ]:
# Model and Parameter Setting# Use xgb.XGBRanker# Use parameters below first and you can adjust it later.reg=xgb.XGBRanker(max_depth=6,eta=0.1,gamma=0.1,subsample=0.8,colsample_bytree=0.8,alpha=0.5,objective="rank:ndcg",eval_metric="rmse")
In [ ]:
# Data Input
In [ ]:
# Fit the model by your_model.fit(X_train, y_train, group = np.full(14,98))
In [ ]:
# Show the feature importance plot. Use plot_importance(your_model, importance_type = 'gain')

Please interpret the importance value of the most important variable. (5 points)

ANSWER: HERE


Problem 3. Interpretation Short-Answers (10 points)

Please paste the RMSE values of all questions above in the table below. Please round your RMSE up to 4 decimals.

To access the table written in markdown below, double click the placeholder table, edit the corresponding value and run the cell.

Before Tuning:

Question P1-Q1 P1-Q2 P1-Q3 P1-Q4 P2-Q5 P2-Q6 P2-Q7
RMSE 1.0000 2.0000 3.0000 4.0000 5.0000 6.0000 7.0000

After Tuning:

Question P2-Q5 P2-Q6 P2-Q7
RMSE 5.0000 6.0000 7.0000

Question: Please describe what you observe from the RMSE value for all methods of pointwise ranking. What is the best method? Why? (5 points)

Answer: HERE

Question: Please compare the results from pointwise ranking and pairwise ranking and describe your observatins.(5 points)

Answer: HERE

Appendix

  1. ID: identification number (i.e., row number in the dataset)

  2. Position: the actual google ranking of the webpage to the query

  3. ReverseRank: equals 101101 minus Position. It is equivalent to the position. Sometimes using ReverseRank as the dependent variable can have a better prediction.

  4. Title: the title of the webpage

  5. URL: the URL of the webpage

  6. Meta: the meta description of the webpage

  7. TitleFlag: indicates that whether the whole keyword is included in the page title. TitleFlag equals 1 if yes, otherwise, 0.

  8. UrlFlag: indicates that whether the whole keyword is included in the page url. UrlFlag equals 1 if yes, otherwise, 0.

  9. MetaFlag: indicates that whether the whole keyword is included in the page meta description. MetaFlag equals 1 if yes, otherwise, 0.

  10. TitleDensity: is the percentage of times a keyword appears in the title of a web page compared to the total number of words in the title of a web page.

  11. UrlDensity: is the percentage of times a keyword appears in the URL of a web page compared to the total number of words in the URL of a web page.

  12. MetaDensity: is the percentage of times a keyword appears in the meta description of a web page compared to the total number of words in the meta description of a web page.

  13. PageAuthority: is a score developed by Moz that predicts how well a specific page will rank on search engine result pages (SERP). https://moz.com/learn/seo/page-authority

  14. DomainAuthority: is a search engine ranking score developed by Moz that predicts how well a website will rank on search engine result pages (SERPs). https://moz.com/learn/seo/domain-authority

  15. LinkingDomain: is the number of unique external domains linking to this page. Two or more links from the same websites are considered as one linking domain. Provided by Moz.

  16. InboundLink: is the number of unique external pages linking to this page. Two or more links from the same page on a website are considered as one inbound link. Provided by Moz.

  17. RankingKeyword: is the number of keywords for which this site ranks within the top 50 positions on Google US. Provided by Moz.

发表评论

电子邮件地址不会被公开。 必填项已用*标注