COMP4650 /COMP6490 Document Analysis

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

Assignment 2 Specification (Version 2)

Machine/Deep Learning and Natural Language Processing

Document Analysis (COMP4650 /COMP6490), 2024 Semester 2

Tasks

This assignment consists of 5 tasks related to classifying job descriptions.

Task 1: Analyse the Document Collection (max 5 marks, indicative)

Familiarise yourself with the following job postings dataset available at

https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction/data.

The set contains around 18K job postings, out of which about 800 are fake. Analyse the set; in this linguistic analysis, you may wish to report on not only linguistic characteristics of postings but also describe their structure by comparing and contrasting text under some key posting fields and/or within different categories considered in text classification tasks below.

Report in your PDF file your analysis methods and results (max 360–440 words), supported by possible Tables and Figures plus the list of references. Submit your code as part of your ZIP file.

Task 2: Text Classification with Logistic Regression (max 5+ 5 = 10 marks, indicative)

In Task 2, you will use the dataset from Task 1 to perform text classification. Recall from the lectures that a text classifier predicts a label for a piece of input text, and such a text classifier can be trained from examples that have the ground truth labels. In Task 2, you will build logistic regression classifiers to label the job descriptions in the dataset.

Task 2 will consider the following two text classification problems:

1. Given a job description (found in the “description” column of the dataset file), predict the required education level of the job (found in the “required_education” column).

a. There are 13 different educational levels found in the “required_education” column of the dataset. To simplify this first classification problem, you can consider only those job postings where the “required_education” column has one of the following three values: “Master’s Degree”, “Bachelor’s Degree”, “High School or equivalent”. You can filter out the other job postings where the required educational level is not one of the above for this first classification problem.

b. You should use only those non-fraudulent job postings (indicated by the “fraudulent” column) to train, validate and test your first classifier.

2. Given a job description, predict whether it is a fraudulent job posing (found in the “fraudulent” column, where 0 indicates a real job posting and 1 indicates a fake job posting).

Imagine a job seeker who has access to many job postings that do not state the required educational level, and some of these job postings may be fake. Hypothetically, the job seeker can apply the second classifier as described above that you will build to detect and filter out the fake job postings, and then apply the first classifier that you will build to identify those job postings that the job seeker is qualified for. For this assignment, however, you will independently train and test the two classifiers.

Part A (max 5 out of the max 10 Task 2 marks, indicative)

A simple approach to building a logistic regression model for text classification is to use Term Frequency x Inverse Document Frequency (TF-IDF) features. This approach is relatively straightforward to implement and can be very hard to beat in practice.

We will provide you with some starting code for the first classification problem, i.e., prediction of the educational level based on a job description. It should be straightforward to adopt the code for the second classification problem once you have completed the code for the first.

Specifically, to build a logistic regression classifier using TF-IDF features, you should first implement the get_features_tfidf function (in features.py) that takes a set of training job descriptions as input and calculates the TF-IDF (sparse) document vectors. You may want to use the TfidfVectorizer in the scikit-learn package. You should use it after reading the documentation. For text preprocessing, you could set the analyzer argument of TfidfVectorizer to the tokenise_text function provided in features.py. Alternatively, you may set appropriate values to the arguments of TfidfVectorizer or write your own text preprocessing code.

Next, implement the search_C function (in classifier.py) to try several values for the regularisation parameter C and select the best based on the accuracy on the validation data. The train_model and eval_model functions provided in the same Python file might be useful for this task. To try regularisation parameters, you should use an automatic hyper-parameter search method presented in the lectures.

You should then run job_classification.py which first reads in the dataset and splits it into training, validation and test sets; it then trains a logistic regression text classifier and evaluate its performance on the test set. Make sure you first uncomment the line with the analyse_classification_tfidf function (which uses your get_features_tfidf function to generate TF-IDF features, and your search_C function to find the best value of C) in the top-level code block of job_classification.py (i.e., the block after the line “if name == ’ main ’:”) and then run job_classification.py.

For training and testing this first classifier, remember to filter out fraudulent job postings and job postings whose required educational level is not among these three values: “Master’s Degree”, “Bachelor’s Degree”, “High School or equivalent”.

Next, modify your code to train and test a second classifier for fraudulent job posting detection, i.e., a binary classification problem where the class labels are found in the “fraudulent” column of the dataset. Because this classifier is independent of the first classifier, here you should use job postings of all educational levels.

Answer the following questions in your answers PDF for each of the two classification problems (max 360–440 words, supported by possible Tables and Figures plus the list of references):

1. What range of values for C did you try? Explain, why this range is reasonable. Also explain what search technique you used and why it is appropriate here.

2. What was the best performing C value?

3. What was your accuracy on the test set?

Also make sure you submit your code as part of your ZIP file.

Part B (max 5 out of the max 10 Task 2 marks, indicative)

Another simple approach to building a text classifier is to train a logistic regression model that uses aggregated pre-trained word embeddings. While this approach, with simple aggregation, normally works best with short sequences, you will try it out on the job descriptions.

Your task is to use Word2Vec in the gensim package to learn embeddings of words and predict the qualifications required of job descriptions using a logistic regression classifier with the aggregated word embedding features. You should use it after reading the documentation.

First implement the train_w2v function (in word2vec.py) using Word2Vec from the gensim package, then implement the search_hyperparamsfunction (in word2vec.py) to tune at least two of the many hyper-parameters of Word2Vec (e.g. vector_size, window, negative, alpha, epochs, etc.) as well as the regularisation parameter C for your logistic regression classifier. You should use an automatic hyper- parameter search method presented in the lectures. (Hint: The search_C function in classifier.py and the get_features_w2v in features.py might be useful.)

Next implement the document_to_vector function in features.py. This function should convert a tokenised document (which is a list of tokenised sentences) into a vector by aggregating the embeddings of the words/tokens in the document using trained Word2Vec word vectors.

Last, you should uncomment the line with the analyse_classification_w2v function (and comment out the line with analyse_classification_tfidf) in the top-level code block of job_classification.py, and then run job_classification.py to train a logistic regression text classifier with the word vectors from your Word2Vec model, and evaluate the classification performance on the test set.

Similar to Part A, train and test two classifiers for the two classification tasks: prediction of the required educational level, and identification of fraudulent job postings.

Answer the following questions in your answers PDF for each of the two classification problems (max 360–440 words, supported by possible Tables and Figures plus the list of references):

1. What hyper-parameters of Word2Vec did you tune? What ranges of values for the hyper parameters did you try? Explain, why the ranges are reasonable. Also explain what search technique you used and why it is appropriate here.

2. What were the best performing values of the hyper-parameters you tuned?

3. What was your accuracy on the test set? Compare it with the accuracy you got in Part A of this question and discuss why one is more accurate than the other.

Also make sure you submit your code as part of your ZIP file.

Task 3: Text Classification with a Transformer Encoder (max 5 marks, indicative)

In this task, you will apply a transformer-based classifier to perform the same text classification tasks as in Task 2 and compare the classification performance with the results you have got for Task 2.

Specifically, you will train a transformer encoder using PyTorch. An input sequence is first tokenised and a special [CLS] token prepended to the list of tokens. Similar to BERT, the final hidden state (from the transformer encoder) corresponding to the [CLS] token is used as a representation of the input sequence in this text classification task.

First, implement the get_positional_encoding function in job_classifier.py. This function computes the following positional encoding:

t ∈ {0, … , T − 1}

Next, complete the __init__ method of class JobClassifier by creating a TransformerEncoder consisting of a few (determined by parameter num_tfm_layer) TransformerEncoderLayer with the specified input dimension (emb_size), number of heads (num_head), hidden dimension of the feedforward sub-network (ffn_size) and dropout probability (p_dropout). Then implement the forward method of class JobClassifier. You should implement the following steps sequentially in the function:

(a) add the positional encoding to the embeddings of the input sequence;

(b) apply dropout (i.e. call the dropout layer of JobClassifier);

(c) call the transformer encoder you created in the __init__ method; (Hint: make sure you set the parameter src_key_padding_mask to an appropriate value.)

(d) extract the final hidden state of the [CLS] token from the transformer encoder for each sequence in the input (mini-batch);

(e) use the linear layer to compute the logits (i.e., unnormalised probabilities) for binary classification and return the logits.

You should read the documentation before implementing these methods.

Last, complete the train_model function to learn the classifier using the training dataset. You should optimise the model parameters through mini-batch stochastic gradient descent with the Adam optimiser. Make sure you evaluate the model on the validation set after each epoch of training and save the parameters of the model that achieves the best accuracy on the validation set.

You are now able to run job_classifier.py which prepares the training, validation and test datasets, trains a classifier and evaluates its accuracy on the test set. Note that this approach does not directly use the combined training and validation datasets to re-train the model, and we adopt it here for simplicity. Tuning hyper-parameters systematically is not required (but encouraged if you have access to sufficient computational resources) for this question, and setting the hyper-parameters to some reasonable values (from your experience) is acceptable. In your answers PDF, you should record the values of hyper-parameters used in your text classifier.

Similar to Task 2, train and test two classifiers for the two classification tasks: prediction of the required educational level, and identification of fraudulent job postings.

Answer the following questions in your answers PDF for each of the two classification problems (max 360–440 words, supported by possible Tables and Figures plus the list of references):

1. What is the architecture of your classifier? (Hint: you may print your classifier in Python and record the output in your answers PDF.)

2. What was your accuracy on the test set?

3. Compare your classification performance with those you got for Task 2 (Part A and Part B). Discuss the advantages and limitations of the three approaches to job description classification.

Also make sure you submit your code as part of your ZIP file.

Task 4: Text Classification with a Pre-trained Language Model (max 1 + 1.5 = 2.5 marks, indicative)

Pre-trained language models such as BERT and ChatGPT have been widely used for many NLP problems. As you have learned in the lectures, these pre-trained language models have strong capabilities to handle a wide range of tasks with little or no further training. In Task 4, you will use pre-trained language models to perform the same text classification problem as in Task 2 and Task 3.

Part A (max 1 out of the max 2.5 Task 4 marks, indicative)

In Lab 6, you will see how to fine-tune a smaller version of BERT. Adopt the strategy of fine-tuning DistilBERT as you will learn in Lab 6 on the job posting dataset. Compare the text classification performance of using the original DistilBERT and your fine-tuned DistilBERT on the two classification tasks. Note that you will still use the [CLS] token produced by the model to perform classification.

Analyse and discuss the results you observe in your answers PDF in free-form text (max 90–110 words, supported by possible Tables and Figures plus the list of references). Also make sure you submit your code as part of your ZIP file.

Part B (max 1.5 out of the max 2.5 Task 4 marks, indicative)

Prompt-based zero-shot text classification using a pre-trained language model such as ChatGPT has emerged as a new way of classifying text without any training involved. See, for example, the following Hugging Face page that explains how it works:

https://huggingface.co/tasks/zero-shot-classification

In this part of Task 4, use a Large Language Model (LLM) of your choice (e.g., ChatGPT, Gemini, or Llama) to predict the required educational level of a sample of job postings in the dataset and observe the performance of the LLM on the task.

Report your findings in your answers PDF in free-form text (max 180-220 words, supported by possible Tables and Figures plus the list of references). Things you can consider trying include, but are not limited to, the following (you are not required to do all of them):

• Compare different LLMs on the job posting classification task, using the same subset of data you have sampled.

• Compare the performance of the prompt-based zero-shot method with the previous methods you have tried in Task 2 and Task 3.

• Use the LLM’s API to batch process a large sample of the data for prompt-based zero-shot classification, if you have access to the API. (This requires additional work of understanding the LLM’s API, which is not covered in this course.)

• Discuss the pros and cons of a training-based method you have explored in the earlier parts of the assignment and of this prompt-based zero-shot method.

Note: You do not need to use an LLM to predict whether a job posting is fraudulent in Part B of Task 4. You should submit your code, prompts, or other supporting information as part of your ZIP file.

Task 5: Reflect on Your Learning and Academic Integrity (max 2.5 marks, indicative)

Add a short written, free-form text answer (max 180-220 words) to reflect on your learning and academic Integrity considerations relevant to your Assignment 2 work in your PDF file. In particular, explain where, how, and why you used Generative AI tools in coding and/or academic writing. Demonstrate your critical thinking skills in analysing and assessing the advantages and disadvantages of this tooling to your learning. If you chose not to use Generative AI tools, reflect on the advantages and disadvantages of this choice to your learning.

发表评论

电子邮件地址不会被公开。 必填项已用*标注