首页 » 数据科学与大数据技术 » Data Analytics ECS784P

Data Analytics ECS784P

2025-04-21 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

Coursework 2 specification for 2025

Data Analytics ECS784P,

Revised on 19/02/2025,

Dr Anthony Constantinou, Dr Neville Kitson.

1. Important Dates

Release date: Week 9, Tuesday 25 th March 2025 at 20:00 evening.
Submission deadline: Week 13, Thursday 24th April 2025 at 10:00AM.
Late submission deadline (cumulative penalty applies): Within 7 days after deadline.

General information (same as Coursework 1):

i. Students will sometimes upload their coursework as a draft and not hit the submit button. Make sure you fully complete the submission process.

ii.A penalty will be applied automatically by the system for late submissions.

a. Lecturers cannot remove the penalty!

b. Penalties can only be challenged via submission of an Extenuating Circumstances (EC) form which can be found on your Student Support page. All the information you need to know is on that page, including how to submit an EC claim along with the deadline dates and full guidelines.

c. Deadline extensions can only be granted through approval of an EC claim.

d. If you submit an EC form, your case will be reviewed by a panel. When the panel reaches a decision, they will inform both you and the module organiser (Anthony).

e. If you miss both the submission deadline and the late submission deadline, you will automatically receive a score of 0.

iii. Submissions via e-mail are not accepted.

iv. The School recommends that we set the deadline during a weekday at 10:00 AM.

v. For more details on submission regulations, please refer to your relevant student handbook.

2. Coursework overview

Coursework 2 involves applying causal machine learning to a data set of your choice. You will have to complete a series of tasks, and then answer a set of questions.

• This coursework is based on the lecture material covered between Weeks 6 and 12, and on the lab material covered between Weeks 9 and 11.

• The coursework must be completed individually.

• Submission should be a single file (Word or PDF) containing your answers to each of the questions.

o Ensure you clearly indicate which answer corresponds to what question.

o Data sets and other relevant files are not needed for submission, but do save them in case we ask to have a look at them.

• To complete the coursework, follow the tasks below and answer ALL questions enumerated in Section 3. It is recommended that you read this document in full before you start completing Task 1.

• You can start working on your answers as early as you want, but keep in mind that you need to go through up to Week’s 11 material to gain the knowledge needed to

answer all the questions.

TASK 1: Set up and reading

a) Visit http://bayesian-ai.eecs.qmul.ac.uk/bayesys/

b) Download the Bayesys user manual.

c) Set up the NetBeans project by following the steps in Section 1 of the manual.

Note that the screenshots in the manual may vary slightly depending on the NetBeans version, but we do not expect any misalignment with the instructions.

d) Read Sections 2, 3, 4 and 5 of the manual.

e) Skip Section 6.

f) Read Section 7 and repeat the example.

i. Skip subsections 7.3 and 7.4.

g) Read Section 8 and repeat the example.

h) Skip Sections 9, 10, 11 and 12.

i) Read Section 13.

i. Skip subsection 13.6.

TASK 2: Determine research area and prepare data set

You are free to choose or collate your own data set. As with Coursework 1, we recommend that you address a problem you are interested in or related to your professional field. If you are motivated by the subject matter, the project will be more fun for you, and you will likely perform better.

Data requirements:

• Size of data: The data set must contain at least 8 variables (yes, penalty applies for using <8 variables). There is no upper-bound restriction on the number of the variables. However, we recommend using <50 variables for the purposes of the coursework to make it much easier for you to visualise the causal graph, and to save computational runtime. While the vast majority of submissions typically rely on relatively small data sets that take a few seconds to ‘learn’, keep in mind some algorithms might take hours to complete learning when given more than 100 variables!

i. You do not need to use a special technique for feature selection – it is up to you to decide which variables to keep. We will not be assessing feature selection decisions.

ii. There is no sample-size restriction and you are free to use a part of the samples. For example, your data set may contain millions of rows and you may want to use fewer to speed-up learning.

• Re-use data from CW1: You are allowed to reuse the data set you have prepared for Coursework 1, as long as: a) you consider that data set to be suitable for causal structure learning (refer to Q1 in Section 3), and b) it contains at least 8 variables.

• Bayesys repository: You are not allowed to use any of the data sets available in the Bayesys repository for this coursework.

• Categorical data: Bayesys assumes the input data are categorical or discrete; e.g.,

{"low","medium","high"},

{"yellow","blue","green"},

{" < 10","10-

20","20 + "} etc, rather than a continuous range of numbers. If your data set contains continuous variables, Bayesys will consider each value of a continuous variable as a different category. This will cause problems with model dimensionality, leading to poor accuracy and high runtime (if this is not clear why, refer to the Conditional Probability Tables (CPTs) covered in the lectures).

To address this issue, you should discretise all continuous variables to reduce the number of states to reasonable levels. For example, a variable with continuous values ranging from 1 to 100 (e.g., {"14.34","78.56","89.23"}) can be

discretised

into

categories

such as

{"1to20","21to40","41to60","61to80","81to100"}. Because Coursework 2 is not concerned with data pre-processing, you are free to follow any approach you wish to discretise continuous variables. You could discretise the variables manually as discussed in the above example, or even use k-means which we covered in previous lectures, or any other data discretisation approach. We will not be assessing data discretisation decisions.

• Missing data values: The input data set must not contain missing values/empty cells. If it does, the easiest solution would be to replace ALL empty cells with a new category value called missing (or use a different relevant name). This will force the algorithms to consider missing values as an additional state.

Alternatively, you could use any data imputation approach, such as MissForest.

We will not be assessing data imputation decisions.

Once you ensure your data set is consistent with what has been stated above, rename your data set to trainingData.csv and place it in folder Input.TASK 3: Draw out your knowledge-based graph

1. Use your own knowledge to produce a knowledge-based causal graph based on the variables you decide to keep in your data set. Remember that this graph is based on your knowledge, and it is not necessarily correct or incorrect. You will compare the graphs learnt by the different algorithms with reference to your knowledge graph.

You may find it easier if you start drawing the graph by hand, and then record the directed relationships in the DAGtrue.csv file. In creating your

DAGtrue.csv file, we recommend that you edit one of the sample files that come with Bayesys; e.g., create a copy of the DAGtrue_ASIA.csv file available in the directory Sample input files/Structure learning, then rename the file to DAGtrue.csv, and then replace the directed relationships with those present in your knowledge graph.

NOTE: Your knowledge graph should have a maximum node in-degree of 11; i.e., no node in the graph should have more than 11 parents (this is a library/package restriction).

2. Once you are happy with the graph you have prepared, ensure the file is called DAGtrue.csv and placed in folder Input.

NOTE: If your OS is not showing the file extensions (e.g., .CSV or.PDF), name your file DAGtrue and not DAGtrue.csv; otherwise, the file might end up being called

DAGtrue.csv.csv unintentionally (when the file extension is not visible). If this happens, Bayesys will be unable to locate the file.

3. Make a copy of the DAGtrue.csv file, and rename this copy into DAGlearned.csv and place it in folder Output. You can discard the copied file once you complete

Task 3 (or save it for the Bonus task!).

4. Ensure that your DAGtrue.csv and trainingData.csv (from Task 2) files are in folder Input, and the DAGlearned.csv file is in folder Output. Run Bayesys in NetBeans.

Under tab Main, select Evaluate graph and then click on the first subprocess as shown below. Then hit the Run button found at the bottom of tab Main.

The above process will generate output information in the terminal window of NetBeans. Save the last three lines, as highlighted in the Fig below; you will need this information later when answering some of the questions in Section 3.

Additionally, the above process should have generated one PDF file in folder Input called DAGtrue.pdf. Save this file as you will need it for later.

This only concerns MAC/Linux users: The above process might return an error while creating the PDF file, due to compatibility issues. Even if the system completes the process without errors, the PDF files generated may be corrupted and not open on MAC/Linux. If this happens, you should use the online GraphViz editor to produce your graphs, available here: https://edotor.net/, which converts text into a visual drawing. As an example, copy the code shown below in the web editor:

digraph {

Earthquake -> Alarm

Burglar -> Alarm

Alarm -> Call

}

If you are drawing a CPDAG containing undirected edges, then consider:

digraph {

Earthquake -> Alarm

Burglar -> Alarm

Alarm -> Call [arrowhead=none];

}

You can then edit the above code to be consistent with your DAGtrue.csv. You could copy-and-paste the variable relationships (e.g., Earthquake → Alarm) directly from DAGtrue.csv into the code editor, taking care to remove commas and quote any variable names containing spaces.7

TASK 4: Perform structure learning

1. Run Bayesys.Under tab Main, select Structure learning and algorithm HC(default selection). Select Evaluate graph and then click on the last two (out of four) options so that you also generate the learned DAG and CPDAG in PDF files, in addition to the DAGlearned.csv file which is generated by default. Then, hit the Run button.

2. Once the above process completes, you should see:

i. Relevant text generated in the terminal window of NetBeans.

ii. The files DAGlearned.csv, DAGlearned.pdf and CPDAGlearned.pdf should be generated in folder Output. As stated in Task 3, the PDF files may be corrupted on MAC/Linux, and you will have to use the online GraphViz editor to produce the graph corresponding to DAGlearned.csv (simply copy the relationships from the CSV file into the editor as discussed in Task 3).

3. Repeat the above process for the other four algorithms; i.e., TABU, SaiyanH, MAHC and GES. Save the same output information and files that each algorithm produces (ensure you first read the NOTE below).

NOTE: As stated in the manual, Bayesys overwrites the output files every time it runs. You need to remember to either rename or move the output files to another folder before running the next algorithm.

Similarly, if you happen to have one of the output files open – for example, viewing the DAGlearned.pdf in Adobe Reader while running structure learning -

Bayesys will fail to replace the PDF file, and the output file will not reflect the latest iteration. Ensure you close all output files before running structure learning.BONUS TASK (15%): Working with a Bayesian network

Note: This task can provide up to an additional 15% to your CW2 mark. However, please note that your total mark for CW2 cannot exceed 100%. For instance, if you score 95% excluding the bonus task, and 10% in the bonus task (totalling 105%), your final CW2 mark will be capped at 100%.

You have learned the graph structure and model parameters in Bayesys, and these are stored in the GeNIe_BN.xdsl file in a format readable by the GeNIe BN software. GeNIe enables us to make inferences and estimate the effects of interventions, actions, or decisions. In this task, you will use GeNIe to make inferences.

1. File requirements: For this task, we require a graphical structure and a dataset.

a. For the structure, you can use either (i) the one you constructed manually based on your knowledge, or (ii) a structure learned using any of the structure learning algorithms. Pick one and name it DAGlearned.csv, then place it in the Output folder.

b. Ensure that your dataset is also placed in the Input folder and named trainingData.csv.

2. Generating a Bayesian network model:

a. Return to the Bayesys manual and read Section 6.1. Follow the instructions in the first paragraph of Section 7.3. This process will convert your learned graph into a Bayesian network model with the file extension XDSL. To do this, run Bayesys and click on Generate BN model (GeNIe is the default choice).

b. The file generated in the Output folder, named GeNIe_BN.xdsl, can be loaded into the commercial GeNIe BN software.

3. Setup software:

a. Download the GeNIe BN software (Academic version) from this link: https://www.bayesfusion.com/downloads/

b. Under Academic Downloads, click on BayesFusion Downloads for Academia. Note that the academic version is free for academic use, but you will need to validate your profile using a Microsoft, Linkedin, Yahoo, or Google account.

4. Visualising the model:

a. Double click on the file GeNIe_BN.xdsl, created in Step 2, to load your model into GeNIe.

b. Return to Section 7.3 of the Bayesys manual, and follow the Layout instructions to automatically spread the nodes apart for better visualisation of the network. You could also adjust the positioning of the nodes manually, if you wish. Skip subsection 7.3.1 and everything that follows.

c. Click on the Update (Lightning strike symbol) button to generate the output of each node, which should be a categorical distribution by default. Node visualisation can be changed by right-clicking on a node, selecting View as, and switching between Icon and Bar Chart, where the latter choice presents the distribution of states and associated probabilities.

d. Provide a screen capture of your network with the nodes shown as Bar Chart (i.e., showing the distribution and not a node icon). In Windows OS, you can do this by clicking on a node within your network in GeNIe, then pressing CTRL+A to select all nodes. With all nodes selected, press CTRL+C to copy them. Open Microsoft Paint (or a similar software) and press CTRL+V to paste the network as an image. Save the image.

5. Exploring the model (Optional): You can explore how your model behaves after entering evidence.

a. In GeNIe, click on Network in the menu and then select Update Immediately. This ensures that once you enter evidence into a node, the posterior distributions on the unobserved nodes are updated instantly, without the need to click the Update (Thunder) button.

b. You can now enter evidence by either (i) right-clicking on a node, selecting Enter evidence, and choosing one of the available categories, or (ii) double-clicking on one of the states of a node to instantiate the node in that state.

6. Validation: Now you are ready to perform different forms of predictive validation on your network. These include classification accuracy, cross-validation, and ROC Curve analysis based on the confusion matrix values. We will focus on the simplest methods available in the Desktop version of GeNIe, as the more advanced ones require API access and coding.

a. More detailed information about validation is available in the GeNIe modeller manual, accessible here:

https://support.bayesfusion.com/docs/GeNIe.pdf , on pages 525-544. It is recommended to read these pages (they contain many images and little text), or refer to them while following the subtasks below. 10

b. First, load the dataset by selecting File, then Open Data File, navigating to the Input folder in Bayesys, and selecting trainingData.csv to load it.

c. Note that the GeNIe menu options change depending on whether you are viewing a dataset or a network. Minimise the dataset (using the smaller minimise button in the top right hand corner) and view the network again.

The menu should now include a Learning tab. Click on it and select Validate.

d. After activating validation, a new window will appear showing the values assumed for each node state in the BN model, and the values read from trainingData.csv for that same variable. These may be slightly different because Bayesys might modify some values to ensure consistency with the restrictions built into GeNIe regarding variable names. For example, in the case below, numeric values are automatically prefixed with the text ‘a_’ by Bayesys (while converting the graph into a GeNIe BN model) because GeNIe does not allow variable name states to begin with a numeric value.

To ensure accurate validation, order the text-based states to match the integer-based states, as shown in the figures below. For example, the left figure shows mismatched states for the node Age, which are corrected by manually rearranging them by clicking and dragging. Click OK once the states are matched.

Left figure: States of node Age in the BN model mismatched with variable Age in trainingData.csv

Right figure: States of node Age in the BN model matched with variable Age in

trainingData.csv

e. In the next window, select K-fold cross-validation and set your preferred number of folds. You will also see a list of all nodes in your network. Select a minimum of two and a maximum of four nodes to validate, then click OK.

f. The process will generate a new window displaying four types of results:

Classification Accuracy, Confusion Matrix of the Classification, ROC Curve, and Calibration. Present the results for the first three types of validation (i.e., ignore Calibration) and provide a brief discussion of your interpretation of each of these three results3. Questions

This coursework involves applying five different structure learning algorithms to your data set. We do not expect you to have a detailed understanding of how the algorithms operate. None of the Questions focuses on the algorithms and hence, your answers should not focus on discussing differences between algorithms.

• You should answer ALL questions.

• You should answer the questions in your own words.

• Do not exceed the maximum number of words specified for each question. If a question restricts the answer to, say 100 words, only the first 100 words will be considered when marking the answer.

• Marking is out of 100.

QUESTION 1: Discuss the research area and the data set you have prepared, along with pointers to your data sources. Screen-capture part of the final version of your data set

and present it here as a Figure. For example, if your data set contains 15 variables and 1,000 samples, you could present the first 10 columns and a small part of the sample size. Explain why you considered this data set to be suitable for structure learning, and what questions you expect a structure learning algorithm to answer.

Maximum number of words: 150

Marks: 10

QUESTION 2: Present your knowledge-based DAG (i.e., DAGtrue.pdf or the corresponding DAGtrue.csv graph visualised through the web editor), and briefly describe the information you have considered to produce this graph. For example, did you refer to the literature to obtain the necessary knowledge, or did you consider your own knowledge to be sufficient for this problem? If you referred to the literature to obtain additional information, provide references and very briefly describe the knowledge gained from each paper. If you did not refer to the literature, justify why you considered your own knowledge to be sufficient in determining the knowledge-based graph.

NOTE: It is possible to obtain maximum marks without referring to the literature, as long as you clearly justify why you considered your personal knowledge alone to be sufficient.

Any references provided will not be counted towards the word limit.

Maximum number of words: 200

Marks: 10

1112

QUESTION 3: Complete Table Q3 below with the results you have obtained by applying each of the algorithms to your data set during Task 4. Compare your CPDAG scores produced by F1, SHD and BSF with the corresponding CPDAG scores shown in Table 3.1 (page 13) in the Bayesys manual.

Specifically, are your scores mostly lower, similar, or higher compared to those shown in Table 3.1 in the manual? Why do you think this is? Is this the result you expected? Explain why.

Table Q3. The scores of the five algorithms when applied to your data set.

Algorithm

CPDAG scores Log-Likelihood

(LL) score

BIC

score

# free

parameters

Structure learning

elapsed time

BSF SHD F1

TABU

SaiyanH

MAHC

GES

Maximum number of words: 250

Marks: 15

QUESTION 4: Present the CPDAG generated by HC (i.e., CPDAGlearned.pdf or the corresponding CPDAGlearned.csv graph visualised through the web editor). Highlight the three causal classes in the CPDAG. You only need to highlight one example for each causal class. If a causal class is not present in the CPDAG, explain why this might be the case.

Maximum number of words: 200

Marks: 1013

QUESTION 5: Rank the six algorithms by score, as determined by each of the three metrics specified in Table Q5. Are your rankings consistent with the rankings shownunder the column “Rankings according to the Bayesys manual” in Table Q5 below? Is this the result you expected? Explain why.

Table Q5. Rankings of the algorithms based on your data set, versus ranking of the algorithms based on the results shown in Table 3.1 in the Bayesys manual.

Your rankings

Rankings according to the Bayesys manual

Rank

BSF

[single

score]

SHD

[single

score]

[single

score]

BSF

[average

score]

SHD

[average

score]

[average

score]

SaiyanH [0.516]

MAHC [44.6] SaiyanH [0.584]

TABU [0.515]

TABU [49.21]

TABU [0.569]

HC [0.514]

HC [49.46]

HC [0.567]

GES [0.505]

GES [50.56] MAHC [0.562]

MAHC [0.487] SaiyanH [55.22]

GES [0.557]

Maximum number of words: 200

Marks: 10

QUESTION 6: Refer to your elapsed structure learning runtimes and compare them to the runtimes shown in Table 3.1 in the Bayesys manual. Indicate whether your results are consistent or not with the results shown in Table 3.1. Explain why.

Maximum number of words: 100

Marks: 10

QUESTION 7: Compare the BIC score, the Log-Likelihood (LL) score, and the number of free parameters generated in Task 3, against the same values produced by the five structure learning algorithms you used in Task 4, and enter these values into Table Q7.

What do you understand from the difference between those three scores? Are these the results you expected? Explain why.

Table Q7. The BIC scores, Log-Likelihood (LL) scores, and number of free parameters generated by each of the five algorithms during Task 3 and Task 4.

Algorithm

Your Task 4 results

Algorithm

Your Task 5 results

BIC

score

Log

Likelihood

Free

parameters

BIC

score

Log

Likelihood

Free

parameters

Your

knowledge

based

graph

TABU

SaiyanH

MAHC

GES

Maximum number of words: 200

Marks: 15

QUESTION 8: Select TWO knowledge approaches from those covered in Week 11

Lecture and Lab; i.e., any two of the following: a) Directed, b) Undirected, c) Forbidden, d) Temporal, e) Initial graph, f) Variables are relevant, and g) Target nodes. Apply each of the two approaches to the structure learning process of HC, separately (i.e. only use one knowledge approach at a time). It is up to you to decide how many constraints to specify for each approach. Then, complete Table Q8 and explain the differences in scores produced before and after incorporating knowledge. Are these the results you expected?

Explain why.

Remember to clarify which two knowledge approaches you have selected from those listed between (a) and (g) above, and show in a separate/new table the constraints you have specified for each approach. These constraints must come from your knowledge graph you have produced in Task 3. Note that knowledge approach (f) does not require any constraints; but yes, you can still use this as one of your two selections.

Table Q8. The scores of HC applied to your data, with and without knowledge.

Knowledge

approach

CPDAG scores

LL BIC

Free

parameters

Number

BSF SHD F1 of edges Runtime

Without knowledge

With knowledge:

List your 1 st knowledge approach here

With knowledge:

List your 2 nd knowledge approach here

Maximum number of words: 300

Marks: 20QUESTION 9 (BONUS): The tasks enumerated below refer to the subtasks under the BONUS task described on pages 8-10.

1. Provide a screen capture of your network (see subtask 4).

2. Present and explain the results on Classification Accuracy (see subtask 6).

3. Present and explain the Confusion Matrix (see subtask 6).

4. Present and explain the ROC Curve (see subtask 6).

Maximum number of words: 300

Bonus marks: up to 15

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *