首页 » 数据科学与大数据技术 » Data Engineering 2025

Data Engineering 2025

2025-02-22 Admin 写评论

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

Assessed Coursework

Course Name Database Systems (H)

Coursework Number 1/1

Deadline Time: 16h30

Date: 14/03/2025

% Contribution to final course mark 20%

Solo or Group ✓ Solo

Group ✓

Anticipated Hours Average 20 hours

Submission Instructions

Please, read the description.

Please Note: This Coursework cannot be Re-Assessed

Code of Assessment Rules for Coursework Submission

Deadlines for the submission of coursework which is to be formally assessed will be published in course documentation, and work which is submitted later than the deadline will be subject to penalty as set out below.

The primary grade and secondary band awarded for coursework which is submitted after the published deadline will be calculated as follows:

(i) in respect of work submitted not more than five working days after the deadline

a. the work will be assessed in the usual way;

b. the primary grade and secondary band so determined will then be reduced by two secondary bands for each working day (or part of a working day) the work was submitted late.

(ii) work submitted more than five working days after the deadline will be awarded Grade H.

Penalties for late submission of coursework will not be imposed if good cause is established for the late submission. You should submit documents supporting good cause via MyCampus.

Penalty for non-adherence to Submission Instructions is: 2 bands

Data Engineering 2025

Task 1 [60 Marks]

Your team is hired by the HM Revenue & Customs (HMRC) to execute an analytics query over the last 4 years estimating the moving average of the collected city council tax across all the UK City Councils. Specifically, youare given access to the relation HMRC (Council, RecordedDate, Tax). The attribute Council is 128 bytes with 320 different values, each one associated with a city council regional authority. The attribute RecordedDate is 64 bytes with 48 different values, each one representing the date of the first day of each month from ‘01-01- 2020’ to ’01-12-2024’. The Tax attribute is a positive real number with size 64 bytes representing the revenue (in GBP) of a specific Council in a specific month. An example of this relation is provided below (the tuple values do not reflect any real figures):

HMRC

For each city council, there are 48 tax revenue values across 48 months (one per month), thus, the HMRC relation has 320*48 = 15,360 tuples. The file accommodating the HMRC relation is not sorted by any attribute.

Your team is asked to execute a window function-based analytics query showing the moving average of the tax per month per city. Given a city, the tax moving average of a month X takes the average of the tax values of the months X-1, X, an X+1. For example, for Birmingham City Council, the moving average for ’01-02-2020’ is: (£261M + £236M + £240M)/3 = £245.67M. The outcome of the moving average is stored in the attribute MovingAVG with size 64 bytes.

The analytics query is provided below:

SELECT Council, RecordedDate, Tax,

AVG (Tax) OVER (PARTITION BY Council

ORDER BY RecordedDate ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

AS MovingAVG FROM HMRC;

Objective: Practice with processing advanced analytics SQL operators using File and Indexing

StructuresGiven that this query may be deployed across diverse operational scenarios—such as high-frequency real-time dashboards for finance departments, annual compliance reporting for councils, or ad-hoc analytics for external auditors—your team must investigate how the underlying file and indexing structures perform under varying workloads, access patterns, and resource constraints (e.g., memory, latency tolerance, or infrastructure costs). Currently, three strategies are proposed for further consideration:

Strategy 1: Sorting File Structure [20 Marks]

Your team proposes to sort the HMRC file by the Council attribute using external sorting. You are asked to devise a query processing algorithm implementing the query using the sorted HMRC file only. For the external sorting, the degree of merging M = 2 and the available memory for sorting a ‘sub-file’ is 1,280 blocks.

Strategy 2: Hashing File Structure [20 Marks]

Your team proposes to hash the HMRC file using the Council attribute. You are asked to devise a query processing algorithm implementing the query using the hashed HMRC file only. The available memory dedicated for hashing is 10,000 blocks.

Strategy 3: Indexing Structure with B+ Tree [20 Marks]

Your team proposes to create a B+ Tree (secondary index) to index the HMRC file using the non-ordering, no unique Council attribute. Each internal node is of order p = 4 (4 pointers/3 key values). Each leaf node stores 3 key values along with 3 pointers and 1 sibling pointer to the ‘next’ leaf node. Each node is stored in 1 block.

You are asked to devise a query processing algorithm implementing the query using the B+ Tree. The cost for building the B+ Tree is equal to the cost for accessing all the blocks of the HMRC file.

For each Strategy:

Optional: Provide a simple sketch/diagram of the underlying data structure(s), which can help you visualizing the query processing steps of your solutions.
Optional: You can use pgAdmin4 to create the relation HMRC populating with fictitious tuples. Then, use the EXPLAIN tool over the provided query and interpret the query processing outcome from your DB server. You could also explore more options, like creating an index over the Council and/or RecordedDate, and then use the EXPLAIN tool again to check whether there you obtain more efficient query processing.
Compulsory:

Describe the steps of your method/algorithm. You must explain the steps you follow to access the blocks and how you calculate the moving average having accessed the required records. Note: the outcome of the query is stored in the main memory where the moving average is stored in the attribute MovingAVG (defined in the query). Assume that there is sufficient memory for storing the outcome of the query. [10 marks per strategy, 30 total]
Report on (i) the expected number of block accesses of each proposed query processing strategy and (ii) the required storage of the proposed structure, if any (e.g., size of B+ Tree). [5 marks per strategy, 15 total]
Calculate the total memory (in blocks) required to store the results of the query. Justify your answer. [5 marks total]
Recommend the best overall strategy for different use cases (e.g., high-frequency queries vs. annual reporting) supported by your cost analysis. [10 marks total]

A file block has size 512 bytes, and any pointer has size 16 bytes.Notes & Submission

Note 1: The Assessed Group Work is graded out of 60 Marks. The group submits one product, and all group members receive the same grade.

Note 2: If there are any issues regarding the load contribution/effort of members, please let me know by emailing the whole group and me.

Note 3: Answer ALL tasks.

Note 4: Only one member of your group will need to submit a document including your group details and answers in one PDF document file.

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *