Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
Assessed Coursework
Code of Assessment Rules for Coursework Submission
The primary grade and secondary band awarded for coursework which is submitted after the published deadline will be calculated as follows:
(ii) work submitted more than five working days after the deadline will be awarded Grade H.
Penalties for late submission of coursework will not be imposed if good cause is established for the late submission. You should submit documents supporting good cause via MyCampus.
Penalty for non-adherence to Submission Instructions is: 2 bands
Data Engineering 2025
For each city council, there are 48 tax revenue values across 48 months (one per month), thus, the HMRC relation has 320*48 = 15,360 tuples. The file accommodating the HMRC relation is not sorted by any attribute.
Your team is asked to execute a window function-based analytics query showing the moving average of the tax per month per city. Given a city, the tax moving average of a month X takes the average of the tax values of the months X-1, X, an X+1. For example, for Birmingham City Council, the moving average for ’01-02-2020’ is: (£261M + £236M + £240M)/3 = £245.67M. The outcome of the moving average is stored in the attribute MovingAVG with size 64 bytes.
Objective: Practice with processing advanced analytics SQL operators using File and Indexing
StructuresGiven that this query may be deployed across diverse operational scenarios—such as high-frequency real-time dashboards for finance departments, annual compliance reporting for councils, or ad-hoc analytics for external auditors—your team must investigate how the underlying file and indexing structures perform under varying workloads, access patterns, and resource constraints (e.g., memory, latency tolerance, or infrastructure costs). Currently, three strategies are proposed for further consideration:
Your team proposes to hash the HMRC file using the Council attribute. You are asked to devise a query processing algorithm implementing the query using the hashed HMRC file only. The available memory dedicated for hashing is 10,000 blocks.
Your team proposes to create a B+ Tree (secondary index) to index the HMRC file using the non-ordering, no unique Council attribute. Each internal node is of order p = 4 (4 pointers/3 key values). Each leaf node stores 3 key values along with 3 pointers and 1 sibling pointer to the ‘next’ leaf node. Each node is stored in 1 block.
You are asked to devise a query processing algorithm implementing the query using the B+ Tree. The cost for building the B+ Tree is equal to the cost for accessing all the blocks of the HMRC file.
- Optional: Provide a simple sketch/diagram of the underlying data structure(s), which can help you visualizing the query processing steps of your solutions.
- Optional: You can use pgAdmin4 to create the relation HMRC populating with fictitious tuples. Then, use the EXPLAIN tool over the provided query and interpret the query processing outcome from your DB server. You could also explore more options, like creating an index over the Council and/or RecordedDate, and then use the EXPLAIN tool again to check whether there you obtain more efficient query processing.
- Compulsory:
- Describe the steps of your method/algorithm. You must explain the steps you follow to access the blocks and how you calculate the moving average having accessed the required records. Note: the outcome of the query is stored in the main memory where the moving average is stored in the attribute MovingAVG (defined in the query). Assume that there is sufficient memory for storing the outcome of the query. [10 marks per strategy, 30 total]
- Report on (i) the expected number of block accesses of each proposed query processing strategy and (ii) the required storage of the proposed structure, if any (e.g., size of B+ Tree). [5 marks per strategy, 15 total]
- Calculate the total memory (in blocks) required to store the results of the query. Justify your answer. [5 marks total]
- Recommend the best overall strategy for different use cases (e.g., high-frequency queries vs. annual reporting) supported by your cost analysis. [10 marks total]