COMP9319 Web Data Compression and Search

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

COMP9319 Web Data Compression and Search


Course Details & Outcomes

Course Description

As the amount of Web data increases, it is becoming vital to not only be able to search and retrieve this information quickly, but also to store it in a compact manner. This is especially important for mobile devices which are becoming increasingly popular. Without loss of generality, within this course, we assume Web data (excluding media content) will be in XML and its like (e.g., HTML, JSON).

If time allows, we may cover optional topics such as: streaming algorithms, text analytics, Web data optimization for mobile devices. The lecture materials will be complemented by two programming assignments and numerous tutorial-type, written exercises.

Course Aims

This course aims to introduce the concepts, theories, and algorithmic issues important to Web data compression and search. The course will also introduce the most recent development in various areas of Web data optimization topics, common practice, and its applications. The course is composed of the following parts:

  • Adaptive coding, information theory
  • Text compression (zip, gzip, bzip, etc)
  • Burrows-Wheeler Transform and backward search
  • XML compression
  • Indexing
  • Pattern matching and regular expression search
  • Distributed querying
  • Fast index construction
  • Implementation

Course Learning Outcomes

Course Learning Outcomes
CLO1 : Apply the fundamentals of text compression
CLO2 : Apply advanced data compression techniques such as those based on Burrows Wheeler Transform
CLO3 : Write computer programs for Web data compression and search with optimization
CLO4 : Use selected XML processing and optimization techniques
CLO5 : Analyze the advantages and disadvantages of data compression for Web search
CLO6 : Apply basic techniques from XML distributed query processing
CLO7 : Discuss the past, present, and future of data compression and Web data optimization


Course Learning Outcomes Assessment Item
CLO1 : Apply the fundamentals of text compression
  • Assignment 1
  • Assignment 2
  • Final Examination
CLO2 : Apply advanced data compression techniques such as those based on Burrows Wheeler Transform
  • Assignment 1
  • Assignment 2
  • Final Examination
CLO3 : Write computer programs for Web data compression and search with optimization
  • Assignment 1
  • Assignment 2
CLO4 : Use selected XML processing and optimization techniques
  • Final Examination
  • Assignment 1
  • Assignment 2
CLO5 : Analyze the advantages and disadvantages of data compression for Web search
  • Final Examination
  • Assignment 1
  • Assignment 2
CLO6 : Apply basic techniques from XML distributed query processing
  • Final Examination
CLO7 : Discuss the past, present, and future of data compression and Web data optimization
  • Final Examination

Learning and Teaching Technologies

Moodle - Learning Management System | Echo 360 | EdStem | Blackboard Collaborate

Assessments

Assessment Structure

Assessment Item Weight Relevant Dates
Assignment 1
Assessment FormatIndividual
15%
Due DateWeek 5: 24 June - 30 June
Assignment 2
Assessment FormatIndividual
35%
Due DateWeek 9: 22 July - 28 July
Final Examination
Assessment FormatIndividual
50%
Start DateNot Applicable
Due DateDuring Exam Period

Assessment Details

  • Assignment 1
    Assessment Overview

    This is a warm-up programming assignment for the course. Hence it will be relatively lightweight (students are expected to be able to finish the assignment in a few hours).

    Assessment of assignments will be primarily based on how accurately they satisfy the requirements; this means that most of the marks will be based on automatic marking. However, we may also manually examine submitted assignments to determine (a) whether they are written with good style, (b) how closely they satisfied the requirements, if time allows.

    Individual graded results with optional comments will be emailed to each student. Overall feedbacks will be discussed in the lectures, and students may discuss with the tutors in consultation sessions for further assessment feedbacks.

    Course Learning Outcomes
    • CLO1 : Apply the fundamentals of text compression
    • CLO2 : Apply advanced data compression techniques such as those based on Burrows Wheeler Transform
    • CLO3 : Write computer programs for Web data compression and search with optimization
    • CLO4 : Use selected XML processing and optimization techniques
    • CLO5 : Analyze the advantages and disadvantages of data compression for Web search
  • Assignment 2
    Assessment Overview

    This is the second programming assignment for the course. Hence it will be relatively heavier weight since it involves more advanced techniques that students have learnt from the course (students are expected to be able to finish the assignment in a few days).

    Assessment of assignments will be primarily based on how accurately they satisfy the requirements; this means that most of the marks will be based on automatic marking. However, we may also manually examine submitted assignments to determine (a) whether they are written with good style, (b) how closely they satisfied the requirements, if time allows.

    Individual graded results with optional comments will be emailed to each student. Overall feedbacks will be discussed in the lectures, and students may discuss with the tutors in consultation sessions for further assessment feedbacks.

    Course Learning Outcomes
    • CLO1 : Apply the fundamentals of text compression
    • CLO2 : Apply advanced data compression techniques such as those based on Burrows Wheeler Transform
    • CLO3 : Write computer programs for Web data compression and search with optimization
    • CLO4 : Use selected XML processing and optimization techniques
    • CLO5 : Analyze the advantages and disadvantages of data compression for Web search
  • Final Examination
    Assessment Overview

    The final exam will be a major assessment in this course and aims to test what students learned about data compression and search during the course of the semester. To pass this course, students are required to have satisfactory performance on the final exam even if they do very well on the assignments. In order to meet the hurdle requirement, students must score better than 40% on the final exam. Note that the hurdle will be enforced after any required scaling.

    Course Learning Outcomes
    • CLO1 : Apply the fundamentals of text compression
    • CLO2 : Apply advanced data compression techniques such as those based on Burrows Wheeler Transform
    • CLO4 : Use selected XML processing and optimization techniques
    • CLO5 : Analyze the advantages and disadvantages of data compression for Web search
    • CLO6 : Apply basic techniques from XML distributed query processing
    • CLO7 : Discuss the past, present, and future of data compression and Web data optimization
    Assignment submission Turnitin type

    Not Applicable

    Hurdle rules

    To pass this course, students are required to have satisfactory performance on the final exam even if they do very well on the assignments. In order to meet the hurdle requirement, students must score better than 40% on the final exam. Note that the hurdle will be enforced after any required scaling.

General Assessment Information

Assignments will be completed individually ; this means that you should do them yourself without assistance from others, except for asking advice from the Lecturer or Tutor. As noted above, assignments are the primary vehicle for learning the material in this course. If you don't do them, or simply copy and submit someone else's work, you have wasted a valuable learning opportunity.

Assignments are to be submitted via "give" before the specified time on the due date. Assessment of assignments will be primarily based on how accurately they satisfy the requirements; this means that most of the marks will be based on automatic marking. However, we may also manually examine submitted assignments to determine (a) whether they are written with good style, (b) how closely they satisfied the requirements, if time allows.

The penalty for late submission of assignments will be 5% (of the worth of the assignment) subtracted from the raw mark per day of being late. In other words, earned marks will be lost. For example, assume an assignment worth 20 marks is marked as 18, but had been submitted two days late. The late penalty will be 2 marks, resulting in a mark of 16 being awarded. No assignments will be accepted later than 5 days after the original deadline. For example, if you have your special consideration granted by UNSW for a one-week extension, there will be no late penalty if the assignment is submitted within 7 days after the original deadline. However, no further late submissions will be accepted after these 7 days.

Grading Basis

Standard

Course Schedule

Teaching Week/Module Activity Type Content
Week 1 : 27 May - 2 June Lecture

Introduction, basic information theory, basic compression

Week 2 : 3 June - 9 June Lecture

More basic compression algorithms

Week 3 : 10 June - 16 June Lecture

Adaptive Huffman; Overview of BWT

Week 4 : 17 June - 23 June Lecture

Pattern matching and regular expression

Week 5 : 24 June - 30 June Lecture

FM index, backward search, compressed BWT

Week 7 : 8 July - 14 July Lecture

Suffix tree, suffix array, the linear time algorithm

Week 8 : 15 July - 21 July Lecture

XML overview; XML compression

Week 9 : 22 July - 28 July Lecture

Graph compression; Distributed Web query processing

Week 10 : 29 July - 4 August Lecture

Optional advanced topics; Course Revision

Attendance Requirements

Students are strongly encouraged to attend all classes and review lecture recordings.

General Schedule Information

The course schedule is an approximate guide to the sequence of topics in this course. It is subject to change as the term progresses.

Course Resources

Recommended Resources

There will be no textbook used in this course. Lecture slides and supplementary readings will be provided and used.

You may find the readings below useful as reference materials:

  • Managing Gigabytes: Compressing and Indexing. Documents and Images, Second Edition. Ian H. Witten, Alistair Moffat, Timothy C. Bell, Morgan Kaufmann, 1999. (recommended reference, available at the university bookstore)
  • Search Engines: Information Retrieval in Practice. W. Bruce Croft, Donald Metzler, and Trevor Strohman, Pearson Education, 2009.
  • http://www.data-compression.info contains lots of valuable resources on data compression (especially links to readings and useful advice), despite the website's pink color!
  • Data on the Web: from relations to semistructured data and XML. Serge Abiteboul, Peter Buneman, Dan Suciu. Morgan Kaufmann, 2000.

You will also find your previous textbooks on data structures and/or algorithms useful, in case you need to refer to the fundamentals of data structures and algorithms for text processing.

Course Evaluation and Development

This course is evaluated each session using MyExperience.

The MyExperience evaluation from the last time I taught this course showed that students were overall satisfied with all aspects of the course. Thus we maintain a similar style and structure for this term. Since this is the second time that we run this course after the pandemic (from totally online back to hybrid mode), we will go through the in-depth topics in the recorded lectures and discuss more examples and/or practical considerations in the live lectures (mixed online & in person. Please note that your feedback is important and will be considered to improve future offerings of this course (e.g., how much content can remain online).

Students are also encouraged to provide informal feedback during the term and let the lecturer know of any problems, as soon as they arise. Suggestions will be listened to very openly, positively, constructively, and thankfully, and every reasonable effort will be made to address them as soon as possible.



发表评论

电子邮件地址不会被公开。 必填项已用*标注