Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
Section 1: Overview
In the age of big data, there is a constant struggle with regard to storage. For example, a single Whole Genome Sequencing experiment can generate over 100 GB of raw data. For years, computer scientists have attempted to reduce the storage size of files without losing any information. In this assignment, we will be implementing our own file compression tool using the Huffman algorithm.
Task: File Compression and Uncompression
For grading purposes, you are guaranteed that original_file will be at most 10 MB.
File Uncompression
The uncompress program will then take as input a compressed file created by your compress program (compressed_file) and will uncompress the file (uncompressed_file):
Note that uncompressed_file must be completely identical to original_file! In other words, calling diff should return no differences:
Reference SolutionYou have been provided two programs (refcompress and refuncompress), which are a reference solution implementation of this Project. You can use them to help guide you. Note that the reference binaries were compiled to run in this Project's coding environment, so if you attempt to run them on a different architecture, they will most likely not work.
Compiling and Running
We have provided a Makefile, but you are free to modify anything in the starter code that you'd like (including the Makefile and any of the .cpp or .hpp files) however you wish, and you are allowed to use anything in the C++ STL. Our only requirements are the following:
Once you are ready to submit your code, click the "Save & Grade" button at the bottom-left corner of this panel. Your code will not submit automatically: you must submit your code by clicking this button before the deadline, otherwise your code will not be accepted.
Before doing anything, it is a good idea to create all required .cpp files (see the Makefile) and write blank main() functions in your compress.cpp and uncompress.cpp files to get your project files to the point that you can compile successfully with make (even though the compress and uncompress executables don't do anything). Once you get things compiling, start working on compress, and once you are confident that you have that working reasonably well, start working on uncompress. Use print statements, gdb, etc. as you develop to make sure you're implementing things correctly.Suggested Control Flow for compress
If you choose to use the code structure we have laid out in the starter, one reasonable potential development process is the following. Note that this is independent of the "Suggested Development Process" section, which gives an overview of how a functioning compress.cpp and uncompress.cpp (the source code) should look like. This section illustrates our suggested steps to go from the starter code to a functioning compress and uncompress (the executables):
○ This will allow you to compile your program as you make changes to the other files○ As you implement the components described below, you can use your main() functions in compress.cpp and uncompress.cpp to test them out one-by-one○ For right now, think of compress.cpp and uncompress.cpp as blank canvases to be able to write code to test the other components, and you will write the actual implementation of both programs later
○ Refer to HCTree.hpp to see what function signatures are declared in the HCTree class
○ You will want to use gdb or print statements to help you trace your tree to make sure it matches what you would get if you constructed it manually by hand (on small example files)
○ Incorporate one step of the control flow, then test it thoroughly to make sure things work to that point, then incorporate the next step, test thoroughly, etc.○ How you choose to implement your compressed file header is up to you, but we have some recommendations in the "Design Notes" section, so be sure to refer to that
As mentioned, the above suggested control flows and development processes are simply what we recommend if you feel lost and don't know where to start. However, not everybody thinks/works the same! What we suggest may not be the best approach for you personally. Before you even write a single line of code, read through this guide and make sure you fully understand the required tasks of the project, and then actually sit down and map out how you personally want to approach the problem. You are of course free to deviate from or modify the programming approach you design, but simply having a roadmap to help guide you can be extremely helpful when coding.
Section 3: Design NotesBefore you write even a single line of code, it's important to properly design the various components of your compress and uncompress programs. Here, we will discuss some aspects of the design that you should be sure to think about.
See this article to learn more about how to use them.
For this assignment, you will have to read data from files and write data to files. To help you with this, in Helper.hpp and Helper.cpp, we have provided two helper classes that can make it easier to perform file I/O: FancyInputStream and FancyOutputStream. It is up to you to thoroughly read the header comments of the functions in these two classes to see how to use them. It is a good idea to have the "Project Workflow" open while you look at the function headers for FancyInputStream and FancyOutputStream to get a better idea of how you can apply various functions of those classes to the compress and uncompress workflows.
Note that, in C++, ifstream and ofstream objects are automatically closed by their destructors. Thus, when a FancyInputStream or FancyOutputStream object is destroyed (e.g. if you create it on the runtime stack and it goes out of scope), the internal ifstream/ofstream object will also automatically get destroyed (and thus will automatically close).
One crucial data structure you will need is a binary trie (i.e., "code tree" or "encoding tree") that represents a Huffman code. The HCTree.hpp header file provides a possible interface for this structure (included in your repo as skeleton code); you can modify this in any way you want.
Additionally, you will write a companion HCTree.cpp implementation file that implements the interface specified in HCTree.hpp, and you will then use it in your compress and uncompress programs. Note that, when you implement the HCTree class, you will want to use the HCNode class we have provided in Helper.hpp.As you implement Huffman's algorithm, you will find it convenient to use multiple data structures. For example, a priority queue will assist in building the Huffman Tree. Feel free to utilize other beneficial data structures. However, you should use good object-oriented design in your solution. For example, since a Huffman code tree will be used by both your compress and uncompress programs, it makes sense to encapsulate its functionality inside a single class accessible by both programs. With a good design, the main functions in the compress and uncompress programs will be quite simple: they will create objects of other classes and call their methods to do the necessary work.
Be sure to refer to the "Honey, I Shrunk the File" section of the Data Structures Cogniterra text for more information about Huffman's algorithm.
Priority Queues in C++
A C++ priority_queue is a generic container that can hold any type, including HCNode* (wink wink). By default, a priority_queue<T> will use the < operator defined for objects of type T. Specifically, if a < b, then b is taken to have a higher priority than a (i.e., by default, it functions as a max-heap).
However, in the Huffman Tree building algorithm, we want to select symbols with lower frequencies first (i.e., we want to use a min-heap). Thus, we need to tell the priority_queue how to compare HCNode* objects by defining a custom "comparison class" that will dereference the pointers and compare the objects they point to. This has been done for you via the HCNodePtrComp class in Helper.hpp. You can use it to create a priority_queue as follows: priority_queue<HCNode*, vector<HCNode*>, HCNodePtrComp> pq;
● The first argument of the template (HCNode*) tells the priority_queue that it will be storing HCNode* objects
Be sure to refer to the C++ priority_queue documentation for more information about how to use a priority_queue.
Compressed File Header
Efficient Header Design
One extremely inefficient strategy for designing the header (which is what is used in the reference solution) is to simply store 256 integers (in 4-byte chunks as unsigned int objects) at the beginning of the compressed file (i.e., encode the symbol counts as the header), but note that this is not very efficient and is guaranteed to use up 256*4 = 1024 bytes! In order to receive full points, you must BEAT our reference solution by coming up with a more efficient way to represent this header!
However, we strongly encourage you to implement the naive (256 unsigned int objects) approach first, and do not attempt to reduce the size of the header until you’ve gotten your compress and uncompress to work correctly for the provided inputs.
One way to reduce the size of the header is to take a frequency approach, but to think about how many bytes you really need to store each integer. Answering this question relates to the maximum size of the files that you are required to encode: your compress program must work for input files up to 10 MB in size, so a particular byte value may occur up to ~10 million times in the file. This fact should help you determine the minimum number of bytes required to represent each frequency count in you header.
Alternative approaches may use arrays to represent the structure of the tree itself in the header. With some cleverness, it is possible to optimize the header size to about 10M bits, where M is the number of distinct byte values that actually appear in the input file. This write-up may be helpful to learn about how to "serialize" the tree structure itself.
Section 4: How to Debug
As you work on this Project, you will almost certainly run into bugs you will need to find and fix. Because of the bitwise nature of this Project, debugging might be more challenging than usual. Here, we will provide some general tips for debugging that may prove useful.
Printing a Huffman Tree
As you implement the Huffman Tree components of compress and uncompress, you will want to verify that the Huffman Tree you built is correct by working through the exact same example by hand and manually comparing the tree your code builds against the tree you manually build by hand. To compare the tree your code builds against the tree you built yourself, you will want to somehow view the Huffman Tree in your code. We suggest doing this in one of the following two ways:
- Write a helper function that traverses an HCTree object and outputs the edges in some human-readable format (e.g. as an edge list you can then draw out by hand)
- Use gdb to set a breakpoint just after the HCTree object is built, and use gdb commands to print the entire tree (e.g. as an edge list you can then draw out by hand)
Viewing a Binary File as Bits
As you implement compress, you will likely want to try compressing a small example file that you can trace by hand and then manually look at the compressed output file to verify that it looks like what you would expect. However, if you try to open the binary file in vim or similar, you will likely see garbage data: the text editor is trying to render each byte as a human-readable character (and is failing to do so).
Instead, you can view the bit representation of a binary file using a hex dump tool. One good choice is xxd, which is a command-line hex dump tool that exists on most Linux distributions. Given any arbitrary file (plain-text or binary!), you can call xxd with the -b flag (for "bits") to view the bit representation of the bytes in the file. For example, if I have a file combooter.txt that contains the text boo (3 bytes), I can view the binary representation of the file as follows:
If your code is taking too long to run, you will want to investigate which parts of your code are slow so you can speed up those specific parts. On pretty much any Linux platform, you can use gprof (GNU Profiler) to perform "per-function profiling". There are more complicated profiling tools that exist, so if you feel comfortable learning how to set up / use one of those, feel free, but gprof works for me. The basic workflow is as follows:
The gprof output tells you what percentage of your runtime was spent on each function that was called in your code. Note that the profiling results will likely be
Section 5: Common Issues
Plain-Text Files Ending with Newline Characters
Try using a text editor like vim to create a text file niema.txt that contains only the string Niema in it, and save the file. What is the file size? Intuitively, you might thinkthat the size is 5 bytes (each character takes exactly 1 byte), but try running ls -l or du -sb to get the file size in bytes, and the results might surprise you:
In your final implementation, you will need to support all possible bytes, including the newline character, but as you begin the Project, you may want to only play around with visible human-readable symbols. If you want to create plain-text files without a newline character at the end, you will want to use the echo command with the -n flag (which tells echo not to add a newline character at the very end), and redirect the output to a file using >:
In general, we typically think of a single character in a plain-text file as a char, which is 1 byte (8 bits), meaning there are 28=256possible characters. However, let's imagine creating a plain-text file containing the
You may expect this file to be exactly 5 bytes, as there are exactly 5 characters, right? However, manually checking the file size may surprise us:
Why is our file 10 bytes instead of 5? The issue is that we are using symbols from the extended Unicode alphabet: basically, a single byte can only represent 256 possible symbols, so under the Unicode encoding, a single character in the extended alphabets can be represented using more than 1 byte. In this specific example, each of our 5 symbols (ŇĩēѪą) is represented using 2 bytes, which is why the file size is 10 bytes.
Just as a heads-up, you don't need to (and shouldn't try to) handle the multi-byte extended Unicode symbols as single symbols: if your compress and uncompress tools simply treat every file they encounter as files over the alphabet of all 256 possible bytes, they will be able to handle any arbitrary possible file completely fine.
No Space Left On Device
This message means that your virtual environment has run out of disk space. This typically implies that you've accidentally created a massive file that's filled up your disk space. One common cause (but not necessarily the only cause) is that a loop in your code within which you're writing to disk is running too long (e.g. potentially an infinite loop). You will need to find out which file(s) are too large and delete them using the rm command.
You find out which files are largest using the du command (-h is for human-readable file size units, e.g. M for megabyte, and -a is to view all files, including hidden files):
$ du -h -a *
You can also use the ls command to list the files in any given directory (-h is again for human-readable, and -a is again to view all files, including hidden files):
Valgrind Reports Memory Leak with make gprof
If your code was timing out because the runtime was too long, you likely investigated the runtime of the parts of your code using gprof (which is excellent!), and in doingso, you likely compiled your code using make gprof, which adds the -pg compilation flag to the compilation command (the -pg compilation flag is required in order to use gprof).
However, for reasons out-of-scope of this class, Valgrind doesn't play nicely with the -pg flag, and it will (potentially) incorrectly report a memory leak even if your code doesn't actually have a memory leak. Because of this, you will want to remove the -pg flag from your compilation command before checking for memory leaks (and before submitting your code, as the grader will be checking for memory leaks). For more information about Valgrind and -pg, see this post.