Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
CS4278/5278 Principles of Software Engineering Homework Assignment #5 — Debugging Automation
In this assignment you will develop two debugging support tools to reduce the effort associated with software maintenance.
First, you will implement the Delta Debugging algorithm to minimize "interesting" sets.
Second, you will implement a Coverage-Based Fault Localization Algorithm to identify suspicious lines implicated in test failures.
You may work with a partner for this assignment. If you do you must use the same partner for all sub-components of this assignment. You must indicate your partner in the report. Only one partner needs to submit the report on BrightSpace, but if you both do, nothing fatal happens.Delta Debugging
You must write a Python 3 program, delta.py, that implements delta debugging to find a one-minimal interesting subset of a given set. Your program takes two arguments:
- the size n of the set to be minimized — for example, if n is 5, the program will find an interesting subset of {0,1,2,3,4}
-
a command that determines if a given subset is interesting — for example, if the command is is-interesting.sh, you may invoke is-interesting.sh 0 2 4 to probe if the subset 0 2 4 is interesting
- the command returns the exit code 1 if the subset is interesting and 0 otherwise
- note that command may be "bash ./is-interesting.sh" (or something similarly long) on the autograding server for security/permission reasons; your program should correctly handle the case where the single command string passed in contains spaces (by treating it as one command: do not break it up)
- do not try to add your own "bash" or "./" or whatnot with string concatenation — really just simply use the command passed in and append some arguments
Your program should print out a one-minimal interesting subset in standard Python list format in sorted order. (This is the only output your program should produce. Do not do anything more. Do not submit a program with debugging output for a grade.)
Autograder Submission
We will evaluate your delta.py implementation in terms of the correct answers it generates (i.e., the one-minimal interesting subsets) and also in terms of the number of probes (i.e., calls to is-interesting.sh).
Minimizing Failure-Inducing Inputs
A standard use of Delta Debugging is to minimize failure-inducing inputs — inputs which cause a program to crash.
For example, consider this (inefficient) implementation of is-interesting.sh that checks to see if its command-line arguments contain both 3 and 6:
$ cat is-interesting.sh #!/bin/bash FIRST=0 SECOND=0 for i in $* ; do if [ $i -eq 3 ]; then FIRST=1 ; fi if [ $i -eq 6 ]; then SECOND=1 ; fi done if [ $FIRST -eq 1 ] ; then if [ $SECOND -eq 1 ] ; then exit 1 # interesting fi fi exit 0To find a one-minimal interesting subset of [0,1,2,3,4,5,6,7,8], we would run:
$ python3 delta.py 9 ./is-interesting.sh [3, 6]The above example shows the only expected or required output for your program: the single line with the list of elements.
With debugging output turned on (do not submit with debugging output like this), we see:
$ python3 delta.py 9 ./is-interesting.sh delta.py: dd(p= [] , c= [0, 1, 2, 3, 4, 5, 6, 7, 8] ) delta.py: calling ./is-interesting.sh 0 1 2 3 delta.py: calling ./is-interesting.sh 4 5 6 7 8 delta.py: dd(p= [4, 5, 6, 7, 8] , c= [0, 1, 2, 3] ) delta.py: calling ./is-interesting.sh 4 5 6 7 8 0 1 delta.py: calling ./is-interesting.sh 4 5 6 7 8 2 3 delta.py: dd(p= [4, 5, 6, 7, 8] , c= [2, 3] ) delta.py: calling ./is-interesting.sh 4 5 6 7 8 2 delta.py: calling ./is-interesting.sh 4 5 6 7 8 3 delta.py: dd(p= [4, 5, 6, 7, 8] , c= [3] ) delta.py: dd(p= [0, 1, 2, 3] , c= [4, 5, 6, 7, 8] ) delta.py: calling ./is-interesting.sh 0 1 2 3 4 5 delta.py: calling ./is-interesting.sh 0 1 2 3 6 7 8 delta.py: dd(p= [0, 1, 2, 3] , c= [6, 7, 8] ) delta.py: calling ./is-interesting.sh 0 1 2 3 6 delta.py: dd(p= [0, 1, 2, 3] , c= [6] ) [3, 6]In this example we see that Delta Debugging made 9 probes to is-interesting.sh to find a one-minimal subset (compared to 2^9 = 512 to exhaustively find a minimal subset). Delta Debugging is usually very efficient.
Note that the debugging output shown above is only to help you see the algorithm in action and is not something you should have enabled in your final submission.
Minimizing Failure-Inducing Changes
To use Delta Debugging to minimize a different type of set, such as a set of failure-inducing changes, it suffices to change your is-interesting.sh script.
Consider the program wireworld.c from Rosetta Code. We have an original version of it as well as 10 patches:
$ cat patch.0 6a7 > this looks bad $ cat patch.2 11c13 < "+-----------+\n" --- > "+===========+\n" $ cat patch.4 25c27 < for (i = 0; i < w*h; i++) { --- > for (i = 0+0; i < w*h; i++) {We observe that the original version compiles correctly:
$ gcc -c wireworld-original.c $But if we apply all of the patches, it no longer compiles:
$ cat patch.* | patch -p0 wireworld-original.c patching file wireworld-original.c $ gcc -c wireworld-original.c wireworld-original.c: In function ‘next_world’: wireworld-original.c:36:11: error: ‘j’ undeclared (first use in this function) out[j] = (hc == 1 || hc == 2) ? 'H' : '.'; ^ wireworld-original.c:36:11: note: each undeclared identifier is reported only once for each function it appears inGiven the correct definition of "interesting", Delta Debugging can find the bad subset of patches for us:
$ python3 delta.py 10 ./is-failure-inducing-change delta.py: dd(p= [] , c= [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ) delta.py: calling ./is-failure-inducing-change 0 1 2 3 4 patching file wireworld-original.c delta.py: calling ./is-failure-inducing-change 5 6 7 8 9 patching file wireworld-original.c foo.c: In function ‘next_world’: foo.c:34:11: error: ‘j’ undeclared (first use in this function) out[j] = (hc == 1 || hc == 2) ? 'H' : '.'; ^ foo.c:34:11: note: each undeclared identifier is reported only once for each function it appears in delta.py: dd(p= [] , c= [5, 6, 7, 8, 9] ) delta.py: calling ./is-failure-inducing-change 5 6 patching file wireworld-original.c foo.c: In function ‘next_world’: foo.c:34:11: error: ‘j’ undeclared (first use in this function) out[j] = (hc == 1 || hc == 2) ? 'H' : '.'; ^ foo.c:34:11: note: each undeclared identifier is reported only once for each function it appears in delta.py: dd(p= [] , c= [5, 6] ) delta.py: calling ./is-failure-inducing-change 5 patching file wireworld-original.c delta.py: calling ./is-failure-inducing-change 6 patching file wireworld-original.c foo.c: In function ‘next_world’: foo.c:34:11: error: ‘j’ undeclared (first use in this function) out[j] = (hc == 1 || hc == 2) ? 'H' : '.'; ^ foo.c:34:11: note: each undeclared identifier is reported only once for each function it appears in delta.py: dd(p= [] , c= [6] ) [6] $ cat patch.6 34c36 < out[i] = (hc == 1 || hc == 2) ? 'H' : '.'; --- > out[j] = (hc == 1 || hc == 2) ? 'H' : '.';You must write is-failure-inducing-change (a bash shell script) that determines if given patches (named patch.0 through patch.n), when applied to wireworld-original.c, result in a GCC compilation error. Your script is responsible for restoring wireworld-original.c to its pristine state after each run. You will likely have to read up on how the Unix patch utility works.
The inputs, outputs and arguments to is-failure-inducing-change are exactly the same as is-interesting from HW5a.
The wireworld example shown above is available for local testing.
Minimizing Test Suites
We can also use Delta Debugging to find one-minimal subsets of test suites that retain the same coverage. We will return to one of our favorite programs, libpng-1.6.34.tar.gz, since many students manually minimized test suites for it in an earlier homework.
You should use Delta Debugging with the goal of finding a one-minimal subset of these 1639 test cases such that that subset has the same line coverage as the complete set as reported by gcov (just as in Homework 1). Read the PDF report guidelines carefully before beginning.
You must submit a short PDF report, via Brightspace, describing your experiences using Delta Debugging to minimize this test suite. In particular:
-
[interesting] What was your high level notion of interesting for this activity? How did you implement it?
[1 point for concept, 2 points for implementation] -
[results] What was the original coverage of the test suite? What were the size and coverage of your final result? Were you able to use Delta Debugging to produce a one-minimal subset of the test suite with the same coverage? Why or why not?
- Note that both why and why not can potentially be full-credit answers, depending on the logical argument you present and the mastery of the course material you demonstrate. This is a tricky thought question.
- (You may be tempted to ask for clarifications on this point, but I'm afraid we must leave this aspect to your interpretation. Figuring out what it means to provide a high-quality writeup here is part of the assignment.)
-
[time] How many probes did Delta Debugging make to determine if a set was interesting? What was the overall runtime? How does this formally compare to the expected "Big-Oh" performance? How does this informally compare to a manual effort to minimize the test suite (either your own in an earlier homework or a hypothetical situation)?
[1 point for numbers, 3 points for comparisons] -
[conclusion] Would you use Delta Debugging to minimize a test suite in the future? Why or why not?
[2 point for insight]
What if test suite minimization seems to be taking too long?
Planning your time in advance and scheduling so that you do not start too late are major themes in this course.
Most students are able to run the delta.py computing part of HW5c "overnight". If it seems to be running for too long, one possibility is that you have made a mistake in your programming (e.g., of Delta Debugging or the definition of interesting) and another is that you are running on a very slow setup (e.g., a virtual machine in an unplugged laptop with files stored on a networked drive, etc.).
You can roughly predict how long the entire operation will take by measuring how long it takes to generate coverage for the full test suite and then studying the Big-Oh running time for Delta Debugging.
If all else fails and you fear it will take too long:
- Consider passing multiple png files to pngtest at once. (That may give slightly different coverage results than in HW1, but as long as you do it consistently, it is fine for this assignment.)
- Some students cached the per-test coverage results and simply re-used them later (e.g., if you know the lines covered by test #1 and the lines covered by test #2, simply union them to get the lines covered jointly).
- Some students have reported success with using Amazon AWS instances or similar free-or-cheap cloud computing options.
- If started the assignment too late and your computer will not finish in time, you may use a prefix of the test suite. That is, treat the assignment as if you had been given the first 500 tests in the zip file (for some value of 500 that will allow it to run overnight on your setup). Note that to do this you will have to change your definition of "full coverage" to match what is provided by those 500 tests. If you do this you must explicitly indicate the time taken to compute coverage on all 1639 test cases, the reduced number of tests chosen (e.g., 500), and the maximum total coverage associated with your test subset in your report.
Coverage-Based Fault Localization
In the second part of this assignment we will implement coverage-based fault localization atop the output of gcov.
In addition to reporting summary coverage statistics, gcov also indicates how many times each line was visited. This information can be found in the produced .gcov files (which are actually just text files!), as detailed in its documentation.
For example, consider a fragment of pngtest.c.gcov from an example run:
6: 2020: if (i == 1) 2: 2021: status_dots_requested = 1; -: 2022: 4: 2023: else if (verbose == 0) 4: 2024: status_dots_requested = 0;The lines begin with whitespace. Then the first number is the "number of visits observed" and the second number is the "line number in the original code". The rest of the text is the original source code to the program. So in this example, the if on line 2020 was visited 6 times. The true branch was taken 2 times, and the other four times, verbose was always 0. (You will only need to read the first "number", the colon and the second number.)
You should generate your own .gcov files and take a look at them; the format (e.g., whitespace, etc.) might be slightly different on your machine.
In class we discussed the Tarantula coverage-based fault localization algorithm. In this assignment you will implement the Ochiai coverage-based fault localization algorithm, as summarized in Pearson et al.'s Evaluating and improving fault localization (local copy). They are very similar, but Ochiai is believed to be better than Tarantula in the real world.
You must write a Python 3 program, faultloc.py, that takes the names of multiple .gcov files as command-line arguments. The arguments will all be named passi.gcov and faili.gcov. For example, you might receive pass0.gcov, pass1.gcov, pass2.gcov, and fail0.gcov, fail1.gcov as arguments. The former encodes coverage from passing runs, the latter from failing runs. Your program must compute the Ochiai suspiciousness rating for each applicable line and print out a sorted list of the top 100 most suspicious lines (and their suspiciousness values). This printout should be formatted as a Python list of tuples using the standard Python print function. For example, if line 5 has suspiciousness 0.75 and line 9 has suspiciousness 0.2, then you would print:
[(5, 0.75), (9, 0.2)]Details:
- If the beginning of a line is marked #####, - or ===== in all tests it is not relevant.
- Note that lines can actually start with whitespace (see example above). You may find python functions such as split, strip or rstrip helpful.
- If there are fewer than 100 relevant lines, print all of them.
-
When you are computing the fault localization formula (which involves division):
- If the numerator is zero, then you should include that statement with a suspiciousness of 0.0 in your final list. (Example: 0/5 is 0.0.)
- If the denominator is zero, then that value is undefined, so you should not include that statement with a suspiciousness value of 0.0 in your final list. (Example: 5/0 is undefined.)
- You must print the answers in descending order of suspiciousness, with the most suspicious lines at the top. In case of a tie, break ties by printing the line with the smaller line number first.
- To print the output, please use Python's print function on the list-of-up-to-100-pairs. Do not individually iterate through the elements and print them or cast values from one number format to another.
Submission
Via the autograder, you should:
- HW5a — submit delta.py (a Python 3 program)
- HW5b — submit is-failure-inducing-change (a bash shell script that does not end in .sh)
- HW5d — submit faultloc.py (a Python 3 program)
Via Brightspace, submit the PDF report for HW5c. You must indicate your name, VU Net ID, VU email, and partner's name, VU Net ID and VU email (if applicable) in the report. There is no explicit format (e.g., for headings or citations) required. For example, you may either use an essay structure or a point-by-point list of question answers.
FAQ and Troubleshooting
In this section we detail previous student issues and resolutions:
-
Question: For HW5b, when minimizing failure-inducing changes, my approach appears to work locally but not on the autograder.
Answer: Some students report that applying the patches one by one (rather than using cat to apply them "all at once") resolves this issue.
-
Question: I'm seeing an error like ./is_interesting.sh: line 4: syntax error near unexpected token `$'do\r''.
Answer: Some students resolved this by running dos2unix on the is-interesting.sh script, since there are extra invisible characters (the \r above) that Windows uses.
-
Question: I'm seeing an error like sh: 1: ./is-interesting-2.sh: Permission denied on the autograder.
Answer 1: Make sure you're using os.system to invoke the interesting function.
Answer 2: Make sure you're treating the command passed in to you as a single command, even if it is a string that contains spaces. Don't break it up. Locally, try out both "/bin/bash ./is-interesting.sh" and ./is-interesting.sh as commands and make sure your script works with both. -
Question: I'm seeing /bin/bash: /bin/bash: cannot execute binary file on the autograder.
Answer: You don't need to add an "extra" bash to the beginning of the command you run. Just run the command you are given.
-
Question: When I run gcov it gives me coverage for each file but does not give the total coverage.
Answer: This is a gcov version issue. You need to install gcov-5 to get the summary.