Before solving the following problems, make sure to install Codon and Seq by running these com- mands2:
bash -c "$(curl -fsSL https://exaloop.io/install.sh) "
export platform=$(uname -s | awk ' {print tolower($0)} ' )-$(uname -m)
culta(-)r(L) z(h)x(t)v(t)f(p)s-:/-/C(g)iucodon(b.com)/(/)lib(exa)cod(oop)n(s)p(q)ugins(relea)ses/download/v0.11.3/seq-${platform} .tar.gz \
Then familiarize yourself with Codon (and Seq) by going through the following tutorials:
1. https://docs.exaloop.io/codon/general/intro
2. https://docs.seq-lang.org/tutorial/primer.html
3. https://docs.seq-lang.org/tutorial/tutorial.html
If you get stuck, let us know! We will be using Piazza to help you with Codon. Codon is still in development, so don’t be surprised if things do not work out! And please note down any bugs or
inconsistencies you discover during this assignment. General grievances and rants are also appreci- ated! You will submit all of those at the end of the semester as a part of Codon feedback collection program.
Each of these problems should be implemented as a short script in the Codon language unless otherwise specified. Each script should be relatively short (no more than 100 lines of code). The source code files and discussion points (in a PDF or a TXT, file please), should be submitted to Brightspace packed in a single ZIP file.
Problem 1 (x-mer, 3 points): Implement a short Codon script that reads a FASTA file (passed as a command line argument). Then:
1. Print all palindromic 8-mers in the file to xmer-8. txt in a lexicographical order. A palindromic k-mer is a k-mer identical to its reverse complement.
2. Print all canonical 5-mers and their frequencies to xmer-5.txt. 5-mers should be sorted by their frequencies. Only print those 5-mers whose frequency is greater or equal than 5.
A canonical k-mer is a k-mer that is lexicographically smaller than its reverse complement.
Example invocation:
# Get a low-complexity region from chr1
$ samtools faidx \
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz \ 1:6430000-6436000 \
> chr1-low.fa
$ codon run -plugin seq xmer.codon chr1-low.fa
# This is the output for chr1-low.fa
$ tail -n+1 xmer-*
==> xmer-8.txt <==
AAGTACTT
AATGCATT
ACCTAGGT
AGAGCTCT
AGGCGCCT
AGGGCCCT
CACCGGTG
CAGGCCTG
CCCATGGG
CCTGCAGG
CTGATCAG
CTGTACAG
CTTCGAAG
GCAGCTGC
GCTGCAGC
GGAGCTCC
TAAATTTA
==> xmer-5.txt <==
AAAAA 62
CAGGC 28
CCAGC 22
GAGGC 21
...
You can also play with one of the E.coli genomes.
Note: if a FASTA has more than one sequence, you should consider k-mers from all sequences. However, do not concatenate them: if you do so, you will end up with extra k-mers that should not be counted.
Solution. In this problem, we just need to create a set that will maintain palindromes and a dictionary that will keep k-mer counts. We can use seq. kmers method to get all k-mers from a read. We can then sort k-mers via sorted command. Frequencies can be constructed on the fly and sorted with reverse parameter.
1 import bio , sys
2
3 # Part 1
4 pals = set ()
5 with bio .FASTA(sys .argv[1], fai=False) as f: # fai=False avoids "no index" errors
6 for s in bio .seqs(f):
7 for kmer in s .kmers(1, k=8):
8 if kmer == ˜kmer: # check if this is a palindrome
9 pals .add(kmer)
10
11 with open ( ' xmer-8.txt ' , ' w ' ) as fo:
12 for k in sorted (pals):
13 print (k, file=fo)
14
15 # Part 2
16 kmers = {}
17 with bio .FASTA(sys .argv[1], fai=False) as f:
18 for s in bio .seqs(f):
19 for kmer in s .kmers(1, k=5):
20 if kmer <= ˜kmer:
21 if kmer in kmers:
22 kmers[kmer] += 1
23 else:
24 kmers[kmer] = 1
25
26 with open ( ' xmer-5.txt ' , ' w ' ) as fo:
27 for cnt, k in sorted ([(v, k) for k, v in kmers .items() if v >= 5], reverse=True):
28 print (k, cnt, file=fo)
Real test cases:
Problem 2 (Quality, 4 points): To get started, first perform the following warm-up steps:
1. Download the first 500,000 paired-end records (not lines!) from SRR2040271 sample (hint: you just need SRR2040271 1. fastq and SRR2040271 2.fastq) . Then interleave the reads into a single file interleaved. fastq (where an interleaved FASTQ is a FASTQ that contains a record from the first end followed by a record from the second end and so on).
2. Take the interleaved. fastq and de-interleave it by using only shell commands. This script should produce two files mate 1. fastq and mate 2.fastq . Submit the shell script (or a Bash one-liner) as an attachment. Explain what the script does (and how!) step by step.
Then, implement a basic FASTQ quality control program that counts and prints the number of paired-end reads in an interleaved FASTQ that pass and fail the quality control. A pair of reads fails the quality control if:
• a read or its mate that begins or ends either with a universal adapter (in this case, AGATCGGAAGAG), or a string that has an edit distance of 1 to the universal adapter (to account for a single se- quencing error).
• a median quality score of a read or its mate is lower than 20 (in this dataset, the quality offset is 33; make sure to subtract it before computing the median).
Print the number of reads that failed each filter as shown below. Note that the adapter filter takes precedence (i.e., if a read fails adapter filter, it is included in the adapter count and not in the quality count).
Example invocation:
# Take first 1250 paired-end reads from interleaved.fastq
$ head -n 10000 interleaved.fastq > interleaved-small.fastq
# This is the output for interleaved-small.fastq
$ codon run -plugin seq quality.codon interleaved-small.fastq
PASS=1214
FAIL_ADAPTER=16
FAIL_QUALITY=20
Solution. We can use the following shell commands to get a small subset of a FASTQ file. As each FASTQ record consists of 4 lines, we will need to get 4 million lines for each file. We use curl for downloading, bzip2 for decompression and head for taking the first 2 million lines (that corresponds to the first 500,000 records) in a file. Note that by doing this, you will only download the first 2 million lines, not the whole file! Also note that a long shell command can be split into multiple lines by using \ line separator:
# Download and extract 500,000 records from the first mate
culh(b)d(p)2(f)t0(t)0(.)dbj.nig.ac.jp/ddbj_database/dra/fastq/SRA269/SRA269955/SRX1038646/SRR2040271_1.fastq.bz2 \
> f_1.fq
# Download and extract 500,000 records from the second mate
culh(b)d(p)2(f)t0(t)0(.)dbj.nig.ac.jp/ddbj_database/dra/fastq/SRA269/SRA269955/SRX1038646/SRR2040271_2.fastq.bz2 \
> f_2.fq
Now we can interleave and de-interleave these files by using standard POSIX tools such as tee, cut, paste and awk (more cool examples can be found here):
# Interleave:
paste f_1.fq f_2.fq \
| aw(| pa)k(s)te-v-OF(-)S "n -v FS= "\t" ' {print($1,$3,$5,$7,$2,$4,$6,$8)} ' \
> SRR2040271.fastq
# De-interleave:
paste - - - - - - - - \
c(t)u(e)t(e) ->c5(u)t-8- t(1)r(-)4 "tr " "> "R(n)20(>)40(S)2(R)7(R)1(2)0_2(4)0fastq(271_1) .fastq) \
Finally, the solution to the problem 2 is:
1 import bio , sys , statistics
2
3 adapter = ' AGATCGGAAGAG '
4
5 # Generate all indels compatible with the adapter
6 del_adapters = set () # deletions
7 for i in range (len (adapter)+1):
8 del_adapters .add(adapter[:i] + adapter[i + 1 :])
9 ins_adapters = set () # insertions
10 for i in range (len (adapter)+1):
11 for c in ' ACGT ' :
12 ins_adapters .add(adapter[:i] + c + adapter[i:])
13
14 def has_adapter (s):
15 def hamming(s):
16 return sum (int (adapter[i] != s[i]) for i in range (12))
17 if hamming(s[:12]) <= 1 or hamming(s[-12 :]) <= 1:
18 return True
19 if s[:11] in del_adapters or s[-11 :] in del_adapters:
20 return True
21 if s[:13] in ins_adapters or s[-13 :] in ins_adapters:
22 return True
23 return False
24
25 def bad_quality(qual): # check if the median quality score falls below 20
26 m = statistics .median([ord (q) - 33 for q in qual])
27 return m < 20
28
29 read_total, fail_adapter, fail_quality = 0 , 0 , 0
30 with bio .FASTQ(sys .argv[1]) as f:
31 mate1, mate1_qual = s '' , ''
32 for r in f:
33 if read_total % 2 == 1: # second mate in the pair
34 if has_adapter(str(mate1)) or has_adapter(str (r.read)):
35 fail_adapter += 1
36 elif bad_quality(mate1_qual) or bad_quality(r.qual):
37 fail_quality += 1
38 else: # first mate in the pair
39 mate1, mate1_qual = r .read, r .qual
40 read_total += 1
41
42 print (f ' PASS={read_total // 2 - fail_adapter - fail_quality} ' )
43 print (f ' FAIL_ADAPTER={fail_adapter} ' )
44 print (f ' FAIL_QUALITY={fail_quality} ' )
Real test cases:
1. The first 500,000 interleaved records from the aforementioned SRR2040271.fastq
2. The first 250,000 interleaved records from the SRR870667.
Problem 3 (Squeeze, 3 points): Download the reads mapping to chromosome 22 from the following sample. Use Samtools—available through apt or brew—to get this data; the expected file size is ≈700 MB. Then:
1. Implement a script that reads a SAM file and prints the following six fields for each record:
• read name,
• position,
• CIGAR string,
• mapping quality,
• sequence (read), and
• quality score.
Compress this file with bzip2 and with gzip. Report the size of the compressed files.
2. Now, modify your script to store each of these fields in a separate file. You should end up with 6 different files (e.g., file readnames. txt, file positions. txt and so on). Compress each of these files with gzip and bzip2. Any improvements? Where? Why?
3. Apply Rice encoding to the file mapqualities. txt (use k = 4). Is the compression better now? How much?
In addition to the code, please submit the report explaining what you did to Brightspace. Do not submit the code to the submission server.
Bonus point: Would other coding scheme work for file mapqualities. txt? How about file positions. txt ? Why?
Hint: Run samtools -Hto obtain the list of chromosome names in the SAM file. It might be useful!
Solution. Instead of downloading the whole 16 GB SAM file, we can just download (and extract) the chromosome 22 directly from the provided URL thanks to the random access capabilities of SAM/BAM format. First, let’s find out the name of the chromosome 22 by examining the header via -H parameter:
$ samtools view -H \
https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00275/alignment/HG00275.mapped.ILLUMINA . bwa.FIN.low_coverage.20120522.bam