首页 » 计算机科学(Computing Science) » CSC427/527: Assignment 1

CSC427/527: Assignment 1

2024-01-10 Admin 写评论

CSC 427 / 527: Assignment 1

Before solving the following problems, make sure to install Codon and Seq by running these com- mands2:

bash -c "$(curl -fsSL https://exaloop.io/install.sh) "

export platform=$(uname -s | awk ' {print tolower($0)} ' )-$(uname -m)

culta(-)r(L) z(h)x(t)v(t)f(p)s-:/-/C(g)iucodon(b.com)/(/)lib(exa)cod(oop)n(s)p(q)ugins(relea)ses/download/v0.11.3/seq-${platform} .tar.gz \

Then familiarize yourself with Codon (and Seq) by going through the following tutorials:

1. https://docs.exaloop.io/codon/general/intro

2. https://docs.seq-lang.org/tutorial/primer.html

3. https://docs.seq-lang.org/tutorial/tutorial.html

If you get stuck, let us know! We will be using Piazza to help you with Codon. Codon is still in development, so don’t be surprised if things do not work out! And please note down any bugs or

inconsistencies you discover during this assignment. General grievances and rants are also appreci- ated! You will submit all of those at the end of the semester as a part of Codon feedback collection program.

Each of these problems should be implemented as a short script in the Codon language unless otherwise speciﬁed. Each script should be relatively short (no more than 100 lines of code). The source code ﬁles and discussion points (in a PDF or a TXT, ﬁle please), should be submitted to Brightspace packed in a single ZIP ﬁle.

Problem 1 (x-mer, 3 points): Implement a short Codon script that reads a FASTA ﬁle (passed as a command line argument). Then:

1. Print all palindromic 8-mers in the ﬁle to xmer-8. txt in a lexicographical order. A palindromic k-mer is a k-mer identical to its reverse complement.

2. Print all canonical 5-mers and their frequencies to xmer-5.txt. 5-mers should be sorted by their frequencies. Only print those 5-mers whose frequency is greater or equal than 5.

A canonical k-mer is a k-mer that is lexicographically smaller than its reverse complement.

Example invocation:

# Get a low-complexity region from chr1

$ samtools faidx \

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz \ 1:6430000-6436000 \

> chr1-low.fa

$ codon run -plugin seq xmer.codon chr1-low.fa

# This is the output for chr1-low.fa

$ tail -n+1 xmer-*

==> xmer-8.txt <==

AAGTACTT

AATGCATT

ACCTAGGT

AGAGCTCT

AGGCGCCT

AGGGCCCT

CACCGGTG

CAGGCCTG

CCCATGGG

CCTGCAGG

CTGATCAG

CTGTACAG

CTTCGAAG

GCAGCTGC

GCTGCAGC

GGAGCTCC

TAAATTTA

==> xmer-5.txt <==

AAAAA 62

CAGGC 28

CCAGC 22

GAGGC 21

...

You can also play with one of the E.coli genomes.

Note: if a FASTA has more than one sequence, you should consider k-mers from all sequences. However, do not concatenate them: if you do so, you will end up with extra k-mers that should not be counted.

Solution. In this problem, we just need to create a set that will maintain palindromes and a dictionary that will keep k-mer counts. We can use seq. kmers method to get all k-mers from a read. We can then sort k-mers via sorted command. Frequencies can be constructed on the ﬂy and sorted with reverse parameter.

1 import bio , sys

3 # Part 1

4 pals = set ()

5 with bio .FASTA(sys .argv[1], fai=False) as f: # fai=False avoids "no index" errors

6 for s in bio .seqs(f):

7 for kmer in s .kmers(1, k=8):

8 if kmer == ˜kmer: # check if this is a palindrome

9 pals .add(kmer)

11 with open ( ' xmer-8.txt ' , ' w ' ) as fo:

12 for k in sorted (pals):

13 print (k, file=fo)

15 # Part 2

16 kmers = {}

17 with bio .FASTA(sys .argv[1], fai=False) as f:

18 for s in bio .seqs(f):

19 for kmer in s .kmers(1, k=5):

20 if kmer <= ˜kmer:

21 if kmer in kmers:

22 kmers[kmer] += 1

23 else:

24 kmers[kmer] = 1

26 with open ( ' xmer-5.txt ' , ' w ' ) as fo:

27 for cnt, k in sorted ([(v, k) for k, v in kmers .items() if v >= 5], reverse=True):

28 print (k, cnt, file=fo)

Real test cases:

1. E.coli ASM886v2

2. Human hg19 chromosome 22

Problem 2 (Quality, 4 points): To get started, ﬁrst perform the following warm-up steps:

1. Download the ﬁrst 500,000 paired-end records (not lines!) from SRR2040271 sample (hint: you just need SRR2040271 1. fastq and SRR2040271 2.fastq) . Then interleave the reads into a single ﬁle interleaved. fastq (where an interleaved FASTQ is a FASTQ that contains a record from the ﬁrst end followed by a record from the second end and so on).

2. Take the interleaved. fastq and de-interleave it by using only shell commands. This script should produce two ﬁles mate 1. fastq and mate 2.fastq . Submit the shell script (or a Bash one-liner) as an attachment. Explain what the script does (and how!) step by step.

Then, implement a basic FASTQ quality control program that counts and prints the number of paired-end reads in an interleaved FASTQ that pass and fail the quality control. A pair of reads fails the quality control if:

• a read or its mate that begins or ends either with a universal adapter (in this case, AGATCGGAAGAG), or a string that has an edit distance of 1 to the universal adapter (to account for a single se- quencing error).

• a median quality score of a read or its mate is lower than 20 (in this dataset, the quality oﬀset is 33; make sure to subtract it before computing the median).

Print the number of reads that failed each ﬁlter as shown below. Note that the adapter ﬁlter takes precedence (i.e., if a read fails adapter ﬁlter, it is included in the adapter count and not in the quality count).

Example invocation:

# Take first 1250 paired-end reads from interleaved.fastq

$ head -n 10000 interleaved.fastq > interleaved-small.fastq

# This is the output for interleaved-small.fastq

$ codon run -plugin seq quality.codon interleaved-small.fastq

PASS=1214

FAIL_ADAPTER=16

FAIL_QUALITY=20

Solution. We can use the following shell commands to get a small subset of a FASTQ ﬁle. As each FASTQ record consists of 4 lines, we will need to get 4 million lines for each ﬁle. We use curl for downloading, bzip2 for decompression and head for taking the ﬁrst 2 million lines (that corresponds to the ﬁrst 500,000 records) in a ﬁle. Note that by doing this, you will only download the ﬁrst 2 million lines, not the whole ﬁle! Also note that a long shell command can be split into multiple lines by using \ line separator:

# Download and extract 500,000 records from the first mate

culh(b)d(p)2(f)t0(t)0(.)dbj.nig.ac.jp/ddbj_database/dra/fastq/SRA269/SRA269955/SRX1038646/SRR2040271_1.fastq.bz2 \

> f_1.fq

# Download and extract 500,000 records from the second mate

culh(b)d(p)2(f)t0(t)0(.)dbj.nig.ac.jp/ddbj_database/dra/fastq/SRA269/SRA269955/SRX1038646/SRR2040271_2.fastq.bz2 \

> f_2.fq

Now we can interleave and de-interleave these ﬁles by using standard POSIX tools such as tee, cut, paste and awk (more cool examples can be found here):

# Interleave:

paste f_1.fq f_2.fq \

| aw(| pa)k(s)te-v-OF(-)S "n -v FS= "\t" ' {print($1,$3,$5,$7,$2,$4,$6,$8)} ' \

> SRR2040271.fastq

# De-interleave:

paste - - - - - - - - \

c(t)u(e)t(e) ->c5(u)t-8- t(1)r(-)4 "tr " "> "R(n)20(>)40(S)2(R)7(R)1(2)0_2(4)0fastq(271_1) .fastq) \

Finally, the solution to the problem 2 is:

1 import bio , sys , statistics

3 adapter = ' AGATCGGAAGAG '

5 # Generate all indels compatible with the adapter

6 del_adapters = set () # deletions

7 for i in range (len (adapter)+1):

8 del_adapters .add(adapter[:i] + adapter[i + 1 :])

9 ins_adapters = set () # insertions

10 for i in range (len (adapter)+1):

11 for c in ' ACGT ' :

12 ins_adapters .add(adapter[:i] + c + adapter[i:])

14 def has_adapter (s):

15 def hamming(s):

16 return sum (int (adapter[i] != s[i]) for i in range (12))

17 if hamming(s[:12]) <= 1 or hamming(s[-12 :]) <= 1:

18 return True

19 if s[:11] in del_adapters or s[-11 :] in del_adapters:

20 return True

21 if s[:13] in ins_adapters or s[-13 :] in ins_adapters:

22 return True

23 return False

25 def bad_quality(qual): # check if the median quality score falls below 20

26 m = statistics .median([ord (q) - 33 for q in qual])

27 return m < 20

29 read_total, fail_adapter, fail_quality = 0 , 0 , 0

30 with bio .FASTQ(sys .argv[1]) as f:

31 mate1, mate1_qual = s '' , ''

32 for r in f:

33 if read_total % 2 == 1: # second mate in the pair

34 if has_adapter(str(mate1)) or has_adapter(str (r.read)):

35 fail_adapter += 1

36 elif bad_quality(mate1_qual) or bad_quality(r.qual):

37 fail_quality += 1

38 else: # first mate in the pair

39 mate1, mate1_qual = r .read, r .qual

40 read_total += 1

42 print (f ' PASS={read_total // 2 - fail_adapter - fail_quality} ' )

43 print (f ' FAIL_ADAPTER={fail_adapter} ' )

44 print (f ' FAIL_QUALITY={fail_quality} ' )

Real test cases:

1. The ﬁrst 500,000 interleaved records from the aforementioned SRR2040271.fastq

2. The ﬁrst 250,000 interleaved records from the SRR870667.

Problem 3 (Squeeze, 3 points): Download the reads mapping to chromosome 22 from the following sample. Use Samtools—available through apt or brew—to get this data; the expected ﬁle size is ≈700 MB. Then:

1. Implement a script that reads a SAM ﬁle and prints the following six ﬁelds for each record:

• read name,

• position,

• CIGAR string,

• mapping quality,

• sequence (read), and

• quality score.

Compress this ﬁle with bzip2 and with gzip. Report the size of the compressed ﬁles.

2. Now, modify your script to store each of these ﬁelds in a separate ﬁle. You should end up with 6 diﬀerent ﬁles (e.g., file readnames. txt, file positions. txt and so on). Compress each of these ﬁles with gzip and bzip2. Any improvements? Where? Why?

3. Apply Rice encoding to the file mapqualities. txt (use k = 4). Is the compression better now? How much?

In addition to the code, please submit the report explaining what you did to Brightspace. Do not submit the code to the submission server.

Bonus point: Would other coding scheme work for file mapqualities. txt? How about file positions. txt ? Why?

Hint: Run samtools -Hto obtain the list of chromosome names in the SAM ﬁle. It might be useful!

Solution. Instead of downloading the whole 16 GB SAM ﬁle, we can just download (and extract) the chromosome 22 directly from the provided URL thanks to the random access capabilities of SAM/BAM format. First, let’s ﬁnd out the name of the chromosome 22 by examining the header via -H parameter:

$ samtools view -H \

https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00275/alignment/HG00275.mapped.ILLUMINA . bwa.FIN.low_coverage.20120522.bam

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

验证码 *