Genome assembly#

Method#

Since the DNA raw data are long reads generated by PacBio, FastQC is not suitable for quality control. Canu, a software assembling PacBio sequences, containing correction, trimming and assembly phases, was used in this project. Other software such as SOAPdenovo, Spades, Pilon, require short reads as input, which are not available in this paper.

Command:

canu \
-p lfts \
-d ~/genome_analysis/analyses/01_genome_assembly/01-lfts-pacbio \
genomeSize=2.6m \
stopOnReadQuality=false \
-pacbio-raw ~/genome_analysis/data/DNA_raw_data/*.fastq.gz
  • -p: output prefix

  • -d: output directory

  • genomeSize: approximate genome size required to determine converage

  • -pacbio-raw: technology

Results and discussion#

I got 8 contigs with total length 2794451 bp, the longest one of which has 2577733 bp. Contig 1 covers over 90% of the whole assembly. Compared with the the paper, the authors got 2 contigs, one of which has 2569357 bp. The small difference is because they conducted assembly with HGAP3. The file contigs.fasta is used in downstream analysis.

Q & A#

Genome assembly#

  1. What information can you get from the plots and reports given by the assembler?

  • Histogram of read lengths, the histogram or k-mers in the raw and corrected reads, the summary of corrected data, summary of overlaps, and the summary of contig lengths.

  1. What intermediate steps generate informative output about the assembly?

  • Correction, trimming and assembly.

  1. How many contigs do you expect? How many do you obtain?

  • At least one contig. 8 contigs are obtained.

  1. What is the difference between a ‘contig’ and a ‘unitig’?

  • Contigs is a set of overlapping DNA segments that represent a consensus region of DNA. Unitig is contig that split at alternate paths in the graph.

  1. What is the difference between a ‘contig’ and a ‘scaffold’?

  • Scaffold consists of contigs separated by gaps of known length using information from pair-end sequencing.

  1. What are the kmers? What kmer(s) should you use? What are the problems and benefits of choosing a small kmer? And a big kmer?

  • A k-mer is a contiguous sequence of k bases. The kmer size depends on the read length and the read depth. Small kmer increases the chance of construct the graph but also increase the path ambiguities and graph complexity. Big kmer decrease the number of path but may result in disjoint parts in the graph.

  1. Some assemblers can include a read-correction step before doing the assembly. What is this step doing?

  • Replace the original noisy read sequences with consensus sequences computed from overlapping reads.

  1. How different do different assemblers perform for the same data?

  • They may use different algorithms (overlap graph and de Bruijn graph).

  1. Can you see any other letter appart from AGTC in your assembly? If so, what are those?

  • No. There are other letters in IUPAC nucleotide code representing gap or multiple alternatives in a position.