Assembly pipeline (Overview)

Get Started. It's Free
or sign up with your email address
Assembly pipeline (Overview) by Mind Map: Assembly pipeline (Overview)

1. Obtain raw reads from assembler: Raw data is essential because it contains sequence quality data for each base unlike the automatically-generated sequence data

1.1. 454 Sequencing

1.1.1. Flowgrams (SFF files)

1.2. Illumina sequencing

1.2.1. HUGE (2 GB files) in fastq format - in pairs.

1.3. Sanger sequencing

1.3.1. AB1 files

2. Upload assembly data to portal

2.1. LFZ has a a Web page under construction to be used to view, manage and perform analyses on whole genome sequence data: http://bgph.dyndns.org/

3. Annotation: The assembled genome is a set of G, A, T and C nucleotides. Genes must now be assigned to the genome.

3.1. The genome is automatically annotated by computer, and each annotation is curated by a human to check accuracy.

3.2. Software used: myRAST (for automated annotation), in-house software (for manual curation), and Artemis for genome viewing.

3.3. OUTPUT: Annotated genome

4. Assembly of raw reads using a genome assembler.

4.1. Software used: MIRA is currently the only assembler which will perform hybrid assemblies using different sequencing technologies, e.g. a mix of 454 and Illumina

4.1.1. De novo assemblies

4.1.1.1. Reads are assembled using algorithms based upon sequence quality, paired end distances and average depth of coverage: the latter prevents misassembly of heavily repeated areas

4.1.2. Scaffolded assemblies

4.1.2.1. Used for all genomes with a close relative. Select appropriate scaffold, usually closest relative, ideally using phylogenetic software.

4.2. OUTPUT: draft assembly

5. Post-processing of reads

5.1. It is important to see what the genome assembly looks like. Software used: Tablet

5.2. Proof reading: All assembler make mistakes. All sequences get proof-read by humans. Software used: gap5

5.3. For discovery of markers for detection, Single nucleotide polymorphisms (SNPs) are important. Software used: gigaBayes - usually used for mammal data but LFZ has altered it for use in calling bacterial SNPs

5.4. OUTPUT: assembled, proof-read genome

6. New node

6.1. New node