Assembly pipeline (Overview)
by John Nash
1. Obtain raw reads from assembler: Raw data is essential because it contains sequence quality data for each base unlike the automatically-generated sequence data
1.1. 454 Sequencing
1.1.1. Flowgrams (SFF files)
1.2. Illumina sequencing
1.2.1. HUGE (2 GB files) in fastq format - in pairs.
1.3. Sanger sequencing
1.3.1. AB1 files
2. Upload assembly data to portal
2.1. LFZ has a a Web page under construction to be used to view, manage and perform analyses on whole genome sequence data: http://bgph.dyndns.org/
3. Annotation: The assembled genome is a set of G, A, T and C nucleotides. Genes must now be assigned to the genome.
3.1. The genome is automatically annotated by computer, and each annotation is curated by a human to check accuracy.
3.2. Software used: myRAST (for automated annotation), in-house software (for manual curation), and Artemis for genome viewing.
3.3. OUTPUT: Annotated genome
4. Assembly of raw reads using a genome assembler.
4.1. Software used: MIRA is currently the only assembler which will perform hybrid assemblies using different sequencing technologies, e.g. a mix of 454 and Illumina
4.1.1. De novo assemblies
4.1.1.1. Reads are assembled using algorithms based upon sequence quality, paired end distances and average depth of coverage: the latter prevents misassembly of heavily repeated areas
4.1.2. Scaffolded assemblies
4.1.2.1. Used for all genomes with a close relative. Select appropriate scaffold, usually closest relative, ideally using phylogenetic software.