Assembly & Annotation
Some of the tools in KBase available for Assembly and Annotation
KBase provides multiple Apps for de novo assembly of prokaryotic Next-Generation Sequencing (NGS) reads from various sequencing platforms. These assemblies can then be annotated to explore structural and functional features of a Genome or use it in other analyses. The interactive tutorials are a good way to learn about these workflows
Read Processing
Trim Reads with Trimmomatic – Read trimming and adaptor removal
Filter Out Low-Complexity Reads with PRINSEQ – Filter low complexity reads
Assess Read Quality with FastQC – Quality assessment and reporting
Cutadapt – Custom adapter removal
Assembly
De novo assembly of Illumina and Ion Torrent next-generation sequencing reads. Supports single-end and paired-end read libraries.
Assemble with HipMer – HipMer is a highly-parallelized port of JGI’s Meraculous assembler. Meraculous is a de Bruijn graph-based which increases speed by not performing error correction. Instead, it bases contigs on already high-quality scores and fills the gaps based on localized assemblies from the reads. HipMer enhances the speed of Meraculous.
Assemble with IDBA-UD – IDBA-UD is an iterative graph-based assembler for single-cell and standard short read data and is good for data of highly uneven sequencing depth. This assembler uses an iterative approach for selecting k-mer size that compensates for the information loss associated with single k-mer based de Bruijn graphs, making IDBA-UD one of the more accurate microbial assemblers.
Assemble with MaSuRCA – MaSuRCA is a short read assembler that combines the benefits of de Bruijn graph and overlap layout consensus assembly approaches. The main concept is the creation of super-reads that contain sequence information present in the original reads, which super-reads are then extended in both directions using an efficient k-mer lookup table. MaSuRCA is one of a smaller set of assemblers biologists use for eukaryotic assembly.
Assemble with MEGAHIT – MEGAHIT is a single node assembler for large and complex metagenomics NGS reads. It makes use of succinct de Bruijn graph (SdBG) to achieve low memory assembly, making it fast and especially suitable for assembly of small metagenomes, metatranscriptomes or low-coverage data in general.
Assemble with SPAdes – SPAdes is a single-cell and standard assembler based on paired de Bruijn graphs, considered to be one of the most accurate microbial assemblers. SPAdes employs a multisized de Bruijn graph which detects and removes bubble and chimeric reads, estimates insert distance from paired kmers, and computes contigs based on paired assembly graph.
Assemble with Velvet – Velvet is a classic de Bruijn graph based assembler that works by efficiently manipulating de Bruijn graphs through simplification and compression. It eliminates errors and resolves repeats by first using an error correction algorithm that merges sequences together. Repeats are then removed from the sequence via the repeat solver that separates paths which share local overlaps.
Compare assemblies with QUAST – Assess the output assemblies from different configurations of the same assembler, or compare assemblies from multiple assemblers to determine which one is optimal for downstream analysis.
Annotation
Genomes can be annotated with Prokka or RAST.
Annotate Domains in a Genome – identifies protein domains from widely used domain libraries (COGs, TIGRfams, Pfam).
Annotate Assembly with Prokka – combines multiple open-source annotation tools in a quick and thorough annotation pipeline for prokaryotic sequences for genomes, plasmids, and metagenomes.
Annotate Microbial Assembly – uses components from the RAST (Rapid Annotations using Subsystems Technology) toolkit to annotate an assembled bacterial or archaeal genome.
Annotate Microbial Genome – uses RAST to annotate a prokaryotic genome, to update the annotations of a genome, or to perform computations on a set of genomes so that they are consistent.
Annotate Plant Coding Sequences with Metabolic Functions – performs functional annotation of plant cDNA or protein sequences.
Bulk Annotate Genomes/Assemblies – uses components from the RAST (Rapid Annotations using Subsystems Technology) toolkit to annotate a set of genomes or assemblies.
The output of the annotation apps is a Genome, which is displayed in a tabular genome viewer (see below) that shows information about the Genome as well as a list of contigs and the genes that were called on each contig.
Last updated