BackgroundThe de novo assembly of transcriptomes from short shotgun sequences
raises challenges due to random and non-random sequencing biases and
inherent transcript complexity. We sought to define a pipeline for de
novo transcriptome assembly to aid researchers working with
emerging model systems where well annotated genome assemblies are not
available as a reference. To detail this experimental and computational
method, we used early embryos of the sea anemone, Nematostella
vectensis, an emerging model system for studies of animal body plan
evolution. We performed RNA-seq on embryos up to 24 h of development
using Illumina HiSeq technology and evaluated independent de novo
assembly methods. The resulting reads were assembled using either the
Trinity assembler on all quality controlled reads or both the Velvet and
Oases assemblers on reads passing a stringent digital normalization filter.
A control set of mRNA standards from the National Institute of Standards and
Technology (NIST) was included in our experimental pipeline to invest our
transcriptome with quantitative information on absolute transcript levels
and to provide additional quality control.ResultsWe generated >200 million paired-end reads from directional cDNA libraries
representing well over 20 Gb of sequence. The Trinity assembler pipeline,
including preliminary quality control steps, resulted in more than 86% of
reads aligning with the reference transcriptome thus generated.
Nevertheless, digital normalization combined with assembly by Velvet and
Oases required far less computing power and decreased processing time while
still mapping 82% of reads. We have made the raw sequencing reads and
assembled transcriptome publically available.ConclusionsNematostella vectensis was chosen for its strategic position in the
tree of life for studies into the origins of the animal body plan, however,
the challenge of reference-free transcriptome assembly is relevant to all
systems for which well annotated gene models and independently verified
genome assembly may not be available. To navigate this new territory, we
have constructed a pipeline for library preparation and computational
analysis for de novo transcriptome assembly. The gene models
defined by this reference transcriptome define the set of genes transcribed
in early Nematostella development and will provide a valuable
dataset for further gene regulatory network investigations.