As I mentioned in my prior post, Sanger capillary sequencing is not going away anytime soon. Yet next-generation sequencing has made a huge mark in the world – growing from zero in 2005 to a USD $1 Billion market in 2012. And its growth is estimated by various sources to grow 20 to 25% every year for the next five years, approximately tripling in size from where we are now.
But the underlying principles are often confounded by a too-simplistic approach, equating the data quality, read length, workflow, and (most importantly) the massive scale of this data as easily handled by an existing sequencing core facility (for example). For many well-versed in the ‘first-generation sequencing’ world, they are in for somewhat of a rude awakening.
The first-generation world has a mature technology on their hands – the basic approach of dideoxy-terminator DNA chain-based sequencing has been settled as the approach of choice since the early 1980’s; the only differences have been the signal and method of detection (first 32P, then 35S radioactivity separated on polyacrylamide gels, progressing to fluorescent-labeled ‘BigDye’ terminators separated on automated capillaries), with its attendant automation. The sequencing was basically a PCR with a cleanup step – take 10-50 ng of input DNA and a sequencing master mix and a pair of specific primers, perform the PCR, cleanup the reaction, and then separate the products. A priori, you know what you are sequencing (you’ve designed specific primers for your region to sequence) and you get a single read of 700 – 1000 bases of very high quality data, called the ‘gold standard’ for just that reason.
The beauty of next-generation sequencing is the massively parallel nature of it – an approach termed ‘shotgun’ due to its inherent random scattershot approach – so that instead of a single read per 6 hour Sanger process, there are 100’s of millions of reads (numbering in the billions) per 5 to 11 day process. (There are cases where the process can be shorter – i.e. Ion Torrent – but at present the read numbers are in the low millions of reads.) This massively parallel method has opened up worlds of research that were simply unachievable before, due to cost.
As a case in point, the first Solexa 1G in early 2007 had a throughput of 800 million bases (Mb for short), with a 35-base pair readlength, for a total of about 23 million reads per 5 day run. One of the first customers at the NIH, an investigator named Keji Zhao, had published prior work mapping histone modifications using Sanger CE sequencing, but it was very expensive to get limited resolution. He was able to get his CE sequencing down to a very low price, but even at $2 per read a given experiment would naturally be severely limited in what it could do. (Doing the calculations at that time, it was $55K for 1/10th the number of reads, whereas on the Solexa 1G it was on the order of $6K for 10x the number of reads, so he was getting 10x the read number for 1/10th the price.) This 100-fold economy in terms of read count applied to a particular application Chromatin ImmunoPrecipitation sequencing (ChIP-Seq for short) on the Solexa platform made Keji into a minor celebrity on the NIH campus, as his Cell publication was produced in about 3 months, and described as a ‘tour de force‘.
Breaking it down, NGS has three components: library preparation, template preparation, and sequencing. And the library preparation is the most variable (and creative) part of the process, as the input DNA can be random fragments from a genomic sample, it could be immunoprecipitated fragments from a ChIP experiment, it could be cDNA from a messenger RNA preparation, it could be DNAse-I digested fragments that assay for open chromatin within the genome, or several other types. Stay tuned, and more about library preparation to follow.