Readlengths do matter in Next Generation Sequencing

Length Matters Notebook cover — Notebook cover from AGBT 2009 (c) Dale Yuzuki

Recently I was asked about how important readlengths are, in the context of where MiSeq and Ion Torrent PGM currently stand in the marketplace. As the 454 advertisement used to say until recently, ‘Length Matters’. Given a number of recent announcements from Ion Torrent and the other folks in San Diego, let’s assess where we are.

The MiSeq at 2x150bp had announced earlier this year in January that the complimentary upgrade to enable 2x250bp reads in ‘mid-2012’. As of mid-September 2012 the first MiSeq upgrades in the field were publicly discussed, so they are starting to be implemented now (and I would welcome any comment or feedback). At 2×250, the runtimes are vague, with “~39 hours” stated on their datasheet for MiSeq.

The Ion Torrent PGM launched a 300bp kit in mid-September, with longer runtimes (on the order of 5 to 6 hours).

Cost is a main driver for readlength

A major driver for longer readlengths from both the vendor’s and customer’s point of view is the cost per base. Of the funds that go into a particular run’s costs, some go to the library preparation, some to template prep, other funds go to the sequencing itself. And if with additional sequencing reagents the overall throughput can increase 50% or 100%, there is an accompanying drop in per-Mb cost. It is this mechanism of longer reads (and of course higher densities) that has driving the exponential decrease in costs and increase in throughput, so this trend continues. In the context of MiSeq and Ion Torrent PGM, here the timing is simultaneous.

One way the MiSeq would compete on the surface is with a 7Gb, 2×250 run, enabling a single exome per run. (The amount of data needed for a ‘typical’ exome enrichment experiment is on the order of 5 or 6 Gb.) However, it won’t be much bump in overall coverage of the exome target. The problems here is the target insert length.

MiSeq and human exomes

Depending on the reference paper you look at, the average size of an exon is from the mid-90 base-pairs to about 150 base-pairs (I’ve seen numbers all over this range). For TruSeq ILM libraries, the target length is 365bp, but 130bp of that is adapters (according to their data sheet, figure 3). So the insert (the targeted exome region) of any given library molecule is only 235bp long.

So what does 2×250 get you when sequencing these 235bp inserts? Perhaps greater accuracy toward the middle (as is known the quality dips toward the end of the read), but not greater ‘reach’ into the targeted exons. You are simply sequencing more of the same territory, which is that 235bp insert, so the additional 2×100 ‘worth’ of sequencing in addition to the first 2×150) is going over known, previously sequenced bases. And of course the 65bp adapters on each end, which you will be sequencing both the + and – strands ad nauseum.

One can then try to re-engineer the size-selection up-front, but then the you would have to muck around with optimizing the cluster generation, which on the MiSeq is a black-box. (Okay, literally a black-and-white box but the point is that cluster generation was a lot more modifiable and flexible in the old ‘Cluster Station’ days before the cBot and now the MiSeq, with its monolithic reagent packaging.) If you use a 500bp insert, the overall run is going to suffer. But even before those considerations, is the targeted selection itself. By going into the details of how TruSeq does its selection, much more re-engineering has to take place before the size selection part, and ILM then has its work cut out for it to get a larger insert to take advantage of 2×250 reads. (I.e. doubling the target from 235bp to something like 470bp.) How easy would this be to do for them will determine how quickly they will be able to change the TruSeq enrichment and library construction processes, so time will tell.

Ion Torrent and human exomes

As far as the Ion Torrent PGM, the NCI has run exomes on the PGM (4x 318 chips) and have shown excellent results (their video from Ion World is here). At last year’s ASHG at a LifeTech workshop, Tim Harkins presented whole-exome work on the PGM using then-new 318 chip runs and then-new 200bp kits, and demonstrated better coverage (and of course mapping) compared to the same Gb raw data produced by 100bp runs. Interestingly it was the longer readlengths ‘reaching into’ previously unsequenced regions (or regions captured and sequenced poorly) that would explain the better results (if I remember correctly it was on the order of 5-7% better target coverage at >20x across multiple samples and multiple runs, going from the mid-80’s to the low 90’s percentile with the same targeted enrichment).

Recent results on the Ion Proton (in Donna Muzny’s talk – she’s from Baylor College of Medicine) at Ion World were presented here; see the 18′ mark to hear about first exomes on PGM and then on Proton. She indicated a 92% exome target called at >20x with 6Gb of total coverage.

Two markets for longer reads

There are two medium-sized (but growing) markets where 300-400 base pair readlengths have a huge impact: the HLA / transplantation market (the ASHI meeting is in Puerto Rico starting next week) where Life Technologies’ SeCore product (CE-based) has a very good market (and market share, in addition to other platforms for this market), and will eventually move over onto Ion (454 made some mis-steps in entering this market, including not automating their template preparation). The HLA region is remarkable in its function and for its complexity, and Life Technologies has done very well in addressing this market’s particular needs. I believe that we have a very strong hand to play, in a market that ILM won’t be able to compete in, without either the technology, people (by this I mean a dedicated sales force), nor experience in it.

The other market is metagenomics. The HMP (Human Microbiome Project) published their results a few months ago and made a big splash (even making the cover of The Economist in a lead story), and 300-400 bp is critical for that market in order take a comprehensive look at the different variable regions of 16S (V2, V3, V5, V7). Both ILM and Ion Torrent platforms will sell into this space with 2×250 and 1×400 reads respectively.

Number of reads vs. longer ones

At Ion World Tim Triche made public his request for 1000bp readlengths for Ion Torrent. He is looking at a very long non-coding RNA species in Ewing’s sarcoma (his talk is very interesting), and his case is bolstered by his work in looking at alternative splicing events. Catching novel splice variants was impossible on expression microarrays; via RNA-Seq with adequate depth thousands were discovered in a single publication. (Reference is Sultan et al. Science 2008 here.) Longer reads makes the discovery so much easier and more efficient.

If you are interested, here’s a paper from the Virginia Biotech Institute doing an MAQC-type study on 454 reads, and it is remarkable what can be done with only 3.6M reads, albeit these are 250bp in length. (Compare this to a currently-unavailable ILM whitepaper on using 1.5M short reads, and there are some striking differences. If you’d like to see this technical note – which I now can’t find anywhere on the web, after some 15 minutes of searching – contact me directly through the web form.)

A future trend

If you look at it historically, 454’s first publication had 89bp average readlength in the latter half of 2005, then going to a 2011 launch of the FLX+ (formerly named Titanium XLR) with a distribution mode readlength of 700bp. So that is a seven-fold increase in six years. PGM started at 100bp in early 2010, 200bp in the fall of 2011, and 300bp in the fall of 2012, with 400bp in early access by the end of 2012 with an expected commercial launch in early 2013. So in a timeframe of three years, a four-fold increase, indicating an increased rate of development, undoubtedly due to the hard-won experience of development. (Many of the key 454/Curagen developers joined Ion Torrent in its earliest phases.)

Perhaps in another three years, another four hundred bases out to 800? Time will tell – check back in 2016.

Hi MikeF, the primary difficulty with PacBio reads is their low accuracy – their C2 chemistry has been reported to be only a few percentage points higher, but still on the order of 85%-87% accurate on a per-base basis.

On top of that is a throughput difficulty, with 90Mb yields per run and only about 30,000 reads per run (presuming an average readlength of 3000bp). So for a reasonable number of RNA-Seq reads as the original post, obtaining even ‘only’ 300,000 reads means ten runs, and that is only for one sample, which represents a major hurdle.

At last year’s AGBT there were a large number of PacBio posters (and a fair number of talks) indicating that centers that purchased a system are making good progress in understanding how to use them and make use of the data; however due to these factors (in addition to the instrument cost) PacBio is relegated to a niche market. (I’ve made reference to this before in this post: https://yuzuki.org/some-thoughts-on-pacific-biosciences-single-molecule-sequencing/ )

As far as Oxford Nano goes, there has been no word on their progress on the commercial front, nor the early-access / beta-testing side, and if they intend to ‘launch before the end of 2012’ they simply won’t be able to. (It takes at minimum 3 months if not much longer to field-test and then incorporate changes before a full commercial launch, including changing software, documentation, manufacturing, training service people etc.) And other than what was said last February, here we are in October without any word from them. (Although ASHG is coming up in early November, where they may opt to share their progress. A quick search just now on all ASHG abstracts for the upcoming meeting came up empty.)

Best –
Dale

5 thoughts on “Readlengths do matter in Next Generation Sequencing”

Brian Taylor
October 8, 2012 at 8:19 am
Very interesting post, Dale. How do you see longer read lengths affecting the economics of genomic DNA sequencing? I’d particularly like to know your opinion on how longer reads will influence the downstream costs. Since we’ve reached the point where data acquisition costs less than analysis, will longer read lengths improve assembly and reduce false positives enough to significantly bring those costs down?
Dale Yuzuki
October 9, 2012 at 6:27 am
Hi Brian,
The improvement in readlength won’t be reducing the downstream overall analysis cost. The compute cycles still need to be run, and that process has to be paid for by someone, basically independent of the quality of the sequence. (If you had the same number and length of CE-quality reads you would still need to align all that data.)
And when talking about interpretation of a given variant, the expertise and time involved is substantial. Elaine Mardis wrote about this in a piece in Genome Medicine entitled “The $1,000 genome, the $100,000 analysis?” (http://genomemedicine.com/content/2/11/84). She makes the case for a clinical grade analysis pipeline, which will take many years and a focused effort to develop.
MikeF
October 10, 2012 at 8:47 am
Hi Dale,
Where do you see the longer reads from the PacBio and Oxford Nanopore systems fitting in ? Surely if we can get back to 1kb+ reads again, then the analysis heads back towards those we used in the mists of time past …… welcome back the Staden package 🙂 .
- Dale Yuzuki
  October 10, 2012 at 11:09 pm
  Hi MikeF, the primary difficulty with PacBio reads is their low accuracy – their C2 chemistry has been reported to be only a few percentage points higher, but still on the order of 85%-87% accurate on a per-base basis.
  On top of that is a throughput difficulty, with 90Mb yields per run and only about 30,000 reads per run (presuming an average readlength of 3000bp). So for a reasonable number of RNA-Seq reads as the original post, obtaining even ‘only’ 300,000 reads means ten runs, and that is only for one sample, which represents a major hurdle.
  At last year’s AGBT there were a large number of PacBio posters (and a fair number of talks) indicating that centers that purchased a system are making good progress in understanding how to use them and make use of the data; however due to these factors (in addition to the instrument cost) PacBio is relegated to a niche market. (I’ve made reference to this before in this post: https://yuzuki.org/some-thoughts-on-pacific-biosciences-single-molecule-sequencing/ )
  As far as Oxford Nano goes, there has been no word on their progress on the commercial front, nor the early-access / beta-testing side, and if they intend to ‘launch before the end of 2012’ they simply won’t be able to. (It takes at minimum 3 months if not much longer to field-test and then incorporate changes before a full commercial launch, including changing software, documentation, manufacturing, training service people etc.) And other than what was said last February, here we are in October without any word from them. (Although ASHG is coming up in early November, where they may opt to share their progress. A quick search just now on all ASHG abstracts for the upcoming meeting came up empty.)
  Best –
  Dale
Leonardo Varuzza
October 16, 2012 at 7:58 am
Hi Dale,
Another important application for longer read lengths is Denovo assembly.