Readlengths do matter in Next Generation Sequencing 5

Notebook cover from AGBT 2009 (c) Dale Yuzuki

Recently I was asked about how important readlengths are, in the context of where MiSeq and Ion Torrent PGM currently stand in the marketplace. As the 454 advertisement used to say until recently, ‘Length Matters’. Given a number of recent announcements from Ion Torrent and the other folks in San Diego, let’s assess where we are.

The MiSeq at 2x150bp had announced earlier this year in January that the complimentary upgrade to enable 2x250bp reads in ‘mid-2012’. As of mid-September 2012 the first MiSeq upgrades in the field were publicly discussed, so they are starting to be implemented now (and I would welcome any comment or feedback). At 2×250, the runtimes are vague, with “~39 hours” stated on their datasheet for MiSeq.

The Ion Torrent PGM launched a 300bp kit in mid-September, with longer runtimes (on the order of 5 to 6 hours).

Cost is a main driver for readlength

A major driver for longer readlengths from both the vendor’s and customer’s point of view is the cost per base. Of the funds that go into a particular run’s costs, some go to the library preparation, some to template prep, other funds go to the sequencing itself. And if with additional sequencing reagents the overall throughput can increase 50% or 100%, there is an accompanying drop in per-Mb cost. It is this mechanism of longer reads (and of course higher densities) that has driving the exponential decrease in costs and increase in throughput, so this trend continues. In the context of MiSeq and Ion Torrent PGM, here the timing is simultaneous.

One way the MiSeq would compete on the surface is with a 7Gb, 2×250 run, enabling a single exome per run. (The amount of data needed for a ‘typical’ exome enrichment experiment is on the order of 5 or 6 Gb.) However, it won’t be much bump in overall coverage of the exome target. The problems here is the target insert length.

MiSeq and human exomes

Depending on the reference paper you look at, the average size of an exon is from the mid-90 base-pairs to about 150 base-pairs (I’ve seen numbers all over this range). For TruSeq ILM libraries, the target length is 365bp, but 130bp of that is adapters (according to their data sheet, figure 3). So the insert (the targeted exome region) of any given library molecule is only 235bp long.

So what does 2×250 get you when sequencing these 235bp inserts? Perhaps greater accuracy toward the middle (as is known the quality dips toward the end of the read), but not greater ‘reach’ into the targeted exons. You are simply sequencing more of the same territory, which is that 235bp insert, so the additional 2×100 ‘worth’ of sequencing in addition to the first 2×150) is going over known, previously sequenced bases. And of course the 65bp adapters on each end, which you will be sequencing both the + and – strands ad nauseum.

One can then try to re-engineer the size-selection up-front, but then the you would have to muck around with optimizing the cluster generation, which on the MiSeq is a black-box. (Okay, literally a black-and-white box but the point is that cluster generation was a lot more modifiable and flexible in the old ‘Cluster Station’ days before the cBot and now the MiSeq, with its monolithic reagent packaging.) If you use a 500bp insert, the overall run is going to suffer. But even before those considerations, is the targeted selection itself. By going into the details of how TruSeq does its selection, much more re-engineering has to take place before the size selection part, and ILM then has its work cut out for it to get a larger insert to take advantage of 2×250 reads. (I.e. doubling the target from 235bp to something like 470bp.) How easy would this be to do for them will determine how quickly they will be able to change the TruSeq enrichment and library construction processes, so time will tell.

Ion Torrent and human exomes

As far as the Ion Torrent PGM, the NCI has run exomes on the PGM (4x 318 chips) and have shown excellent results (their video from Ion World is here). At last year’s ASHG at a LifeTech workshop, Tim Harkins presented whole-exome work on the PGM using then-new 318 chip runs and then-new 200bp kits, and demonstrated better coverage (and of course mapping) compared to the same Gb raw data produced by 100bp runs. Interestingly it was the longer readlengths ‘reaching into’ previously unsequenced regions (or regions captured and sequenced poorly) that would explain the better results (if I remember correctly it was on the order of 5-7% better target coverage at >20x across multiple samples and multiple runs, going from the mid-80’s to the low 90’s percentile with the same targeted enrichment).

Recent results on the Ion Proton (in Donna Muzny’s talk – she’s from Baylor College of Medicine) at Ion World were presented here; see the 18′ mark to hear about first exomes on PGM and then on Proton. She indicated a 92% exome target called at >20x with 6Gb of total coverage.

Two markets for longer reads

There are two medium-sized (but growing) markets where 300-400 base pair readlengths have a huge impact: the HLA / transplantation market (the ASHI meeting is in Puerto Rico starting next week) where Life Technologies’ SeCore product (CE-based) has a very good market (and market share, in addition to other platforms for this market), and will eventually move over onto Ion (454 made some mis-steps in entering this market, including not automating their template preparation). The HLA region is remarkable in its function and for its complexity, and Life Technologies has done very well in addressing this market’s particular needs. I believe that we have a very strong hand to play, in a market that ILM won’t be able to compete in, without either the technology, people (by this I mean a dedicated sales force), nor experience in it.

The other market is metagenomics. The HMP (Human Microbiome Project) published their results a few months ago and made a big splash (even making the cover of The Economist in a lead story), and 300-400 bp is critical for that market in order take a comprehensive look at the different variable regions of 16S (V2, V3, V5, V7). Both ILM and Ion Torrent platforms will sell into this space with 2×250 and 1×400 reads respectively.

Number of reads vs. longer ones

At Ion World Tim Triche made public his request for 1000bp readlengths for Ion Torrent. He is looking at a very long non-coding RNA species in Ewing’s sarcoma (his talk is very interesting), and his case is bolstered by his work in  looking at alternative splicing events. Catching novel splice variants was impossible on expression microarrays; via RNA-Seq with adequate depth thousands were discovered in a single publication. (Reference is Sultan et al. Science 2008 here.) Longer reads makes the discovery so much easier and more efficient.

If you are interested, here’s a paper from the Virginia Biotech Institute doing an MAQC-type study on 454 reads, and it is remarkable what can be done with only 3.6M reads, albeit these are 250bp in length. (Compare this to a currently-unavailable ILM whitepaper on using 1.5M short reads, and there are some striking differences. If you’d like to see this technical note – which I now can’t find anywhere on the web, after some 15 minutes of searching – contact me directly through the web form.)

A future trend

If you look at it historically, 454’s first publication had 89bp average readlength in the latter half of 2005, then going to a 2011 launch of the FLX+ (formerly named Titanium XLR) with a distribution mode readlength of 700bp. So that is a seven-fold increase in six years. PGM started at 100bp in early 2010, 200bp in the fall of 2011, and 300bp in the fall of 2012, with 400bp in early access by the end of 2012 with an expected commercial launch in early 2013. So in a timeframe of three years, a four-fold increase, indicating an increased rate of development, undoubtedly due to the hard-won experience of development. (Many of the key 454/Curagen developers joined Ion Torrent in its earliest phases.)

Perhaps in another three years, another four hundred bases out to 800? Time will tell – check back in 2016.

About Dale Yuzuki

A sales and marketing professional in the life sciences research-tools area, Dale currently is employed by Pillar Biosciences as a Global Marketing Manager. He represents Pillar across the East Coast, engages key customers for feedback for further product improvement and development, and is responsible for sales activities across the region. He also represents Pillar at tradeshows, writes on a blog for them, helps guide social media strategy and tactics, and keeps track of what is going on in the marketplace. For additional biographical information, please see my LinkedIn profile here: and also find me on Twitter @DaleYuzuki.