First Customer PhiX Data from the NextSeq 500 4


Image courtesy of Wikipedia user Fdardel

Phi X 174 image courtesy of Wikipedia user Fdardel

Illumina announced in January at the JP Morgan Healthcare conference the NextSeq 500, which was trumpeted (at least at the level of press releases and public relations) of being available immediately. Knowing first-hand how difficult it is to launch a new system, I had expected the first systems to ship by the end of the first quarter (March), but here we are in mid-June and the first data is only now being reported on the NextSeq 500 system.

At an Illumina workshop at the European Society for Human Genetics two weeks ago it was shared that there were ~20 NextSeq 500 installs (alas I can’t find that particular tweet), and sure enough now some of the first NextSeq data is being made available. Here is the first dataset uploaded to the Short Read Archive, and here is an analysis by James Hadfield of Cancer Research UK of his own PhiX run, which he entitled “Why no PhiX on BaseSpace: %Q30 vs error rate, should you choose between them”.

First some commentary about James’ results: he raises a valid point about the error rate is one of several important parameters, including the assigned Qv score, % passing filter, yet it is the error rate which will affect how much trimming of the reads will be done in the analysis, which of course will affect the overall yield. Also he rightly points out that the ability to call somatic mutations will be hampered severely when the error rate rises (he cites the example of being able to call a 1% MAF when the error is a very low 0.1%, presuming a 10:1 signal to noise metric).

For those in the cancer field somatic mutation calling and MAF sensitivity is important; a given FFPE sample may have a lower percentage of tumor in a background of normal tissue, and for targeted sequencing at a 1% error in the sequencing accuracy, a 5 to 10% minor allele frequency can be detected. Lower MAF at that 1% threshold cannot be attained, regardless of sequencing depth, which is a matter of distribution statistics. (Way back when, oh perhaps 2010 or so, this accuracy calculator was put together to show the effect of system accuracy on the coverage needed to detect a given allele frequency in a hetergeneous sample: move the slider from 20% to 5%, and note how a 5% allele can be detected at 450x coverage if the system is 99% accurate. Do take note that at 5% MAF, no level of additional coverage will suffice if the accuracy is only 98%.)

What about the Q score? Shouldn’t that be sufficient without a ‘real-world’ DNA control like PhiX to compare ‘truth’ against?

You need both, to assess from a ‘top-down’ approach on the overall quality of the run (such as % of bases above a given Q score) but it is important to remember that these are vendor-supplied metrics on such details as signal to noise and other quality metrics. (The popular Broad package GATK adjusts / calibrates the manufacturer-provided Q score as part of the software.) But the Q scores will only go so far in giving a clear picture on how far back to trim bases, and in the case of this early NextSeq 500 data, it looks like a 2×100 bp format for the time being will need to suffice.

Of course the error model will improve with subsequent iterations of the software, the nuances of calling a ‘G’ base as an absence of signal and the ‘A’ base being a combination of signals will be further elucidated, and James is happy with his NextSeq system. (After all the system performs well above specification for PhiX at >400M reads and >75% Q30 at 2x150bp.) But his overall point is a valid one: why doesn’t the Illumina-supplied datasets have internal PhiX controls so he could do a comparison?

One last point: it an open question how much coverage is currently needed to run exomes on a NextSeq if the base quality from 100 to 150 has 1 to 4% error. Given that most are doing targeted sequencing on heterogeneous FFPE samples, and most are doing germline sequencing on exomes (for inherited disorders), you could get overcome the accuracy deficit from 100-150 bases by increasing the depth. But given that the average TruSeq Exome insert is 235bp (Figure 3 from this datasheet PDF), how much of that 35 bases in the middle need additional coverage will need to await ‘real world’ runs.

 

 


About Dale Yuzuki

A sales and marketing professional in the life sciences research-tools area, Dale currently is employed by Olink as their Americas Field Marketing Director. https://olink.com For additional biographical information, please see my LinkedIn profile here: http://www.linkedin.com/in/daleyuzuki and also find me on Twitter @DaleYuzuki.

Leave a Reply to Dale Yuzuki Cancel reply

Your email address will not be published. Required fields are marked *

4 thoughts on “First Customer PhiX Data from the NextSeq 500

    • Dale Yuzuki Post author

      That would seem to make logical sense Frank.

      Yet every system out on the market (with the exception of PacBio) has some level of systematic error and bias, especially with GC-rich regions. It remains to be seen exactly the type of errors the NextSeq 500 has, and you can expect that first dataset to be analysed to look for exactly that.

    • Mantis Toboggan

      I think it’s important to clarify that the ‘A’ base is NOT double-labeled (two fluors on one molecule). Rather, it is a MIXTURE of red- and green-labeled bases.

      Not sure why no one has bothered to point that out yet…