Illumina announced in January at the JP Morgan Healthcare conference the NextSeq 500, which was trumpeted (at least at the level of press releases and public relations) of being available immediately. Knowing first-hand how difficult it is to launch a new system, I had expected the first systems to ship by the end of the first quarter (March), but here we are in mid-June and the first data is only now being reported on the NextSeq 500 system.
At an Illumina workshop at the European Society for Human Genetics two weeks ago it was shared that there were ~20 NextSeq 500 installs (alas I can’t find that particular tweet), and sure enough now some of the first NextSeq data is being made available. Here is the first dataset uploaded to the Short Read Archive, and here is an analysis by James Hadfield of Cancer Research UK of his own PhiX run, which he entitled “Why no PhiX on BaseSpace: %Q30 vs error rate, should you choose between them”.
First some commentary about James’ results: he raises a valid point about the error rate is one of several important parameters, including the assigned Qv score, % passing filter, yet it is the error rate which will affect how much trimming of the reads will be done in the analysis, which of course will affect the overall yield. Also he rightly points out that the ability to call somatic mutations will be hampered severely when the error rate rises (he cites the example of being able to call a 1% MAF when the error is a very low 0.1%, presuming a 10:1 signal to noise metric).
For those in the cancer field somatic mutation calling and MAF sensitivity is important; a given FFPE sample may have a lower percentage of tumor in a background of normal tissue, and for targeted sequencing at a 1% error in the sequencing accuracy, a 5 to 10% minor allele frequency can be detected. Lower MAF at that 1% threshold cannot be attained, regardless of sequencing depth, which is a matter of distribution statistics. (Way back when, oh perhaps 2010 or so, this accuracy calculator was put together to show the effect of system accuracy on the coverage needed to detect a given allele frequency in a hetergeneous sample: move the slider from 20% to 5%, and note how a 5% allele can be detected at 450x coverage if the system is 99% accurate. Do take note that at 5% MAF, no level of additional coverage will suffice if the accuracy is only 98%.)
What about the Q score? Shouldn’t that be sufficient without a ‘real-world’ DNA control like PhiX to compare ‘truth’ against?
You need both, to assess from a ‘top-down’ approach on the overall quality of the run (such as % of bases above a given Q score) but it is important to remember that these are vendor-supplied metrics on such details as signal to noise and other quality metrics. (The popular Broad package GATK adjusts / calibrates the manufacturer-provided Q score as part of the software.) But the Q scores will only go so far in giving a clear picture on how far back to trim bases, and in the case of this early NextSeq 500 data, it looks like a 2×100 bp format for the time being will need to suffice.
Of course the error model will improve with subsequent iterations of the software, the nuances of calling a ‘G’ base as an absence of signal and the ‘A’ base being a combination of signals will be further elucidated, and James is happy with his NextSeq system. (After all the system performs well above specification for PhiX at >400M reads and >75% Q30 at 2x150bp.) But his overall point is a valid one: why doesn’t the Illumina-supplied datasets have internal PhiX controls so he could do a comparison?
One last point: it an open question how much coverage is currently needed to run exomes on a NextSeq if the base quality from 100 to 150 has 1 to 4% error. Given that most are doing targeted sequencing on heterogeneous FFPE samples, and most are doing germline sequencing on exomes (for inherited disorders), you could get overcome the accuracy deficit from 100-150 bases by increasing the depth. But given that the average TruSeq Exome insert is 235bp (Figure 3 from this datasheet PDF), how much of that 35 bases in the middle need additional coverage will need to await ‘real world’ runs.