As mentioned previously, there are three main methods of sequencing, the first being the pyrophosphate detection approach, and here is the second (and most popular) approach, using reversible terminators.
Given a set of amplified template molecules (remember there are millions to many hundreds of millions of these discrete ‘clusters’ of molecules) and a sequencing primer hybridized to one end of the adapter on each molecule, a mixture of modified deoxynucleotides are added. These nucleotides can only extend a single base, just like the Sanger method, and are modified in that each nucleotide is labeled with a fluorescent molecule. Since there is a protecting group on the 3′ hydroxyl end, a single nucleotide is added regardless of how many of the same nucleotide is added, thus if there were a stretch of say four A bases in a row, only one A base would be added in each sequencing round.
It is worth noting here that the protecting group is a rather large molecule, which requires a modified polymerase to accommodate it, that in turn requires some careful enzyme engineering.
After a single base is added, the entire flowcell is imaged, using what is basically an inverted confocal microscope and a color filter wheel to help the photomultiplier tube / CCD camera setup to discriminate between the different spectral characteristic of each of the four color dyes.
Thus the pictures look like a ‘starry sky’ image, each ‘black and white’ due to the color filters used, but can be artificially combined to form a colorful image for illustration purposes. Each image (there are thousands of images per single flowcell) are .tif images that have to be individually registered (that is, each cluster has to be informatically identified and assigned to a matrix of coordinates) and the intensities of the individual base-addition determined. (Thus the sequencing process of this image-based method is computationally intensive, and also time-consuming since the chemistry takes time for a single base of a 125-base read, and the imaging across the thousands of ’tiles’ in a flowcell.) Way back in time, during the Genome Analyzer ancient history (2007 is ancient history here), over a Terabyte of images were saved to disk. The current state-of-the-art HiSeq 2000 now simply deletes the images as the sequencing progresses.
In the earlier days customers could go back to the Terabytes of image data and use a newer version of the intensity-extraction algorithm (or even a publicly available one) to squeeze out higher-quality extracted intensities, and better overall yield of bases; however as the vendor-supplied Firecrest algorithm improved, there was decreasing need for such tweaking of the overall informatics pipeline. At present the HiSeq simply deletes the images as the intensities are extracted, they being too large for customers to store as another reason to delete them.
Downstream of the extracted intensities, the base-calling algorithm (called Bustard; in the early days Solexa had a penchant for naming their software after different birds) would call individual bases comparing the signal of each cluster against the four images generated per sequencing cycle.
Next up is the method of sequencing-by-ligation, so stay tuned!