NGHRI’s Dr. Adam Phillippy presents a remarkable dataset – the telomere-to-telomere assembly of a complete human X chromosome
When the completion of the Human Genome Project was announced on June 6, 2002, President Bill Clinton said the following:
We are here to celebrate the completion of the first survey of the entire human genome. Without a doubt, this is the most important, most wondrous map ever produced by humankind.
The moment we are here to witness was brought about through brilliant and painstaking work of scientists all over the world, including many men and women here today. It was not even 50 years ago that a young Englishman named Crick and a brash even younger American named Watson, first discovered the elegant structure of our genetic code.Transcript of “Remarks on the Completion of the First Survey of the Entire Human Genome Project”
As momentous and historic as that moment was, this achievement was something of a milestone marker on a long journey. The landmark papers were not published until Feb 15, 2001 in Nature and Feb 16, 2001 in Science. Continuously refined from that point on, the current Genome Research Consortium human build 38 (or GRCh38 as it is known) incorporates gradual improvements in the quality of that sequence over the past 19 years since the original announcement.
Kinds of technical improvements
The companies Affymetrix (now part of Thermo Fisher Scientific) and Illumina commercialized the first genome-wide SNP microarrays, starting with the first Affymetrix array with 1,494 SNPs using an assay called HuSNP. Looking at that original 1998 Science reference now, it is informative to note where some of those names are now. In 2004-2005 the density leapt upward, with Affymetrix releasing a 50,000 SNP chip and Illumina followed closely after with its first Human-1 BeadChip with 180,000 SNPs. (Note: I was very involved with the development of that product in 2003-2005, only to start selling them after moving to the mid-Atlantic.)
The Human HapMap project was a key driver of genome map refinement, characterizing human variation. The original human genome was a collection, a mixture, of a small number of anonymous individuals, and the HapMap Project set out to characterize variation of human populations by ethnicity.
From whole-genome genotyping of polymorphisms (now usage of microarrays of 1 million genotypes is routine and inexpensive, used for direct-to-consumer genomics such as 23andMe), the cost of a single genotype has plummeted from that first 1,494 array to 1000 times that. And in 2005 the first 454 GS20 sequencers arrived in the market with a capacity of about 1M reads of over 100 basepairs (obtaining a 100 Mb throughput seems quaint today although was revolutionary then, as the highest daily throughput of an ABi 3730xl was about 1 Mb). And two years after that, the first Illumina GenomeAnalyzer instruments (which were re-branded Solexa 1G instruments of which only a handful were installed in early 2007) changed the capacity again, with the output of short reads (originally 36 basepairs long) but 30 million of them.
These leaps in technical advances on the data generation side worked together with the refining of the maps and populating variation databases, opening up several varied lines of investigation at once: better understanding of the genetic basis of disease, the ability to look into the past history of human migration, among others.
The human reference genome is still incomplete
Dr. Adam Phillippy started his AGBT presentation with the facts: there are 102 gaps in the existing reference, and 368 unresolved issues. (An example issue is Human Genome Issue HG-2530, covering the GCNT7 gene, described as ‘poor quality (includes component misassembly)’. You can view it for yourself here.)
These gaps will lead to errors in analysis, and also the information contained in these gaps remain unexplored. Working with Karen Miga at University of California Santa Cruz (a prominent center for bioinformatics due to their original role in the HGP that remains so to this day), Dr. Phillippy presented the following slide.
The centromere is a problem due to its highly repetitive nature. 1 megabase-pair reads are needed to sequence through that region, thus existing short-read technologies simply cannot make that span. As far as accuracy, whether 90% accuracy of enough 100,000 basepair reads could span the 1 megabases is the open question that he set out to solve.
Oxford Nanopore technology illustrated
He then illustrated the technical feat that Oxford Nanopore technology accomplishes. If the single nanopore was scaled appropriately to the size of a human fist 8 cm high, the speed at which ONT can measure DNA is 32 km of linear DNA in 37 minutes. Doing some math: that is 865 meters per minute, or no less than 14 meters per second.
Now think about a 14 meter-long (50 ft) extension cord or rope going through your hand every second. That this works at this kind of speed is a little short of miraculous.
A group effort
He pointed out two groups, first The Long Read Club, that I wrote up here yesterday, featuring Matt Loose and Nick Loman.
The second group is the T2T Working Group, where T2T stands for ‘Telomere to Telomere’, an open, community-based effort ‘to generate the first complete assembly of the human genome’. Their website, which took a bit of work to find, is here.
Scaffolding, polishing and validation
Their design was to cover the whole genome (and then zero-in on the X-chromosome) with 30x ‘Nanopore ultralong’ reads for scaffolding, 60x PacBio and 50x 10X Genomics for polishing, and Bionano Genomics optical maps for structural validation.
He then illustrated the ‘Nanopore Ultralong’ readlength distribution as a long-tailed one, and as a very small number of reads are ultralong provided there were enough of them to span their region of interest they could solve the large gap problem.
Choosing CHM13 (a haploid cell line) to simplify the WGS (a little more about this concept here from my 2015 AGBT writeup), over the course of six months they generated 98GB of data, 8.9M reads with an average readlength of 76kb.
Importantly 44GB of that 98GB (45%) was over the 100kb desired readlength, with the maximum read length of 1.03 Mb. According to the Long Read Club video, Matt Loose has the existing record, a remarkable 2,272,580 basepairs. Naturally the single longest read may not be very useful nor the most useful metric (the median readlength is a much better metric) but nonetheless is one measure of the distribution.
For polishing the higher accuracy Pacific Biosciences reads make sense (they showed some Continuous Long Read technology that is promising (Marty Badgett in their workshop presentation indicated an average readlength of 62.5kb) however the distribution of readlengths isn’t high enough for this application, at present. Also using 10X Genomics for polishing also makes sense, with the high accuracy of Illumina short reads.
Verification of their assembly
The slide above summarizes a lot of data, indicating the manual work that was done and the gene areas that were corrected. Amazing to think that the DXZ4 ‘Macrosatellite’ has no less than 55 copies as a tandem 165kb chunk, verified by digital PCR.
Of note is the region in the upper corner, where the Bionano Genomics optical map (I highlighted their new DLS Saphyr technology from AGBT 2018 here) shows an additional 13 repeats that were missed in the assembly. Even with 44GB of >100kb ONT reads and the latest in de-novo assemblers (these repeats are very hard to map!) Dr. Phillippy pointed out that “an assembly is a hypothesis” – ever subject to refinement. Here is a link to their GitHub data repository and additional information about the T2T project.
Only yesterday did a major publication come out in Nature Communications on structural variation, mapping 154 individuals from each of the 26 populations from the Thousand Genomes Project, using a combination of Bionano optical mapping and 10X Genomics linked reads.
From the last portion of the abstract:
We are able to characterize SVs in many intractable regions of the genome, including segmental duplications and subtelomeric, pericentromeric, and acrocentric areas. In addition, we discover ~60 Mb of non-redundant genome content missing in the reference genome sequence assembly. Our results highlight the need for a comprehensive set of alternate haplotypes from different populations to represent SV patterns in the genome.From Nature Commun 2019 Levy-Sakin and Kwok et al. Genome maps across 26 human populations reveal population-specific patterns of structural variation.