LiftOver alignments are used to map annotations from one human assembly to another one. The subtracks of this track were created by the T2T consortium using the minimap2 aligner and strong filters; it maps CHM13 coordinates to the human assemblies hg19 and hg38.
The T2T pipeline used the minimap2 aligner which outputs long alignments that do not require "chaining" of alignments into longer ones, then removed alignments that go to other chromosomes and removed all alignments to alternate haplotypes, fixes (corrections to the assembly) and unplaced contig sequences.
This means that the T2T alignments are tuned for high specificity. These alignments are probably best used for mapping annotations to hg38 in automated pipelines and in cases where the final processing on hg38 does not use alts/fixes/unplaced sequences and when one wants to be sure that annotations that are mapped are as reliable as possible.
Here is an example to illustrate the liftOver track:
Also, we created dot plots from these alignments:
The track displays boxes joined together by either single or double lines, with the boxes represent aligning regions, single lines indicating gaps that are largely due to a deletion in the CHM13 v2.0 assembly or an insertion in the GRCh38 or GRCh37, and double lines representing more complex gaps that involve substantial sequence in both assembly.
To prevent ambiguous alignments, all false duplications, as determined by the Genome in a Bottle Consortium (GCA_000001405.15_GRCh38_GRC_exclusions_T2Tv2.bed), as well as the GRCh38 modeled centromeres, were masked from the GRCh38/hg38 primary assembly. In addition, unlocalized and unplaced (random) contigs were removed.
Unlocalized and unplaced (random) contigs were removed from the GRCh37/hg19 assembly.
For the minimap2-based pipeline, the initial chain file was generated using nf-LO v1.5.1 with minimap2 v2.24 alignments. These chains were then split at all locations that contained unaligned segments greater than 1kbp or gaps greater than 10kbp. Split chain files were then converted to PAF format with extended CIGAR strings using chaintools (https://doi.org/10.5281/zenodo.6342391, v0.1), and alignments between nonhomologous chromosomes were removed. The trim-paf operation of rustybam (https://zenodo.org/record/6342176, v0.1.29) was next used to remove overlapping alignments in the query sequence, and then the target sequence, to create 1:1 alignments. PAF alignments were converted back to the chain format with paf2chain commit f68eeca, and finally, chaintools was used to generate the inverted chain file.
Full commands with parameters used were:
nextflow run main.nf --source GRCh38.fa --target chm13v2.0.fasta --outdir dir -profile local --aligner minimap2
python chaintools/src/split.py -c input.chain -o input-split.chain
python chaintools/src/to_paf.py -c input-split.chain -t target.fa -q query.fa -o input-split.paf
awk '$1==$6' input-split.paf | rb break-paf --max-size 10000 | rb trim-paf -r | rb invert | rb trim-paf -r | rb invert > out.paf
paf2chain -i out.paf > out.chain
python chaintools/src/invert.py -c out.chain -o out_inverted.chain
The above process does not add chain ids or scores. The UCSC utilities
chainMergeSort
and chainScore
are used to update the
chains:
chainMergeSort out.chain | chainScore stdin chm13v2.0.2bit hg38.2bit chm13v2.0-hg38.chain
chainMergeSort out_inverted.chain | chainScore stdin hg38.2bit chm13v2.0.2bit hg38-chm13v2.0.chain
Rustybam trim-paf uses dynamic programming and the CIGAR string to find an optimal splitting point between overlapping alignments in the query sequence. It starts its trimming with the largest overlap and then recursively trims smaller overlaps.
Results were validated by using chaintools to confirm that there were no overlapping sequences with respect to both CHM13v2.0 and GRCh38 in the released chain file. In addition, trimmed alignments were visually inspected with SafFire to confirm their quality.
The T2T v1_nflo liftOver chains were generated by Nae-Chyun Chen<naechyun.chen@gmail.com> and Mitchell Vollger<mvollger@uw.edu>. The UCSC liftOver chains and the dot-plots were created by Hiram Clawson.
lastz was developed by Robert Harris, Pennsylvania State University.
The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler.
The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent.
The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent.
Nurk S, Koren S, Rhie A, Rautiainen M, et al. The complete sequence of a human genome. bioRxiv, 2021.
Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University
Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468
Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784
Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961