This directory contains phylogenetic trees of fully public SARS-CoV-2 sequences
updated daily:

* public-latest.all.masked.pb[.gz]
  Protobuf file for use with usher --load-mutation-annotated-tree

* public-latest.all.masked.vcf.gz
  Variant Call Format (VCF) file containing mutations in public sequences,
  generated from public-latest.all.masked.pb with matUtils extract.
  Missing or ambiguous bases have been imputed by usher to the most parsimonious
  base value [ACGT] at the time each sequence was placed in the tree.

* public-latest.all.nwk.gz
  Newick tree file (usher's uncondensed-final-tree.nh output)

* public-latest.metadata.tsv.gz
  Information about each public sequence e.g. collection date, location,
  Nextstrain clade and Pango lineage.  Dates and locations are not available
  for some sequences.

* public-latest.all.masked.ShUShER.pb.gz
  Size-limited version of public-latest.all.masked.pb.gz, randomly downsampled
  to 6000000 sequences in order to prevent ShUShER from exceeding web browser
  memory limits.

* public-latest.version.txt
  A brief description including date, sources and number of sequences.

Previous versions of the files are available in year/month/day directories:
  http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/2021/

The trees encoded in the Newick and protobuf files are derived from releases of
Rob Lanfear's sarscov2phylo (https://github.com/roblanf/sarscov2phylo), pruned
to include only public sequences aggregated from GenBank, COG-UK, and the
China National Center for Bioinformation, mapped to GISAID EPI_ISL_* IDs used
in the sarscov2phylo tree files.  The trees have also been re-rooted to
Wuhan/Hu-1 (GenBank MN908947.3, RefSeq NC_045512.2), and nodes with no
associated mutations have been collapsed.  Sequences released after the final
sarscov2phylo release (Nov. 13, 2020) were added to the tree using UShER.

A file that maps GISAID EPI_ISL_* IDs to public sequence IDs may be downloaded
from
  https://github.com/CDCgov/SARS-CoV-2_Sequencing/blob/master/files/epiToPublic.tsv.gz

GenBank sequences and metadata may be downloaded from
  https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049

COG-UK sequences and metadata may be downloaded from the "Latest Sequence Data"
section of
  https://www.cogconsortium.uk/tools-analysis/public-data-analysis-2/

The China National Center for Bioinformation offers additional sequences and
metadata:
  https://bigd.big.ac.cn/ncov/release_genome

A more extensive archive of public sequence trees is available in year/month/day
directories (note that methods changed slightly over time and the trees may be
lower quality than the more recent trees on hgdownload.soe.ucsc.edu):
  https://hgwdev.gi.ucsc.edu/~angie/UShER_SARS-CoV-2/2020/
  https://hgwdev.gi.ucsc.edu/~angie/UShER_SARS-CoV-2/2021/

The DD-MM-YY release labels used in 2020/10/ and 2020/11/ subdirectories
correspond to sarscov2phylo releases:

  https://github.com/roblanf/sarscov2phylo/releases

This work is made possible by the open sharing of genetic data by research
groups from all over the world.  We gratefully acknowledge the authors and the
originating laboratories where the clinical specimen or virus isolate was first
obtained and the submitting laboratories, where sequence data have been
generated and submitted to public databases, on which this research is based.

If you use these files please cite

McBroome et al. (2021).  A Daily-Updated Database and Tools for Comprehensive
SARS-CoV-2 Mutation-Annotated Trees.
https://academic.oup.com/mbe/article/38/12/5819/6361626

Please also acknowledge the submitters of SARS-CoV-2 sequences to public
databases.
      Name                                       Last modified      Size  Description
Parent Directory - 2021/ 2021-12-02 05:02 - 2022/ 2022-12-01 15:47 - 2023/ 2023-12-01 16:46 - 2024/ 2024-10-01 19:39 - public-latest.all.masked.vcf.gz 2024-10-06 21:12 655M public-latest.metadata.tsv.gz 2024-10-06 21:23 129M public-latest.version.txt 2024-10-06 21:27 126 public-latest.all.masked.taxonium.jsonl.gz 2024-10-06 21:43 318M public-latest.all.masked.ShUShER.pb.gz 2024-10-06 21:48 113M public-latest.all.nwk.gz 2024-10-06 21:48 120M public-latest.all.masked.pb.gz 2024-10-06 21:49 171M public-latest.all.msa.fa.xz 2024-10-06 21:51 6.3M public-latest.all.masked.msa.pb.gz 2024-10-06 21:53 1.9M public-latest.masked.msa.nwk.gz 2024-10-06 21:53 1.2M