This file is from:
http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/multiz119way/README.txt
This directory contains compressed multiple alignments of
119 virus sequences.
The 'reference' sequence for this collection is the sequence:
NC_045512v2 - 2019-12-30 - Wuhan-Hu-1
https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2
These 119 unique sequences were obtained from 3 sources:
1. NCBI Entrez search term: "SARS-CoV-2" produces 106 sequences
as of 2020-03-06. 40 of these sequences were exact duplicates
to other sequences in this set, and 21 of these sequences were fragments
of gene sequences. The duplicates and fragments are not included
in the list of 119 sequences
2. "coronaviridae" sequences (55 sequences) obtained from:
https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=11118
selected to show "RefSeq nucleotides"
3. 12 additional unique sequences were obtained from:
https://bigd.big.ac.cn/ncov/release_genome
all the other sequences available here were copies of the NCBI/genbank
sequences
Description files in this directory:
md5sum.txt - md5 sums to verify copied files
wuhCor1.119way.nameList.txt - relating the accession name to
sequence name, and sample collection date
wuhCor1.119way.nh - Phylogenetic tree used for multiz alignment.
The phylogenetic tree was calculated on 31mer frequency similarity
and neighbor joining that distance matrix with the phylip toolset:
http://evolution.genetics.washington.edu/phylip.html
'neighbor' command:
http://evolution.genetics.washington.edu/phylip/progs.data.dist.html
wuhCor1.multiz119way.maf.gz - alignments with gap annotation with
accession identifiers
sequences/ - directory with files:
https://hgdownload.soeucsc.edu/goldenPath/wuhCor1/multiz119way/sequences/
sequences/dnaFasta119.tgz - gzipped tar file for the DNA fasta, 119 sequences
sequences/proteinFasta119.tgz - gzipped tar file for the proteins as obtained
from the genbank records, for example in:
https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2
one .faa.gz for each sequence
(not all protein sequences are available, only 107 sequences present)
sequences/proteinTab119.tgz - the same proteins arranged in single lines
of the format:
sequenceName.proteinName<tab>amino acids . . .
One file for each of the sequences. Not all of the 119 sequences have
these protein records, there are 107 protein records here.
This format file is convenient for extracting proteins from all the
sequences that have a similar length. For example, the longest protein
(the 'spike' protein) is over 6000 AAs, after unpacking the tgz file
into a directory:
zcat *.faa.tab.gz | awk -F$'\t' 'length($2) > 6000' \
| awk -F$'\t' '{printf ">%s\n%s\n", $1, $2}' > spikeProtein.faa
You can drop that spikeProtein.faa file into a multiple aligner such
as 'COBALT'
https://www.ncbi.nlm.nih.gov/tools/cobalt/re_cobalt.cgi
to obtain a multiple alignment of that protein for 99 of these sequences
For a description of multiple alignment format (MAF), see
http://genome.ucsc.edu/goldenPath/help/maf.html.
PhastCons conservation scores for these alignments are available at:
http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/phastCons119way
PhyloP conservation scores for these alignments are available at:
http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/phyloP119way
---------------------------------------------------------------
To download a large file or multiple files from this directory, we recommend
that you use rsync or ftp rather than downloading the files via our website.
Via rsync:
rsync -avz --progress \
rsync://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/multiz119way/ ./
Via FTP:
ftp hgdownload.soe.ucsc.edu
user name: anonymous
password: <your email address>
go to the directory goldenPath/wuhCor1/multiz119way
To download multiple files from the UNIX command line, use the "mget" command.
mget <filename1> <filename2> ...
- or -
mget -a (to download all the files in the directory)
Use the "prompt" command to toggle the interactive mode if you do not want
to be prompted for each file that you download.
---------------------------------------------------------------
All the files in this directory are freely usable for any
purpose. For data use restrictions regarding the individual
genome assemblies, see http://genome.ucsc.edu/goldenPath/credits.html.
---------------------------------------------------------------
Name Last modified Size Description
Parent Directory -
md5sums.txt 2020-03-13 12:16 497
sequences/ 2020-05-01 14:55 -
wuhCor1.119way.descriptiveName.nh 2020-03-13 10:52 9.6K
wuhCor1.119way.nameList.txt 2020-03-11 14:55 5.5K
wuhCor1.119way.nh 2020-03-13 10:03 6.4K
wuhCor1.119way.phyloDistance.txt 2020-03-13 12:06 5.5K
wuhCor1.strainName119way.maf.gz 2020-05-01 13:48 1.0M