This file is from:
http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/multiz44way/README.txt
This directory contains compressed multiple alignments of 44 virus sequences.
These 44 sequences represent coronavirus strains in bat populations
The 'reference' sequence for this collection is the sequence:
NC_045512v2 - 2019-12-30 - Wuhan-Hu-1
https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2
Description files in this directory:
md5sum.txt - md5 sums to verify copied files
wuhCor1.44way.nameList.txt - relating the accession name to
sequence name, and sample collection date
wuhCor1.44way.nh - Phylogenetic tree used for multiz alignment.
The phylogenetic tree was calculated on 31mer frequency similarity
and neighbor joining that distance matrix with the phylip toolset:
http://evolution.genetics.washington.edu/phylip.html
'neighbor' command:
http://evolution.genetics.washington.edu/phylip/progs.data.dist.html
wuhCor1.multiz44way.maf.gz - alignments with gap annotation with
accession identifiers
sequences/ - directory with files:
https://hgdownload.soeucsc.edu/goldenPath/wuhCor1/multiz44way/sequences/
sequences/dnaFasta44.tgz - gzipped tar file for the DNA fasta, 44 sequences
sequences/proteinFasta44.tgz - gzipped tar file for the proteins as obtained
from the genbank records, for example in:
https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2
one .faa.gz for each sequence
sequences/proteinTab44.tgz - the same proteins arranged in single lines
of the format:
sequenceName.proteinName<tab>amino acids . . .
One file for each of the sequences.
This format file is convenient for extracting proteins from all the
sequences that have a similar length. For example, the longest protein
(the 'spike' protein) is over 6000 AAs, after unpacking the tgz file
into a directory:
zcat *.faa.tab.gz | awk -F$'\t' 'length($2) > 6000' \
| awk -F$'\t' '{printf ">%s\n%s\n", $1, $2}' > spikeProtein.faa
You can drop that spikeProtein.faa file into a multiple aligner such
as 'COBALT'
https://www.ncbi.nlm.nih.gov/tools/cobalt/re_cobalt.cgi
to obtain a multiple alignment of that protein for 99 of these sequences
For a description of multiple alignment format (MAF), see
http://genome.ucsc.edu/goldenPath/help/maf.html.
PhastCons conservation scores for these alignments are available at:
http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/phastCons44way
PhyloP conservation scores for these alignments are available at:
http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/phyloP44way
---------------------------------------------------------------
To download a large file or multiple files from this directory, we recommend
that you use rsync or ftp rather than downloading the files via our website.
Via rsync:
rsync -avz --progress \
rsync://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/multiz44way/ ./
Via FTP:
ftp hgdownload.soe.ucsc.edu
user name: anonymous
password: <your email address>
go to the directory goldenPath/wuhCor1/multiz44way
To download multiple files from the UNIX command line, use the "mget" command.
mget <filename1> <filename2> ...
- or -
mget -a (to download all the files in the directory)
Use the "prompt" command to toggle the interactive mode if you do not want
to be prompted for each file that you download.
---------------------------------------------------------------
All the files in this directory are freely usable for any
purpose. For data use restrictions regarding the individual
genome assemblies, see http://genome.ucsc.edu/goldenPath/credits.html.
---------------------------------------------------------------
Name Last modified Size Description
Parent Directory -
md5sum.txt 2020-03-18 11:11 483
sequences/ 2020-03-26 15:37 -
wuhCor1.44way.descriptiveName.nh 2020-03-18 09:34 2.9K
wuhCor1.44way.nameList.txt 2020-03-14 13:55 1.9K
wuhCor1.44way.phyloDistance.txt 2020-03-18 10:43 1.8K
wuhCor1.multiz44way.maf.gz 2020-03-17 13:02 334K