sarsCov2PhyloPub Phylogeny: Public Phylogenetic Tree and Nucleotide Substitution Mutations in Sequences in Public Databases Variation and Repeats Description This track displays a phylogenetic tree relating public SARS-CoV-2 genome sequences available from NCBI Virus / GenBank, COG-UK and the China National Center for Bioinformation, contributed by laboratories around the world, and mutations found in those sequences. By default, only very common mutations (alternate allele found in at least 1% of samples) are displayed, but other subtracks may be made visible in order to see more rare mutations. The phylogenetic tree is inferred by the sarscov2phylo pipeline (Lanfear). For display in the narrow space to the left of the main genome browser image, nodes in the tree are collapsed unless a mutation is associated with a node; i.e. the only branching points displayed are those at which mutations occurred. The tree is colored by Pangolin lineage (Rambaut et al.). The coloring scheme is adapted from Figure 1 of (Alm et al.) which presents a unified view of a simplified phylogenetic tree, Pangolin lineages, Nextstrain clades and GISAID clades. colorPangolin lineage(s)Nextstrain cladeGISAID clade       A 19B S       B.n (n > 1) 19A L       n/a (color not used when coloring by lineage; overlaps on tree with B.4 - B.7) n/a (overlaps on tree with 19A) O       n/a (color not used when coloring by lineage; overlaps on tree with B.2) n/a (overlaps on tree with 19A) V       B.1.5, B.1.6, B.1.8, other B.1.n that overlap GISAID clade G 20A (partial) G       B.1.9, B.1.13, B.1.22, B.1.22, B.1.36, B.1.37 20A (partial) GH (partial)       B.1.3, B.1.12, B.1.26, other B.1.n that overlap GISAID clade GH 20C GH (partial)       B.1.1 20B GR Display Conventions In "dense" mode, a vertical line is drawn at each position where there is a mutation. In "squish" and "pack" modes, the display shows a plot of all samples' mutations, with samples ordered using the phylogenetic tree in order to highlight patterns of linkage. "Full" display mode shows each mutation on its own row, ordered by position instead of lineage. Each sample is placed in a horizontal row of pixels; when the number of samples exceeds the number of vertical pixels for the track, multiple samples fall in the same pixel row and pixels are averaged across samples. Each mutation is a vertical bar at its position in the SARS-CoV-2 genome with white (invisible) representing the reference allele; the non-reference allele is shown in red if it changes the protein sequence of a gene, green if it falls within a gene but does not change the protein, and black if it does not fall within a gene. Tick marks are drawn at the top and bottom of each mutation's vertical bar to make the bar more visible when most alleles are reference alleles. Only single-nucleotide substitutions are displayed, not insertions or deletions. The phylogenetic tree showing inferred relationships between the samples is depicted in the left column of the display. Mousing over this will show the sample identifiers. At the default track height, about 100 samples are averaged into each row of pixels. The track height can be adjusted in the track controls, which can be reached by clicking on the gray button to the left of the tree or by right-clicking on the image. Methods Rob Lanfear regularly runs the sarscov2phylo pipeline on all complete, high-coverage sequences available from GISAID EpiCoV™. The pipeline aligns all sequences to the same reference genome used by the Genome Browser (RefSeq NC_045512.2, GenBank MN908947.3, GISAID sample hCoV-19/Wuhan/Hu-1/2019|EPI_ISL_402125|2019-12-31) using MAFFT (Katoh et al.). It masks sites identified as problematic by the ProblematicSites_SARS-CoV2 repository (De Maio et al.), as well as sites that are N's or gaps in >50% of samples. fasttree (Price et al.) is used to infer the phylogenetic tree; sequences on very long branches are removed using TreeShrink (Mai et al.). The tree is re-rooted to hCoV-19/Wuhan/WH04/2020|EPI_ISL_406801|2020-01-05. For full details, see the sarscov2phylo documentation. UCSC makes a reduced version of the tree that contains only samples from fully public databases (GenBank, COG-UK direct release, CNCB) that do not prohibit UCSC from offering sequence mutations for download (see Data Access). UCSC also makes several adjustments to the phylogenetic tree for compact display: We shorten "2019" and "2020" in dates to "19" and "20". We change the root of the tree to the reference genome used by the Genome Browser (NC_045512.2, Wuhan/Hu-1). Nodes that do not have an associated mutation are collapsed using UShER (Turakhia et al.). Data Access Files are available from our Download Server: Sequence mutations, allele counts and sample genotypes in Variant Call Format (VCF) files: Mutations with a minimum alternate allele frequency of 1% (public.minAf.01.vcf.gz) Mutations with a minimum alternate allele frequency of 0.1% (public.minAf.001.vcf.gz) All mutations (public.ambigToN.vcf.gz) UCSC's rerooted, collapsed, public-sequence-only version of the phylogenetic tree (sarscov2phylo.pub.ft.nh) Author credits (acknowledgements.tsv.gz) The VCF data can be explored interactively with the Table Browser or the Data Integrator, and accessed from scripts through our API. The sarscov2phylo repository includes all releases of the full phylogenetic tree. Credits This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge the authors and the originating laboratories where the clinical specimen or virus isolate was first obtained and the submitting laboratories, where sequence data have been generated and submitted to public databases, on which this research is based. Special thanks to Rob Lanfear for developing, running and sharing the sarscov2phylo pipeline and results. Data usage policy The data presented here is intended to rapidly disseminate analysis of important pathogens. Unpublished data is included with permission of the data generators, and does not impact their right to publish. Please contact the respective authors if you intend to carry out further research using their data. Authors and/or institutions that provided the sequences are listed in acknowledgements.tsv.gz. References Lanfear, R. A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo DOI: 10.5281/zenodo.3958883. 2020. Rambaut A, Holmes EC, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020 Nov;5(11):1403-1407. PMID: 32669681 Alm E, Broberg EK, Connor T, Hodcroft EB, Komissarov AB, Maurer-Stroh S, Melidou A, Neher RA, O'Toole Á, Pereyaslov D et al. Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region, January to June 2020. Euro Surveill. 2020 Aug;25(32). PMID: 32794443; PMC: PMC7427299 Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013 Apr;30(4):772-80. PMID: 23329690; PMC: PMC3603318 De Maio N, Walker C, Borges R, Weilguny L, Slodkowicz G, Goldman N. Masking strategies for SARS-CoV-2 alignments. virological.org. 2020 May 13. De Maio N, Gozashti L, Turakhia Y, Walker C, Lanfear R, Corbett-Detig R, Goldman N. Updated analysis with data from 12th June 2020. virological.org. 2020 July 14. Turakhia Y, Thornlow B, Hinrichs AS, De Maio N, Gozashti L, Lanfear R, Haussler D, and Corbett-Detig R. Ultrafast Sample Placement on Existing Trees (UShER) Empowers Real-Time Phylogenetics for the SARS-CoV-2 Pandemic. bioRxiv. 2020 September 28. Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One. 2010 Mar 10;5(3):e9490. PMID: 20224823; PMC: PMC2835736 Mai U, Mirarab S. TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees. BMC Genomics. 2018 May 8;19(Suppl 5):272. PMID: 29745847; PMC: PMC5998883 sarsCov2PhyloPubAllFull No Min AF (slow!) All Nucleotide Substitution Mutations in Public Sequences (slow at whole-genome scale) Variation and Repeats sarsCov2PhyloPubAllMinAf001 Min AF 0.1% Nucleotide Substitution Mutations with Alternate Allele Frequency >= 0.1% in Public Sequences Variation and Repeats sarsCov2PhyloPubAllMinAf01 Min AF 1% Nucleotide Substitution Mutations with Alternate Allele Frequency >= 1% in Public Sequences Variation and Repeats sarsCov2Phylo Phylogeny: GISAID Phylogenetic Tree and Nucleotide Substitution Mutations in High-coverage Sequences in GISAID EpiCoV TM Variation and Repeats Description This track displays a phylogenetic tree inferred from SARS-CoV-2 genome sequences collected by GISAID, and mutations found in the sequences. By default, only very common mutations (alternate allele found in at least 1% of samples) are displayed, but other subtracks may be made visible in order to see more rare mutations. The phylogenetic tree is inferred by Rob Lanfear's sarscov2phylo pipeline. For display in the narrow space to the left of the main genome browser image, nodes in the tree are collapsed unless a mutation is associated with a node; i.e. the only branching points displayed are those at which mutations occurred. Two options for coloring the tree, by Pangolin lineage (Rambaut et al.) or GISAID clade, are available. Both coloring schemes are adapted from Figure 1 of (Alm et al.) which presents a unified view of a simplified phylogenetic tree, Pangolin lineages, Nextstrain clades and GISAID clades. colorlineage(s)Nextstrain cladeGISAID clade       A 19B S       B.n (n > 1) 19A L       n/a (color not used when coloring by lineage; overlaps on tree with B.4 - B.7) n/a (overlaps on tree with 19A) O       n/a (color not used when coloring by lineage; overlaps on tree with B.2) n/a (overlaps on tree with 19A) V       B.1.5, B.1.6, B.1.8, other B.1.n that overlap GISAID clade G 20A (partial) G       B.1.9, B.1.13, B.1.22, B.1.22, B.1.36, B.1.37 20A (partial) GH (partial)       B.1.3, B.1.12, B.1.26, other B.1.n that overlap GISAID clade GH 20C GH (partial)       B.1.1 20B GR Display Conventions In "dense" mode, a vertical line is drawn at each position where there is a mutation. In "pack" mode, the display shows a plot of all samples' mutations, with samples ordered using the phylogenetic tree in order to highlight patterns of linkage. Each sample is placed in a horizontal row of pixels; when the number of samples exceeds the number of vertical pixels for the track, multiple samples fall in the same pixel row and pixels are averaged across samples. Each mutation is a vertical bar at its position in the SARS-CoV-2 genome with white (invisible) representing the reference allele; the non-reference allele is shown in red if it changes the protein sequence of a gene, green if it falls within a gene but does not change the protein, and black if it does not fall within a gene. Tick marks are drawn at the top and bottom of each mutation's vertical bar to make the bar more visible when most alleles are reference alleles. Only single-nucleotide mutations are displayed, not insertions or deletions. The phylogenetic tree showing inferred relationships between the samples is depicted in the left column of the display. Mousing over this will show the GISAID identifiers for the different samples. At the default track height, about 100 samples are averaged into each row of pixels. The track height can be adjusted in the track controls, which can be reached by clicking on the gray button to the left of the tree or by right-clicking on the image. Methods Rob Lanfear regularly runs the sarscov2phylo pipeline on all complete, high-coverage sequences available from GISAID. The pipeline aligns all sequences to the same reference genome used by the Genome Browser (RefSeq NC_045512.2, GenBank MN908947.3, GISAID sample hCoV-19/Wuhan/Hu-1/2019|EPI_ISL_402125|2019-12-31) using MAFFT (Katoh et al.). It masks sites identified as problematic by the ProblematicSites_SARS-CoV2 repository (De Maio et al., Turakhia et al.), as well as sites that are N's or gaps in >50% of samples. fasttree (Price et al.) is used to infer the phylogenetic tree; sequences on very long branches are removed using TreeShrink (Mai et al.). The tree is re-rooted to hCoV-19/Wuhan/WH04/2020|EPI_ISL_406801|2020-01-05. For full details, see the sarscov2phylo documentation. Collapsing of nodes that do not have an associated mutation is done using strain_phylogenetics (Turakhia et al.). Data Access You can download the VCF files underlying this track (gisaid.*.vcf.gz) from our Download Server. The data can be explored interactively with the Table Browser or the Data Integrator. Note: while the VCF files contain mutations found in sequences collected by GISAID, they are not sufficient to reconstruct the original sequences available from GISAID due to treatment of ambiguous IUPAC bases as missing information in the VCF and omission of insertion and deletion mutations. Additionally, the subtracks that are filtered to include only mutations found in a minimum percentage of samples give very incomplete representations of samples. Researchers wishing to work with SARS-CoV-2 genomic sequences should register with GISAID and download the full sequences. Credits This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions. Sequences are collected by GISAID and may be downloaded by registered users. Special thanks to Rob Lanfear for developing, running and sharing the sarscov2phylo pipeline and results. Data usage policy The data presented here is intended to rapidly disseminate analysis of important pathogens. Unpublished data is included with permission of the data generators, and does not impact their right to publish. Please contact the respective authors if you intend to carry out further research using their data. Author contact info is available via https://github.com/roblanf/sarscov2phylo/tree/master/acknowledgements. References Rambaut A, Holmes EC, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020 Nov;5(11):1403-1407. PMID: 32669681 Alm E, Broberg EK, Connor T, Hodcroft EB, Komissarov AB, Maurer-Stroh S, Melidou A, Neher RA, O'Toole Á, Pereyaslov D et al. Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region, January to June 2020. Euro Surveill. 2020 Aug;25(32). PMID: 32794443; PMC: PMC7427299 Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013 Apr;30(4):772-80. PMID: 23329690; PMC: PMC3603318 De Maio N, Walker C, Borges R, Weilguny L, Slodkowicz G, Goldman N. Masking strategies for SARS-CoV-2 alignments. virological.org. 2020 May 13. De Maio N, Gozashti L, Turakhia Y, Walker C, Lanfear R, Corbett-Detig R, Goldman N. Updated analysis with data from 12th June 2020. virological.org. 2020 July 14. Turakhia Y, Thornlow B, Gozashti L, Hinrichs AS, Fernandes JD, Haussler D, and Corbett-Detig R. Stability of SARS-CoV-2 Phylogenies. bioRxiv. 2020 June 9. Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One. 2010 Mar 10;5(3):e9490. PMID: 20224823; PMC: PMC2835736 Mai U, Mirarab S. TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees. BMC Genomics. 2018 May 8;19(Suppl 5):272. PMID: 29745847; PMC: PMC5998883 sarsCov2PhyloFull All (slow!) All Nucleotide Substitution Mutations in GISAID EpiCov TM Sequences (slow at whole-genome scale) Variation and Repeats sarsCov2PhyloMinAf001 Min alt AF 0.1% Nucleotide Substitution Mutations with Alternate Allele Frequency >= 0.1% in GISAID EpiCov TM Sequences Variation and Repeats sarsCov2PhyloMinAf01 Min alt AF 1% Nucleotide Substitution Mutations with Alternate Allele Frequency >= 1% in GISAID EpiCov TM Sequences Variation and Repeats nextstrainClade Nextstrain Clades Nextstrain year-letter clade designations (19A, 19B, 20A, etc.) Variation and Repeats Description Nextstrain.org displays data about mutations that occur in the current 2019/2020 outbreak. Nextstrain has a powerful user interface for viewing the timestamped molecular phylogeny tree that it infers from the patterns of mutations in sequences worldwide using the TreeTime algorithm. Nextstrain defines clades named by year and letter, anticipating that SARS-CoV-2 will become a seasonal virus in the coming years. For more information about the rationale and properties of current clades, see https://github.com/nextstrain/ncov/blob/master/docs/src/reference/naming_clades.md. The Nextstrain Mutations track contains all mutations shown on Nextstrain, of which this track's mutations are a very small subset. Methods Nextstrain downloads SARS-CoV-2 genomes from GISAID as they are submitted by labs worldwide. Nextstrain identifies mutations that define clades of interest and puts them in the file clades.tsv. The genome sequences and metadata including clades.tsv are processed by an automated pipeline and annotations are written to a data file that UCSC downloads and extracts annotations for display. Data Access You can download the bigBed file underlying this track (nextstrainClade.bb) from our Download Server. The data can be explored interactively with the Table Browser or the Data Integrator. The data can also be accessed from scripts through our API. Credits Thanks to nextstrain.org for sharing its analysis of genomes collected by GISAID, and to researchers worldwide for sharing their SARS-CoV-2 genome sequences. References Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher RA. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018 Dec 1;34(23):4121-4123. PMID: 29790939; PMC: PMC6247931 nextstrainSamples Nextstrain Mutations Nextstrain Subset of GISAID EpiCoV TM Sample Mutations Variation and Repeats Description Nextstrain.org displays data about mutations in the SARS-CoV-2 RNA and protein sequences that have occurred in different samples of the virus during the current 2019-2021 outbreak. Nextstrain has a powerful user interface for viewing the evolutionary tree that it infers from the patterns of mutations in sequences worldwide, but does not offer a detailed plot of mutations along the genome that can be correlated with other molecular information, so we have processed their data into this track to display the mutations called by Nextstrain for each sample that Nextstrain has obtained from GISAID. Click on the vertical column in the display for any position in the SARS-CoV-2 genome to see more details about the mutation(s) that occur at that position, including protein change (if applicable; protein changes use gene names in the Nextstrain Genes track), number of samples with the mutation, list giving the nucleotide (allele) for that position in each GISAID sample, etc. Nextstrain identifies certain clades within the phylogenetic tree according to a set of defining mutations. The Nextstrain Clades track provides more information about these clades and serves as a useful color key for the clade colors in the phylogenetic tree display. This track is composed of several subtracks so that different subsets of mutations may be viewed: Recurrent Bi-allelic: This is the only subtrack displayed by default. It is limited to mutations that have been observed in at least two samples, and excludes positions at which more than one alternate allele has been observed in more than one sample. All: All mutations found in all samples. <Clade> Mutations: All mutations found in samples belonging to <Clade>, which is one of Nextstrain's clades (19A, 19B, 20A, etc.) Display Conventions In "dense" mode, a vertical line is drawn at each position where there is a mutation. In "pack" mode, the display shows a plot of all samples' mutations, with samples ordered using Nextstrain's phylogenetic tree in order to highlight patterns of linkage. Each sample is placed in a horizontal row of pixels; when the number of samples exceeds the number of vertical pixels for the track, multiple samples fall in the same pixel row and pixels are averaged across samples. Each mutation is a vertical bar at its position in the SARS-CoV-2 genome with white (invisible) representing the reference allele and black representing the non-reference allele(s). Tick marks are drawn at the top and bottom of each mutation's vertical bar to make the bar more visible when most alleles are reference alleles. Insertions and deletions are not shown as these are removed from the data by Nextstrain. The phylogenetic tree for the samples built by Nextstrain is depicted in the left column of the display. Mousing over this will show the GISAID identifiers for the different samples. When the vertical height of the track is set sufficiently high (10 pixels per sample with the default font), sample names are drawn to the right of the tree; however, with thousands of samples in the Nextstrain tree, and a maximum track height of 2500 pixels, the full Nextstrain tree is too large for sample names to be displayed. In the track controls, the user can choose to display subtracks containing the phylogenetic trees and mutations for individual clades. Some clades have few enough samples that they can be made tall enough to display sample names. Branches of the phylogenetic tree are colored by clade using the same color scheme as nextstrain.org. Methods Nextstrain downloads SARS-CoV-2 genomes from GISAID as they are submitted by labs worldwide, and downsamples to a subset of several thousand sequences in order to provide an interactive display. The selected subset of GISAID sequences is processed by an automated pipeline, producing an annotated phylogenetic tree data structure underlying the Nextstrain display; UCSC downloads the results and extracts annotations for display. Data Access SARS-CoV-2 mutations displayed by Nextstrain are derived from a subset of GISAID sequences, and the GISAID Terms and Conditions prohibit the redistribution of GISAID-derived data. They also require that the submitters of all sequences be acknowledged when the mutations are used. Nextstrain.org offers phylogenetic trees, author credits and other files: scroll to the bottom of the page and click "DOWNLOAD DATA", and a dialog with download options appears. All GISAID SARS-CoV-2 genome sequences and metadata are available for download from GISAID EpiCoV™ by registered users. We have a program faToVcf that can extract VCF from a multi-sequence FASTA alignment such as the "msa_date" download file from GISAID. faToVcf is available for Linux and MacOSX on the download server: https://hgdownload.soe.ucsc.edu/admin/exe. It requires at least 4GB of memory to process the complete msa_date file. Here are some steps to get started using faToVcf: This command enables faToVcf to be run as a program (otherwise the command would say "Permission denied"): chmod a+x faToVcf This command shows basic usage instructions and describes the options: ./faToVcf This command converts msa fasta to VCF without per-sample genotype columns (substitute correct date for "0925" in filenames): ./faToVcf -includeRef \ -ref='hCoV-19/Wuhan/Hu-1/2019|EPI_ISL_402125|2019-12-31|Asia' \ -vcfChrom=NC_045512.2 \ -noGenotypes \ msa_0925.fasta msa_0925.sites.vcf Credits This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions. Special thanks to nextstrain.org for sharing its analysis of genomes collected by GISAID. Data usage policy The data presented here is intended to rapidly disseminate analysis of important pathogens. Unpublished data is included with permission of the data generators, and does not impact their right to publish. Please contact the respective authors if you intend to carry out further research using their data. Author contact info is available via nextstrain.org: scroll to the bottom of the page, click "DOWNLOAD DATA" and click "ALL METADATA (TSV)" in the resulting dialog. References Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher RA. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018 Dec 1;34(23):4121-4123. PMID: 29790939; PMC: PMC6247931 Sagulenko P, Puller V, Neher RA. TreeTime: Maximum-likelihood phylodynamic analysis. Virus Evol. 2018 Jan;4(1):vex042. PMID: 29340210; PMC: PMC5758920 Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015 Jan;32(1):268-74. PMID: 25371430; PMC: PMC4271533 nextstrainSamplesViewNewClades Year-Letter Clades Nextstrain Subset of GISAID EpiCoV TM Sample Mutations Variation and Repeats nextstrainSamples21J_Delta 21J/Delta Mutations Mutations in Clade 21J/Delta Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples21I_Delta 21I/Delta Mutations Mutations in Clade 21I/Delta Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples21H_Mu 21H/Mu Mutations Mutations in Clade 21H/Mu Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples21G_Lambda 21G/Lambda Mutations Mutations in Clade 21G/Lambda Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples21F_Iota 21F/Iota Mutations Mutations in Clade 21F/Iota Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples21E_Theta 21E/Theta Mutations Mutations in Clade 21E/Theta Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples21D_Eta 21D/Eta Mutations Mutations in Clade 21D/Eta Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples21B_Kappa 21B/Kappa Mutations Mutations in Clade 21B/Kappa Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples21A_Delta 21A/Delta Mutations Mutations in Clade 21A/Delta Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples20G 20G Mutations Mutations in Clade 20G Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples20F 20F Mutations Mutations in Clade 20F Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples20D 20D Mutations Mutations in Clade 20D Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples20C 20C Mutations Mutations in Clade 20C Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples20B 20B Mutations Mutations in Clade 20B Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples20A 20A Mutations Mutations in Clade 20A Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples19B 19B Mutations Mutations in Clade 19B Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamples19A 19A Mutations Mutations in Clade 19A Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats nextstrainSamplesViewAll All Samples Nextstrain Subset of GISAID EpiCoV TM Sample Mutations Variation and Repeats nextstrainSamplesAll All Mutations in Nextstrain Subset of GISAID EpiCov TM Samples Variation and Repeats nextstrainSamplesRb Rec Bi-allelic Recurrent Bi-allelic Mutations in Nextstrain Subset of GISAID EpiCoV TM Samples Variation and Repeats strainCons44way 44 Bat CoVs Multiz Alignment & Conservation (44 Strains with bats as hosts) Comparative Genomics Downloads for data in this track are available: Multiz alignments (MAF format), and phylogenetic trees PhyloP conservation (WIG format) PhastCons conservation (WIG format) Description This track shows multiple alignments of 44 virus sequences, aligned to the SARS-CoV-2 reference sequence NC_045512.2, genome assembly GCF_009858895.2. It also includes measurements of evolutionary conservation using two methods (phastCons and phyloP) from the PHAST package, for all 44 virus sequences. The multiple alignments were generated using multiz and other tools in the UCSC/Penn State Bioinformatics comparative genomics alignment pipeline. Conserved elements identified by phastCons are also displayed in this track. PhastCons (which has been used in previous Conservation tracks) is a hidden Markov model-based method that estimates the probability that each nucleotide belongs to a conserved element, based on the multiple alignment. It considers not just each individual alignment column, but also its flanking columns. By contrast, phyloP separately measures conservation at individual columns, ignoring the effects of their neighbors. As a consequence, the phyloP plots have a less smooth appearance than the phastCons plots, with more "texture" at individual sites. The two methods have different strengths and weaknesses. PhastCons is sensitive to "runs" of conserved sites, and is therefore effective for picking out conserved elements. PhyloP, on the other hand, is more appropriate for evaluating signatures of selection at particular nucleotides or classes of nucleotides (e.g., third codon positions, or first positions of miRNA target sites). Another important difference is that phyloP can measure acceleration (faster evolution than expected under neutral drift) as well as conservation (slower than expected evolution). In the phyloP plots, sites predicted to be conserved are assigned positive scores (and shown in blue), while sites predicted to be fast-evolving are assigned negative scores (and shown in red). The absolute values of the scores represent -log p-values under a null hypothesis of neutral evolution. The phastCons scores, by contrast, represent probabilities of negative selection and range between 0 and 1. Both phastCons and phyloP treat alignment gaps and unaligned nucleotides as missing data. In the track display, the sequence is labeled using its NCBI Nucleotide accession number. The mapping between sequence accession identifiers and more descriptive names is provided via a text file on our download server. Display Conventions and Configuration Pairwise alignments of each species to the SARS-CoV-2 genome are displayed as a series of colored blocks indicating the functional effect of polymorphisms (in pack mode), or as a wiggle (in full mode) that indicates alignment quality. In dense display mode, percent identity of the whole alignments is shown in grayscale using darker values to indicate higher levels of identity. In pack mode, regions that align with 100% identity are not shown. When there is not 100% percent identity, blocks of four colors are drawn. Red blocks are drawn when a polymorphism in a coding region results in a change in the amino acid that is generated. Green blocks are drawn when a polymorphism in a coding region results in no change to the amino acid that is generated. Blue blocks are drawn when a polymorphism is outside a coding region. Pale yellow blocks are drawn when there are no aligning bases to that region in the reference genome. Checkboxes on the track configuration page allow selection of the species to include in the pairwise display. Configuration buttons are available to select all of the species (Set all), deselect all of the species (Clear all), or use the default settings (Set defaults). To view detailed information about the alignments at a specific position, zoom the display in to 30,000 or fewer bases, then click on the alignment. Base Level When zoomed-in to the base-level display, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the SARS-CoV-2 sequence at those alignment positions relative to the longest non-SARS-CoV-2 sequence. If there is sufficient space in the display, the size of the gap is shown. If the space is insufficient and the gap size is a multiple of 3, a "*" is displayed; other gap sizes are indicated by "+". Codon translation is available in base-level display mode if the displayed region is identified as a coding segment. To display this annotation, select the species for translation from the pull-down menu in the Codon Translation configuration section at the top of the page. Then, select one of the following modes: No codon translation: The gene annotation is not used; the bases are displayed without translation. Use default species reading frames for translation: The annotations from the genome displayed in the Default species to establish reading frame pull-down menu are used to translate all the aligned species present in the alignment. Use reading frames for species if available, otherwise no translation: Codon translation is performed only for those species where the region is annotated as protein coding. Use reading frames for species if available, otherwise use default species: Codon translation is done on those species that are annotated as being protein coding over the aligned region using species-specific annotation; the remaining species are translated using the default species annotation. Methods Pairwise alignments with the reference sequence were generated for each sequence using lastz version 1.04.00. Parameters used for each lastz alignment: # hsp_threshold = 2200 # gapped_threshold = 4000 = L # x_drop = 910 # y_drop = 3400 = Y # gap_open_penalty = 400 # gap_extend_penalty = 30 # A C G T # A 91 -90 -25 -100 # C -90 100 -100 -25 # G -25 -100 100 -90 # T -100 -25 -90 91 # seed=1110100110010101111 w/transition # step=1 Pairwise alignments were then linked into chains using a dynamic programming algorithm that finds maximally scoring chains of gapless subsections of the alignments organized in a kd-tree. Parameters used in the chaining (axtChain) step: -minScore=10 -linearGap=loose High-scoring chains were then placed along the genome, with gaps filled by lower-scoring chains, to produce an alignment net. count sampledate accession phylogeneticdistance descriptive name 0012019-12NC_045512.20.000000SARS-CoV-2/Wuhan-Hu-1 0022013-07-24MN996532.10.111391Bat CoV RaTG13 0032005-10-25DQ022305.20.756533Bat SARS CoV HKU3-1 0042010-04-05GQ153542.10.758373Bat SARS CoV HKU3-7 0052010-04-05GQ153547.10.758589Bat SARS CoV HKU3-12 0062011-09JX993987.10.825373Bat CoV Rp/Shaanxi2011 0072013KJ473814.10.844563BtRs-BetaCoV/HuB2013 0082006-07-13DQ412043.10.861670Bat SARS CoV Rm1 0092011JX993988.10.866485Bat CoV Cp/Yunnan2011 0102006FJ588686.10.870015Bat SARS CoV Rs672/2006 0112006-07-13DQ412042.10.873059Bat SARS CoV Rf1 0122006-07-19DQ648856.10.874586Bat CoV (BtCoV/273/2005) 0132013KJ473812.10.876344BtRf-BetaCoV/HeB2013 0142016-09MK211375.10.880260CoV BtRs-BetaCoV/YN2018A 0152013-04-17KY417147.10.883717Bat SARS-like CoV Rs4237 0162012KY770860.10.884677Bat CoV Jiyuan-84 0172013-04-17KY417149.10.886441Bat SARS-like CoV Rs4255 0182012KU973692.10.886655UNVERIFIED: SARS-related CoV F46 0192016-04-17KY938558.10.886844Bat CoV strain 16BO133 0202016-09MK211378.10.887400CoV BtRs-BetaCoV/YN2018D 0212014-05-12KY417142.10.888076Bat SARS-like CoV As6526 0222013KY770858.10.889779Bat CoV Anlong-103 0232016-09MK211377.10.890783CoV BtRs-BetaCoV/YN2018C 0242012-09-18KY417145.10.891547Bat SARS-like CoV Rf4092 0252013KJ473816.10.892938BtRs-BetaCoV/YN2013 0262018-08-13NC_004718.30.896070SARS CoV 0272013-04-17KY417148.10.897176Bat SARS-like CoV Rs4247 0282012-09-18KY417143.10.898813Bat SARS-like CoV Rs4081 0292013KJ473815.10.900478BtRs-BetaCoV/GX2013 0302006-01-25DQ071615.10.903660Bat SARS CoV Rp3 0312013-05-23KP886808.10.914845Bat SARS-like CoV YNLF_31C 0322016-08MK211374.10.920214CoV BtRl-BetaCoV/SC2018 0332016-09MK211376.10.932471CoV BtRs-BetaCoV/YN2018B 0342012-09KF367457.10.935102Bat SARS-like CoV WIV1 0352012-09-18KY417144.10.938296Bat SARS-like CoV Rs4084 0362015-10-16KY417152.10.938841Bat SARS-like CoV Rs9401 0372011KF569996.10.940405Rhinolophus affinis CoV LYRa11 0382014-10-24KY417151.10.945367Bat SARS-like CoV Rs7327 0392013-04-17KY417146.10.946050Bat SARS-like CoV Rs4231 0402013-07-21KT444582.10.961789SARS-like CoV WIV16 0412007-08KY352407.11.063753SARS-related CoV strain BtKY72 0422008NC_014470.11.075344Bat CoV BM48-31/BGR/2008 0432017-02MG772933.11.076854Bat SARS-like CoV bat-SL-CoVZC45 0442015-07MG772934.11.106462Bat SARS-like CoV bat-SL-CoVZXC21 The multiple alignment was constructed from the resulting pairwise alignments progressively aligned using multiz/autoMZ. The phylogenetic tree was calculated on 31mer frequency similarity and neighbor joining that distance matrix with the phylip toolset command: neighbor. The reference sequence NC_045512v2 is at the top of the tree: (((NC_045512v2 MN996532v1) ((((DQ022305v2 GQ153547v1) GQ153542v1) (MG772933v1 MG772934v1)) ((((((DQ071615v1 KJ473815v1) ((((FJ588686v1 KY770858v1) ((((((KF367457v1 KY417144v1) (KY417151v1 KY417152v1)) ((KY417142v1 MK211377v1) (MK211376v1 MK211378v1))) ((((KT444582v1 KY417143v1) KY417149v1) KY417146v1) (KY417147v1 KY417148v1))) (KJ473816v1 KY417145v1)) MK211375v1)) NC_004718v3) KP886808v1)) MK211374v1) (KF569996v1 KU973692v1)) JX993988v1) ((((DQ412042v1 DQ648856v1) (KJ473812v1 KY770860v1)) KY938558v1) ((DQ412043v1 KJ473814v1) JX993987v1))))) (KY352407v1 NC_014470v1)) Framing tables from the genes were constructed to enable visualization of codons in the multiple alignment display. Phylogenetic Tree Model Both phastCons and phyloP are phylogenetic methods that rely on a tree model containing the tree topology, branch lengths representing evolutionary distance at neutrally evolving sites, the background distribution of nucleotides, and a substitution rate matrix. The all-species tree model for this track was generated using the phyloFit program from the PHAST package (REV model, EM algorithm, medium precision) using multiple alignments of 4-fold degenerate sites extracted from the 44-way alignment (msa_view). The 4d sites were derived from the NCBI gene set, filtered to select single-coverage long transcripts. This same tree model was used in the phyloP calculations; however, the background frequencies were modified to maintain reversibility. The resulting tree model: all species. PhastCons Conservation The phastCons program computes conservation scores based on a phylo-HMM, a type of probabilistic model that describes both the process of DNA substitution at each site in a genome and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2005). PhastCons uses a two-state phylo-HMM, with a state for conserved regions and a state for non-conserved regions. The value plotted at each site is the posterior probability that the corresponding alignment column was "generated" by the conserved state of the phylo-HMM. These scores reflect the phylogeny (including branch lengths) of the species in question, a continuous-time Markov model of the nucleotide substitution process, and a tendency for conservation levels to be autocorrelated along the genome (i.e., to be similar at adjacent sites). The general reversible (REV) substitution model was used. Unlike many conservation-scoring programs, phastCons does not rely on a sliding window of fixed size; therefore, short highly-conserved regions and long moderately conserved regions can both obtain high scores. More information about phastCons can be found in Siepel et al, 2005. The phastCons parameters used were: expected-length=45, target-coverage=0.3, rho=0.3. PhyloP Conservation The phyloP program supports several different methods for computing p-values of conservation or acceleration, for individual nucleotides or larger elements (http://compgen.cshl.edu/phast/). Here it was used to produce separate scores at each base (--wig-scores option), considering all branches of the phylogeny rather than a particular subtree or lineage (i.e., the --subtree option was not used). The scores were computed by performing a likelihood ratio test at each alignment column (--method LRT), and scores for both conservation and acceleration were produced (--mode CONACC). Conserved Elements The conserved elements were predicted by running phastCons with the --most-conserved option. The predicted elements are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM. Each element is assigned a log-odds score equal to its log probability under the conserved model minus its log probability under the non-conserved model. The "score" field associated with this track contains transformed log-odds scores, taking values between 0 and 1000. (The scores are transformed using a monotonic function of the form a * log(x) + b.) The raw log odds scores are retained in the "name" field and can be seen on the details page or in the browser when the track's display mode is set to "pack" or "full". Credits This track was created using the following programs: Alignment tools: lastz (formerly blastz) and multiz by Minmei Hou, Scott Schwartz, Robert Harris, and Webb Miller of the Penn State Bioinformatics Group Conservation scoring: phastCons, phyloP, phyloFit, tree_doctor, msa_view and other programs in PHAST by Adam Siepel at Cold Spring Harbor Laboratory (original development done at the Haussler lab at UCSC). Chaining and Netting: axtChain, chainNet by Jim Kent at UCSC MAF Annotation tools: mafAddIRows by Brian Raney, UCSC; mafAddQRows by Richard Burhans, Penn State; genePredToMafFrames by Mark Diekhans, UCSC Tree image generator: phyloPng by Galt Barber, UCSC Conservation track display: Kate Rosenbloom, Hiram Clawson (wiggle display), and Brian Raney (gap annotation and codon framing) at UCSC References Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632; Supplemental Materials and Methods Phylo-HMMs, phastCons, and phyloP: Felsenstein J, Churchill GA. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol. 1996 Jan;13(1):93-104. PMID: 8583911 Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010 Jan;20(1):110-21. PMID: 19858363; PMC: PMC2798823 Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. PMID: 16024819; PMC: PMC1182216 Siepel A, Haussler D. Phylogenetic Hidden Markov Models. In: Nielsen R, editor. Statistical Methods in Molecular Evolution. New York: Springer; 2005. pp. 325-351. Yang Z. A space-time process model for the evolution of DNA sequences. Genetics. 1995 Feb;139(2):993-1005. PMID: 7713447; PMC: PMC1206396 Chain/Net: Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Multiz: Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004 Apr;14(4):708-15. PMID: 15060014; PMC: PMC383317 Lastz (formerly Blastz): Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Harris RS. Improved pairwise alignment of genomic DNA. Ph.D. Thesis. Pennsylvania State University, USA. 2007. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 Funding This annotation track in the UCSC SARS-CoV-2 genome browser is funded by generous private donors to the UC Santa Cruz Genomics Institute. strainCons44wayViewalign Multiz Alignments Multiz Alignment & Conservation (44 Strains with bats as hosts) Comparative Genomics strainName44way Bat CoV multiz Multiz Alignment of 44 strains with bats as hosts: red=nonsyn green=syn blue=noncod yellow=noalign Comparative Genomics strainCons44wayViewphastcons Bat Element Conservation (phastCons) Multiz Alignment & Conservation (44 Strains with bats as hosts) Comparative Genomics strainPhastCons44way Bat PhastCons 44 bat virus strains Basewise Conservation by PhastCons Comparative Genomics strainCons44wayViewphyloP Basewise Conservation (phyloP) Multiz Alignment & Conservation (44 Strains with bats as hosts) Comparative Genomics strainPhyloP44way Bat PhyloP 44 Bat virus strains Basewise Conservation by PhyloP Comparative Genomics cpgIslandExt CpG Islands CpG Islands (Islands < 300 Bases are Light Green) Expression and Regulation Description CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time, methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to keep them or a region is not methylated for some other reason, perhaps having to do with the regulation of gene expression. CpG islands are regions where CpGs are present at significantly higher levels than is typical for the genome as a whole. The unmasked version of the track displays potential CpG islands that exist in repeat regions and would otherwise not be visible in the repeat masked version. By default, only the masked version of the track is displayed. To view the unmasked version, change the visibility settings in the track controls at the top of this page. Methods CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater length greater than 200 bp ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment The entire genome sequence, masking areas included, was used for the construction of the track Unmasked CpG. The track CpG Islands is constructed on the sequence after all masked sequence is removed. The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence. The calculation of the track data is performed by the following command sequence: twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \ | cpg_lh /dev/stdin 2> cpg_lh.err \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > cpgIsland.bed The unmasked track data is constructed from twoBitToFa -noMask output for the twoBitToFa command. Data access CpG islands and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog. The source for the cpg_lh program can be obtained from src/utils/cpgIslandExt/. The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") Credits This track was generated using a modification of a program developed by G. Miklem and L. Hillier (unpublished). References Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987 Jul 20;196(2):261-82. PMID: 3656447 cpgIslandSuper CpG Islands CpG Islands (Islands < 300 Bases are Light Green) Expression and Regulation Description CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time, methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to keep them or a region is not methylated for some other reason, perhaps having to do with the regulation of gene expression. CpG islands are regions where CpGs are present at significantly higher levels than is typical for the genome as a whole. The unmasked version of the track displays potential CpG islands that exist in repeat regions and would otherwise not be visible in the repeat masked version. By default, only the masked version of the track is displayed. To view the unmasked version, change the visibility settings in the track controls at the top of this page. Methods CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater length greater than 200 bp ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment The entire genome sequence, masking areas included, was used for the construction of the track Unmasked CpG. The track CpG Islands is constructed on the sequence after all masked sequence is removed. The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence. The calculation of the track data is performed by the following command sequence: twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \ | cpg_lh /dev/stdin 2> cpg_lh.err \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > cpgIsland.bed The unmasked track data is constructed from twoBitToFa -noMask output for the twoBitToFa command. Data access CpG islands and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog. The source for the cpg_lh program can be obtained from src/utils/cpgIslandExt/. The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") Credits This track was generated using a modification of a program developed by G. Miklem and L. Hillier (unpublished). References Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987 Jul 20;196(2):261-82. PMID: 3656447 treangen Intrahost SNPs Intrahost SNP patient data from Todd Treangen's group Variation and Repeats Description This track shows iSNPs (intrahost SNPs). These are SNPs that have evidence for variation within one host. That is, a single patient can have variation among the various SARS-CoV-2 viruses infecting their cells. This variation is lost when a single consensus genome sequence is reported for a patient. The data were published in Sapoval et al, 2020 "Hidden genomic diversity of SARS-CoV-2: implications for qRT-PCR diagnostics and transmission". In this track, iSNPs (intrahost SNP's) of human patients from New York City and Houston are shown. Display Conventions and Configuration The track contains a list of iSNPs found in patient data from New York City and Houston with nucleotide and amino acid changes, one feature per variant. The name field in this track represents the median observed allele frequency for patients meeting inclusion criteria in the VCFs provided by Sapoval et al. Finally, bedToBigBed was used to create the BigBed track. Interested users may wish to inspect each of the individual VCFs; for this track we have chosen to show a condensed version of all VCFs (see Methods). Methods VCF files were downloaded from the Rice University data repository. SARS-CoV-2 iSNPs from New York City and Houston patient data were parsed, and if the base position was modified in more than one sample then it was included. The frequency of observing a particular base (A,C,G,T) at the position when a change was recorded was then included, and the dominant base change was used to determine whether the base modification would also result in an amino acid change. The original data files are available from a shared box.com folder with VCF files. References Sapoval N, Mahmoud M, Jochum MD, Liu Y, Leo Elworth RA, Wang Q, Albin D, Ogilvie H, Lee MD, Villapol S et al. Hidden genomic diversity of SARS-CoV-2: implications for qRT-PCR diagnostics and transmission. bioRxiv. 2020 Jul 2;. PMID: 32637955; PMC: PMC7337385 pond Nat. Selection (Pond) Natural selection analysis from Sergei Pond's research group Variation and Repeats Description This track shows data from Sergei Pond's research group, updated several times between 2020 and 2022, with results published in 2022. The current dataset is from February 2022 and is scheduled to be updated soon. Contact us or Sergei if you believe that the data shown is too outdated for your analyses. The authors use several statistical techniques to identify selection sites of interest in SARS-CoV-2 data from GISAID. Display Conventions and Configuration This track has two subtracks: Positive Selection: "On average along interior tree branches, this site has a dN/dS>1 is accumulating non-synonymous changes (some of which might have a functional impact, but most probably don't) faster relative to synonymous changes than would be expected under neutral evolution." Negative Selection: "On average along interior tree branches, this site has a dN/dS<1, meaning that it is conserved, i.e. non-synonymous changes might be selected against. Note that sites with no changes (i.e. perfectly conserved sites) cannot be detected by dN/dS based methods" Methods The CSV used to generate the genomic coordinates of selection sites was parsed and the position, gene, site_in_gene, score, and type fields were used to generate the resulting fields provided for each site in the data. References Pond et al, 2020 "Natural selection analysis of SARS-CoV-2/COVID-19" Pond et al, 2020 "Natural selection analysis of SARS-CoV-2/COVID-19, V2" Martin DP, Lytras S, Lucaci AG, Maier W, Grüning B, Shank SD, Weaver S, MacLean OA, Orton RJ, Lemey P et al. Selection Analysis Identifies Clusters of Unusual Mutational Changes in Omicron Lineage BA.1 That Likely Impact Spike Function. Mol Biol Evol. 2022 Apr 11;39(4). PMID: 35325204; PMC: PMC9037384 Positive_Selection Positive Selection Sites of positive selection implicated in data from Sergei Pond's research group Variation and Repeats Negative_Selection Negative Selection Sites of negative selection implicated in data from Sergei Pond's research group Variation and Repeats ncbiGeneBGP NCBI Genes NCBI Genes from NC_045512.2 Genes and Gene Predictions Description The NCBI Gene track for the 13 Jan 2020 SARS-CoV-2 virus/GCF_009858895.2 genome assembly is constructed from the NCBI nuccore entry for NC_045512.2 https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2 Data Access The raw data can be explored interactively with the Table Browser, or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found on our utilities page. The tool can also be used to obtain features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/ncbiGene.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits This track was created by Max Haeussler and Brian Raney at UCSC, with help from Daniel Schmelter and many others. Thanks to NCBI and the US National Institutes of Health for making all data available for download. ncbiProducts NCBI Proteins NCBI Proteins: annotated mature peptide products Genes and Gene Predictions Description The NCBI Mature Proteins track for the 13 Jan 2020 SARS-CoV-2 virus/GCF_009858895.2 genome assembly is constructed from the NCBI nuccore entry for NC_045512.2 https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2 It shows the mature peptides, after cleavage, as annotated on the Genbank record. Data Access The raw data can be explored interactively with the Table Browser, or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found on our utilities page. The tool can also be used to obtain features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/ncbiGene.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits This track was created by Max Haeussler and Brian Raney at UCSC, with help from Daniel Schmelter and many others. Thanks to NCBI and the US National Institutes of Health for making all data available for download. ORFs ORF predictions Weizman ORF predictions Genes and Gene Predictions Description The Weizman ORFs (Open Reading Frames) track shows previously unannotated ORF predictions based on Ribo-Seq and RNA-seq data. It is a collection of tracks (super track) that contains not only the predicted gene models, but also data supporting them. Display Conventions and Configuration The Predicted ORFs track shows the predicted exons. All other tracks show the signal as a x-y plot with bars. Methods Methods from Finkel et al: To capture the full SARS-CoV-2 coding capacity, we applied a suite of ribosome profiling approaches to Vero cells infected with SARS-CoV-2 for 5 and 24 hours, and Calu3 cells infected for 7 hours. For each time point we prepared three different ribosome-profiling libraries, each one in two biological replicates. Two Ribo-seq libraries facilitate mapping of translation initiation sites, by treating cells with lactimidomycin (LTM) or harringtonine (Harr), two drugs with distinct mechanisms that prevent 80S ribosomes at translation initiation sites from elongating. The third Ribo-seq library was prepared from cells treated with the translation elongation inhibitor cycloheximide (CHX), and gives a snap-shot of actively translating ribosomes across the body of the translated ORF. In parallel, RNA-sequencing was applied to map viral transcripts. The ORF prediction was done by using two computational tools, PRICE and ORF-RATER, that rely on different features of ribosome profiling data, and by manual inspection of the data. The predictions are based on Ribo-seq libraries from two time points (5 and 7 hpi) of two different cell lines (Vero E6 and Calu3 cells), infected with separate virus isolates. The Ribo-Seq data of the 24 hours samples do not show the expected profile of read distribution on viral genes and therefore were not used for the procedure of ORF predictions. For more details see the paper in the References section below. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Finkel Y, Mizrahi O, Nachshon A, Weingarten-Gabbay S, Morgenstern D, Yahalom-Ronen Y, Tamir H, Achdout H, Stein D, Israeli O et al. The coding capacity of SARS-CoV-2. Nature. 2020 Sep 9;. PMID: 32906143 weizmanOrfs Weizman ORFs New ORFs based on RNA-seq and Ribo-seq by the Weizman Institute Genes and Gene Predictions Description The Weizman ORFs (Open Reading Frames) track shows previously unannotated ORF predictions based on Ribo-Seq and RNA-seq data. It is a collection of tracks (super track) that contains not only the predicted gene models, but also data supporting them. Display Conventions and Configuration The Predicted ORFs track shows the predicted exons. All other tracks show the signal as a x-y plot with bars. Methods Methods from Finkel et al: To capture the full SARS-CoV-2 coding capacity, we applied a suite of ribosome profiling approaches to Vero cells infected with SARS-CoV-2 for 5 and 24 hours, and Calu3 cells infected for 7 hours. For each time point we prepared three different ribosome-profiling libraries, each one in two biological replicates. Two Ribo-seq libraries facilitate mapping of translation initiation sites, by treating cells with lactimidomycin (LTM) or harringtonine (Harr), two drugs with distinct mechanisms that prevent 80S ribosomes at translation initiation sites from elongating. The third Ribo-seq library was prepared from cells treated with the translation elongation inhibitor cycloheximide (CHX), and gives a snap-shot of actively translating ribosomes across the body of the translated ORF. In parallel, RNA-sequencing was applied to map viral transcripts. The ORF prediction was done by using two computational tools, PRICE and ORF-RATER, that rely on different features of ribosome profiling data, and by manual inspection of the data. The predictions are based on Ribo-seq libraries from two time points (5 and 7 hpi) of two different cell lines (Vero E6 and Calu3 cells), infected with separate virus isolates. The Ribo-Seq data of the 24 hours samples do not show the expected profile of read distribution on viral genes and therefore were not used for the procedure of ORF predictions. For more details see the paper in the References section below. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Finkel Y, Mizrahi O, Nachshon A, Weingarten-Gabbay S, Morgenstern D, Yahalom-Ronen Y, Tamir H, Achdout H, Stein D, Israeli O et al. The coding capacity of SARS-CoV-2. Nature. 2020 Sep 9;. PMID: 32906143 unipCov2AliSwissprot Protein Alignments UCSC alignment of full-length SwissProt proteins to genome UniProt Protein Annotations Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. It shows how the protein sequences in this database map to the genome. This mapping was used to "lift" the UniProt protein annotations to the SARS-CoV-2 genome. The protein annotation themselves have been curated from scientific publications by the UniProt/SwissProt staff. Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. Mouse-over a feature to see the full UniProt annotation comment. For variants, the mouse-over will show the full name of the UniProt disease acronym. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. For human and mouse, the alignments were filtered by retaining only proteins annotated with a given transcript in the Genome Browser table kgXref. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipAliSwissprotCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout archive folder of our downloads server. --!> Please refer to our mailing list archives for questions or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Alejo Mujica, Regeneron Pharmaceuticals. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 Starr_Bloom_bind RBD Mut Bind S RBD Deep Mutational Scanning: ACE2 Binding (Jesse Bloom's Group) Immunology Description This track contains deep mutational scanning data measuring the effect of the change in expression from a wild type allele to a mutant allele. The authors use a yeast display system to experimentally measure the effect of all possible point (amino acid) RBD mutations on protein expression & ACE2 affinity. Display Conventions and Configuration Each subtrack contains all the scores representing mutations to a particular amino acid (each annotation is an S codon). For instance the A subtrack measures the change in ACE2 binding of S RBD if the annotated amino acid is mutated to alanine (if the wildtype amino acid is A, then the score is 0). A positive score indicates increased binding a negative score is a loss of binding. Please see the interactive heatmap generated by the authors at this link. Structural visualizations of the data are available from the authors via dms-view here. Methods Table S2 from Starr et al, was downloaded and parsed into bedGraph format using the average value of both replicates reported. All NA values were filtered out. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Starr TN, Greaney AJ, Hilton SK, Ellis D, Crawford KHD, Dingens AS, Navarro MJ, Bowen JE, Tortorici MA, Walls AC et al. Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding. Cell. 2020 Sep 3;182(5):1295-1310.e20. PMID: 32841599; PMC: PMC7418704 Y_bind_avg Y_bind_avg DMS data for RBD Binding Immunology W_bind_avg W_bind_avg DMS data for RBD Binding Immunology V_bind_avg V_bind_avg DMS data for RBD Binding Immunology T_bind_avg T_bind_avg DMS data for RBD Binding Immunology S_bind_avg S_bind_avg DMS data for RBD Binding Immunology R_bind_avg R_bind_avg DMS data for RBD Binding Immunology Q_bind_avg Q_bind_avg DMS data for RBD Binding Immunology P_bind_avg P_bind_avg DMS data for RBD Binding Immunology N_bind_avg N_bind_avg DMS data for RBD Binding Immunology M_bind_avg M_bind_avg DMS data for RBD Binding Immunology L_bind_avg L_bind_avg DMS data for RBD Binding Immunology K_bind_avg K_bind_avg DMS data for RBD Binding Immunology I_bind_avg I_bind_avg DMS data for RBD Binding Immunology H_bind_avg H_bind_avg DMS data for RBD Binding Immunology G_bind_avg G_bind_avg DMS data for RBD Binding Immunology F_bind_avg F_bind_avg DMS data for RBD Binding Immunology E_bind_avg E_bind_avg DMS data for RBD Binding Immunology D_bind_avg D_bind_avg DMS data for RBD Binding Immunology C_bind_avg C_bind_avg DMS data for RBD Binding Immunology A_bind_avg A_bind_avg DMS data for RBD Binding Immunology Starr_Bloom RBD Mut Expr S RBD Deep Mutational Scanning: Expression (Jesse Bloom's Group) Immunology Description This track contains deep mutational scanning data measuring the effect of the change in expression, ACE2 or antibody binding from a wild type allele to a mutant allele. The authors use a yeast display system to experimentally measure the effect of all possible point (amino acid) RBD mutations on protein expression, ACE2 affinity or antibody binding. Display Conventions and Configuration Each subtrack contains all the scores representing mutations to a particular amino acid (each annotation is a S codon). For instance the A subtrack measures the change in expression of S RBD expression if the annotated amino acid is mutated to alanine (if the wildtype amino acid is A, then the score is 0). A positive score indicates increased expression a negative score is a loss of expression. Please see the interactive heatmap generated by the authors at this link. Structural visualizations of the data are available from the authors via dms-view here. Methods Table S2 from Starr et al, was downloaded and parsed into bedGraph format using the average value of both replicates reported. All NA values were filtered out. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Starr TN, Greaney AJ, Hilton SK, Ellis D, Crawford KHD, Dingens AS, Navarro MJ, Bowen JE, Tortorici MA, Walls AC et al. Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding. Cell. 2020 Sep 3;182(5):1295-1310.e20. PMID: 32841599; PMC: PMC7418704 Y_expr_avg_Expression Y_expr_avg DMS data for RBD expression Immunology W_expr_avg_Expression W_expr_avg DMS data for RBD expression Immunology V_expr_avg_Expression V_expr_avg DMS data for RBD expression Immunology T_expr_avg_Expression T_expr_avg DMS data for RBD expression Immunology S_expr_avg_Expression S_expr_avg DMS data for RBD expression Immunology R_expr_avg_Expression R_expr_avg DMS data for RBD expression Immunology Q_expr_avg_Expression Q_expr_avg DMS data for RBD expression Immunology P_expr_avg_Expression P_expr_avg DMS data for RBD expression Immunology N_expr_avg_Expression N_expr_avg DMS data for RBD expression Immunology M_expr_avg_Expression M_expr_avg DMS data for RBD expression Immunology L_expr_avg_Expression L_expr_avg DMS data for RBD expression Immunology K_expr_avg_Expression K_expr_avg DMS data for RBD expression Immunology I_expr_avg_Expression I_expr_avg DMS data for RBD expression Immunology H_expr_avg_Expression H_expr_avg DMS data for RBD expression Immunology G_expr_avg_Expression G_expr_avg DMS data for RBD expression Immunology F_expr_avg_Expression F_expr_avg DMS data for RBD expression Immunology E_expr_avg_Expression E_expr_avg DMS data for RBD expression Immunology D_expr_avg_Expression D_expr_avg DMS data for RBD expression Immunology C_expr_avg_Expression C_expr_avg DMS data for RBD expression Immunology A_expr_avg_Expression A_expr_avg DMS data for RBD expression Immunology pyle SHAPE Struct Pyle RNA SHAPE Structure from the Pyle group mRNA and EST Description This track shows data from Anna Pyle's lab at Yale University describing the RNA secondary structure of the SARS-CoV-2 genome. The authors performed experimental measurements using SHAPE (selective 2-hydroxyl acylation) as well as in-silico analysis. These data are described in depth in de Cesaris Araujo Tavares et al and Huston, Wan, et al. Display Conventions and Configuration Two tracks are available: SHAPE reactivity: A low SHAPE score (<0.4) indicates that the nucleotide is not accessible (folded between complementary regions - base-paired). A mid (0.4<SHAPE<0.85) or high SHAPE reactivity (>0.85) indicates the nucleotide is flexible (single-stranded). A score of -999 means no data for that nucleotide was recovered experimentally. For visualization purposes the minimum display value has been set to -0.5 and the max has been set to 4. Full Length Shannon Entropy: A genome-wide Shannon entropy profile derived from base pairing probabilities was computed using SuperFold with in vivo SHAPE reactivity as constraints . A low Shannon entropy (near 0) is evidence for well-determined RNA conformation. Note that the authors provide further data (including secondary structure predictions) at their github repository. Methods For Shannon entropy, comSuperFold was used with default settings to predict secondary structure for the full-length SARS-CoV-2 RNA genome. For SHAPE_reactivity, VeroE6 cells were infected with 105 PFU of SARS133 CoV-2 isolate USA-WA1/2020. See the preprint for further details. wigToBigWig was used to convert wiggles from the author's github repository to bigWig files after filtering out headers. The Github repository with all raw data files can be found here: https://github.com/pylelab/SARS-CoV-2_SHAPE_MaP_structure. References Huston NC, Wan H, Araujo Tavares RC, Wilen C, Pyle AM. Comprehensive in-vivo secondary structure of the SARS-CoV-2 genome reveals novel regulatory motifs and mechanisms. bioRxiv. 2020 Jul 10;. PMID: 32676598; PMC: PMC7359520 Araujo Tavares RC, Mahadeshwar G, Pyle AM. The global and local distribution of RNA structure throughout the SARS-CoV-2 genome. bioRxiv. 2020 Jul 7;. doi: 190660 Entropy Shannon Entropy Shannon Entropy derived from the in vivo SHAPE constrained Superfold structure prediction mRNA and EST SHAPE SHAPE Reactivity SHAPE Reactivity (VeroE6 cells, virus isolate USA-WA1/2020) mRNA and EST strainCons119way 119 Vertebrate CoVs Multiz Alignment & Conservation (119 strains: strains with vertebrate hosts and human SARS-Cov2) Comparative Genomics Downloads for data in this track are available: Multiz alignments (MAF format), and phylogenetic trees PhyloP conservation (WIG format) PhastCons conservation (WIG format) Description This track shows multiple alignments of 119 virus sequences, aligned to the SARS-CoV-2 reference sequence SARS-CoV-2/NC_045512.2, genome assembly assembly GCF_009858895.2_ASM985889v3. These 119 sequences are from very different coronavirus strains: All 55 reference genomes annotated as "Coronaviridae" in NCBI Viral Genomes. These cover very different vertebrate hosts, from bat to wigeon. A subset of sequences from the 2019/2020 outbreak, from Genbank (52 sequences) and BIGD (12 sequences). It also includes measurements of evolutionary conservation using two methods (phastCons and phyloP) from the PHAST package, for all 119 virus sequences. The multiple alignments were generated using multiz and other tools in the UCSC/Penn State Bioinformatics comparative genomics alignment pipeline. Conserved elements identified by phastCons are also displayed in this track. PhastCons (which has been used in previous Conservation tracks) is a hidden Markov model-based method that estimates the probability that each nucleotide belongs to a conserved element, based on the multiple alignment. It considers not just each individual alignment column, but also its flanking columns. By contrast, phyloP separately measures conservation at individual columns, ignoring the effects of their neighbors. As a consequence, the phyloP plots have a less smooth appearance than the phastCons plots, with more "texture" at individual sites. The two methods have different strengths and weaknesses. PhastCons is sensitive to "runs" of conserved sites, and is therefore effective for picking out conserved elements. PhyloP, on the other hand, is more appropriate for evaluating signatures of selection at particular nucleotides or classes of nucleotides (e.g., third codon positions, or first positions of miRNA target sites). Another important difference is that phyloP can measure acceleration (faster evolution than expected under neutral drift) as well as conservation (slower than expected evolution). In the phyloP plots, sites predicted to be conserved are assigned positive scores (and shown in blue), while sites predicted to be fast-evolving are assigned negative scores (and shown in red). The absolute values of the scores represent -log p-values under a null hypothesis of neutral evolution. The phastCons scores, by contrast, represent probabilities of negative selection and range between 0 and 1. Both phastCons and phyloP treat alignment gaps and unaligned nucleotides as missing data. In the track display, the sequence is labeled using its NCBI Nucleotide accession number. The mapping between sequence accession identifiers and more descriptive names is provided via a text file on our download server. Display Conventions and Configuration Pairwise alignments of each species to the SARS-CoV-2 genome are displayed as a series of colored blocks indicating the functional effect of polymorphisms (in pack mode), or as a wiggle (in full mode) that indicates alignment quality. In dense display mode, percent identity of the whole alignments is shown in grayscale using darker values to indicate higher levels of identity. In pack mode, regions that align with 100% identity are not shown. When there is not 100% percent identity, blocks of four colors are drawn. Red blocks are drawn when a polymorphism in a coding region results in a change in the amino acid that is generated. Green blocks are drawn when a polymorphism in a coding region results in no change to the amino acid that is generated. Blue blocks are drawn when a polymorphism is outside a coding region. Pale yellow blocks are drawn when there are no aligning bases to that region in the reference genome. Checkboxes on the track configuration page allow selection of the species to include in the pairwise display. Configuration buttons are available to select all of the species (Set all), deselect all of the species (Clear all), or use the default settings (Set defaults). To view detailed information about the alignments at a specific position, zoom the display in to 30,000 or fewer bases, then click on the alignment. Base Level When zoomed-in to the base-level display, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the SARS-CoV-2 sequence at those alignment positions relative to the longest non-SARS-CoV-2 sequence. If there is sufficient space in the display, the size of the gap is shown. If the space is insufficient and the gap size is a multiple of 3, a "*" is displayed; other gap sizes are indicated by "+". Codon translation is available in base-level display mode if the displayed region is identified as a coding segment. To display this annotation, select the species for translation from the pull-down menu in the Codon Translation configuration section at the top of the page. Then, select one of the following modes: No codon translation: The gene annotation is not used; the bases are displayed without translation. Use default species reading frames for translation: The annotations from the genome displayed in the Default species to establish reading frame pull-down menu are used to translate all the aligned species present in the alignment. Use reading frames for species if available, otherwise no translation: Codon translation is performed only for those species where the region is annotated as protein coding. Use reading frames for species if available, otherwise use default species: Codon translation is done on those species that are annotated as being protein coding over the aligned region using species-specific annotation; the remaining species are translated using the default species annotation. Methods Pairwise alignments with the reference sequence were generated for each sequence using lastz version 1.04.00. Parameters used for each lastz alignment: # hsp_threshold = 2200 # gapped_threshold = 4000 = L # x_drop = 910 # y_drop = 3400 = Y # gap_open_penalty = 400 # gap_extend_penalty = 30 # A C G T # A 91 -90 -25 -100 # C -90 100 -100 -25 # G -25 -100 100 -90 # T -100 -25 -90 91 # seed=1110100110010101111 w/transition # step=1 Pairwise alignments were then linked into chains using a dynamic programming algorithm that finds maximally scoring chains of gapless subsections of the alignments organized in a kd-tree. Parameters used in the chaining (axtChain) step: -minScore=10 -linearGap=loose High-scoring chains were then placed along the genome, with gaps filled by lower-scoring chains, to produce an alignment net. count sampledate accession phylogeneticdistance descriptive name 0012019-12-30NC_045512.20.000000Wuhan-Hu-1 0022020-01-02MN988668.10.0000032019-nCoV WHU01 0032019-12-30MN996528.10.000003WIV04 0042019-12-30MT019532.10.000003BetaCoV/Wuhan/IPBCAMS-WH-04/2019 0052020-01-01LR757996.10.000004BetaCoV/Wuhan/WH-03/2019 0062019-12-30MN996530.10.000006WIV06 0072020-01-20MT039873.10.000006BetaCoV/Hangzhou/HZ-1/2020 20cov-1L 0082020-01-07NMDC60013002-070.000006BetaCoV/Wuhan/YS8011/2020 0092020-01-01MT019533.10.000008BetaCoV/Wuhan/IPBCAMS-WH-05/2020 0102020-02-10MT106053.10.0000082019-nCoV/USA-CA8/2020 0112020-02-23MT118835.10.0000082019-nCoV/USA-CA9/2020 0122019-12-30NMDC60013002-060.000008BetaCoV/Wuhan/WH19008/2019 0132019-12-30MT019531.10.000010BetaCoV/Wuhan/IPBCAMS-WH-03/2019 0142019-12-30NMDC60013002-100.000010BetaCoV/Wuhan/WH19005/2019 0152019-12-30MT019530.10.000011BetaCoV/Wuhan/IPBCAMS-WH-02/2019 0162020-01-13MT072688.10.000012SARS0CoV-2/61-TW/human/2020/NPL 0172020-01-08MT093631.10.000013SARS-CoV-2/WH-09/human/2020/CHN 0182020-01-31MT039887.10.0000142019-nCoV/USA-WI1/2020 0192020-01-29MT027064.10.0000152019-nCoV/USA-CA5/2020 0202020-01-22MN994468.10.0000162019-nCoV/USA-CA2/2020 0212020-01-01NMDC60013002-090.000017BetaCoV/Wuhan/WH19004/2020 0222020-02-05MT066176.10.000018BetaCov/Taiwan/NTU02/2020 0232020-02-10LC528232.10.000020SARS-CoV-2/Hu/DP/Kng/19-020 0242020-02-10LC528233.10.000020SARS-CoV-2/Hu/DP/Kng/19-027 0252019-12-30MN996531.10.000020WIV07 0262019-12-30MN996529.10.000021WIV05 0272020-01-27MT044258.10.000022SARS-CoV-2/CA6/human/2020/USA 0282019-12-26LR757998.10.000023BetaCoV/Wuhan/WH-01/2019 0292019-12-30MN996527.10.000024WIV02 0302020-01-29MT027062.10.0000252019-nCoV/USA-CA3/2020 0312020-02-05MT123290.10.000026SARS-CoV-2/IQTC01/human/2020/CHN 0322020-01-25LC5219250.000027BetaCoV/Japan/AI/I-004/2020 0332020-01-29LC5229720.0000272019-nCoV/Japan/KY/V-029/2020 0342019-12-23MT019529.10.000027BetaCoV/Wuhan/IPBCAMS-WH-01/2019 0352020-02-28MT126808.10.000028SARS-CoV-2/SP02/human/2020/BRA 0362020-01-25MT007544.10.000030BetaCoV/Australia/VIC01/2020 0372020-01-17MT049951.10.000030SARS-CoV-2/Yunnan-01/human/2020/CHN 0382020-02-28MT123292.10.000031SARS-CoV-2/IQTC04/human/2020/CHN 0392020-01-29MT020781.10.000032BetaCoV/Finland/1/2020 0402020-02-06MT106052.10.0000322019-nCoV/USA-CA7/2020 0412020-01-26MT135041.10.000032SARS-CoV-2/105/human/2020/CHN 0422020-01-28MT135043.10.000032SARS-CoV-2/233/human/2020/CHN 0432020-01MN975262.10.0000332019-nCoV_HKU-SZ-005b_2020 0442020-03-05MT152824.10.000033SARS-CoV-2/WA2/human/2020/USA 0452020-02-11MT039888.10.0000342019-nCoV/USA-MA1/2020 0462020-01-31LC5229750.0000352019-nCoV/Japan/TY/WK-521/2020 0472020-01-29LC5229730.0000362019-nCoV/Japan/TY/WK-012/2020 0482020-01-31LC5229740.0000362019-nCoV/Japan/TY/WK-501/2020 0492020-01MN938384.10.0000362019-nCoV_HKU-SZ-002a_2020 0502020-01-19MN985325.10.0000362019-nCoV/USA-WA1/2020 0512020-01-22MN997409.10.0000362019-nCoV/USA-AZ1/2020 0522020-02-11MT106054.10.0000362019-nCoV/USA-TX1/2020 0532020-01-29MT123291.10.000036SARS-CoV-2/IQTC02/human/2020/CHN 0542020-02-28MT123293.10.000036SARS-CoV-2/IQTC03/human/2020/CHN 0552020-01-05LR757995.10.000037BetaCoV/Wuhan/WH-04/2019 0562020-02-01MT066175.10.000037Taiwan/NTU01/2020 0572020-01-23MN994467.10.0000382019-nCoV/USA-CA1/2020 0582020-01-28MT044257.10.000038SARS-CoV-2/IL2/human/2020/USA 0592020-02-07MT093571.10.000038SARS-CoV-2/01/human/2020/SWE 0602020-01-21MN988713.10.0000392019-nCoV/USA-IL1/2020 0612020-01MT039890.10.000040BetaCoV/Korea/SNU01/2020 0622019-12-31LR757997.10.001411BetaCoV/Wuhan/WH-02/2019 0632019-12-30NMDC60013002-050.002318BetaCoV/Wuhan/WH19002/2019 0642013-07-24GWHABKP000000000.126322Bat CoV TG13 0652019-03-01GWHABKW000000000.318494Pangolin-CoV-2020 MP789 0662018-08-13NC_004718.30.885159SARS CoV 0672018-08-13NC_014470.11.088135Bat CoV BM48-31/BGR/2008 0682018-08-13NC_035191.11.483474Wencheng Sm shrew CoV Xingguo-101 0692018-08-13NC_009019.12.026165Bat CoV HKU4-1 0702018-08-13NC_010646.12.170350Beluga Whale CoV SW1 0712018-08-13NC_016995.12.225972Wigeon CoV HKU20 0722018-08-13NC_001451.12.263777Avian infectious bronchitis virus 0732018-08-13NC_026011.12.265050BetaCoV HKU24 strain HKU24-R05005I 0742018-08-13NC_010800.12.317576Turkey CoV 0752018-08-24NC_039207.12.337043BetaCoV ErinaceusCoV/VMC/2012-174/GER/2012 0762018-08-13NC_025217.12.384179Bat Hp-betaCoV/Zhejiang2013 0772018-08-24NC_038294.12.434929BetaCoV England 1 0782018-08-13NC_019843.32.434930MERS Middle East respiratory syndrome CoV 0792018-08-13NC_017083.12.491178Rabbit CoV HKU14 0802018-08-13NC_016994.12.498432Night-heron CoV HKU19 0812018-08-13NC_003045.12.507057Bovine CoV 0822019-03-10NC_034440.12.542837Bat CoV PREDICT/PDF-2180 0832018-08-13NC_009020.12.551785Bat CoV HKU5-1 0842019-02-21NC_006213.12.589639Human CoV OC43 strain ATCC VR-759 0852018-08-13NC_016996.12.598443Common-moorhen CoV HKU21 0862018-08-24NC_011547.12.612548Bulbul CoV HKU11-934 0872018-08-13NC_006577.22.649716Human CoV HKU1 0882018-08-13NC_009021.12.778940Bat CoV HKU9-1 0892018-08-13NC_012936.12.785289Rat CoV Parker 0902018-08-13NC_028811.12.786411BtMr-AlphaCoV/SAX2011 0912018-08-13NC_001846.12.792354Mouse hepatitis virus strain MHV-A59 C12 mutant 0922018-08-13NC_016993.12.828026Magpie-robin CoV HKU18 0932018-08-13NC_018871.12.831499Rousettus bat CoV HKU10 0942018-08-13NC_030886.12.885630Rousettus bat CoV GCCDC1 356 0952018-08-13NC_016992.12.887272Sparrow CoV HKU17 0962018-08-24NC_039208.12.902341Porcine CoV HKU15 strain HKU15-155 0972018-08-13NC_011550.12.936050Munia CoV HKU13-3514 0982018-08-13NC_002645.12.983896Human CoV 229E 0992018-08-13NC_016991.13.002680White-eye CoV HKU16 1002018-08-13NC_034972.13.004210Coronavirus AcCoV-JC34 1012018-08-13NC_028752.13.008138Camel alphaCoV camel/Riyadh/Ry141/2015 1022018-08-13NC_005831.23.009141Human Coronavirus NL63 1032018-08-13NC_023760.13.023450Mink CoV strain WD1127 1042018-08-13NC_030292.13.041640Ferret CoV FRCoV-NL-2010 1052018-08-13NC_003436.13.068747Porcine epidemic diarrhea virus 1062018-08-13NC_009657.13.074701Scotophilus bat CoV 512 1072018-08-13NC_032107.13.086646NL63-related bat CoV strain BtKYNL63-9a 1082018-08-13NC_010438.13.120462Bat CoV HKU8 1092018-08-13NC_022103.13.126101Bat CoV CDPHE15/USA/2006 1102018-08-13NC_011549.13.154652Thrush CoV HKU12-600 1112018-08-13NC_028814.13.185708BtRf-AlphaCoV/HuB2013 1122018-08-13NC_032730.13.203052Lucheng Rn rat CoV Lucheng-19 1132018-08-13NC_009988.13.204418Bat CoV HKU2 1142018-08-13NC_028833.13.260958BtNv-AlphaCoV/SC2013 1152018-08-13NC_028806.13.346396Swine enteric CoV strain Italy/213306/2009 1162018-08-24NC_038861.13.359831Transmissible gastroenteritis virus 1172018-08-13NC_010437.13.404316Bat CoV 1A 1182018-08-13NC_002306.33.472931Feline infectious peritonitis virus 1192018-08-13NC_028824.13.505645BtRf-AlphaCoV/YN2012 The multiple alignment was constructed from the resulting pairwise alignments progressively aligned using multiz/autoMZ. The phylogenetic tree was calculated on 31mer frequency similarity and neighbor joining that distance matrix with the phylip toolset command: neighbor. The reference sequence NC_045512v2 is at the top of the tree: ((((((((((((((((((((((((((((((((((((((((((((((((((((((NC_045512v2 (MN996528v1 MT019532v1)) MN988668v1) LR757996v1) (MN996530v1 NMDC60013002_07)) MT039873v1) (MT106053v1 NMDC60013002_06)) MT019533v1) MT118835v1) MT019531v1) NMDC60013002_10) MT019530v1) MT072688v1) MT093631v1) MT039887v1) MT027064v1) MN994468v1) NMDC60013002_09) MT066176v1) (LC528232v1 LC528233v1)) MN996531v1) MN996529v1) MT044258v1) LR757998v1) MN996527v1) MT027062v1) (LC522972 MT123290v1)) MT019529v1) LC521925) MT126808v1) (((((((LC522973 LC522974) LC522975) (((LR757995v1 MT066175v1) MN985325v1) (MN938384v1 MN997409v1))) MN975262v1) MT106052v1) (MT135041v1 MT135043v1)) MT049951v1)) MT007544v1) MT123292v1) MT020781v1) MT152824v1) MT039888v1) (MT123291v1 MT123293v1)) MT106054v1) (MN994467v1 MT044257v1)) MT093571v1) MN988713v1) MT039890v1) NMDC60013002_05) LR757997v1) GWHABKP00000000) GWHABKW00000000) NC_004718v3) NC_014470v1) NC_035191v1) (NC_009019v1 ((NC_009020v1 ((NC_019843v3 NC_038294v1) NC_034440v1)) NC_039207v1))) ((NC_001451v1 NC_010800v1) ((((NC_011547v1 (NC_011549v1 NC_016991v1)) ((NC_011550v1 NC_016993v1) (NC_016992v1 NC_039208v1))) NC_016996v1) NC_016994v1))) NC_025217v1) (((((NC_001846v1 NC_012936v1) ((NC_003045v1 NC_006213v1) NC_017083v1)) NC_026011v1) NC_006577v2) NC_010646v1)) ((((((NC_002306v3 (NC_028806v1 NC_038861v1)) NC_003436v1) (NC_023760v1 NC_030292v1)) (((NC_002645v1 NC_028752v1) (NC_005831v2 NC_032107v1)) (((((NC_009657v1 NC_010437v1) NC_010438v1) ((NC_009988v1 NC_028824v1) NC_028833v1)) NC_022103v1) (NC_018871v1 NC_028814v1)))) (NC_028811v1 (NC_032730v1 NC_034972v1))) (NC_009021v1 NC_030886v1))) NC_016995v1) Framing tables from the genes were constructed to enable visualization of codons in the multiple alignment display. Phylogenetic Tree Model Both phastCons and phyloP are phylogenetic methods that rely on a tree model containing the tree topology, branch lengths representing evolutionary distance at neutrally evolving sites, the background distribution of nucleotides, and a substitution rate matrix. The all-species tree model for this track was generated using the phyloFit program from the PHAST package (REV model, EM algorithm, medium precision) using multiple alignments of 4-fold degenerate sites extracted from the 119-way alignment (msa_view). The 4d sites were derived from the NCBI gene set, filtered to select single-coverage long transcripts. This same tree model was used in the phyloP calculations; however, the background frequencies were modified to maintain reversibility. The resulting tree model: all species. PhastCons Conservation The phastCons program computes conservation scores based on a phylo-HMM, a type of probabilistic model that describes both the process of DNA substitution at each site in a genome and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2005). PhastCons uses a two-state phylo-HMM, with a state for conserved regions and a state for non-conserved regions. The value plotted at each site is the posterior probability that the corresponding alignment column was "generated" by the conserved state of the phylo-HMM. These scores reflect the phylogeny (including branch lengths) of the species in question, a continuous-time Markov model of the nucleotide substitution process, and a tendency for conservation levels to be autocorrelated along the genome (i.e., to be similar at adjacent sites). The general reversible (REV) substitution model was used. Unlike many conservation-scoring programs, phastCons does not rely on a sliding window of fixed size; therefore, short highly-conserved regions and long moderately conserved regions can both obtain high scores. More information about phastCons can be found in Siepel et al, 2005. The phastCons parameters used were: expected-length=45, target-coverage=0.3, rho=0.3. PhyloP Conservation The phyloP program supports several different methods for computing p-values of conservation or acceleration, for individual nucleotides or larger elements (http://compgen.cshl.edu/phast/). Here it was used to produce separate scores at each base (--wig-scores option), considering all branches of the phylogeny rather than a particular subtree or lineage (i.e., the --subtree option was not used). The scores were computed by performing a likelihood ratio test at each alignment column (--method LRT), and scores for both conservation and acceleration were produced (--mode CONACC). Conserved Elements The conserved elements were predicted by running phastCons with the --most-conserved option. The predicted elements are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM. Each element is assigned a log-odds score equal to its log probability under the conserved model minus its log probability under the non-conserved model. The "score" field associated with this track contains transformed log-odds scores, taking values between 0 and 1000. (The scores are transformed using a monotonic function of the form a * log(x) + b.) The raw log odds scores are retained in the "name" field and can be seen on the details page or in the browser when the track's display mode is set to "pack" or "full". Credits This track was created using the following programs: Alignment tools: lastz (formerly blastz) and multiz by Minmei Hou, Scott Schwartz, Robert Harris, and Webb Miller of the Penn State Bioinformatics Group Conservation scoring: phastCons, phyloP, phyloFit, tree_doctor, msa_view and other programs in PHAST by Adam Siepel at Cold Spring Harbor Laboratory (original development done at the Haussler lab at UCSC). Chaining and Netting: axtChain, chainNet by Jim Kent at UCSC MAF Annotation tools: mafAddIRows by Brian Raney, UCSC; mafAddQRows by Richard Burhans, Penn State; genePredToMafFrames by Mark Diekhans, UCSC Tree image generator: phyloPng by Galt Barber, UCSC Conservation track display: Kate Rosenbloom, Hiram Clawson (wiggle display), and Brian Raney (gap annotation and codon framing) at UCSC References Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632; Supplemental Materials and Methods Phylo-HMMs, phastCons, and phyloP: Felsenstein J, Churchill GA. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol. 1996 Jan;13(1):93-104. PMID: 8583911 Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010 Jan;20(1):110-21. PMID: 19858363; PMC: PMC2798823 Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. PMID: 16024819; PMC: PMC1182216 Siepel A, Haussler D. Phylogenetic Hidden Markov Models. In: Nielsen R, editor. Statistical Methods in Molecular Evolution. New York: Springer; 2005. pp. 325-351. Yang Z. A space-time process model for the evolution of DNA sequences. Genetics. 1995 Feb;139(2):993-1005. PMID: 7713447; PMC: PMC1206396 Chain/Net: Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Multiz: Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004 Apr;14(4):708-15. PMID: 15060014; PMC: PMC383317 Lastz (formerly Blastz): Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Harris RS. Improved pairwise alignment of genomic DNA. Ph.D. Thesis. Pennsylvania State University, USA. 2007. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 strainCons119wayViewalign Multiz Alignments Multiz Alignment & Conservation (119 strains: strains with vertebrate hosts and human SARS-Cov2) Comparative Genomics strainName119way Multiz Align Multiz Alignment of 119 strains: red=nonsyn green=syn blue=noncod yellow=noalign Comparative Genomics strainCons119wayViewphastcons Element Conservation (phastCons) Multiz Alignment & Conservation (119 strains: strains with vertebrate hosts and human SARS-Cov2) Comparative Genomics strainPhastCons119way PhastCons 119 virus strains Basewise Conservation by PhastCons Comparative Genomics strainCons119wayViewphyloP Basewise Conservation (phyloP) Multiz Alignment & Conservation (119 strains: strains with vertebrate hosts and human SARS-Cov2) Comparative Genomics strainPhyloP119way PhyloP 119 virus strains Basewise Conservation by PhyloP Comparative Genomics unipCov2Chain Protein Products UniProt Protein Products (Polypeptide Chains, after cleavage) UniProt Protein Annotations Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt/SwissProt staff. The annotations are spread over multiple tracks, based on their "feature type" in UniProt: Track Name Description UCSC Alignment, SwissProt Protein sequences from SwissProt mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. UCSC Alignment, TrEMBL Protein sequences from TrEMBL mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. This track is hidden by default. To show it, click its checkbox on the track description page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse-over a feature to see the full UniProt annotation comment. For variants, the mouse-over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipStructCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout archive folder of our downloads server. --!> Please refer to our mailing list archives for questions or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Phil Berman, UCSC. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 cpgIslandExtUnmasked Unmasked CpG CpG Islands on All Sequence (Islands < 300 Bases are Light Green) Expression and Regulation Description CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time, methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to keep them or a region is not methylated for some other reason, perhaps having to do with the regulation of gene expression. CpG islands are regions where CpGs are present at significantly higher levels than is typical for the genome as a whole. The unmasked version of the track displays potential CpG islands that exist in repeat regions and would otherwise not be visible in the repeat masked version. By default, only the masked version of the track is displayed. To view the unmasked version, change the visibility settings in the track controls at the top of this page. Methods CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: GC content of 50% or greater length greater than 200 bp ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment The entire genome sequence, masking areas included, was used for the construction of the track Unmasked CpG. The track CpG Islands is constructed on the sequence after all masked sequence is removed. The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula (cited in Gardiner-Garden et al. (1987)): Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) where N = length of sequence. The calculation of the track data is performed by the following command sequence: twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \ | cpg_lh /dev/stdin 2> cpg_lh.err \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > cpgIsland.bed The unmasked track data is constructed from twoBitToFa -noMask output for the twoBitToFa command. Data access CpG islands and its associated tables can be explored interactively using the REST API, the Table Browser or the Data Integrator. All the tables can also be queried directly from our public MySQL servers, with more information available on our help page as well as on our blog. The source for the cpg_lh program can be obtained from src/utils/cpgIslandExt/. The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") Credits This track was generated using a modification of a program developed by G. Miklem and L. Hillier (unpublished). References Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987 Jul 20;196(2):261-82. PMID: 3656447 unipCov2Interest Highlights UniProt highlighted "Regions of Interest" UniProt Protein Annotations Description This track shows protein sequence annotations defined as "regions of interest" from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt/SwissProt staff. Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name. A click on the item shows additional annotation deltails. Mouse-over a feature to see the full UniProt annotation comment. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipInterestCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Alejo Mujica, Regeneron Pharmaceuticals. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 unipCov2LocSignal Signal Peptides UniProt Signal Peptides UniProt Protein Annotations Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt/SwissProt staff. The annotations are spread over multiple tracks, based on their "feature type" in UniProt: Track Name Description UCSC Alignment, SwissProt Protein sequences from SwissProt mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. UCSC Alignment, TrEMBL Protein sequences from TrEMBL mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. This track is hidden by default. To show it, click its checkbox on the track description page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse-over a feature to see the full UniProt annotation comment. For variants, the mouse-over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipStructCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout archive folder of our downloads server. --!> Please refer to our mailing list archives for questions or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Phil Berman, UCSC. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 transcriptome Subgenomic Canonical Canonical Subgenomic Transcripts Genes and Gene Predictions Description This track shows predicted and experimental representations of the SARS-CoV-2 transcriptome based on long-read Nanopore sequencing. Display Conventions and Configuration SARS-CoV-2 generates sub-genomic mRNAs (sgmRNAs) for all ORFs. The virus achieves this by recombination mechanisms in which replication machinery jumps from one of many TRS-B site (transcription regulatory sequence, body) to the TRS-L (leader sequence) during negative strand synthesis. These negative strands are then used as templates for mRNA synthesis. On these tracks we depict the predicted mRNAs with the excised sequence drawn like introns. The ORFs predicted to be translated by these mRNAs are shown in thick boxes. The thin bars function as UTRs for that particular mRNA species. Multiple subtracks are available: TRS sites: Annotated core TRS sequence (ACGAAC) in the viral genome that allows recombination. One site TRS* differs from the canonical TRS site by 1 nt, but has experimental support and is required to generate a 7b mRNA. SARS-CoV-2 Transcripts: Canonical SARS-CoV-2 Transcripts (gRNA and mRNA). Generated by recombining all TRS-B sites in the above track with the leader. Note the actual recombination breakpoints can often be drawn in multiple ways (since the TRS core motif is identical), and direct RNA sequencing suggests slightly different breakpoints, depending on the mRNA. See the experimental tracks if your analysis requires a detailed understanding of the breaks. The reported breaks in this track are only meant to be approximate. Methods TRS sites: The core TRS sequence (ACGAAC) was identified using the findMotif tool. The TRS* (AaGAAC) site identified in Kim et al was manually added to create overall agreement with their Figure 1. SARS CoV-2 Transcripts: Using the TRS track, we generated mRNAs which span from nt 1-75 (the end of the TRS-L core sequence) and resume at the 3' end of all TRS-B sequences. Neither the 5' and 3' terminal ends of these RNAs, nor their internal breakpoints, should be considered exact. CDS sequences match Figure 1 from Kim et al. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/kim2020/TRS.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits Thanks to Jason Fernandes (Haussler-lab, UCSC) for preparing this track. References Kim D, Lee JY, Yang JS, Kim JW, Kim VN, Chang H. The Architecture of SARS-CoV-2 Transcriptome. Cell. 2020 Apr 18;. PMID: 32330414; PMC: PMC7179501 mRNAs Known Transcripts (gRNA and mRNA) Canonical Subgenomic Transcripts (gRNA and mRNA) Genes and Gene Predictions TRS_sites TRS sites Transcription Regulatory Sequences (TRS) of Canonical Subgenomic Transcripts Genes and Gene Predictions kimNp Subgenomic Breakpts Subgenomic Transcript Breakpoints from Kim et al 2020: Nanopore and MGISeq Genes and Gene Predictions Description This track shows the locations of RNA transcript breakpoints as determined by Nanopore and DNA Nanoball MGISEQ sequencing by Kim et al, Cell 2020. Display Conventions and Configuration The height of the bars show the frequency of the transcript breakpoints. This track contains six subtracks, each of which can be hidden and modified in height and min/max settings for the bars by clicking its "Configure" link above. You can also configure all tracks together with the controls at the top of the track configuration page. Credits Thanks to Hyeshik Chang for preparing and sharing custom tracks. References The architecture of SARS-CoV-2 transcriptome. Cell 2020. pre-proof pending --> kimNp5pBreak Nanop NC-Brkpt 5' Nanopore Noncanonical 5' Breakpoints Genes and Gene Predictions kimNpNc3pBrk Nanop NC-Brkpt 3' Nanopore Noncanonical 3' Breakpoints Genes and Gene Predictions kimNpLdr3pBreak Nanop Ld2Bd 3' Nanopore Leader-to-body 3' Breakpoints Genes and Gene Predictions kimMgiNc5p MGI NC-Brkpt 5' Kim et al. MGISEQ Noncanonical 5' Breakpoints Genes and Gene Predictions kimMgiNc3p MGI NC-Brkpt 3' Kim et al. MGISEQ Noncanonical 3' Breakpoints expression kimMgiLdr3p MGI Ld2Bd 3' MGISEQ Leader-to-body 3' Breakpoints expression kimTranscripts Subgenomic Observed Subgenomic Transcripts found in long-read sequences by Kim et al. 2020 Genes and Gene Predictions Description This track shows predicted and experimental representations of the SARS-CoV-2 transcriptome based on long-read Nanopore sequencing. Display Conventions and Configuration SARS-CoV-2 generates sub-genomic mRNAs (sgmRNAs) for all ORFs. The virus achieves this by recombination mechanisms in which replication machinery jumps from one of many TRS-B site (transcription regulatory sequence, body) to the TRS-L (leader sequence) during negative strand synthesis. These negative strands are then used as templates for mRNA synthesis. On these tracks, we depict the predicted mRNAs with the excised sequence drawn like introns. The ORFs predicted to be translated by these mRNAs are shown in thick boxes. The thin bars function as UTRs for that particular mRNA species. Multiple subtracks are available: Kim Recombined transcripts: All RNA recombination products reported by Nanopore direct RNA sequencing in Kim et al, 2020, Cell. Kim Recomb. TRS transcripts: RNA recombination products that involve the leader on the 5' end; validated by nanopore sequencing. The score is the number of reads divided by 100. RNAs with read support >100,000 are rounded to a score of 1000. Kim Recomb. Novel transcripts: RNA recombination products that do not involve the leader, discovered by nanopore sequencing. RNAs with less than 100 reads were discarded. The score is the number of reads. RNAs with read support >1000 are listed as 1000. Methods Kim Recombined transcripts: was generated from Table S2 of Kim et al, 2020. This represents all RNA species identified by direct RNA sequencing. Scores are the number of reads divided by 100 (RNAs with >1000 reads are rounded to 1000). Kim Recomb. TRS transcripts: generated from Table S3 of Kim et al, 2020. These viral transcripts all contain the TRS-L recombined with 3' viral sequence. Kim Recomb. Novel transcripts: this track was generated from Table S4. These viral transcripts contain novel recombination events that do not involve the leader. Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/kim2020/Kim_TRS.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits Thanks to Jason Fernandes (Haussler-lab, UCSC) for preparing this track. References Kim D, Lee JY, Yang JS, Kim JW, Kim VN, Chang H. The Architecture of SARS-CoV-2 Transcriptome. Cell. 2020 Apr 18;. PMID: 32330414; PMC: PMC7179501 Kim_non-TRS_sgmRNAs Kim Recomb. Novel transcripts Recombined Subgenomic Trans.: Transcripts from Kim et al. 2020 without a TRS Genes and Gene Predictions Kim_TRS_sgmRNAs Kim recomb. TRS trans. Subgenomic Trans.: Recombined transcripts from Kim et al. 2020 with a TRS Genes and Gene Predictions Kim_RNAs Kim recomb. trans. Subgenomic Trans.: All recombined transcripts from Kim et al. 2020 Genes and Gene Predictions kimIvtCov Nanopore coverage Nanopore coverage of in-vitro-transcribed RNA seq + PCR, Kim et al 2020 Mapping and Sequencing Description This track shows the coverage of Nanopore sequences from Kim et al. 2020 obtained after in-vitro reverse transcription and tiling PCR of SARS-CoV-2 genomes. This is not direct RNA sequencing, but multiplex PCR on DNA, followed by sequencing. The coverage shown here does not allow to draw conclusions on RNA modifications or RNA editing, but indicates regions of the genome that are harder to sequence with Nanopore sequencing. Display Conventions and Configuration Sequence coverage of every bp is shown. All reads were used. Related tracks ARTIC primers: a set of well-known primers used for PCR + Nanopore sequencing. Kim Transcripts: transcripts as determined with direct RNA sequencing by Kim et al. 2020 Kim RNA Brkpts: locations where RNA tends to break, as determined by direct sequencing by Kim et al. 2020 Kim RNA Modif: modified bases in the direct RNA sequencing relative to the IVT sequences. Methods Minimap2 alignments BAM files were processed with bamCoverage. References Kim D, Lee JY, Yang JS, Kim JW, Kim VN, Chang H. The Architecture of SARS-CoV-2 Transcriptome. Cell. 2020 May 14;181(4):914-921.e10. PMID: 32330414; PMC: PMC7179501 kimRnaMod Subgenomic RNA Modif. Subgenomic RNA Modifications from Kim et al. 2020: gRNA S 3a E M 6 7a 7b 8 N Genes and Gene Predictions Description This track shows the locations of RNA-modifications as determined by Nanopore sequencing Kim et al, Cell 2020. Display Conventions and Configuration Very small tickmarks indicate the position of the RNA modifications on the genome. One has to zoom in to basepair level detail to see their exact extent. A small barchart over the tick indicates the fraction of transcripts that are modified relative to the un-modified transcripts. E.g. frac=0.3 means that 30% of the transcripts were modified. There is one barchart per transcript. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/kim2020.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits Thanks to Hyeshik Chang for preparing and sharing custom tracks. References The architecture of SARS-CoV-2 transcriptome. Cell 2020. pre-proof pending --> unipCov2LocExtra Extracellular UniProt Extracellular Domain UniProt Protein Annotations Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt/SwissProt staff. The annotations are spread over multiple tracks, based on their "feature type" in UniProt: Track Name Description UCSC Alignment, SwissProt Protein sequences from SwissProt mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. UCSC Alignment, TrEMBL Protein sequences from TrEMBL mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. This track is hidden by default. To show it, click its checkbox on the track description page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse-over a feature to see the full UniProt annotation comment. For variants, the mouse-over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipStructCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout archive folder of our downloads server. --!> Please refer to our mailing list archives for questions or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Phil Berman, UCSC. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 artic ARTIC Primers V3 ARTIC V3 Oxford Nanopore sequencing primers Mapping and Sequencing Description This track shows the primers for the ARTIC network SARS-CoV-2 sequencing protocol, Version 3. Display Conventions and Configuration Genomic locations of primers are highlighted. A click on them shows the primer pool. Methods Artic Network primer sequences were downloaded from the file Artic Github BED file and converted to bigBed. Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/artic.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. articV4 ARTIC Primers V4 ARTIC V4 Oxford Nanopore sequencing primers Mapping and Sequencing Description This track shows the primers for the ARTIC network SARS-CoV-2 sequencing protocol, Version 4 (June 21, 2021). Display Conventions and Configuration Genomic locations of primers are highlighted. A click on them shows the primer pool. Methods Artic Network primer sequences were downloaded from github (file SARS-CoV-2.primer.bed) and converted to bigBed. Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/articV4.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. articV4_1 ARTIC Primers V4.1 ARTIC V4.1 Oxford Nanopore sequencing primers Mapping and Sequencing Description This track shows the primers for the ARTIC network SARS-CoV-2 sequencing protocol, Version 4.1 (January 7, 2022), which restores the functionality of some primers against the Omicron variant. Display Conventions and Configuration Genomic locations of primers are highlighted. A click on them shows the primer pool. Methods Artic Network primer sequences were downloaded from Github (file SARS-CoV-2.primer.bed) and converted to bigBed. Data Access The raw data can be explored interactively with the REST API, Table Browser, or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/articV4.1.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. rapid RAPID/Midnight Primers RAPID/Midnight 1200bp amplicon Oxford Nanopore sequencing primers Mapping and Sequencing Description This track shows the primers for the RAPID SARS-CoV-2 sequencing protocol, also commonly referred to as Midnight. The primers enable amplification of the genome of SARS-CoV-2. This approach uses multiplexed 1200 base pair (bp) tiled amplicons. Briefly, two PCR reactions are performed for each SARS-CoV-2 positive patient sample to be sequenced. One PCR reaction contains thirty primers that generate the odd numbered amplicons ("Pool 1"), while the second PCR reaction contains twenty eight primers that generate the even numbered amplicons ("Pool 2"). After PCR, the two amplicon pools are combined and can be used for a range of downstream sequencing approaches. Primers were all designed using Primal Scheme and described in Nature Protocols 2017. This primer set results in amplicons that exhibit lower levels of variation in coverage compared to other commonly used primer sets. Display Conventions and Configuration Genomic locations of primers are highlighted. A click on them shows the primer pool. This is one of the few tracks that may be best displayed in "full" mode. Methods RAPID primer sequences were downloaded from the Google Spreadsheet and converted to bigBed. More details are available in the paper referenced below or in the supplemental files on Zenodo. Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/rapid.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Freed NE, Vlková M, Faisal MB, Silander OK. Rapid and inexpensive whole-genome sequencing of SARS-CoV-2 using 1200 bp tiled amplicons and Oxford Nanopore Rapid Barcoding. Biol Methods Protoc. 2020;5(1):bpaa014. PMID: 33029559; PMC: PMC7454405 swift Swift Primers Swift BioSciences sequencing primers Mapping and Sequencing Description This track shows the primers for the Swift Amplicon ® SARS-CoV-2 Panel single-tube NGS assay: This kit leverages patented multiplex PCR technology, enabling library construction from 1st-strand or 2nd-strand cDNA using tiled primer pairs to target the entire 29.9 kb viral genome with a single pool of multiplexed primer pairs. Primers were designed against the NCBI Reference Sequence NC_045512.2 (Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome). In silico analysis predicted zero off-target products from human host genome sequences. Display Conventions and Configuration Genomic locations of primers are highlighted. A click on them shows the primer sequence. This is one of the few tracks that may be best displayed in "full" mode. Methods Primer sequences, names and genomic locations were downloaded from Swift and converted to bigBed. More details are available from Swift Biosciences. Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/swift.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. unipCov2LocTransMemb Transmem. Domains UniProt Transmembrane Domains UniProt Protein Annotations Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt/SwissProt staff. The annotations are spread over multiple tracks, based on their "feature type" in UniProt: Track Name Description UCSC Alignment, SwissProt Protein sequences from SwissProt mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. UCSC Alignment, TrEMBL Protein sequences from TrEMBL mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. This track is hidden by default. To show it, click its checkbox on the track description page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse-over a feature to see the full UniProt annotation comment. For variants, the mouse-over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipStructCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout archive folder of our downloads server. --!> Please refer to our mailing list archives for questions or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Phil Berman, UCSC. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 publicAnnots Crowd-Sourced Data Crowd-sourced data: annotations contributed via bit.ly/cov2annots Mapping and Sequencing Description This track shows annotations made via a public spreadsheet available at http://bit.ly/cov2annots. Originally, anyone could add annotations to this spreadsheet and they went live after 1-2 days. We switched off the automated updates from the spreadsheet to the track in mid-2021. To add changes to this track now, please contact us. Display Conventions and Configuration Only start-end annotations can be shown. Contact us at genome-www@soe.ucsc.edu if you have feedback on the form, e.g. you need exon or intron lines or have datasets with more than 5-10 annotations. unipCov2LocCytopl Cytoplasmic UniProt Cytoplasmic Domains UniProt Protein Annotations Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt/SwissProt staff. The annotations are spread over multiple tracks, based on their "feature type" in UniProt: Track Name Description UCSC Alignment, SwissProt Protein sequences from SwissProt mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. UCSC Alignment, TrEMBL Protein sequences from TrEMBL mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. This track is hidden by default. To show it, click its checkbox on the track description page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse-over a feature to see the full UniProt annotation comment. For variants, the mouse-over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipStructCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout archive folder of our downloads server. --!> Please refer to our mailing list archives for questions or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Phil Berman, UCSC. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 bpmIggCovidSum Antib Pept Array Sum (IgG) Antibody Proteome Peptide Binding Microarray, Wang et al 2020 - IgG, Covid - Sum of scores per nucleotide Immunology Description This track shows intensities of a microarray spotted with short peptides derived from the entire proteome of SARS-CoV-2. Sera from 10 COVID-19 patients (early stage) and 12 healthy controls were screened on the peptide microarray for both IgG and IgM responses. Note that the infections here were in the early stage, unlike the other microarray track shown on this genome browser. Display Conventions and Configuration Genomic locations of peptides that were spotted on the array are highlighted. Because these peptides overlap, the tracks default to dense mode and the sequence is shown as labels drawn onto the rectangles, but only visible on high zoom levels. Put any track into pack mode to fully see all probe sequences. The color is assigned based on the Z-Score, without any other normalization. Blue with decreasing intensity is assigned to the values -5 to 0, white is 0, and red colors with increasing intensity are used for the values 0-3.5, exactly as in the original publication figures. There are also two wiggle/signal style tracks to summary the information, they show the sum of Z-scores across all peptides, as one score per nucleotide. Methods Supplemental files were converted from Excel, rearranged and run through the command line script bigHeat to create a heatmap-like display, with multiplication factors of 2 for negative values, 0.285 for positive values. For better visibility, colormap seismic from matplotlib was used from 0.1 to 0.9 and the range of values after multiplication were restricted to the limits -1 to 1 to address outliers. Like all tracks, the exact commands are documented in our makeDoc text files. Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/pbm/IgG_Z-score-_COVID-19_patients/P52.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Hongye Wang, Xian Wu, Xiaomei Zhang, Xin Hou, Te Liang, Dan Wang, Fei Teng, Jiayu Dai, Hu Duan, Shubin Guo, Yongzhe Li, and Xiaobo Yu SARS-CoV-2 Proteome Microarray for Mapping COVID-19 Antibody Interactions at Amino Acid Resolution. ACS Central Science. 2020 DOI: 10.1021/acscentsci.0c00742 pbm Antib Pept Array Antibody Proteome Peptide Binding Microarray Raw Data from Wang et al, ACS 2020, Xiaobo Yu group, NCPSB Beijing Immunology Description This track shows intensities of a microarray spotted with short peptides derived from the entire proteome of SARS-CoV-2. Sera from 10 COVID-19 patients (early stage) and 12 healthy controls were screened on the peptide microarray for both IgG and IgM responses. Note that the infections here were in the early stage, unlike the other microarray track shown on this genome browser. Display Conventions and Configuration Genomic locations of peptides that were spotted on the array are highlighted. Because these peptides overlap, the tracks default to dense mode and the sequence is shown as labels drawn onto the rectangles, but only visible on high zoom levels. Put any track into pack mode to fully see all probe sequences. The color is assigned based on the Z-Score, without any other normalization. Blue with decreasing intensity is assigned to the values -5 to 0, white is 0, and red colors with increasing intensity are used for the values 0-3.5, exactly as in the original publication figures. There are also two wiggle/signal style tracks to summary the information, they show the sum of Z-scores across all peptides, as one score per nucleotide. Methods Supplemental files were converted from Excel, rearranged and run through the command line script bigHeat to create a heatmap-like display, with multiplication factors of 2 for negative values, 0.285 for positive values. For better visibility, colormap seismic from matplotlib was used from 0.1 to 0.9 and the range of values after multiplication were restricted to the limits -1 to 1 to address outliers. Like all tracks, the exact commands are documented in our makeDoc text files. Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/pbm/IgG_Z-score-_COVID-19_patients/P52.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Hongye Wang, Xian Wu, Xiaomei Zhang, Xin Hou, Te Liang, Dan Wang, Fei Teng, Jiayu Dai, Hu Duan, Shubin Guo, Yongzhe Li, and Xiaobo Yu SARS-CoV-2 Proteome Microarray for Mapping COVID-19 Antibody Interactions at Amino Acid Resolution. ACS Central Science. 2020 DOI: 10.1021/acscentsci.0c00742 bpmIgmCovidSum Antib Pept Array Sum (IgM) Antibody Proteome Peptide Binding Microarray, Wang et al 2020 - IgM, Covid - Sum of scores per nucleotide Immunology Description This track shows intensities of a microarray spotted with short peptides derived from the entire proteome of SARS-CoV-2. Sera from 10 COVID-19 patients (early stage) and 12 healthy controls were screened on the peptide microarray for both IgG and IgM responses. Note that the infections here were in the early stage, unlike the other microarray track shown on this genome browser. Display Conventions and Configuration Genomic locations of peptides that were spotted on the array are highlighted. Because these peptides overlap, the tracks default to dense mode and the sequence is shown as labels drawn onto the rectangles, but only visible on high zoom levels. Put any track into pack mode to fully see all probe sequences. The color is assigned based on the Z-Score, without any other normalization. Blue with decreasing intensity is assigned to the values -5 to 0, white is 0, and red colors with increasing intensity are used for the values 0-3.5, exactly as in the original publication figures. There are also two wiggle/signal style tracks to summary the information, they show the sum of Z-scores across all peptides, as one score per nucleotide. Methods Supplemental files were converted from Excel, rearranged and run through the command line script bigHeat to create a heatmap-like display, with multiplication factors of 2 for negative values, 0.285 for positive values. For better visibility, colormap seismic from matplotlib was used from 0.1 to 0.9 and the range of values after multiplication were restricted to the limits -1 to 1 to address outliers. Like all tracks, the exact commands are documented in our makeDoc text files. Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/pbm/IgG_Z-score-_COVID-19_patients/P52.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Hongye Wang, Xian Wu, Xiaomei Zhang, Xin Hou, Te Liang, Dan Wang, Fei Teng, Jiayu Dai, Hu Duan, Shubin Guo, Yongzhe Li, and Xiaobo Yu SARS-CoV-2 Proteome Microarray for Mapping COVID-19 Antibody Interactions at Amino Acid Resolution. ACS Central Science. 2020 DOI: 10.1021/acscentsci.0c00742 IgG_Z-score-_COVID-19_patients IgG Z-score- early COVID-19 patients Proteome Peptide Microarray - IgG - early COVID-19 patients Immunology Description This track shows intensities of a microarray spotted with short peptides derived from the entire proteome of SARS-CoV-2. Sera from 10 COVID-19 patients (early stage) and 12 healthy controls were screened on the peptide microarray for both IgG and IgM responses. Note that the infections here were in the early stage, unlike the other microarray track shown on this genome browser. Display Conventions and Configuration Genomic locations of peptides that were spotted on the array are highlighted. Because these peptides overlap, the tracks default to dense mode and the sequence is shown as labels drawn onto the rectangles, but only visible on high zoom levels. Put any track into pack mode to fully see all probe sequences. The color is assigned based on the Z-Score, without any other normalization. Blue with decreasing intensity is assigned to the values -5 to 0, white is 0, and red colors with increasing intensity are used for the values 0-3.5, exactly as in the original publication figures. There are also two wiggle/signal style tracks to summary the information, they show the sum of Z-scores across all peptides, as one score per nucleotide. Methods Supplemental files were converted from Excel, rearranged and run through the command line script bigHeat to create a heatmap-like display, with multiplication factors of 2 for negative values, 0.285 for positive values. For better visibility, colormap seismic from matplotlib was used from 0.1 to 0.9 and the range of values after multiplication were restricted to the limits -1 to 1 to address outliers. Like all tracks, the exact commands are documented in our makeDoc text files. Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/pbm/IgG_Z-score-_COVID-19_patients/P52.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Hongye Wang, Xian Wu, Xiaomei Zhang, Xin Hou, Te Liang, Dan Wang, Fei Teng, Jiayu Dai, Hu Duan, Shubin Guo, Yongzhe Li, and Xiaobo Yu SARS-CoV-2 Proteome Microarray for Mapping COVID-19 Antibody Interactions at Amino Acid Resolution. ACS Central Science. 2020 DOI: 10.1021/acscentsci.0c00742 P6 P6 P6 Immunology P52 P52 P52 Immunology P45 P45 P45 Immunology P4 P4 P4 Immunology P33 P33 P33 Immunology P32 P32 P32 Immunology P15 P15 P15 Immunology P11 P11 P11 Immunology P10 P10 P10 Immunology P1 P1 P1 Immunology IgM_Z-score-_COVID-19_patients IgM Z-score - early COVID-19 patients Proteome Peptide Microarray - IgM - early COVID-19 patients Immunology Description This track shows intensities of a microarray spotted with short peptides derived from the entire proteome of SARS-CoV-2. Sera from 10 COVID-19 patients (early stage) and 12 healthy controls were screened on the peptide microarray for both IgG and IgM responses. Note that the infections here were in the early stage, unlike the other microarray track shown on this genome browser. Display Conventions and Configuration Genomic locations of peptides that were spotted on the array are highlighted. Because these peptides overlap, the tracks default to dense mode and the sequence is shown as labels drawn onto the rectangles, but only visible on high zoom levels. Put any track into pack mode to fully see all probe sequences. The color is assigned based on the Z-Score, without any other normalization. Blue with decreasing intensity is assigned to the values -5 to 0, white is 0, and red colors with increasing intensity are used for the values 0-3.5, exactly as in the original publication figures. There are also two wiggle/signal style tracks to summary the information, they show the sum of Z-scores across all peptides, as one score per nucleotide. Methods Supplemental files were converted from Excel, rearranged and run through the command line script bigHeat to create a heatmap-like display, with multiplication factors of 2 for negative values, 0.285 for positive values. For better visibility, colormap seismic from matplotlib was used from 0.1 to 0.9 and the range of values after multiplication were restricted to the limits -1 to 1 to address outliers. Like all tracks, the exact commands are documented in our makeDoc text files. Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/pbm/IgG_Z-score-_COVID-19_patients/P52.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Hongye Wang, Xian Wu, Xiaomei Zhang, Xin Hou, Te Liang, Dan Wang, Fei Teng, Jiayu Dai, Hu Duan, Shubin Guo, Yongzhe Li, and Xiaobo Yu SARS-CoV-2 Proteome Microarray for Mapping COVID-19 Antibody Interactions at Amino Acid Resolution. ACS Central Science. 2020 DOI: 10.1021/acscentsci.0c00742 IgM_Z-score-_COVID-19_patients_P6 P6 P6 Immunology IgM_Z-score-_COVID-19_patients_P52 P52 P52 Immunology IgM_Z-score-_COVID-19_patients_P45 P45 P45 Immunology IgM_Z-score-_COVID-19_patients_P4 P4 P4 Immunology IgM_Z-score-_COVID-19_patients_P33 P33 P33 Immunology IgM_Z-score-_COVID-19_patients_P32 P32 P32 Immunology IgM_Z-score-_COVID-19_patients_P15 P15 P15 Immunology IgM_Z-score-_COVID-19_patients_P11 P11 P11 Immunology IgM_Z-score-_COVID-19_patients_P10 P10 P10 Immunology IgM_Z-score-_COVID-19_patients_P1 P1 P1 Immunology varskip NEB VarSkip Primers New England Biolabs (NEB) VarSkip Primers Mapping and Sequencing Description This track shows the primers for the NEB VarSkip v1 and v2 sequencing primers. Display Conventions and Configuration Genomic locations of primers are highlighted. For primer tracks, the "full" visibility mode is often more suitable than the "pack" or "squish" display modes. Methods Primer sequences were downloaded from the NEB GitHub repository and converted to bigBed. Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/artic.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. varskip2b VarSkip 2b NEB VarSkip 2b (2a + spike-ins) Mapping and Sequencing varskip2 VarSkip 2 NEB VarSkip 2 Mapping and Sequencing varskipVsl1a VarSkip Long 1a NEB VarSkip Long V1a Mapping and Sequencing varskip1a VarSkip 1a NEB VarSkip V1a Mapping and Sequencing unipCov2DisulfBond Disulf. Bonds UniProt Disulfide Bonds UniProt Protein Annotations Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt/SwissProt staff. The annotations are spread over multiple tracks, based on their "feature type" in UniProt: Track Name Description UCSC Alignment, SwissProt Protein sequences from SwissProt mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. UCSC Alignment, TrEMBL Protein sequences from TrEMBL mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. This track is hidden by default. To show it, click its checkbox on the track description page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse-over a feature to see the full UniProt annotation comment. For variants, the mouse-over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipStructCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout archive folder of our downloads server. --!> Please refer to our mailing list archives for questions or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Phil Berman, UCSC. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 unipCov2Domain Protein Domains UniProt Domains UniProt Protein Annotations Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt/SwissProt staff. The annotations are spread over multiple tracks, based on their "feature type" in UniProt: Track Name Description UCSC Alignment, SwissProt Protein sequences from SwissProt mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. UCSC Alignment, TrEMBL Protein sequences from TrEMBL mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. This track is hidden by default. To show it, click its checkbox on the track description page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse-over a feature to see the full UniProt annotation comment. For variants, the mouse-over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipStructCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout archive folder of our downloads server. --!> Please refer to our mailing list archives for questions or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Phil Berman, UCSC. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 igg S-PBM IgG S Protein Antibody Peptide Binding Microarray - IgG - Sheng-ce Tao group, Jiao Tung Univ. Immunology pbmShanghai S Antib Pept Array S Protein Antibody Peptide Binding Microarray from Li et al, Cell & Mol Imm 2020, Sheng-ce Tao group, Jiao Tung Univ. Immunology Description This track shows intensities of a microarray spotted with short peptides derived from the S protein. Fifty-five sera from convalescent COVID-19 patients and 18 control sera were screened on the peptide microarray for both IgG and IgM responses. When comparing the microarry tracks, note that they are from patients at difference stages of the infection. A total of 211 peptides were synthesized and conjugated to BSA. The conjugates along with control proteins were prepared in triplicate at three dilutions. Display Conventions and Configuration This track contains two composite tracks, for IgG and IgM. It also contains two signal (wiggle) tracks, that show the "response frequency", roughly calculated as described in the paper, for IgG and IgM separately. Genomic locations of peptides that were spotted on the array are highlighted. Because these peptides overlap, the tracks default to dense mode and the sequence is shown as labels drawn onto the rectangles, but only visible on high zoom levels. Put any track into pack mode to fully see the sequence and also the triplicates. Since every peptide was spotted in three concentrations, every peptide is shown three times in pack mode. The color is assigned based on the log of the fluorescent signal, with all negative values replaced by 0. The fluorescent signal was restricted to the range 0-15, scaled to 0-1.0 and the viridis color palette was used to assign a color intensity. This color palette is different from the one used in the original paper, to make the quantitative data easier to see. The two signal tracks show the "response frequency", as defined in the paper. The response frequency is the share of Covid samples at a position that exceed the threshold mean(x)+3*stdev(y), with x being all values at a position and y being the negative controls. At positions where two peptides overlap, the two values are summed up, which means that the result on the Genome Browser can exceed 1.0. Methods Supplemental files were received from the authors, converted from Excel, rearranged and run through the command line script bigHeat to create a heatmap-like display. 0.285 for positive values. For better visibility, colormap viridis from matplotlib was used. Like all tracks, the exact commands are documented in our makeDoc text files. Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Li Y, Lai DY, Zhang HN, Jiang HW, Tian X, Ma ML, Qi H, Meng QF, Guo SJ, Wu Y et al. Linear epitopes of SARS-CoV-2 spike protein elicit neutralizing antibodies in COVID-19 patients. Cell Mol Immunol. 2020 Oct;17(10):1095-1097. PMID: 32895485; PMC: PMC7475724 igg_COVID_527 COVID 527 COVID 527 Immunology igg_COVID_401 COVID 401 COVID 401 Immunology igg_COVID_501 COVID 501 COVID 501 Immunology igg_COVID_419 COVID 419 COVID 419 Immunology igg_COVID_532 COVID 532 COVID 532 Immunology igg_COVID_407 COVID 407 COVID 407 Immunology igg_COVID_610 COVID 610 COVID 610 Immunology igg_COVID_730 COVID 730 COVID 730 Immunology igg_COVID_14 COVID 14 COVID 14 Immunology igg_COVID_5333 COVID 5333 COVID 5333 Immunology igg_COVID_520 COVID 520 COVID 520 Immunology igg_COVID_5222 COVID 5222 COVID 5222 Immunology igg_COVID_416 COVID 416 COVID 416 Immunology igg_COVID_4371 COVID 4371 COVID 4371 Immunology igg_COVID_529 COVID 529 COVID 529 Immunology igg_COVID_531 COVID 531 COVID 531 Immunology igg_COVID_432 COVID 432 COVID 432 Immunology igg_COVID_535 COVID 535 COVID 535 Immunology igg_COVID_530 COVID 530 COVID 530 Immunology igg_COVID_522 COVID 522 COVID 522 Immunology igg_COVID_510 COVID 510 COVID 510 Immunology igg_COVID_12 COVID 12 COVID 12 Immunology igg_COVID_11 COVID 11 COVID 11 Immunology igg_COVID_727 COVID 727 COVID 727 Immunology igg_COVID_406 COVID 406 COVID 406 Immunology igg_COVID_508 COVID 508 COVID 508 Immunology igg_COVID_528 COVID 528 COVID 528 Immunology igg_COVID_404 COVID 404 COVID 404 Immunology igg_COVID_405 COVID 405 COVID 405 Immunology igg_COVID_415 COVID 415 COVID 415 Immunology igg_COVID_744 COVID 744 COVID 744 Immunology igg_COVID_19 COVID 19 COVID 19 Immunology igg_COVID_509 COVID 509 COVID 509 Immunology igg_COVID_13 COVID 13 COVID 13 Immunology igg_COVID_526 COVID 526 COVID 526 Immunology igg_COVID_4021 COVID 4021 COVID 4021 Immunology igg_COVID_18 COVID 18 COVID 18 Immunology igg_COVID_502 COVID 502 COVID 502 Immunology igg_COVID_429 COVID 429 COVID 429 Immunology igg_COVID_436 COVID 436 COVID 436 Immunology igg_COVID_409 COVID 409 COVID 409 Immunology igg_COVID_4022 COVID 4022 COVID 4022 Immunology igg_COVID_410 COVID 410 COVID 410 Immunology igg_COVID_408 COVID 408 COVID 408 Immunology igg_COVID_523 COVID 523 COVID 523 Immunology igg_COVID_608 COVID 608 COVID 608 Immunology igg_COVID_16 COVID 16 COVID 16 Immunology igg_COVID_534 COVID 534 COVID 534 Immunology igg_COVID_4372 COVID 4372 COVID 4372 Immunology igg_COVID_414 COVID 414 COVID 414 Immunology igg_COVID_17 COVID 17 COVID 17 Immunology igg_COVID_15 COVID 15 COVID 15 Immunology igg_COVID_605 COVID 605 COVID 605 Immunology igg_COVID_533 COVID 533 COVID 533 Immunology igg_COVID_607 COVID 607 COVID 607 Immunology igg_Ctrl_LC180 Ctrl LC180 Ctrl LC180 Immunology igg_Ctrl_NC98 Ctrl NC98 Ctrl NC98 Immunology igg_Ctrl_NC95 Ctrl NC95 Ctrl NC95 Immunology igg_Ctrl_LC175 Ctrl LC175 Ctrl LC175 Immunology igg_Ctrl_NC63 Ctrl NC63 Ctrl NC63 Immunology igg_Ctrl_NC64 Ctrl NC64 Ctrl NC64 Immunology igg_Ctrl_LC174 Ctrl LC174 Ctrl LC174 Immunology igg_Ctrl_NC97 Ctrl NC97 Ctrl NC97 Immunology igg_Ctrl_NC66 Ctrl NC66 Ctrl NC66 Immunology igg_Ctrl_NC96 Ctrl NC96 Ctrl NC96 Immunology igg_Ctrl_LC181 Ctrl LC181 Ctrl LC181 Immunology igg_Ctrl_LC177 Ctrl LC177 Ctrl LC177 Immunology igg_Ctrl_NC65 Ctrl NC65 Ctrl NC65 Immunology igg_Ctrl_LC171 Ctrl LC171 Ctrl LC171 Immunology igg_Ctrl_LC169 Ctrl LC169 Ctrl LC169 Immunology igg_Ctrl_LC182 Ctrl LC182 Ctrl LC182 Immunology igg_Ctrl_LC168 Ctrl LC168 Ctrl LC168 Immunology igg_Ctrl_NC67 Ctrl NC67 Ctrl NC67 Immunology igm S-PBM IgM S Protein Antibody Peptide Binding Microarray - IgM - Sheng-ce Tao group, Jiao Tung Univ. Immunology igm_COVID_409 COVID 409 COVID 409 Immunology igm_COVID_4022 COVID 4022 COVID 4022 Immunology igm_COVID_436 COVID 436 COVID 436 Immunology igm_COVID_530 COVID 530 COVID 530 Immunology igm_COVID_432 COVID 432 COVID 432 Immunology igm_COVID_429 COVID 429 COVID 429 Immunology igm_COVID_405 COVID 405 COVID 405 Immunology igm_COVID_522 COVID 522 COVID 522 Immunology igm_COVID_401 COVID 401 COVID 401 Immunology igm_COVID_410 COVID 410 COVID 410 Immunology igm_COVID_509 COVID 509 COVID 509 Immunology igm_COVID_501 COVID 501 COVID 501 Immunology igm_COVID_535 COVID 535 COVID 535 Immunology igm_COVID_5222 COVID 5222 COVID 5222 Immunology igm_COVID_510 COVID 510 COVID 510 Immunology igm_COVID_607 COVID 607 COVID 607 Immunology igm_COVID_407 COVID 407 COVID 407 Immunology igm_COVID_4371 COVID 4371 COVID 4371 Immunology igm_COVID_610 COVID 610 COVID 610 Immunology igm_COVID_12 COVID 12 COVID 12 Immunology igm_COVID_533 COVID 533 COVID 533 Immunology igm_COVID_727 COVID 727 COVID 727 Immunology igm_COVID_730 COVID 730 COVID 730 Immunology igm_COVID_605 COVID 605 COVID 605 Immunology igm_COVID_508 COVID 508 COVID 508 Immunology igm_Ctrl_LC174 Ctrl LC174 Ctrl LC174 Immunology igm_COVID_608 COVID 608 COVID 608 Immunology igm_COVID_520 COVID 520 COVID 520 Immunology igm_COVID_5333 COVID 5333 COVID 5333 Immunology igm_COVID_532 COVID 532 COVID 532 Immunology igm_COVID_406 COVID 406 COVID 406 Immunology igm_COVID_14 COVID 14 COVID 14 Immunology igm_COVID_534 COVID 534 COVID 534 Immunology igm_COVID_19 COVID 19 COVID 19 Immunology igm_COVID_415 COVID 415 COVID 415 Immunology igm_COVID_4021 COVID 4021 COVID 4021 Immunology igm_COVID_419 COVID 419 COVID 419 Immunology igm_COVID_18 COVID 18 COVID 18 Immunology igm_COVID_529 COVID 529 COVID 529 Immunology igm_COVID_526 COVID 526 COVID 526 Immunology igm_COVID_744 COVID 744 COVID 744 Immunology igm_COVID_17 COVID 17 COVID 17 Immunology igm_COVID_11 COVID 11 COVID 11 Immunology igm_COVID_414 COVID 414 COVID 414 Immunology igm_COVID_527 COVID 527 COVID 527 Immunology igm_COVID_523 COVID 523 COVID 523 Immunology igm_Ctrl_LC171 Ctrl LC171 Ctrl LC171 Immunology igm_Ctrl_NC97 Ctrl NC97 Ctrl NC97 Immunology igm_Ctrl_NC66 Ctrl NC66 Ctrl NC66 Immunology igm_Ctrl_NC95 Ctrl NC95 Ctrl NC95 Immunology igm_COVID_531 COVID 531 COVID 531 Immunology igm_COVID_4372 COVID 4372 COVID 4372 Immunology igm_Ctrl_NC96 Ctrl NC96 Ctrl NC96 Immunology igm_Ctrl_NC64 Ctrl NC64 Ctrl NC64 Immunology igm_Ctrl_LC180 Ctrl LC180 Ctrl LC180 Immunology igm_Ctrl_NC63 Ctrl NC63 Ctrl NC63 Immunology igm_Ctrl_NC98 Ctrl NC98 Ctrl NC98 Immunology igm_Ctrl_NC65 Ctrl NC65 Ctrl NC65 Immunology igm_COVID_528 COVID 528 COVID 528 Immunology igm_Ctrl_LC182 Ctrl LC182 Ctrl LC182 Immunology igm_Ctrl_LC168 Ctrl LC168 Ctrl LC168 Immunology igm_Ctrl_LC169 Ctrl LC169 Ctrl LC169 Immunology igm_Ctrl_LC177 Ctrl LC177 Ctrl LC177 Immunology igm_COVID_416 COVID 416 COVID 416 Immunology igm_COVID_15 COVID 15 COVID 15 Immunology igm_COVID_502 COVID 502 COVID 502 Immunology igm_COVID_16 COVID 16 COVID 16 Immunology igm_COVID_404 COVID 404 COVID 404 Immunology igm_Ctrl_NC67 Ctrl NC67 Ctrl NC67 Immunology igm_Ctrl_LC175 Ctrl LC175 Ctrl LC175 Immunology igm_Ctrl_LC181 Ctrl LC181 Ctrl LC181 Immunology igm_COVID_13 COVID 13 COVID 13 Immunology igm_COVID_408 COVID 408 COVID 408 Immunology iggAllSum S-PBM: IgG Response Frequency S Protein Antibody Peptide Binding Microarray - IgG - Response Frequency Immunology igmAllSum S-PBM: IgM Response Frequency S Protein Antibody Peptide Binding Microarray - IgM - Response Frequency Immunology unipCov2Modif Glycosyl/Phosph. UniProt Amino Acid Glycosylation/Phosphorylation sites UniProt Protein Annotations Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt/SwissProt staff. The annotations are spread over multiple tracks, based on their "feature type" in UniProt: Track Name Description UCSC Alignment, SwissProt Protein sequences from SwissProt mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. UCSC Alignment, TrEMBL Protein sequences from TrEMBL mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. This track is hidden by default. To show it, click its checkbox on the track description page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse-over a feature to see the full UniProt annotation comment. For variants, the mouse-over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipStructCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout archive folder of our downloads server. --!> Please refer to our mailing list archives for questions or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Phil Berman, UCSC. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 Calu3_07hpi Calu3 7hpi Calu3 7hpi Ribo-seq and RNA-seq Genes and Gene Predictions Description The Weizman ORFs (Open Reading Frames) track shows previously unannotated ORF predictions based on Ribo-Seq and RNA-seq data. It is a collection of tracks (super track) that contains not only the predicted gene models, but also data supporting them. Display Conventions and Configuration The Predicted ORFs track shows the predicted exons. All other tracks show the signal as a x-y plot with bars. Methods Methods from Finkel et al: To capture the full SARS-CoV-2 coding capacity, we applied a suite of ribosome profiling approaches to Vero cells infected with SARS-CoV-2 for 5 and 24 hours, and Calu3 cells infected for 7 hours. For each time point we prepared three different ribosome-profiling libraries, each one in two biological replicates. Two Ribo-seq libraries facilitate mapping of translation initiation sites, by treating cells with lactimidomycin (LTM) or harringtonine (Harr), two drugs with distinct mechanisms that prevent 80S ribosomes at translation initiation sites from elongating. The third Ribo-seq library was prepared from cells treated with the translation elongation inhibitor cycloheximide (CHX), and gives a snap-shot of actively translating ribosomes across the body of the translated ORF. In parallel, RNA-sequencing was applied to map viral transcripts. The ORF prediction was done by using two computational tools, PRICE and ORF-RATER, that rely on different features of ribosome profiling data, and by manual inspection of the data. The predictions are based on Ribo-seq libraries from two time points (5 and 7 hpi) of two different cell lines (Vero E6 and Calu3 cells), infected with separate virus isolates. The Ribo-Seq data of the 24 hours samples do not show the expected profile of read distribution on viral genes and therefore were not used for the procedure of ORF predictions. For more details see the paper in the References section below. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Finkel Y, Mizrahi O, Nachshon A, Weingarten-Gabbay S, Morgenstern D, Yahalom-Ronen Y, Tamir H, Achdout H, Stein D, Israeli O et al. The coding capacity of SARS-CoV-2. Nature. 2020 Sep 9;. PMID: 32906143 mRNA-seq_7hr_2 Calu3 mRNA 7hr 2 Calu3 mRNA 7hr 2 Genes and Gene Predictions mRNA-seq_7hr_1 Calu3 mRNA 7hr 1 Calu3 mRNA 7hr 1 Genes and Gene Predictions LTM_7hr_2 Calu3 LTM 7hr 2 Calu3 LTM 7hr 2 Genes and Gene Predictions LTM_7hr_1 Calu3 LTM 7hr 1 Calu3 LTM 7hr 1 Genes and Gene Predictions Harr_7hr_2 Calu3 Harr 7hr 2 Calu3 Harr 7hr 2 Genes and Gene Predictions Harr_7hr_1 Calu3 Harr 7hr 1 Calu3 Harr 7hr 1 Genes and Gene Predictions CHX_7hr_2 Calu3 CHX 7hr 2 Calu3 CHX 7hr 2 Genes and Gene Predictions CHX_7hr_1 Calu3 CHX 7hr 1 Calu3 CHX 7hr 1 Genes and Gene Predictions Q0_tracks Galaxy ENA mutations in top lineages - current quarter Most frequent lineages of current quarter Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEna Galaxy ENA mutations GalaxyProject surveillance of SARS-CoV-2 mutations through consistent processing of public raw sequencing data Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ0Ba-1-1-11 BA.1.1.11 mutations Mutations (amino acid level) in BA.1.1.11 between 2022-01-05 and 2022-03-09 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ0Ba-1 BA.1 mutations Mutations (amino acid level) in BA.1 between 2022-01-05 and 2022-03-09 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ0Ba-1-17 BA.1.17 mutations Mutations (amino acid level) in BA.1.17 between 2022-01-05 and 2022-03-01 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ0Ba-1-1 BA.1.1 mutations Mutations (amino acid level) in BA.1.1 between 2022-01-05 and 2022-03-06 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ0Ba-2 BA.2 mutations Mutations (amino acid level) in BA.2 between 2022-01-05 and 2022-03-12 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 unipCov2Mut Mutations UniProt Amino Acid Mutations UniProt Protein Annotations Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt/SwissProt staff. The annotations are spread over multiple tracks, based on their "feature type" in UniProt: Track Name Description UCSC Alignment, SwissProt Protein sequences from SwissProt mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. UCSC Alignment, TrEMBL Protein sequences from TrEMBL mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. This track is hidden by default. To show it, click its checkbox on the track description page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse-over a feature to see the full UniProt annotation comment. For variants, the mouse-over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipStructCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout archive folder of our downloads server. --!> Please refer to our mailing list archives for questions or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Phil Berman, UCSC. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 gordon2 Protein Interact. Human Interacting Proteins from Gordon et al. (* = druggable) Genes and Gene Predictions Description This track shows data from Gordon et al, 2020 "A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug-Repurposing". The authors cloned, tagged, and expressed 26 of the mature proteins expressed by SARS-CoV-2 in 293 human cells and used affinity purification mass spectrometry (AP-MS) to identify human proteins that interact with viral proteins. 332 high confidence interactions are reported betwen human and viral proteins. Display Conventions and Configuration On the viral genome the coordinates of the viral protein are marked and labeled with the name of the human interactor. The bed file also includes the MIST score * 1000. A MIST score of 1 is a highly reproducible and specific interaction, scores are multipled by 1000 for display purposes. The set of interactions displayed here includes only the 332 "high confidence" interactions that met criteria for significance (using a MIST cutoff >= 0.7 as well as "a SAINTexpress BFDR <= 0.05 and an average spectral count >= 2."). Methods The tab delimited version of supplementary table 2 was downloaded from biorxiv. This table lists the viral protein that served as the "bait" and Uniprot identifiers of human proteins that were captured as "prey". A version of the "UniProt Mature Protein Products (Polypeptide Chains)" track was then manually modified to rename ORFs to match paper nomenclature as indicated by Figure 1. The table was then joined and MIST scores multipled by 1000 to produce a track reporting interactions. References Gordon et al, "A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug-Repurposing", Biorxiv 2020 Vero6_05hpi Vero6 5hpi Vero6 5hpi Ribo-seq and RNA-seq Genes and Gene Predictions Description The Weizman ORFs (Open Reading Frames) track shows previously unannotated ORF predictions based on Ribo-Seq and RNA-seq data. It is a collection of tracks (super track) that contains not only the predicted gene models, but also data supporting them. Display Conventions and Configuration The Predicted ORFs track shows the predicted exons. All other tracks show the signal as a x-y plot with bars. Methods Methods from Finkel et al: To capture the full SARS-CoV-2 coding capacity, we applied a suite of ribosome profiling approaches to Vero cells infected with SARS-CoV-2 for 5 and 24 hours, and Calu3 cells infected for 7 hours. For each time point we prepared three different ribosome-profiling libraries, each one in two biological replicates. Two Ribo-seq libraries facilitate mapping of translation initiation sites, by treating cells with lactimidomycin (LTM) or harringtonine (Harr), two drugs with distinct mechanisms that prevent 80S ribosomes at translation initiation sites from elongating. The third Ribo-seq library was prepared from cells treated with the translation elongation inhibitor cycloheximide (CHX), and gives a snap-shot of actively translating ribosomes across the body of the translated ORF. In parallel, RNA-sequencing was applied to map viral transcripts. The ORF prediction was done by using two computational tools, PRICE and ORF-RATER, that rely on different features of ribosome profiling data, and by manual inspection of the data. The predictions are based on Ribo-seq libraries from two time points (5 and 7 hpi) of two different cell lines (Vero E6 and Calu3 cells), infected with separate virus isolates. The Ribo-Seq data of the 24 hours samples do not show the expected profile of read distribution on viral genes and therefore were not used for the procedure of ORF predictions. For more details see the paper in the References section below. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Finkel Y, Mizrahi O, Nachshon A, Weingarten-Gabbay S, Morgenstern D, Yahalom-Ronen Y, Tamir H, Achdout H, Stein D, Israeli O et al. The coding capacity of SARS-CoV-2. Nature. 2020 Sep 9;. PMID: 32906143 mRNA-seq_5hr_2 Vero6 mRNA 5hr 2 Vero6 mRNA 5hr 2 Genes and Gene Predictions mRNA-seq_5hr_1 Vero6 mRNA 5hr 1 Vero6 mRNA 5hr 1 Genes and Gene Predictions LTM_5hr_2 Vero6 LTM 5hr 2 Vero6 LTM 5hr 2 Genes and Gene Predictions LTM_5hr_1 Vero6 LTM 5hr 1 Vero6 LTM 5hr 1 Genes and Gene Predictions Harr_5hr_2 Vero6 Harr 5hr 2 Vero6 Harr 5hr 2 Genes and Gene Predictions Harr_5hr_1 Vero6 Harr 5hr 1 Vero6 Harr 5hr 1 Genes and Gene Predictions CHX_5hr_2 Vero6 CHX 5hr 2 Vero6 CHX 5hr 2 Genes and Gene Predictions CHX_5hr_1 Vero6 CHX 5hr 1 Vero6 CHX 5hr 1 Genes and Gene Predictions unipCov2Other Other Annot. UniProt Other Annotations UniProt Protein Annotations Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt/SwissProt staff. The annotations are spread over multiple tracks, based on their "feature type" in UniProt: Track Name Description UCSC Alignment, SwissProt Protein sequences from SwissProt mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. UCSC Alignment, TrEMBL Protein sequences from TrEMBL mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. This track is hidden by default. To show it, click its checkbox on the track description page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse-over a feature to see the full UniProt annotation comment. For variants, the mouse-over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipStructCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout archive folder of our downloads server. --!> Please refer to our mailing list archives for questions or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Phil Berman, UCSC. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 unipCov2Struct Structure UniProt Protein Primary/Secondary Structure Annotations UniProt Protein Annotations Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt/SwissProt staff. The annotations are spread over multiple tracks, based on their "feature type" in UniProt: Track Name Description UCSC Alignment, SwissProt Protein sequences from SwissProt mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. UCSC Alignment, TrEMBL Protein sequences from TrEMBL mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. This track is hidden by default. To show it, click its checkbox on the track description page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse-over a feature to see the full UniProt annotation comment. For variants, the mouse-over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipStructCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout archive folder of our downloads server. --!> Please refer to our mailing list archives for questions or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Phil Berman, UCSC. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 unipCov2Repeat Repeats UniProt Repeats UniProt Protein Annotations Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt/SwissProt staff. The annotations are spread over multiple tracks, based on their "feature type" in UniProt: Track Name Description UCSC Alignment, SwissProt Protein sequences from SwissProt mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. UCSC Alignment, TrEMBL Protein sequences from TrEMBL mapped onto the genome. All other tracks are (start,end) annotations mapped using this track. This track is hidden by default. To show it, click its checkbox on the track description page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse-over a feature to see the full UniProt annotation comment. For variants, the mouse-over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were obtained from the UniProt XML file. The annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on GitHub. Data Access The raw data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/uniprot/unipStructCov2.bb -chrom=NC_045512v2 -start=0 -end=29903 stdout archive folder of our downloads server. --!> Please refer to our mailing list archives for questions or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler at UCSC, with help from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Phil Berman, UCSC. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 Q1_tracks Galaxy ENA mutations in top lineages - a quarter ago Most frequent lineages of a quarter ago Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ1Ay-4-2-2 AY.4.2.2 mutations Mutations (amino acid level) in AY.4.2.2 between 2021-10-05 and 2022-01-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ1Ay-43 AY.43 mutations Mutations (amino acid level) in AY.43 between 2021-10-05 and 2022-01-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ1Ba-1-17 BA.1.17 mutations Mutations (amino acid level) in BA.1.17 between 2021-10-05 and 2022-01-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ1Ay-4-2 AY.4.2 mutations Mutations (amino acid level) in AY.4.2 between 2021-10-05 and 2022-01-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ1Ay-4 AY.4 mutations Mutations (amino acid level) in AY.4 between 2021-10-05 and 2022-01-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 PhyloCSFpower PhyloCSF Power Relative branch length of local alignment, a measure of PhyloCSF statistical power Comparative Genomics PhyloCSF Track Hub Description These tracks show evolutionary protein-coding potential as determined by PhyloCSF [1] to help identify conserved, functional, protein-coding regions of genomes. PhyloCSF examines evolutionary signatures characteristic of alignments of conserved coding regions, such as the high frequencies of synonymous codon substitutions and conservative amino acid substitutions, and the low frequencies of other missense and nonsense substitutions (CSF = Codon Substitution Frequencies). PhyloCSF provides more information than conservation of the amino acid sequence, because it distinguishes the different codons that code for the same amino acid. One of PhyloCSF's main current applications is to help distinguish protein-coding and non-coding RNAs represented among novel transcript models obtained from high-throughput transcriptome sequencing. More information on PhyloCSF can be found on the PhyloCSF wiki. The Smoothed PhyloCSF track shows the PhyloCSF score for each codon in each of 6 frames, smoothed using an HMM. Regions in which most codons have score greater than 0 are likely to be protein-coding in that frame. No score is shown when the relative branch length is less than 0.1 (see PhyloCSF Power). The PhyloCSF Power track shows the branch length score at each codon, i.e., the ratio of the branch length of the species present in the local alignment to the total branch length of all species in the full genome alignment. It is an indication of the statistical power available to PhyloCSF. Codons with branch length score less than 0.1 have been excluded altogether (from all tracks) because PhyloCSF does not have sufficient power to get a meaningful score at these codons. Codons with branch length score greater than 0.1 but much less than 1 should be considered less certain. Caveats Around 10% of annotated protein-coding regions in human get score less than 0. This can happen for various reasons. For example, the region could be coding in the reference species but not in other species, or the alignment does not represent a true orthology relationship between the species. Protein-coding regions will often have positive score on the reverse strand in the frame in which the third codon positions match up (the "antisense" frame), though the score will usually be higher on the correct strand. Methods Tracks were constructed as described in Mudge et al. 2019 [2] and Jungreis et al. 2020 [3]. In brief, PhyloCSF was run with the "fixed" strategy on every codon in every frame on each strand in the wuhCor1/SARS-CoV-2 assembly using an alignment of 44 Sarbecovirus genomes, using the PhyloCSF parameters for 29mammals with the tree replaced with a tree of the 44 Sarbecovirus genomes. The scores were smoothed using a Hidden Markov Model (HMM) with 4 states, one representing coding regions and three representing non-coding regions. The emission of each codon is its PhyloCSF score. The ratio of the emissions probabilities for the coding and non-coding models are computed from the PhyloCSF score, since it represents the log-likelihood ratio of the alignment under the coding and non-coding models. The three non-coding states have the same emissions probabilities but different transition probabilities (they can only transition to coding) to better capture the multimodal distribution of gaps between same-frame coding exons. These transition probabilities represent the best approximation of this gap distribution as a mixture model of three exponential distributions, computed using Expectation Maximization. The HMM defines a probability that each codon is coding, based on the PhyloCSF scores of that codon and nearby codons on the same strand in the same frame, without taking into account start codons, stop codons, or potential splice sites. PhyloCSF+0 shows the log-odds that codons in frame 0 on the '+' strand are in the coding state according to the HMM, and similarly for strand '-' and frames 1 and 2. Credits Questions about the algorithm itself should be directed to Irwin Jungreis. Citing the PhyloCSF Tracks If you use the PhyloCSF browser tracks, please cite Mudge et al. 2019 [2] and Jungreis et al. 2020 [3]. References [1] Lin MF, Jungreis I, and Kellis M (2011). PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions. Bioinformatics 27:i275-i282 (ISMB/ECCB 2011). [2] Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright J, Kay M, Davidson C, Fitzgerald S, Seal R, Tweedie S, He L, Waterhouse RM, Li Y, Bruford E, Choudhary J, Frankish A, Kellis M (2019). Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Research gr-246462. doi: 10.1101/gr.246462.118. [3] Jungreis I, Saelfon R and Kellis M (2020). Sarbecovirus comparative genomics elucidates gene content of SARS-CoV-2 and functional impact of COVID-19 pandemic mutations. Biorxiv 2020. PhyloCSF_smooth PhyloCSF PhyloCSF Comparative Genomics Description These tracks show evolutionary protein-coding potential as determined by PhyloCSF [1] to help identify conserved, functional, protein-coding regions of genomes. PhyloCSF examines evolutionary signatures characteristic of alignments of conserved coding regions, such as the high frequencies of synonymous codon substitutions and conservative amino acid substitutions, and the low frequencies of other missense and nonsense substitutions (CSF = Codon Substitution Frequencies). PhyloCSF provides more information than conservation of the amino acid sequence, because it distinguishes the different codons that code for the same amino acid. One of PhyloCSF's main current applications is to help distinguish protein-coding and non-coding RNAs represented among novel transcript models obtained from high-throughput transcriptome sequencing. More information on PhyloCSF can be found on the PhyloCSF wiki. The Smoothed PhyloCSF track shows the PhyloCSF score for each codon in each of 6 frames, smoothed using an HMM. Regions in which most codons have score greater than 0 are likely to be protein-coding in that frame. No score is shown when the relative branch length is less than 0.1 (see PhyloCSF Power). The PhyloCSF Power track shows the branch length score at each codon, i.e., the ratio of the branch length of the species present in the local alignment to the total branch length of all species in the full genome alignment. It is an indication of the statistical power available to PhyloCSF. Codons with branch length score less than 0.1 have been excluded altogether (from all tracks) because PhyloCSF does not have sufficient power to get a meaningful score at these codons. Codons with branch length score greater than 0.1 but much less than 1 should be considered less certain. Caveats Around 10% of annotated protein-coding regions in human get scores less than 0. This can happen for various reasons. For example, the region could be coding in the reference species but not in other species, or the alignment does not represent a true orthology relationship between the species. Protein-coding regions will often have positive score on the reverse strand in the frame in which the third codon positions match up (the "antisense" frame), though the score will usually be higher on the correct strand. Methods Tracks were constructed as described in Mudge et al. 2019 and Jungreis et al. 2020. In brief, PhyloCSF was run with the "fixed" strategy on every codon in every frame on each strand in the wuhCor1/SARS-CoV-2 assembly using an alignment of 44 Sarbecovirus genomes, using the PhyloCSF parameters for 29mammals with the tree replaced with a tree of the 44 Sarbecovirus genomes. The scores were smoothed using a Hidden Markov Model (HMM) with 4 states, one representing coding regions and three representing non-coding regions. The emission of each codon is its PhyloCSF score. The ratio of the emissions probabilities for the coding and non-coding models are computed from the PhyloCSF score, since it represents the log-likelihood ratio of the alignment under the coding and non-coding models. The three non-coding states have the same emissions probabilities but different transition probabilities (they can only transition to coding) to better capture the multimodal distribution of gaps between same-frame coding exons. These transition probabilities represent the best approximation of this gap distribution as a mixture model of three exponential distributions, computed using Expectation Maximization. The HMM defines a probability that each codon is coding, based on the PhyloCSF scores of that codon and nearby codons on the same strand in the same frame, without taking into account start codons, stop codons, or potential splice sites. PhyloCSF+1 shows the log-odds that codons in frame 1 (sometimes called frame 0) on the '+' strand are in the coding state according to the HMM, and similarly for strand '-' and frames 2 and 3. Data Access The raw bigWig data can be explored interactively with the Table Browser, combined with other datasets in the Data Integrator tool, or downloaded directly from the download server. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits and Citations Questions about the algorithm itself should be directed to Irwin Jungreis. If you use the PhyloCSF browser tracks, please cite Mudge et al. 2019 and Jungreis et al. 2020. References Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics. 2011 Jul 1;27(13):i275-82. PMID: 21685081; PMC: PMC3117341 Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright JC, Kay M, Davidson C, Fitzgerald S, Seal R, Tweedie S et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Res. 2019 Dec;29(12):2073-2087. PMID: 31537640; PMC: PMC6886504 Jungreis I, Sealfon R, Kellis M. Sarbecovirus comparative genomics elucidates gene content of SARS-CoV-2 and functional impact of COVID-19 pandemic mutations. bioRxiv. 2020 Jun 3;. PMID: 32577641; PMC: PMC7302193 PhyloCSF_plus_1 Smoothed PhyloCSF+1 Smoothed PhyloCSF Strand + Frame 1 Comparative Genomics PhyloCSF_plus_2 Smoothed PhyloCSF+2 Smoothed PhyloCSF Strand + Frame 2 Comparative Genomics PhyloCSF_plus_3 Smoothed PhyloCSF+3 Smoothed PhyloCSF Strand + Frame 3 Comparative Genomics PhyloCSF_minus_1 Smoothed PhyloCSF-1 Smoothed PhyloCSF Strand - Frame 1 Comparative Genomics PhyloCSF_minus_2 Smoothed PhyloCSF-2 Smoothed PhyloCSF Strand - Frame 2 Comparative Genomics PhyloCSF_minus_3 Smoothed PhyloCSF-3 Smoothed PhyloCSF Strand - Frame 3 Comparative Genomics Q2_tracks Galaxy ENA mutations in top lineages - two quarters ago Most frequent lineages of two quarters ago Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ2Ay-122 AY.122 mutations Mutations (amino acid level) in AY.122 between 2021-07-05 and 2021-10-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ2Ay-120 AY.120 mutations Mutations (amino acid level) in AY.120 between 2021-07-05 and 2021-10-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ2Ay-6 AY.6 mutations Mutations (amino acid level) in AY.6 between 2021-07-05 and 2021-10-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ2Ay-5 AY.5 mutations Mutations (amino acid level) in AY.5 between 2021-07-05 and 2021-10-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ2Ay-4 AY.4 mutations Mutations (amino acid level) in AY.4 between 2021-07-05 and 2021-10-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 pdb PDB Structures Protein Data Bank (PDB) Sequence Matches Genes and Gene Predictions Description This track shows alignments of sequences with known protein structures in the Protein Data Bank (PDB). The PDB protein sequence has to match the genome over at least 80% of its length, so somewhat similar sequences, e.g. from the SARS 2003 outbreak, are also shown. Display Conventions and Configuration Genomic locations of PDB matches are labeled with the accession number. A click on them shows a standard feature detail page with the PDB page integrated into it. The protein structure is shown on the PDB page. Methods PDB sequences were downloaded from the PDB website and aligned with BLAT. Only alignments with a minimum identity of 80% that span at least 80% of the query sequence were kept. References Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000 Jan 1;28(1):235-42. PMID: 10592235; PMC: PMC102472 Vero6_24hpi Vero6 24hpi Vero6 24hpi Ribo-seq and RNA-seq Genes and Gene Predictions Description The Weizman ORFs (Open Reading Frames) track shows previously unannotated ORF predictions based on Ribo-Seq and RNA-seq data. It is a collection of tracks (super track) that contains not only the predicted gene models, but also data supporting them. Display Conventions and Configuration The Predicted ORFs track shows the predicted exons. All other tracks show the signal as a x-y plot with bars. Methods Methods from Finkel et al: To capture the full SARS-CoV-2 coding capacity, we applied a suite of ribosome profiling approaches to Vero cells infected with SARS-CoV-2 for 5 and 24 hours, and Calu3 cells infected for 7 hours. For each time point we prepared three different ribosome-profiling libraries, each one in two biological replicates. Two Ribo-seq libraries facilitate mapping of translation initiation sites, by treating cells with lactimidomycin (LTM) or harringtonine (Harr), two drugs with distinct mechanisms that prevent 80S ribosomes at translation initiation sites from elongating. The third Ribo-seq library was prepared from cells treated with the translation elongation inhibitor cycloheximide (CHX), and gives a snap-shot of actively translating ribosomes across the body of the translated ORF. In parallel, RNA-sequencing was applied to map viral transcripts. The ORF prediction was done by using two computational tools, PRICE and ORF-RATER, that rely on different features of ribosome profiling data, and by manual inspection of the data. The predictions are based on Ribo-seq libraries from two time points (5 and 7 hpi) of two different cell lines (Vero E6 and Calu3 cells), infected with separate virus isolates. The Ribo-Seq data of the 24 hours samples do not show the expected profile of read distribution on viral genes and therefore were not used for the procedure of ORF predictions. For more details see the paper in the References section below. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Finkel Y, Mizrahi O, Nachshon A, Weingarten-Gabbay S, Morgenstern D, Yahalom-Ronen Y, Tamir H, Achdout H, Stein D, Israeli O et al. The coding capacity of SARS-CoV-2. Nature. 2020 Sep 9;. PMID: 32906143 mRNA-seq_24hr_2 Vero6 mRNA 24hr 2 Vero6 mRNA 24hr 2 Genes and Gene Predictions mRNA-seq_24hr_1 Vero6 mRNA 24hr 1 Vero6 mRNA 24hr 1 Genes and Gene Predictions LTM_24hr_2 Vero6 LTM 24hr 2 Vero6 LTM 24hr 2 Genes and Gene Predictions LTM_24hr_1 Vero6 LTM 24hr 1 Vero6 LTM 24hr 1 Genes and Gene Predictions Harr_24hr_2 Vero6 Harr 24hr 2 Vero6 Harr 24hr 2 Genes and Gene Predictions Harr_24hr_1 Vero6 Harr 24hr 1 Vero6 Harr 24hr 1 Genes and Gene Predictions CHX_24hr_2 Vero6 CHX 24hr 2 Vero6 CHX 24hr 2 Genes and Gene Predictions CHX_24hr_1 Vero6 CHX 24hr 1 Vero6 CHX 24hr 1 Genes and Gene Predictions bloomEscSerumAvg Bloom Serum Average Bloom Lab: S RBD-mutation patient serum antibody escape - average score across serum samples (patients A-K) Immunology Description The subtracks of this track show mutations that lead to escape from patient serum antibodies or monoclonal antibodies. Most of the mutations assayed were in the receptor binding domain (RBD) of the S protein. The data shown here were imported from different studies, listed below. The Bloom lab papers used deep mutational scanning data to measure the effect of all possible mutations in the Spike RBD using a yeast surface display system. Bloom lab - patients A-K: antibodies in sera from the Hospitalized or Ambulatory Adults with Respiratory Viral Infections (HAARVI) cohort, described in Greaney et al., Biorxiv 2021. Bloom lab - 10 antibodies: A selection of ten monoclonal antibodies, described in Greaney et al, Cell Host Microbe 2020. Bloom lab - 4 treatment antibodies: Four monoclonal antibodies licensed for treatment. The results were described in Starr et al, Biorxiv 2021. Whelan lab - 21 antibodies: a selection screen of 21 neutralizing monoclonal antibodies (mAbs) against the receptor binding domain (RBD) generated 48 escape mutants. The results were described in Liu et al, Biorxiv 2020. Rappuoli lab - serum from one patient: three mutations obtained by passaging of cells in neutralizing serum from a single patient, described in Andreano et al, Biorxiv 2021. McCoy lab - mutations tested on monoclonal antibodies and patient sera, described in Rees-Spear et al, Biorxiv 2021. For the Bloom lab data, we show just a summary of the data. Better and detailed structural visualizations are available from the authors via dms-view using the following links: patient sera, 10 monoclonal antibodies, 4 treatment antibodies. Display Conventions and Configuration Bloom lab data Scores represent the "escape fraction" (discussed at length in the Methods of the paper) which "represent the fraction of a given variant that escape antibody binding, and should in principle range from 0 to 1.". "Note that the magnitude of the measured effects of mutations on antibody escape depends on the antibody concentration and the flow cytometry gates applied, meaning that the escape fractions are comparable across sites for any given antibody, but are not precisely comparable among antibodies without external calibration." A higher score indicates a greater level of escape. The data summarized to protein positions are shown as 36 subtracks, one per sample, that indicate the maximum score per amino acid position that was assayed as shades of color or, in full mode, as a x-y barplot. Blue subtracks show data from monoclonal antibodies, red ones from patient sera. By configuring the current track (click on "Antibody escape" under the image), one can display the total sum of all scores per amino acid. The data is summarized as two x-y barplots, as the average values per amino acid, again in red (sera) and blue (MABs). Finally, another summary track has one feature per position where the score exceeds 0.18. These features are clickable and the details page show the exact amino acid changes and their scores. Whelan lab data Features are labeled with the nucleotide and protein coordinates and the name of the antibody. Click a feature or mouse-over a feature to show these annotations. Rappuoli lab data The three mutations are labeled with the protein coordinates. McCoy lab data Features are labeled with the amino acid mutation coordinates. Click a feature or mouse-over a feature to show a description on the specific mutation. Methods Patient sera: data was downloaded from the jbloomlab Github file and parsed into bedGraph format. 10 Antibodies: Table S1 from Starr et al, was downloaded and parsed into bedGraph format. 4 treatment antibodies: Data was downloaded from the jbloomlab Github file and parsed into bedGraph format using the total and maximum values. 21 Antibodies: Table 2 from Liu et al 2020, was copied manually and converted to bedGraph format. For the Rappuoli lab, the mutations were manually copied from the text. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Greaney AJ, Loes AN, Crawford K, Starr T, Malone K, Chu H, Bloom JD. Comprehensive mapping of mutations to the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human serum antibodies . Biorxiv. 2021 Jan 04;. Greaney AJ, Starr TN, Gilchuk P, Zost SJ, Binshtein E, Loes AN, Hilton SK, Huddleston J, Eguia R, Crawford KHD et al. Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition. Cell Host Microbe. 2020 Nov 19;. PMID: 33259788; PMC: PMC7676316 Zhuoming Liu, Laura A. VanBlargan, Paul W. Rothlauf, Louis-Marie Bloyet, Rita E. Chen, Spencer Stumpf, Haiyan Zhao, John M. Errico, Elitza S. Theel, Ali H. Ellebedy, Daved H. Fremont, Michael S. Diamond, Sean P. J. Whelan Landscape analysis of escape variants identifies SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization. Biorxiv. 2020 Starr TN, Greaney AJ, Addetia A, Hannon WW, Choudhary MC, Dingens AS, Li JZ, Bloom JD. Prospective mapping of viral mutations that escape antibodies used to treat COVID-19. bioRxiv. 2020 Dec 1;. PMID: 33299993; PMC: PMC7724661 Andreano E, Piccini G, Licastro D, Casalino L, Johnson NV, Paciello I, Monego SD, Pantano E, Manganaro N, Manenti A et al. SARS-CoV-2 escape <i>in vitro</i> from a highly neutralizing COVID-19 convalescent plasma. bioRxiv. 2020 Dec 28;. PMID: 33398278; PMC: PMC7781313 abEscape Antibody Escape Escape from serum or monoclonal antibodies: Whelan, Bloom and Rappuoli groups Immunology Description The subtracks of this track show mutations that lead to escape from patient serum antibodies or monoclonal antibodies. Most of the mutations assayed were in the receptor binding domain (RBD) of the S protein. The data shown here were imported from different studies, listed below. The Bloom lab papers used deep mutational scanning data to measure the effect of all possible mutations in the Spike RBD using a yeast surface display system. Bloom lab - patients A-K: antibodies in sera from the Hospitalized or Ambulatory Adults with Respiratory Viral Infections (HAARVI) cohort, described in Greaney et al., Biorxiv 2021. Bloom lab - 10 antibodies: A selection of ten monoclonal antibodies, described in Greaney et al, Cell Host Microbe 2020. Bloom lab - 4 treatment antibodies: Four monoclonal antibodies licensed for treatment. The results were described in Starr et al, Biorxiv 2021. Whelan lab - 21 antibodies: a selection screen of 21 neutralizing monoclonal antibodies (mAbs) against the receptor binding domain (RBD) generated 48 escape mutants. The results were described in Liu et al, Biorxiv 2020. Rappuoli lab - serum from one patient: three mutations obtained by passaging of cells in neutralizing serum from a single patient, described in Andreano et al, Biorxiv 2021. McCoy lab - mutations tested on monoclonal antibodies and patient sera, described in Rees-Spear et al, Biorxiv 2021. For the Bloom lab data, we show just a summary of the data. Better and detailed structural visualizations are available from the authors via dms-view using the following links: patient sera, 10 monoclonal antibodies, 4 treatment antibodies. Display Conventions and Configuration Bloom lab data Scores represent the "escape fraction" (discussed at length in the Methods of the paper) which "represent the fraction of a given variant that escape antibody binding, and should in principle range from 0 to 1.". "Note that the magnitude of the measured effects of mutations on antibody escape depends on the antibody concentration and the flow cytometry gates applied, meaning that the escape fractions are comparable across sites for any given antibody, but are not precisely comparable among antibodies without external calibration." A higher score indicates a greater level of escape. The data summarized to protein positions are shown as 36 subtracks, one per sample, that indicate the maximum score per amino acid position that was assayed as shades of color or, in full mode, as a x-y barplot. Blue subtracks show data from monoclonal antibodies, red ones from patient sera. By configuring the current track (click on "Antibody escape" under the image), one can display the total sum of all scores per amino acid. The data is summarized as two x-y barplots, as the average values per amino acid, again in red (sera) and blue (MABs). Finally, another summary track has one feature per position where the score exceeds 0.18. These features are clickable and the details page show the exact amino acid changes and their scores. Whelan lab data Features are labeled with the nucleotide and protein coordinates and the name of the antibody. Click a feature or mouse-over a feature to show these annotations. Rappuoli lab data The three mutations are labeled with the protein coordinates. McCoy lab data Features are labeled with the amino acid mutation coordinates. Click a feature or mouse-over a feature to show a description on the specific mutation. Methods Patient sera: data was downloaded from the jbloomlab Github file and parsed into bedGraph format. 10 Antibodies: Table S1 from Starr et al, was downloaded and parsed into bedGraph format. 4 treatment antibodies: Data was downloaded from the jbloomlab Github file and parsed into bedGraph format using the total and maximum values. 21 Antibodies: Table 2 from Liu et al 2020, was copied manually and converted to bedGraph format. For the Rappuoli lab, the mutations were manually copied from the text. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Greaney AJ, Loes AN, Crawford K, Starr T, Malone K, Chu H, Bloom JD. Comprehensive mapping of mutations to the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human serum antibodies . Biorxiv. 2021 Jan 04;. Greaney AJ, Starr TN, Gilchuk P, Zost SJ, Binshtein E, Loes AN, Hilton SK, Huddleston J, Eguia R, Crawford KHD et al. Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition. Cell Host Microbe. 2020 Nov 19;. PMID: 33259788; PMC: PMC7676316 Zhuoming Liu, Laura A. VanBlargan, Paul W. Rothlauf, Louis-Marie Bloyet, Rita E. Chen, Spencer Stumpf, Haiyan Zhao, John M. Errico, Elitza S. Theel, Ali H. Ellebedy, Daved H. Fremont, Michael S. Diamond, Sean P. J. Whelan Landscape analysis of escape variants identifies SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization. Biorxiv. 2020 Starr TN, Greaney AJ, Addetia A, Hannon WW, Choudhary MC, Dingens AS, Li JZ, Bloom JD. Prospective mapping of viral mutations that escape antibodies used to treat COVID-19. bioRxiv. 2020 Dec 1;. PMID: 33299993; PMC: PMC7724661 Andreano E, Piccini G, Licastro D, Casalino L, Johnson NV, Paciello I, Monego SD, Pantano E, Manganaro N, Manenti A et al. SARS-CoV-2 escape <i>in vitro</i> from a highly neutralizing COVID-19 convalescent plasma. bioRxiv. 2020 Dec 28;. PMID: 33398278; PMC: PMC7781313 bloomEscMabAvg Bloom MAB Average Bloom Lab: S RBD-mutation monoclonal antibody escape - average score across all 13 MAB samples Immunology Description The subtracks of this track show mutations that lead to escape from patient serum antibodies or monoclonal antibodies. Most of the mutations assayed were in the receptor binding domain (RBD) of the S protein. The data shown here were imported from different studies, listed below. The Bloom lab papers used deep mutational scanning data to measure the effect of all possible mutations in the Spike RBD using a yeast surface display system. Bloom lab - patients A-K: antibodies in sera from the Hospitalized or Ambulatory Adults with Respiratory Viral Infections (HAARVI) cohort, described in Greaney et al., Biorxiv 2021. Bloom lab - 10 antibodies: A selection of ten monoclonal antibodies, described in Greaney et al, Cell Host Microbe 2020. Bloom lab - 4 treatment antibodies: Four monoclonal antibodies licensed for treatment. The results were described in Starr et al, Biorxiv 2021. Whelan lab - 21 antibodies: a selection screen of 21 neutralizing monoclonal antibodies (mAbs) against the receptor binding domain (RBD) generated 48 escape mutants. The results were described in Liu et al, Biorxiv 2020. Rappuoli lab - serum from one patient: three mutations obtained by passaging of cells in neutralizing serum from a single patient, described in Andreano et al, Biorxiv 2021. McCoy lab - mutations tested on monoclonal antibodies and patient sera, described in Rees-Spear et al, Biorxiv 2021. For the Bloom lab data, we show just a summary of the data. Better and detailed structural visualizations are available from the authors via dms-view using the following links: patient sera, 10 monoclonal antibodies, 4 treatment antibodies. Display Conventions and Configuration Bloom lab data Scores represent the "escape fraction" (discussed at length in the Methods of the paper) which "represent the fraction of a given variant that escape antibody binding, and should in principle range from 0 to 1.". "Note that the magnitude of the measured effects of mutations on antibody escape depends on the antibody concentration and the flow cytometry gates applied, meaning that the escape fractions are comparable across sites for any given antibody, but are not precisely comparable among antibodies without external calibration." A higher score indicates a greater level of escape. The data summarized to protein positions are shown as 36 subtracks, one per sample, that indicate the maximum score per amino acid position that was assayed as shades of color or, in full mode, as a x-y barplot. Blue subtracks show data from monoclonal antibodies, red ones from patient sera. By configuring the current track (click on "Antibody escape" under the image), one can display the total sum of all scores per amino acid. The data is summarized as two x-y barplots, as the average values per amino acid, again in red (sera) and blue (MABs). Finally, another summary track has one feature per position where the score exceeds 0.18. These features are clickable and the details page show the exact amino acid changes and their scores. Whelan lab data Features are labeled with the nucleotide and protein coordinates and the name of the antibody. Click a feature or mouse-over a feature to show these annotations. Rappuoli lab data The three mutations are labeled with the protein coordinates. McCoy lab data Features are labeled with the amino acid mutation coordinates. Click a feature or mouse-over a feature to show a description on the specific mutation. Methods Patient sera: data was downloaded from the jbloomlab Github file and parsed into bedGraph format. 10 Antibodies: Table S1 from Starr et al, was downloaded and parsed into bedGraph format. 4 treatment antibodies: Data was downloaded from the jbloomlab Github file and parsed into bedGraph format using the total and maximum values. 21 Antibodies: Table 2 from Liu et al 2020, was copied manually and converted to bedGraph format. For the Rappuoli lab, the mutations were manually copied from the text. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Greaney AJ, Loes AN, Crawford K, Starr T, Malone K, Chu H, Bloom JD. Comprehensive mapping of mutations to the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human serum antibodies . Biorxiv. 2021 Jan 04;. Greaney AJ, Starr TN, Gilchuk P, Zost SJ, Binshtein E, Loes AN, Hilton SK, Huddleston J, Eguia R, Crawford KHD et al. Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition. Cell Host Microbe. 2020 Nov 19;. PMID: 33259788; PMC: PMC7676316 Zhuoming Liu, Laura A. VanBlargan, Paul W. Rothlauf, Louis-Marie Bloyet, Rita E. Chen, Spencer Stumpf, Haiyan Zhao, John M. Errico, Elitza S. Theel, Ali H. Ellebedy, Daved H. Fremont, Michael S. Diamond, Sean P. J. Whelan Landscape analysis of escape variants identifies SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization. Biorxiv. 2020 Starr TN, Greaney AJ, Addetia A, Hannon WW, Choudhary MC, Dingens AS, Li JZ, Bloom JD. Prospective mapping of viral mutations that escape antibodies used to treat COVID-19. bioRxiv. 2020 Dec 1;. PMID: 33299993; PMC: PMC7724661 Andreano E, Piccini G, Licastro D, Casalino L, Johnson NV, Paciello I, Monego SD, Pantano E, Manganaro N, Manenti A et al. SARS-CoV-2 escape <i>in vitro</i> from a highly neutralizing COVID-19 convalescent plasma. bioRxiv. 2020 Dec 28;. PMID: 33398278; PMC: PMC7781313 bloomEscTop Bloom Strong Mutations Bloom Lab: Strong S RBD-mutation antibody escape - positions with max score > 0.18 - shading = number of samples where found Immunology Description The subtracks of this track show mutations that lead to escape from patient serum antibodies or monoclonal antibodies. Most of the mutations assayed were in the receptor binding domain (RBD) of the S protein. The data shown here were imported from different studies, listed below. The Bloom lab papers used deep mutational scanning data to measure the effect of all possible mutations in the Spike RBD using a yeast surface display system. Bloom lab - patients A-K: antibodies in sera from the Hospitalized or Ambulatory Adults with Respiratory Viral Infections (HAARVI) cohort, described in Greaney et al., Biorxiv 2021. Bloom lab - 10 antibodies: A selection of ten monoclonal antibodies, described in Greaney et al, Cell Host Microbe 2020. Bloom lab - 4 treatment antibodies: Four monoclonal antibodies licensed for treatment. The results were described in Starr et al, Biorxiv 2021. Whelan lab - 21 antibodies: a selection screen of 21 neutralizing monoclonal antibodies (mAbs) against the receptor binding domain (RBD) generated 48 escape mutants. The results were described in Liu et al, Biorxiv 2020. Rappuoli lab - serum from one patient: three mutations obtained by passaging of cells in neutralizing serum from a single patient, described in Andreano et al, Biorxiv 2021. McCoy lab - mutations tested on monoclonal antibodies and patient sera, described in Rees-Spear et al, Biorxiv 2021. For the Bloom lab data, we show just a summary of the data. Better and detailed structural visualizations are available from the authors via dms-view using the following links: patient sera, 10 monoclonal antibodies, 4 treatment antibodies. Display Conventions and Configuration Bloom lab data Scores represent the "escape fraction" (discussed at length in the Methods of the paper) which "represent the fraction of a given variant that escape antibody binding, and should in principle range from 0 to 1.". "Note that the magnitude of the measured effects of mutations on antibody escape depends on the antibody concentration and the flow cytometry gates applied, meaning that the escape fractions are comparable across sites for any given antibody, but are not precisely comparable among antibodies without external calibration." A higher score indicates a greater level of escape. The data summarized to protein positions are shown as 36 subtracks, one per sample, that indicate the maximum score per amino acid position that was assayed as shades of color or, in full mode, as a x-y barplot. Blue subtracks show data from monoclonal antibodies, red ones from patient sera. By configuring the current track (click on "Antibody escape" under the image), one can display the total sum of all scores per amino acid. The data is summarized as two x-y barplots, as the average values per amino acid, again in red (sera) and blue (MABs). Finally, another summary track has one feature per position where the score exceeds 0.18. These features are clickable and the details page show the exact amino acid changes and their scores. Whelan lab data Features are labeled with the nucleotide and protein coordinates and the name of the antibody. Click a feature or mouse-over a feature to show these annotations. Rappuoli lab data The three mutations are labeled with the protein coordinates. McCoy lab data Features are labeled with the amino acid mutation coordinates. Click a feature or mouse-over a feature to show a description on the specific mutation. Methods Patient sera: data was downloaded from the jbloomlab Github file and parsed into bedGraph format. 10 Antibodies: Table S1 from Starr et al, was downloaded and parsed into bedGraph format. 4 treatment antibodies: Data was downloaded from the jbloomlab Github file and parsed into bedGraph format using the total and maximum values. 21 Antibodies: Table 2 from Liu et al 2020, was copied manually and converted to bedGraph format. For the Rappuoli lab, the mutations were manually copied from the text. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Greaney AJ, Loes AN, Crawford K, Starr T, Malone K, Chu H, Bloom JD. Comprehensive mapping of mutations to the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human serum antibodies . Biorxiv. 2021 Jan 04;. Greaney AJ, Starr TN, Gilchuk P, Zost SJ, Binshtein E, Loes AN, Hilton SK, Huddleston J, Eguia R, Crawford KHD et al. Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition. Cell Host Microbe. 2020 Nov 19;. PMID: 33259788; PMC: PMC7676316 Zhuoming Liu, Laura A. VanBlargan, Paul W. Rothlauf, Louis-Marie Bloyet, Rita E. Chen, Spencer Stumpf, Haiyan Zhao, John M. Errico, Elitza S. Theel, Ali H. Ellebedy, Daved H. Fremont, Michael S. Diamond, Sean P. J. Whelan Landscape analysis of escape variants identifies SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization. Biorxiv. 2020 Starr TN, Greaney AJ, Addetia A, Hannon WW, Choudhary MC, Dingens AS, Li JZ, Bloom JD. Prospective mapping of viral mutations that escape antibodies used to treat COVID-19. bioRxiv. 2020 Dec 1;. PMID: 33299993; PMC: PMC7724661 Andreano E, Piccini G, Licastro D, Casalino L, Johnson NV, Paciello I, Monego SD, Pantano E, Manganaro N, Manenti A et al. SARS-CoV-2 escape <i>in vitro</i> from a highly neutralizing COVID-19 convalescent plasma. bioRxiv. 2020 Dec 28;. PMID: 33398278; PMC: PMC7781313 bloomEscMax Bloom Max Escape Bloom Lab: S RBD-mutation antibody escape - maximum escape score per amino acid - 13 MABs and serum from 11 patients (A-K) Immunology Description The subtracks of this track show mutations that lead to escape from patient serum antibodies or monoclonal antibodies. Most of the mutations assayed were in the receptor binding domain (RBD) of the S protein. The data shown here were imported from different studies, listed below. The Bloom lab papers used deep mutational scanning data to measure the effect of all possible mutations in the Spike RBD using a yeast surface display system. Bloom lab - patients A-K: antibodies in sera from the Hospitalized or Ambulatory Adults with Respiratory Viral Infections (HAARVI) cohort, described in Greaney et al., Biorxiv 2021. Bloom lab - 10 antibodies: A selection of ten monoclonal antibodies, described in Greaney et al, Cell Host Microbe 2020. Bloom lab - 4 treatment antibodies: Four monoclonal antibodies licensed for treatment. The results were described in Starr et al, Biorxiv 2021. Whelan lab - 21 antibodies: a selection screen of 21 neutralizing monoclonal antibodies (mAbs) against the receptor binding domain (RBD) generated 48 escape mutants. The results were described in Liu et al, Biorxiv 2020. Rappuoli lab - serum from one patient: three mutations obtained by passaging of cells in neutralizing serum from a single patient, described in Andreano et al, Biorxiv 2021. McCoy lab - mutations tested on monoclonal antibodies and patient sera, described in Rees-Spear et al, Biorxiv 2021. For the Bloom lab data, we show just a summary of the data. Better and detailed structural visualizations are available from the authors via dms-view using the following links: patient sera, 10 monoclonal antibodies, 4 treatment antibodies. Display Conventions and Configuration Bloom lab data Scores represent the "escape fraction" (discussed at length in the Methods of the paper) which "represent the fraction of a given variant that escape antibody binding, and should in principle range from 0 to 1.". "Note that the magnitude of the measured effects of mutations on antibody escape depends on the antibody concentration and the flow cytometry gates applied, meaning that the escape fractions are comparable across sites for any given antibody, but are not precisely comparable among antibodies without external calibration." A higher score indicates a greater level of escape. The data summarized to protein positions are shown as 36 subtracks, one per sample, that indicate the maximum score per amino acid position that was assayed as shades of color or, in full mode, as a x-y barplot. Blue subtracks show data from monoclonal antibodies, red ones from patient sera. By configuring the current track (click on "Antibody escape" under the image), one can display the total sum of all scores per amino acid. The data is summarized as two x-y barplots, as the average values per amino acid, again in red (sera) and blue (MABs). Finally, another summary track has one feature per position where the score exceeds 0.18. These features are clickable and the details page show the exact amino acid changes and their scores. Whelan lab data Features are labeled with the nucleotide and protein coordinates and the name of the antibody. Click a feature or mouse-over a feature to show these annotations. Rappuoli lab data The three mutations are labeled with the protein coordinates. McCoy lab data Features are labeled with the amino acid mutation coordinates. Click a feature or mouse-over a feature to show a description on the specific mutation. Methods Patient sera: data was downloaded from the jbloomlab Github file and parsed into bedGraph format. 10 Antibodies: Table S1 from Starr et al, was downloaded and parsed into bedGraph format. 4 treatment antibodies: Data was downloaded from the jbloomlab Github file and parsed into bedGraph format using the total and maximum values. 21 Antibodies: Table 2 from Liu et al 2020, was copied manually and converted to bedGraph format. For the Rappuoli lab, the mutations were manually copied from the text. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Greaney AJ, Loes AN, Crawford K, Starr T, Malone K, Chu H, Bloom JD. Comprehensive mapping of mutations to the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human serum antibodies . Biorxiv. 2021 Jan 04;. Greaney AJ, Starr TN, Gilchuk P, Zost SJ, Binshtein E, Loes AN, Hilton SK, Huddleston J, Eguia R, Crawford KHD et al. Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition. Cell Host Microbe. 2020 Nov 19;. PMID: 33259788; PMC: PMC7676316 Zhuoming Liu, Laura A. VanBlargan, Paul W. Rothlauf, Louis-Marie Bloyet, Rita E. Chen, Spencer Stumpf, Haiyan Zhao, John M. Errico, Elitza S. Theel, Ali H. Ellebedy, Daved H. Fremont, Michael S. Diamond, Sean P. J. Whelan Landscape analysis of escape variants identifies SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization. Biorxiv. 2020 Starr TN, Greaney AJ, Addetia A, Hannon WW, Choudhary MC, Dingens AS, Li JZ, Bloom JD. Prospective mapping of viral mutations that escape antibodies used to treat COVID-19. bioRxiv. 2020 Dec 1;. PMID: 33299993; PMC: PMC7724661 Andreano E, Piccini G, Licastro D, Casalino L, Johnson NV, Paciello I, Monego SD, Pantano E, Manganaro N, Manenti A et al. SARS-CoV-2 escape <i>in vitro</i> from a highly neutralizing COVID-19 convalescent plasma. bioRxiv. 2020 Dec 28;. PMID: 33398278; PMC: PMC7781313 COV2-2832Max MAB COV2-2832 Bloom antibody escape - Max Score - MAB COV2-2832 Immunology COV2-2677Max MAB COV2-2677 Bloom antibody escape - Max Score - MAB COV2-2677 Immunology COV2-2499Max MAB COV2-2499 Bloom antibody escape - Max Score - MAB COV2-2499 Immunology COV2-2479Max MAB COV2-2479 Bloom antibody escape - Max Score - MAB COV2-2479 Immunology COV2-2165Max MAB COV2-2165 Bloom antibody escape - Max Score - MAB COV2-2165 Immunology COV2-2096Max MAB COV2-2096 Bloom antibody escape - Max Score - MAB COV2-2096 Immunology COV2-2094Max MAB COV2-2094 Bloom antibody escape - Max Score - MAB COV2-2094 Immunology COV2-2082Max MAB COV2-2082 Bloom antibody escape - Max Score - MAB COV2-2082 Immunology COV2-2050Max MAB COV2-2050 Bloom antibody escape - Max Score - MAB COV2-2050 Immunology rCR3022Max MAB rCR3022 Bloom antibody escape - Max Score - MAB rCR3022 Immunology REGN10987Max MAB REGN10987 Bloom antibody escape - Max Score - MAB REGN10987 Immunology REGN10933-REGN10987Max MAB REGN10933-REGN10987 Bloom antibody escape - Max Score - MAB REGN10933-REGN10987 Immunology REGN10933Max MAB REGN10933 Bloom antibody escape - Max Score - MAB REGN10933 Immunology LY-CoV016Max MAB LY-CoV016 Bloom antibody escape - Max Score - MAB LY-CoV016 Immunology K_29Max Serum: K, day 029 Bloom antibody escape - Max Score - K_29 Immunology K_103Max Serum: K, day 103 Bloom antibody escape - Max Score - K_103 Immunology J_15Max Serum: J, day 015 Bloom antibody escape - Max Score - J_15 Immunology J_121Max Serum: J, day 121 Bloom antibody escape - Max Score - J_121 Immunology I_26Max Serum: I, day 026 Bloom antibody escape - Max Score - I_26 Immunology I_102Max Serum: I, day 102 Bloom antibody escape - Max Score - I_102 Immunology H_61Max Serum: H, day 061 Bloom antibody escape - Max Score - H_61 Immunology H_152Max Serum: H, day 152 Bloom antibody escape - Max Score - H_152 Immunology G_94Max Serum: G, day 094 Bloom antibody escape - Max Score - G_94 Immunology G_18Max Serum: G, day 018 Bloom antibody escape - Max Score - G_18 Immunology F_48Max Serum: F, day 048 Bloom antibody escape - Max Score - F_48 Immunology F_115Max Serum: F, day 115 Bloom antibody escape - Max Score - F_115 Immunology E_28Max Serum: E, day 028 Bloom antibody escape - Max Score - E_28 Immunology E_104Max Serum: E, day 104 Bloom antibody escape - Max Score - E_104 Immunology D_76Max Serum: D, day 076 Bloom antibody escape - Max Score - D_76 Immunology D_33Max Serum: D, day 033 Bloom antibody escape - Max Score - D_33 Immunology C_32Max Serum: C, day 032 Bloom antibody escape - Max Score - C_32 Immunology C_104Max Serum: C, day 104 Bloom antibody escape - Max Score - C_104 Immunology B_26Max Serum: B, day 026 Bloom antibody escape - Max Score - B_26 Immunology B_113Max Serum: B, day 113 Bloom antibody escape - Max Score - B_113 Immunology A_45Max Serum: A, day 045 Bloom antibody escape - Max Score - A_45 Immunology A_21Max Serum: A, day 021 Bloom antibody escape - Max Score - A_21 Immunology A_120Max Serum: A, day 120 Bloom antibody escape - Max Score - A_120 Immunology contacts PDB Ligand Contacts Potential contact residues in PDB structures of viral proteins Genes and Gene Predictions Description This track shows potential contact residues for ligands, inferred from structures in the PDB database by Alastair Fyfe, UCSC. Display Conventions and Configuration Genomic locations of contact residues are highlighted with thick blocks, that look identical to exons. Contact residues of the same PDB structure are connected by thin intron lines. To display the 3D structure viewer with these contact residues highlighted, follow the outlink at the top of the details page of any feature. Methods PDB SEQRES protein sequences were aligned to the genome with tblastn. Ligands were determined manually, all amino acids closer than 3.5 Angstroms were obtained as described below. The positions of these close amino acids on the genome were determined using the tblastn alignment and are highlighted on the genome with exon blocks. The find nearby amino acids, inter-atom distances were calculated with the libraries GEMMI and Clipper at a threshold of 3.5 Å. Distances between atoms whose proximity is due to crystal packing or symmetry are listed but cannot be inspected in the viewer. These contacts are flagged by a check in the "Symm?" column. They include atoms in adjacent unit cells and atoms brought into proximity by application of a symmetry transformation to the asymmetric unit (ASU), but not atoms related by non-crystallographic symmetry (NCS).For each amino acid in the contact table the "Codon" column gives the codon's nucleotide values and position in the NC_045512.2 reference sequence. Amino acids that could not be aligned to the reference, for example insertions or deletions engineered to facilitate protein expression, are flagged by "-". Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/contacts.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits Track created by Alastair Fyfe, UCSC, mentored by David Haussler. bloomEscTotal Bloom Total Escape Bloom Lab: S RBD-mutation antibody escape - total escape score per amino acid - 13 MABs and serum from 11 patients (A-K) Immunology K_103Total Serum: K, day 103 Bloom antibody escape - Total Score - Subject K, Day 103 Immunology K_29Total Serum: K, day 029 Bloom antibody escape - Total Score - Subject K, Day 29 Immunology J_121Total Serum: J, day 121 Bloom antibody escape - Total Score - Subject J, Day 121 Immunology J_15Total Serum: J, day 015 Bloom antibody escape - Total Score - Subject J, Day 15 Immunology I_102Total Serum: I, day 102 Bloom antibody escape - Total Score - Subject I, Day 102 Immunology I_26Total Serum: I, day 026 Bloom antibody escape - Total Score - Subject I, Day 26 Immunology H_152Total Serum: H, day 152 Bloom antibody escape - Total Score - Subject H, Day 152 Immunology H_61Total Serum: H, day 061 Bloom antibody escape - Total Score - Subject H, Day 61 Immunology G_94Total Serum: G, day 094 Bloom antibody escape - Total Score - Subject G, Day 94 Immunology G_18Total Serum: G, day 018 Bloom antibody escape - Total Score - Subject G, Day 18 Immunology F_115Total Serum: F, day 115 Bloom antibody escape - Total Score - Subject F, Day 115 Immunology F_48Total Serum: F, day 048 Bloom antibody escape - Total Score - Subject F, Day 48 Immunology E_104Total Serum: E, day 104 Bloom antibody escape - Total Score - Subject E, Day 104 Immunology E_28Total Serum: E, day 028 Bloom antibody escape - Total Score - Subject E, Day 28 Immunology D_76Total Serum: D, day 076 Bloom antibody escape - Total Score - Subject D, Day 76 Immunology D_33Total Serum: D, day 033 Bloom antibody escape - Total Score - Subject D, Day 33 Immunology C_104Total Serum: C, day 104 Bloom antibody escape - Total Score - Subject C, Day 104 Immunology C_32Total Serum: C, day 032 Bloom antibody escape - Total Score - Subject C, Day 32 Immunology B_113Total Serum: B, day 113 Bloom antibody escape - Total Score - Subject B, Day 113 Immunology B_26Total Serum: B, day 026 Bloom antibody escape - Total Score - Subject B, Day 26 Immunology A_120Total Serum: A, day 120 Bloom antibody escape - Total Score - Subject A, Day 120 Immunology A_45Total Serum: A, day 045 Bloom antibody escape - Total Score - Subject A, Day 45 Immunology A_21Total Serum: A, day 021 Bloom antibody escape - Total Score - Subject A, Day 21 Immunology REGN10987Total MAB REGN10987 total Bloom antibody escape - Total Score - REGN10987 Immunology REGN10933-REGN10987Total MAB REGN10933-REGN10987 total Bloom antibody escape - Total Score - REGN10933-REGN10987 Immunology REGN10933Total MAB REGN10933 total Bloom antibody escape - Total Score - REGN10933 Immunology rCR3022Total MAB rCR3022 total Bloom antibody escape - Total Score - rCR3022 Immunology LY-CoV016Total MAB LY-CoV016 total Bloom antibody escape - Total Score - LY-CoV016 Immunology COV2-2832Total MAB COV2-2832 total Bloom antibody escape - Total Score - COV2-2832 Immunology COV2-2677Total MAB COV2-2677 total Bloom antibody escape - Total Score - COV2-2677 Immunology COV2-2499Total MAB COV2-2499 total Bloom antibody escape - Total Score - COV2-2499 Immunology COV2-2479Total MAB COV2-2479 total Bloom antibody escape - Total Score - COV2-2479 Immunology COV2-2165Total MAB COV2-2165 total Bloom antibody escape - Total Score - COV2-2165 Immunology COV2-2096Total MAB COV2-2096 total Bloom antibody escape - Total Score - COV2-2096 Immunology COV2-2094Total MAB COV2-2094 total Bloom antibody escape - Total Score - COV2-2094 Immunology COV2-2082Total MAB COV2-2082 total Bloom antibody escape - Total Score - COV2-2082 Immunology COV2-2050Total MAB COV2-2050 total Bloom antibody escape - Total Score - COV2-2050 Immunology escape Whelan 21 Ab Whelan lab: RBD Mutations that lead to escape from 21 monoclonal antibodies (click to show mutation details) Immunology Description The subtracks of this track show mutations that lead to escape from patient serum antibodies or monoclonal antibodies. Most of the mutations assayed were in the receptor binding domain (RBD) of the S protein. The data shown here were imported from different studies, listed below. The Bloom lab papers used deep mutational scanning data to measure the effect of all possible mutations in the Spike RBD using a yeast surface display system. Bloom lab - patients A-K: antibodies in sera from the Hospitalized or Ambulatory Adults with Respiratory Viral Infections (HAARVI) cohort, described in Greaney et al., Biorxiv 2021. Bloom lab - 10 antibodies: A selection of ten monoclonal antibodies, described in Greaney et al, Cell Host Microbe 2020. Bloom lab - 4 treatment antibodies: Four monoclonal antibodies licensed for treatment. The results were described in Starr et al, Biorxiv 2021. Whelan lab - 21 antibodies: a selection screen of 21 neutralizing monoclonal antibodies (mAbs) against the receptor binding domain (RBD) generated 48 escape mutants. The results were described in Liu et al, Biorxiv 2020. Rappuoli lab - serum from one patient: three mutations obtained by passaging of cells in neutralizing serum from a single patient, described in Andreano et al, Biorxiv 2021. McCoy lab - mutations tested on monoclonal antibodies and patient sera, described in Rees-Spear et al, Biorxiv 2021. For the Bloom lab data, we show just a summary of the data. Better and detailed structural visualizations are available from the authors via dms-view using the following links: patient sera, 10 monoclonal antibodies, 4 treatment antibodies. Display Conventions and Configuration Bloom lab data Scores represent the "escape fraction" (discussed at length in the Methods of the paper) which "represent the fraction of a given variant that escape antibody binding, and should in principle range from 0 to 1.". "Note that the magnitude of the measured effects of mutations on antibody escape depends on the antibody concentration and the flow cytometry gates applied, meaning that the escape fractions are comparable across sites for any given antibody, but are not precisely comparable among antibodies without external calibration." A higher score indicates a greater level of escape. The data summarized to protein positions are shown as 36 subtracks, one per sample, that indicate the maximum score per amino acid position that was assayed as shades of color or, in full mode, as a x-y barplot. Blue subtracks show data from monoclonal antibodies, red ones from patient sera. By configuring the current track (click on "Antibody escape" under the image), one can display the total sum of all scores per amino acid. The data is summarized as two x-y barplots, as the average values per amino acid, again in red (sera) and blue (MABs). Finally, another summary track has one feature per position where the score exceeds 0.18. These features are clickable and the details page show the exact amino acid changes and their scores. Whelan lab data Features are labeled with the nucleotide and protein coordinates and the name of the antibody. Click a feature or mouse-over a feature to show these annotations. Rappuoli lab data The three mutations are labeled with the protein coordinates. McCoy lab data Features are labeled with the amino acid mutation coordinates. Click a feature or mouse-over a feature to show a description on the specific mutation. Methods Patient sera: data was downloaded from the jbloomlab Github file and parsed into bedGraph format. 10 Antibodies: Table S1 from Starr et al, was downloaded and parsed into bedGraph format. 4 treatment antibodies: Data was downloaded from the jbloomlab Github file and parsed into bedGraph format using the total and maximum values. 21 Antibodies: Table 2 from Liu et al 2020, was copied manually and converted to bedGraph format. For the Rappuoli lab, the mutations were manually copied from the text. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Greaney AJ, Loes AN, Crawford K, Starr T, Malone K, Chu H, Bloom JD. Comprehensive mapping of mutations to the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human serum antibodies . Biorxiv. 2021 Jan 04;. Greaney AJ, Starr TN, Gilchuk P, Zost SJ, Binshtein E, Loes AN, Hilton SK, Huddleston J, Eguia R, Crawford KHD et al. Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition. Cell Host Microbe. 2020 Nov 19;. PMID: 33259788; PMC: PMC7676316 Zhuoming Liu, Laura A. VanBlargan, Paul W. Rothlauf, Louis-Marie Bloyet, Rita E. Chen, Spencer Stumpf, Haiyan Zhao, John M. Errico, Elitza S. Theel, Ali H. Ellebedy, Daved H. Fremont, Michael S. Diamond, Sean P. J. Whelan Landscape analysis of escape variants identifies SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization. Biorxiv. 2020 Starr TN, Greaney AJ, Addetia A, Hannon WW, Choudhary MC, Dingens AS, Li JZ, Bloom JD. Prospective mapping of viral mutations that escape antibodies used to treat COVID-19. bioRxiv. 2020 Dec 1;. PMID: 33299993; PMC: PMC7724661 Andreano E, Piccini G, Licastro D, Casalino L, Johnson NV, Paciello I, Monego SD, Pantano E, Manganaro N, Manenti A et al. SARS-CoV-2 escape <i>in vitro</i> from a highly neutralizing COVID-19 convalescent plasma. bioRxiv. 2020 Dec 28;. PMID: 33398278; PMC: PMC7781313 rappuoli Rappuoli Serum Escape Rappuoli lab: S Mutations that lead to escape from neutralizing antibodies from plasma of a single patient Immunology Description The subtracks of this track show mutations that lead to escape from patient serum antibodies or monoclonal antibodies. Most of the mutations assayed were in the receptor binding domain (RBD) of the S protein. The data shown here were imported from different studies, listed below. The Bloom lab papers used deep mutational scanning data to measure the effect of all possible mutations in the Spike RBD using a yeast surface display system. Bloom lab - patients A-K: antibodies in sera from the Hospitalized or Ambulatory Adults with Respiratory Viral Infections (HAARVI) cohort, described in Greaney et al., Biorxiv 2021. Bloom lab - 10 antibodies: A selection of ten monoclonal antibodies, described in Greaney et al, Cell Host Microbe 2020. Bloom lab - 4 treatment antibodies: Four monoclonal antibodies licensed for treatment. The results were described in Starr et al, Biorxiv 2021. Whelan lab - 21 antibodies: a selection screen of 21 neutralizing monoclonal antibodies (mAbs) against the receptor binding domain (RBD) generated 48 escape mutants. The results were described in Liu et al, Biorxiv 2020. Rappuoli lab - serum from one patient: three mutations obtained by passaging of cells in neutralizing serum from a single patient, described in Andreano et al, Biorxiv 2021. McCoy lab - mutations tested on monoclonal antibodies and patient sera, described in Rees-Spear et al, Biorxiv 2021. For the Bloom lab data, we show just a summary of the data. Better and detailed structural visualizations are available from the authors via dms-view using the following links: patient sera, 10 monoclonal antibodies, 4 treatment antibodies. Display Conventions and Configuration Bloom lab data Scores represent the "escape fraction" (discussed at length in the Methods of the paper) which "represent the fraction of a given variant that escape antibody binding, and should in principle range from 0 to 1.". "Note that the magnitude of the measured effects of mutations on antibody escape depends on the antibody concentration and the flow cytometry gates applied, meaning that the escape fractions are comparable across sites for any given antibody, but are not precisely comparable among antibodies without external calibration." A higher score indicates a greater level of escape. The data summarized to protein positions are shown as 36 subtracks, one per sample, that indicate the maximum score per amino acid position that was assayed as shades of color or, in full mode, as a x-y barplot. Blue subtracks show data from monoclonal antibodies, red ones from patient sera. By configuring the current track (click on "Antibody escape" under the image), one can display the total sum of all scores per amino acid. The data is summarized as two x-y barplots, as the average values per amino acid, again in red (sera) and blue (MABs). Finally, another summary track has one feature per position where the score exceeds 0.18. These features are clickable and the details page show the exact amino acid changes and their scores. Whelan lab data Features are labeled with the nucleotide and protein coordinates and the name of the antibody. Click a feature or mouse-over a feature to show these annotations. Rappuoli lab data The three mutations are labeled with the protein coordinates. McCoy lab data Features are labeled with the amino acid mutation coordinates. Click a feature or mouse-over a feature to show a description on the specific mutation. Methods Patient sera: data was downloaded from the jbloomlab Github file and parsed into bedGraph format. 10 Antibodies: Table S1 from Starr et al, was downloaded and parsed into bedGraph format. 4 treatment antibodies: Data was downloaded from the jbloomlab Github file and parsed into bedGraph format using the total and maximum values. 21 Antibodies: Table 2 from Liu et al 2020, was copied manually and converted to bedGraph format. For the Rappuoli lab, the mutations were manually copied from the text. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Greaney AJ, Loes AN, Crawford K, Starr T, Malone K, Chu H, Bloom JD. Comprehensive mapping of mutations to the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human serum antibodies . Biorxiv. 2021 Jan 04;. Greaney AJ, Starr TN, Gilchuk P, Zost SJ, Binshtein E, Loes AN, Hilton SK, Huddleston J, Eguia R, Crawford KHD et al. Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition. Cell Host Microbe. 2020 Nov 19;. PMID: 33259788; PMC: PMC7676316 Zhuoming Liu, Laura A. VanBlargan, Paul W. Rothlauf, Louis-Marie Bloyet, Rita E. Chen, Spencer Stumpf, Haiyan Zhao, John M. Errico, Elitza S. Theel, Ali H. Ellebedy, Daved H. Fremont, Michael S. Diamond, Sean P. J. Whelan Landscape analysis of escape variants identifies SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization. Biorxiv. 2020 Starr TN, Greaney AJ, Addetia A, Hannon WW, Choudhary MC, Dingens AS, Li JZ, Bloom JD. Prospective mapping of viral mutations that escape antibodies used to treat COVID-19. bioRxiv. 2020 Dec 1;. PMID: 33299993; PMC: PMC7724661 Andreano E, Piccini G, Licastro D, Casalino L, Johnson NV, Paciello I, Monego SD, Pantano E, Manganaro N, Manenti A et al. SARS-CoV-2 escape <i>in vitro</i> from a highly neutralizing COVID-19 convalescent plasma. bioRxiv. 2020 Dec 28;. PMID: 33398278; PMC: PMC7781313 mccoy McCoy Escape McCoy lab: S Mutation impact on neutralization by serum and mAbs Immunology Description The subtracks of this track show mutations that lead to escape from patient serum antibodies or monoclonal antibodies. Most of the mutations assayed were in the receptor binding domain (RBD) of the S protein. The data shown here were imported from different studies, listed below. The Bloom lab papers used deep mutational scanning data to measure the effect of all possible mutations in the Spike RBD using a yeast surface display system. Bloom lab - patients A-K: antibodies in sera from the Hospitalized or Ambulatory Adults with Respiratory Viral Infections (HAARVI) cohort, described in Greaney et al., Biorxiv 2021. Bloom lab - 10 antibodies: A selection of ten monoclonal antibodies, described in Greaney et al, Cell Host Microbe 2020. Bloom lab - 4 treatment antibodies: Four monoclonal antibodies licensed for treatment. The results were described in Starr et al, Biorxiv 2021. Whelan lab - 21 antibodies: a selection screen of 21 neutralizing monoclonal antibodies (mAbs) against the receptor binding domain (RBD) generated 48 escape mutants. The results were described in Liu et al, Biorxiv 2020. Rappuoli lab - serum from one patient: three mutations obtained by passaging of cells in neutralizing serum from a single patient, described in Andreano et al, Biorxiv 2021. McCoy lab - mutations tested on monoclonal antibodies and patient sera, described in Rees-Spear et al, Biorxiv 2021. For the Bloom lab data, we show just a summary of the data. Better and detailed structural visualizations are available from the authors via dms-view using the following links: patient sera, 10 monoclonal antibodies, 4 treatment antibodies. Display Conventions and Configuration Bloom lab data Scores represent the "escape fraction" (discussed at length in the Methods of the paper) which "represent the fraction of a given variant that escape antibody binding, and should in principle range from 0 to 1.". "Note that the magnitude of the measured effects of mutations on antibody escape depends on the antibody concentration and the flow cytometry gates applied, meaning that the escape fractions are comparable across sites for any given antibody, but are not precisely comparable among antibodies without external calibration." A higher score indicates a greater level of escape. The data summarized to protein positions are shown as 36 subtracks, one per sample, that indicate the maximum score per amino acid position that was assayed as shades of color or, in full mode, as a x-y barplot. Blue subtracks show data from monoclonal antibodies, red ones from patient sera. By configuring the current track (click on "Antibody escape" under the image), one can display the total sum of all scores per amino acid. The data is summarized as two x-y barplots, as the average values per amino acid, again in red (sera) and blue (MABs). Finally, another summary track has one feature per position where the score exceeds 0.18. These features are clickable and the details page show the exact amino acid changes and their scores. Whelan lab data Features are labeled with the nucleotide and protein coordinates and the name of the antibody. Click a feature or mouse-over a feature to show these annotations. Rappuoli lab data The three mutations are labeled with the protein coordinates. McCoy lab data Features are labeled with the amino acid mutation coordinates. Click a feature or mouse-over a feature to show a description on the specific mutation. Methods Patient sera: data was downloaded from the jbloomlab Github file and parsed into bedGraph format. 10 Antibodies: Table S1 from Starr et al, was downloaded and parsed into bedGraph format. 4 treatment antibodies: Data was downloaded from the jbloomlab Github file and parsed into bedGraph format using the total and maximum values. 21 Antibodies: Table 2 from Liu et al 2020, was copied manually and converted to bedGraph format. For the Rappuoli lab, the mutations were manually copied from the text. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Greaney AJ, Loes AN, Crawford K, Starr T, Malone K, Chu H, Bloom JD. Comprehensive mapping of mutations to the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human serum antibodies . Biorxiv. 2021 Jan 04;. Greaney AJ, Starr TN, Gilchuk P, Zost SJ, Binshtein E, Loes AN, Hilton SK, Huddleston J, Eguia R, Crawford KHD et al. Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition. Cell Host Microbe. 2020 Nov 19;. PMID: 33259788; PMC: PMC7676316 Zhuoming Liu, Laura A. VanBlargan, Paul W. Rothlauf, Louis-Marie Bloyet, Rita E. Chen, Spencer Stumpf, Haiyan Zhao, John M. Errico, Elitza S. Theel, Ali H. Ellebedy, Daved H. Fremont, Michael S. Diamond, Sean P. J. Whelan Landscape analysis of escape variants identifies SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization. Biorxiv. 2020 Starr TN, Greaney AJ, Addetia A, Hannon WW, Choudhary MC, Dingens AS, Li JZ, Bloom JD. Prospective mapping of viral mutations that escape antibodies used to treat COVID-19. bioRxiv. 2020 Dec 1;. PMID: 33299993; PMC: PMC7724661 Andreano E, Piccini G, Licastro D, Casalino L, Johnson NV, Paciello I, Monego SD, Pantano E, Manganaro N, Manenti A et al. SARS-CoV-2 escape <i>in vitro</i> from a highly neutralizing COVID-19 convalescent plasma. bioRxiv. 2020 Dec 28;. PMID: 33398278; PMC: PMC7781313 cas13Crispr Cas13 CRISPR Cas13 CRISPR targets Mapping and Sequencing Description This track shows the in-silico design of crRNAs for Cas13 using the tool nCov2019_Guide_Design, as described in Abbott et al., 2020. To target highly conserved regions of the SARS-CoV-2 genome, an in-silico collection of all 3,802 possible crRNAs were generated. After excluding crRNAs that are either predicted to have potential off-target binding (≤2 mismatches in the human transcriptome) or having poly-T sequences that may prevent crRNA expression (≥4 consecutive Ts), a set of 3,203 crRNAs were obtained. These crRNAs are also able to target SARS and MERS with ≤1 mismatch. Each crRNA has been characterized with four features: Efficiency is predicted using the online tool at https://gitlab.com/sanjanalab/cas13 Specificy is determined by the number of off-target loci in human mRNA with ≤2 mismatches to the crRNA Generality within Coronaviridae is quantified as the percentage of Coronaviridae strains targeted by the given crRNA with perfect identity Generality within SARS-CoV-2 is quantified as the percentage of 1,087 SARS-CoV-2 patient genomes downloaded on March 20, 2020 that are targeted by the given crRNA with perfect identity Method To design all possible crRNAs for the three pathogenic RNA viruses (SARS-CoV-2, SARS-CoV, and MERS-CoV), the reference genomes of SARS-CoV, MERS-CoV, along with SARS-CoV-2 genomes derived from 47 patients were first aligned by MAFFT using the --auto flag. crRNA candidates were identified by using a sliding window to extract all 22-nucleotide (nt) sequences with perfect identity among the SARS-CoV-2 genomes. We annotated each crRNA candidate with the number of mismatches relative to the SARS-CoV and MERS-CoV genomes, as well as the GC content. 3,802 crRNA candidates were selected with perfect match against the 47 SARS-CoV-2 genomes and with ≤1 mismatch to SARS-CoV or MERS-CoV sequences. To characterize the specificity of 22-nt crRNAs, we ensured that each crRNA does not target any sequences in the human transcriptome. We used Bowtie 1.2.2 to align crRNAs to the human transcriptome (hg38; including non-coding RNA) and removed crRNAs that mapped to the human transcriptome with ≤2 mismatches. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/cas13Crispr.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits The predictions for this track are produced by Xueqiu Lin and Augustine Chemparathy in Stanley Qi lab at Stanford University References Abbott, Timothy R., Girija Dhamdhere, Yanxia Liu, Xueqiu Lin, Laine Goudy, Leiping Zeng, Augustine Chemparathy, et al. , 2020. Development of CRISPR as a Prophylactic Strategy to Combat Novel Coronavirus and Influenza. bioRxiv Q3_tracks Galaxy ENA mutations in top lineages - three quarters ago Most frequent lineages of three quarters ago Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ3Ay-9 AY.9 mutations Mutations (amino acid level) in AY.9 between 2021-04-05 and 2021-07-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ3Ay-5 AY.5 mutations Mutations (amino acid level) in AY.5 between 2021-04-05 and 2021-07-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ3B-1-617-2 B.1.617.2 mutations Mutations (amino acid level) in B.1.617.2 between 2021-04-05 and 2021-07-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ3B-1-1-7 B.1.1.7 mutations Mutations (amino acid level) in B.1.1.7 between 2021-04-05 and 2021-07-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 galaxyEnaQ3Ay-4 AY.4 mutations Mutations (amino acid level) in AY.4 between 2021-04-05 and 2021-07-05 Variation and Repeats Description This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]). Required metadata are: Sample collection date Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data) the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information) some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data. Display Conventions and Configuration Track structure The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update. Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter. Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations. Mutation feature display To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey. Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin. Mutation details Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular: the precise value for its observed frequency in the lineage and quarter the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected. the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters). Filtering Mutations Mutation features displayed in each subtrack can be filtered by country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK. within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect). Methods For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then those histories are made publicly accessible on their server batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to ftp://xfer13.crg.eu/gx-surveillance.json pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed the genome browser tracks get recalculated by parsing all analyzed data on the ftp server determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date extracting all mutations seen in each quarter for each of the five top lineages in that quarter rebuilding the bigbed files and track files Credits The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC. The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world. For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel. The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank: The COVID-19 Genomics UK Consortium (COG-UK) The Estonian KoroGeno-EST initiative The Greece vs Corona project References Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; Grüning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643 Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; Grüning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1 addgene Addgene Plasmids Addgene Plasmid Sequences alignable to the Genome Mapping and Sequencing Description This track shows sequences contained in plasmids that can be ordered through Addgene. Many plasmids containing SARS-CoV-2 sequences are now available and more are currently undergoing quality control analysis. Some of these plasmids are described in preprints, published articles, and summaries. For a more detailed description of some of these collections, please see the links to Addgene's webpage for each plasmid. Some of the following plasmids are available to industry, in addition to academics and nonprofits. For more information on ordering, the link to each plasmid's homepage has been provided. Display Conventions and Configuration Shown on the genome are regions that can be aligned to a plasmid. The feature label is the AddGene identifier. A click on a feature shows a brief description and link to the full webpage for ordering them. Methods Genbank files were downloaded from Addgene and were parsed to extract relevant data fields and the plasmid nucleotide sequence. The reference sequence, SARS-CoV-2 (wuhCor1 - NC_045512v2), was then downloaded from UCSC browser and a reference database for blast alignment was created using makeblastdb. The viral region of each plasmid was then aligned using blastn. Alignments exceeding an expected value of 0.00001 were considered. The aligned sequences were then converted to bed format and the relevant Genbank data field were added. Finally, bedToBigBed was used to create the BigBed track. Credits Thanks to Nathan Mauldin and Jason Fernandes for making this track. References References can be found on the specific plasmid's webpage. Each plasmid is hosted by https://www.addgene.org. gold Assembly Assembly from Fragments Mapping and Sequencing Description This track shows the sequences used in the Jan. 2020 sars-cov-2 genome assembly. Genome assembly procedures are covered in the NCBI assembly documentation. NCBI also provides specific information about this assembly. The definition of this assembly is from the AGP file delivered with the sequence. The NCBI document AGP Specification describes the format of the AGP file. In dense mode, this track depicts the contigs that make up the currently viewed scaffold. Contig boundaries are distinguished by the use of alternating gold and brown coloration. Where gaps exist between contigs, spaces are shown between the gold and brown blocks. The relative order and orientation of the contigs within a scaffold is always known; therefore, a line is drawn in the graphical display to bridge the blocks. Component types found in this track (with counts of that type in parentheses): D - draft sequence (1) augustusGene AUGUSTUS AUGUSTUS ab initio gene predictions v3.1 Genes and Gene Predictions Description This track shows ab initio predictions from the program AUGUSTUS (version 3.1). The predictions are based on the genome sequence alone. For more information on the different gene tracks, see our Genes FAQ. Methods Statistical signal models were built for splice sites, branch-point patterns, translation start sites, and the poly-A signal. Furthermore, models were built for the sequence content of protein-coding and non-coding regions as well as for the length distributions of different exon and intron types. Detailed descriptions of most of these different models can be found in Mario Stanke's dissertation. This track shows the most likely gene structure according to a Semi-Markov Conditional Random Field model. Alternative splicing transcripts were obtained with a sampling algorithm (--alternatives-from-sampling=true --sample=100 --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=3 --temperature=2). The different models used by Augustus were trained on a number of different species-specific gene sets, which included 1000-2000 training gene structures. The --species option allows one to choose the species used for training the models. Different training species were used for the --species option when generating these predictions for different groups of assemblies. Assembly Group Training Species Fish zebrafish Birds chicken Human and all other vertebrates human Nematodes caenorhabditis Drosophila fly A. mellifera honeybee1 A. gambiae culex S. cerevisiae saccharomyces This table describes which training species was used for a particular group of assemblies. When available, the closest related training species was used. Credits Thanks to the Stanke lab for providing the AUGUSTUS program. The training for the chicken version was done by Stefanie König and the training for the human and zebrafish versions was done by Mario Stanke. References Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008 Mar 1;24(5):637-44. PMID: 18218656 Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. PMID: 14534192 cd8escape CD8 Escape Muts T-Cell MHCI CD8+ Escape Mutations from Agerer et al. Sci Immun 2020 Immunology Description Cytotoxic T lymphocytes (CTL) find and kill cells infected with viruses by examining viral epitopes presented by MHC class I molecules on the surface of the cell. In the context of SARS-CoV-2, infected patients have elicited CTL responses by showing higher levels of granzyme B, perforin and IFN-gamma. Rapid rate of mutation in SARS-CoV-2 may hinder the presentation of peptides by MHC-I molecules thereby evading T cell response. Agerer et al. performed deep sequencing of viral genomes from several patients and identified mutations that were observed in Nucleocapsid (N), ORF1ab, Membrane (M), and Envelope (E) proteins. This track includes 194 mutants spanning 305 T-cell epitopes restricted to HLA-A*02:01 and HLA-B*40:01. Methods In this study, deep sequencing was performed on virus samples from Austria. To identify mutant epitopes, MHC-I binding assays were performed. Subsequently, PBMCs collected from HLA-A*02:01 and HLA-B*40:01 typed, COVID-19 patients were subjected to functional assays. A more detailed description of the methodology is decribed in the original article. References Agerer B, Koblischke M, Gudipati V, Montaño-Gutierrez LF, Smyth M, Popa A, Genger JW, Endler L, Florian DM, Mühlgrabner V et al. SARS-CoV-2 mutations in MHC-I-restricted epitopes evade CD8+ T cell responses. Sci Immunol. 2021 Mar 4;6(57). PMID: 33664060 rosettaMhc CD8 RosettaMHC CD8 Epitopes predicted by NetMHC and Rosetta Immunology Description As a first step toward the development of diagnostic and therapeutic tools to fight the Coronavirus disease (COVID-19), it is important to characterize CD8+ T cell epitopes in the SARS-CoV-2 peptidome that can trigger adaptive immune responses. Here, we use RosettaMHC, a comparative modeling approach which leverages existing high-resolution X-ray structures from peptide/MHC complexes available in the Protein Data Bank, to derive physically realistic 3D models for high-affinity SARS-CoV-2 epitopes. We outline an application of our method to model 439 9mer and 279 10mer predicted epitopes displayed by the common allele HLA-A*02:01, and we make our models publicly available through an online database (https://rosettamhc.chemistry.ucsc.edu). As more detailed studies on antigen-specific T cell recognition become available, RosettaMHC models of antigens from different strains and HLA alleles can be used as a basis to understand the link between peptide/HLA complex structure and surface chemistry with immunogenicity, in the context of SARS-CoV-2 infection. This track includes 718 CD8 epitopes restricted to HLA-A*02:01 as predicted by NetMHCpan4.0 and RosettaMHC. The structural models of all 718 epitopes are available in the database (see Description). All the epitopes are scored using a combined NetMHCPan4.0 (eluted ligand) predicted binding affinity and binding energy calculated in Rosetta force field (score = (0.5 * ( ((NetMHCPan affinity - Average NetMHCPan affinity) / range of NetMHCPan affinities) + ( (Rosetta binding energy - Average Rosetta binding energy ) / range of Rosetta binding energies) ) + 1 ) * 500). Methods Epitopes of lengths 9 and 10 from all reading frames of SARS-CoV-2 proteome are generated and filtered using NetMHCPan4.0 (eluted ligand prediction). All the epitopes predicted as strong or weak binders (a total of 718) to HLA-A*02:01 by NetMHCPan4.0 (using default %Rank cut-off) are modeled using RosettaMHC. Further, binding energies of all 718 epitopes to HLA-A*02:01 is calculated in Rosetta. Alongside all the models, their NetMHCpan predictions and binding energies are made available through a database and Supplementary Table 1 from the reference, Nerli and Sgourakis. (2020) in the References section below. Notes For a full description of the methods used, refer to Nerli and Sgourakis. (2020) in the References section below. Credits Nikolaos Sgourakis (nsgourak@ucsc.edu) Santrupti Nerli (snerli@ucsc.edu) Data were generated and processed at UCSC. For inquiries, please contact Nikolaos Sgourakis from the Sgourakis Research Group at UCSC. References Nerli and Sgourakis. 2020 (Manuscript submitted) (BioRxiv). cytoBandIdeo Chromosome Band (Ideogram) Ideogram for Orientation Mapping and Sequencing crisprDet CRISPR Detection CRISPR Detection Guides Mapping and Sequencing Description This track shows the locations of CRISPR detection guides and flanking primers. There are three studies on this topic, with the one by Mammoth Biosciences the most advanced. The Broad Institute's guides were validated, the NYU guides are predictions for now. Prefix Institution Overview Links Mammoth Mammoth Biosciences There are three guides in the set, two are shown on the genome. The third guide detects human RNAseP, as a sample control. Target sequence: TAATTTCTACTAAGTGTAGATAATTACTTGGGTGTGACCCT Protocol and preprint. Broad Sabeti Lab, Broad Institute Various guides are predicted, the one shown here is from the protocol. Protocol from Mar 21 2020, Guide website and preprint. NYU Sanjana Lab, New York University These are predictions by a new algorithm. Shown is the prediction with the highest score. Website Methods Sequences were mapped with BLAT. Credits Thanks to the labs for producing this work, as well as Kiley Charbonneau for pointing us to the papers. resist Drug Resistance Mutations Mutations that confer drug resistance (Anna Niewiadomska, BV-BRC) Variation and Repeats Description This track lists amino acid mutations that are known to confer drug resistance against Remdesivir, Sotrovimab, and Nirmatrelvir (Paxlovid). They were sourced with permission from Paul Gordon's website at the University of Calgary and Anna Niewiadomska from the J. Craig Venter Institute, manually input by Max Haeussler. Display Conventions Mutations are highlighted on the reference genome. Click or mouse over the mutations to show the full description. Data Access The data can be explored interactively with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server or the API. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits Thanks to Anna Niewiadomska from the BV-BRC team at the J. Craig Venter Institute for converting these to coordinates. gap Gap Gap Locations Mapping and Sequencing Description This track shows the gaps in the Jan. 2020 sars-cov-2 genome assembly. Genome assembly procedures are covered in the NCBI assembly documentation. NCBI also provides specific information about this assembly. The definition of the gaps in this assembly is from the AGP file delivered with the sequence. The NCBI document AGP Specification describes the format of the AGP file. Gaps are represented as black boxes in this track. If the relative order and orientation of the contigs on either side of the gap is supported by read pair data, it is a bridged gap and a white line is drawn through the black box representing the gap. This assembly has no annotated gaps. gc5BaseBw GC Percent GC Percent in 5-Base Windows Mapping and Sequencing Description The GC percent track shows the percentage of G (guanine) and C (cytosine) bases in 5-base windows. High GC content is typically associated with gene-rich areas. This track may be configured in a variety of ways to highlight different apsects of the displayed information. Click the "Graph configuration help" link for an explanation of the configuration options. Credits The data and presentation of this graph were prepared by Hiram Clawson. genscan Genscan Genes Genscan Gene Predictions Genes and Gene Predictions Description This track shows predictions from the Genscan program written by Chris Burge. The predictions are based on transcriptional, translational and donor/acceptor splicing signals as well as the length and compositional distributions of exons, introns and intergenic regions. For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The track description page offers the following filter and configuration options: Color track by codons: Select the genomic codons option to color and label each codon in a zoomed-in display to facilitate validation and comparison of gene predictions. Go to the Coloring Gene Predictions and Annotations by Codon page for more information about this feature. Methods For a description of the Genscan program and the model that underlies it, refer to Burge and Karlin (1997) in the References section below. The splice site models used are described in more detail in Burge (1998) below. Credits Thanks to Chris Burge for providing the Genscan program. References Burge C. Modeling Dependencies in Pre-mRNA Splicing Signals. In: Salzberg S, Searls D, Kasif S, editors. Computational Methods in Molecular Biology. Amsterdam: Elsevier Science; 1998. p. 127-163. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 1997 Apr 25;268(1):78-94. PMID: 9149143 multiz7way Human CoV Multiz Alignment and Conservation of 7 Strains of human coronavirus Comparative Genomics Description This track shows multiple alignments of 7 human coronavirus sequences, aligned to the SARS-CoV-2 NCBI reference sequence SARS-CoV-2 for NC_045512.2, genome assembly GCF_009858895.2_ASM985889v3. The multiple alignments were generated using Multiz and other tools in the UCSC/Penn State Bioinformatics comparative genomics alignment pipeline. In the track display, the sequences are labeled using common names. Note the table below to relate these common names to the NCBI assembly accession identifier. Display Conventions and Configuration Pairwise alignments of each species to the SARS-CoV-2 genome are displayed as a series of colored blocks indicating the functional effect of polymorphisms (in pack mode), or as a wiggle (in full mode) that indicates alignment quality. In dense display mode, percent identity of the whole alignments is shown in grayscale using darker values to indicate higher levels of identity. In pack mode, regions that align with 100% identity are not shown. When there is not 100% percent identity, blocks of four colors are drawn. Red blocks are drawn when a polymorphism in a coding region results in a change in the amino acid that is generated. Green blocks are drawn when a polymorphism in a coding region results in no change to the amino acid that is generated. Blue blocks are drawn when a polymorphism is outside a coding region. Pale yellow blocks are drawn when there are no aligning bases to that region in the reference genome. Checkboxes on the track configuration page allow selection of the species to include in the pairwise display. Configuration buttons are available to select all of the species (+), deselect all of the species (-), or use the default settings (Reset to defaults). For text nucleotide alignments, click on the alignment tracks. To view detailed information about the alignments at a specific position, zoom to a small region or click the 'base' button to see amino acid alignments. Base Level When zoomed-in to the base-level display, the track shows the amino acid composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the SARS-CoV-2 sequence at those alignment positions relative to the longest non-SARS-CoV-2 sequence. If there is sufficient space in the display, the size of the gap is shown. If the space is insufficient and the gap size is a multiple of 3, a "*" is displayed; other gap sizes are indicated by "+". Codon translation can be turned off in base-level display mode if desired. You can select the species for translation from the pull-down menu in the Codon Translation configuration section at the top of the page. Then, select one of the following modes: No codon translation: The gene annotation is not used; the bases are displayed without translation. Use default species reading frames for translation: The annotations from the genome displayed in the Default species to establish reading frame pull-down menu are used to translate all the aligned species present in the alignment. Use reading frames for species if available, otherwise no translation: Codon translation is performed only for those species where the region is annotated as protein coding. Use reading frames for species if available, otherwise use default species: Codon translation is done on those species that are annotated as being protein coding over the aligned region using species-specific annotation; the remaining species are translated using the default species annotation. Methods Pairwise alignments with the reference sequence were generated for each sequence using LASTZ version 1.04.03. Parameters used for each LASTZ alignment: # hsp_threshold = 3000 # gapped_threshold = 3000 = L # x_drop = 910 # y_drop = 9400 = Y # gap_open_penalty = 400 # gap_extend_penalty = 30 # A C G T # A 91 -114 -31 -123 # C -114 100 -125 -31 # G -31 -125 100 -114 # T -123 -31 -114 91 # seed=1110100110010101111 w/2 transitions # step=1 Pairwise alignments were then linked into chains using a dynamic programming algorithm that finds maximally scoring chains of gapless subsections of the alignments organized in a kd-tree. Parameters used in the chaining (axtChain) step: -minScore=1000 -linearGap=loose High-scoring chains were then placed along the genome, with gaps filled by lower-scoring chains, to produce an alignment net. count sampledate accession phylogeneticdistance descriptive name 12019-12-30NC_045512.20.000000SARS-CoV-2 (2019) 22003-04NC_004718.30.885159SARS-CoV-1 (Tor2) 32012-06-13NC_019843.32.434930MERS Middle East respiratory syndrome CoV 42004-03NC_006213.12.589639Human CoV OC43 strain ATCC VR-759 52004-04NC_006577.22.649716Human CoV HKU1 62000-09NC_002645.12.983896Human CoV 229E 72004-03NC_005831.23.009141Human Coronavirus NL63 The multiple alignment was constructed from the resulting pairwise alignments progressively aligned using MultiZ/autoMZ. The phylogenetic tree was calculated on 31mer frequency similarity and neighbor joining that distance matrix with the PHYLIP toolset command: neighbor. The reference sequence NC_045512v2 is at the top of the tree: ((((SARS_CoV_2 SARS_CoV_1) MERS) (OC43 HKU1)) (CoV229E NL63)) Framing tables from the genes were constructed to enable visualization of codons in the multiple alignment display. Data Access Downloads for data in this track are available: Multiz alignments (MAF format), and phylogenetic trees Credits This track was created using the following programs: Alignment tools: LASTZ (formerly Blastz) and MultiZ by Minmei Hou, Scott Schwartz, Robert Harris, and Webb Miller of the Penn State Bioinformatics Group Conservation scoring: phastCons, phyloP, phyloFit, tree_doctor, msa_view and other programs in PHAST by Adam Siepel at Cold Spring Harbor Laboratory (original development done at the Haussler lab at UCSC). Chaining and Netting: axtChain, chainNet by Jim Kent at UCSC MAF Annotation tools: mafAddIRows by Brian Raney, UCSC; mafAddQRows by Richard Burhans, Penn State; genePredToMafFrames by Mark Diekhans, UCSC Tree image generator: phyloPng by Galt Barber, UCSC Conservation track display: Kate Rosenbloom, Hiram Clawson (wiggle display), and Brian Raney (gap annotation and codon framing) at UCSC References Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632; Supplemental Materials and Methods Phylo-HMMs, phastCons, and phyloP: Felsenstein J, Churchill GA. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol. 1996 Jan;13(1):93-104. PMID: 8583911 Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010 Jan;20(1):110-21. PMID: 19858363; PMC: PMC2798823 Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. PMID: 16024819; PMC: PMC1182216 Siepel A, Haussler D. Phylogenetic Hidden Markov Models. In: Nielsen R, editor. Statistical Methods in Molecular Evolution. New York: Springer; 2005. pp. 325-351. Yang Z. A space-time process model for the evolution of DNA sequences. Genetics. 1995 Feb;139(2):993-1005. PMID: 7713447; PMC: PMC1206396 Chain/Net: Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Multiz: Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004 Apr;14(4):708-15. PMID: 15060014; PMC: PMC383317 LASTZ (formerly Blastz): Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Harris RS. Improved pairwise alignment of genomic DNA. Ph.D. Thesis. Pennsylvania State University, USA. 2007. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 icshape icSHAPE RNA Struct icSHAPE RNA Structure mRNA and EST Description This track shows normalized icSHAPE reactivity data of in vivo and in vitro SARS-CoV-2 experiments. Methods icSHAPE scores were provided by Qiangfeng Cliff Zhang and converted to bigWig. The first five nucleotides of the genome and the last thirty nucleotides have no data due to technical defects (on the 5' and 3' ends of the virus) or insufficient sequencing depth. The icSHAPE score is distributed between 0 and 1. Data Access You can download the bigWig file underlying this track (icshape/*.bw) from our Download Server. The data can be explored interactively with the Table Browser or the Data Integrator. The data can be accessed from scripts through our API. http://api.genome.ucsc.edu/getData/track?genome=wuhCor1;track=icshapeInvitro;chrom=NC_045512v2;start=100;end=7875 Command-line extraction can be accomplished using an example like the following command: bigWigToWig -udcDir=. -chrom=NC_045512v2 -start=100 end=7875 https://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/icshape/invivo.bigWig myOutput.wig Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Lei Sun, Pan Li, Xiaohui Ju, Jian Rao, Wenze Huang, Shaojun Zhang, Tuanlin Xiong, Kui Xu, Xiaolin Zhou, Lili Ren, Qiang Ding, Jianwei Wang, Qiangfeng Cliff Zhang "In vivo structural characterization of the whole SARS-CoV-2 RNA genome identifies host cell target proteins vulnerable to re-purposed drugs", Biorxiv July 8 2020 icshapeInvivo icSHAPE In-vivo icSHAPE In-vivo mRNA and EST icshapeInvitro icSHAPE In-vitro icSHAPE In-vitro mRNA and EST epitopes IEDB Predictions IEDB-Predicted Epitopes from Grifoni et al 2020 Immunology Description This composite track indicates the immune epitope predictions for B cells, CD4 T-cells and CD8 T-cells, using these software packages: B cells = BebiPred 2.0, CD4 = IEDB Tepitool, CD8 = NetMHCpan4.0EL The color range for the markers is from dark blue (strong prediction) to dark red (weak prediction). Black is used for markers with no calculated prediction. From the publication: Candidate targets for immune responses to 2019-Novel Coronavirus (nCoV): sequence homology- and bioinformatic-based predictions, full reference below. The prediction of epitopes for CD8 T-cells was run for the following HLA alleles, as they have a worldwide population frequency > 6% HLA allelesFrequency in worldwide populationHLA-A*01:0116.2HLA-A*02:0125.2HLA-A*03:0115.4HLA-A*11:0112.9HLA-A*23:016.4HLA-A*24:0216.8HLA-B*07:0213.3HLA-B*08:0111.5HLA-B*35:016.5HLA-B*40:0110.3HLA-B*44:029.2HLA-B*44:037.6 Summary We identified potential targets for immune responses to 2019-nCoV and provide essential information for understanding human immune responses to this virus and evaluation of diagnostic and vaccine candidates. Abstract Effective countermeasures against the recent emergence and rapid expansion of the 2019-Novel Coronavirus (2019-nCoV) require the development of data and tools to understand and monitor viral spread and immune responses. However, little information about the targets of immune responses to 2019-nCoV is available. We used the Immune Epitope Database and Analysis Resource (IEDB) resource to catalog available data related to other coronaviruses, including SARS-CoV, which has high sequence similarity to 2019-nCoV, and is the best-characterized coronavirus in terms of epitope responses. We identified multiple specific regions in 2019-nCoV that have high homology to SARS virus. Parallel bionformatic predictions identified a priori potential B and T cell epitopes for 2019-nCoV. The independent identification of the same regions using two approaches reflects the high probability that these regions are targets for immune recognition of 2019-nCoV. Credits Data collected by Arkal Arjun Rao for the Sgourakis Research Group, U.C. Santa Cruz References Grifoni A, Sidney J, Zhang Y, Scheuermann RH, Peters B, Sette A. Candidate targets for immune responses to 2019-Novel Coronavirus (nCoV): sequence homology- and bioinformatic-based predictions, BioRxiv 2020 (doi: https://doi.org/10.1101/2020.02.12.946087) cd8Epitopes CD8+ T-Cell CD8+ T-Cell Epitopes Immunology cd4Epitopes CD4+ T-Cell CD4+ T-Cell Epitopes Immunology bCellEpitopes B Cell B Cell Epitopes Immunology ucscToINSDC INSDC Accession at INSDC - International Nucleotide Sequence Database Collaboration Mapping and Sequencing Description This track associates UCSC Genome Browser chromosome names to accession names from the International Nucleotide Sequence Database Collaboration (INSDC). The data were downloaded from the NCBI assembly database. Credits The data for this track was prepared by Hiram Clawson. microdel Microdeletions Microdeletions in GISAID sequences Variation and Repeats Description This track shows deletions that have been found in the sequences uploaded to the GISAID database as of June 6, 2020. Three confidence levels of deletion calls are shown: deletions found in at least 1 GISAID sequence deletions found in at least 2 GISAID sequences deletions found in at least 2 GISAID sequences that were able to be validated with raw reads. Methods We accessed all GISAID SARS-CoV-2 sequences on June 6, 2020. We filtered to high coverage reads encompassing the entire SARS-CoV-2 genome (>=29000 bps), leaving 12,403 sequences. We aligned the reads using MAFFT. Verification We validated several deletions with the raw reads from NCBI's SRA Run browser. Additionally, NYU Langone Health provided us with the aligned reads for many of their sequences. Data Access The raw data can be explored interactively with the Table Browser, combined with other datasets in the Data Integrator tool, or downloaded directly as "microdel.txt.gz" from the download server. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits We thank all of the labs that submitted their sequences to the GISAID database. The full acknowledgement table can be found at https://github.com/briannachrisman/SARS-CoV-2_Microdeletions/blob/master/acknowledgments.pdf. We thank the public health laboratories VIDRL and MDU-PHL at The Peter Doherty Institute for Infection and Immunity for providing over 1000 high quality raw reads to NCBI. Thank you NYU Langone SARS-CoV2 Sequencing Team's Matthew T Maurano, Matija Snuderl, and Adriana Heguy for providing many of their raw reads. References Chrisman, Brianna Sierra, Kelley Paskov, Nate Stockham, Kevin Tabatabaei, Jae-Yoon Jung, Peter Washington, Maya Varma, Min Woo Sun, Sepideh Maleki, and Dennis P. Wall. "Indels in SARS-CoV-2 occur at template-switching hotspots." BioData Mining 14, no. 1 (2021): 1-16. https://doi.org/10.1186/s13040-021-00251-0 nextstrainFreq Nextstrain Frequency Nextstrain Mutations Alternate Allele Frequency Variation and Repeats Description Nextstrain.org displays data about mutations that have occurred in the current 2019/2020 outbreak of SARS-CoV-2. Nextstrain has a powerful user interface for viewing the time stamped phylogenetic tree that it infers from the patterns of mutations in sequences worldwide. Nextstrain maintains an ongoing pipeline that continuously obtains SARS-CoV-2 genome sequences and metadata from GISAID, aligns them against the reference genome (NC_045512.2), and infers a phylogenetic tree. This track shows the alternate allele frequency of each mutation reported by Nextstrain as a bar graph with the height indicating the frequency. (The Nextstrain Mutations track offers a more detailed display of the mutations, breaking up the vertical bars according to the order of virus genome samples in the phylogenetic tree.) Methods Nextstrain downloads SARS-CoV-2 genomes from GISAID as they are submitted by labs worldwide. The sequences are processed by an automated pipeline and annotations are written to a data file that UCSC downloads and extracts annotations for display. Data Access You can download the bigBed file underlying this track (nextstrainSamples*.bigWig) from our Download Server. The data can be explored interactively with the Table Browser or the Data Integrator. The data can be accessed from scripts through our API. http://api.genome.ucsc.edu/getData/track?genome=wuhCor1;track=nextstrainFreqB4;chrom=NC_045512v2;start=100;end=7875 Command-line extraction can be accomplished using an example like the following command: bigWigToWig -udcDir=. -chrom=NC_045512v2 -start=100 end=7875 http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/nextstrain/nextstrainSamples.bigWig myOutput.wig Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Data usage policy The data presented here is intended to rapidly disseminate analysis of important pathogens. Unpublished data is included with permission of the data generators, and does not impact their right to publish. Please contact the respective authors (available via the Nextstrain metadata.tsv file) if you intend to carry out further research using their data. Derived data, such as phylogenies, can be downloaded from nextstrain.org (see "DOWNLOAD DATA" link at bottom of page) - please contact the relevant authors where appropriate. Credits Thanks to nextstrain.org for sharing its analysis of genomes collected by GISAID EpiCoV TM, and to researchers worldwide for sharing their SARS-CoV-2 genome sequences. References Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher RA. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018 Dec 1;34(23):4121-4123. PMID: 29790939; PMC: PMC6247931 Sagulenko P, Puller V, Neher RA. TreeTime: Maximum-likelihood phylodynamic analysis. Virus Evol. 2018 Jan;4(1):vex042. PMID: 29340210; PMC: PMC5758920 nextstrainFreqViewNewClades Year-Letter Clades Nextstrain Mutations Alternate Allele Frequency Variation and Repeats nextstrainFreq21J_Delta 21J/Delta Nextstrain, 21J/Delta clade: Alternate allele frequency Variation and Repeats nextstrainFreq21I_Delta 21I/Delta Nextstrain, 21I/Delta clade: Alternate allele frequency Variation and Repeats nextstrainFreq21H_Mu 21H/Mu Nextstrain, 21H/Mu clade: Alternate allele frequency Variation and Repeats nextstrainFreq21G_Lambda 21G/Lambda Nextstrain, 21G/Lambda clade: Alternate allele frequency Variation and Repeats nextstrainFreq21F_Iota 21F/Iota Nextstrain, 21F/Iota clade: Alternate allele frequency Variation and Repeats nextstrainFreq21E_Theta 21E/Theta Nextstrain, 21E/Theta clade: Alternate allele frequency Variation and Repeats nextstrainFreq21D_Eta 21D/Eta Nextstrain, 21D/Eta clade: Alternate allele frequency Variation and Repeats nextstrainFreq21B_Kappa 21B/Kappa Nextstrain, 21B/Kappa clade: Alternate allele frequency Variation and Repeats nextstrainFreq21A_Delta 21A/Delta Nextstrain, 21A/Delta clade: Alternate allele frequency Variation and Repeats nextstrainFreq20G 20G Nextstrain, 20G clade: Alternate allele frequency Variation and Repeats nextstrainFreq20F 20F Nextstrain, 20F clade: Alternate allele frequency Variation and Repeats nextstrainFreq20D 20D Nextstrain, 20D clade: Alternate allele frequency Variation and Repeats nextstrainFreq20C 20C Nextstrain, 20C clade: Alternate allele frequency Variation and Repeats nextstrainFreq20B 20B Nextstrain, 20B clade: Alternate allele frequency Variation and Repeats nextstrainFreq20A 20A Nextstrain, 20A clade: Alternate allele frequency Variation and Repeats nextstrainFreq19B 19B Nextstrain, 19B clade: Alternate allele frequency Variation and Repeats nextstrainFreq19A 19A Nextstrain, 19A clade: Alternate allele frequency Variation and Repeats nextstrainFreqViewAll All Samples Nextstrain Mutations Alternate Allele Frequency Variation and Repeats nextstrainFreqAll All Nextstrain, all samples: Alternate allele frequency Variation and Repeats nextstrainGene Nextstrain Genes Genes annotated by nextstrain.org/ncov Genes and Gene Predictions Description Nextstrain.org displays data about mutations that occur in the current 2019/2020 outbreak. Nextstrain has a powerful user interface for viewing the evolutionary tree that it infers from the patterns of mutations in sequences worldwide, but does not offer a detailed plot of mutations along the genome, so we have processed their genome annotations into genome browser tracks. Use this track in conjunction with the track Nextstrain Mutations (protein-coding mutations use gene names from this track). Methods Nextstrain downloads SARS-CoV-2 genomes from GISAID as they are submitted by labs worldwide. The sequences are processed by an automated pipeline and annotations are written to a data file that UCSC downloads and extracts annotations for display. Data Access You can download the bigBed file underlying this track (nextstrainGene) from our Download Server. The data can be explored interactively with the Table Browser or the Data Integrator. The data can be accessed from scripts through our API. Credits Thanks to nextstrain.org for sharing its analysis of genomes collected by GISAID, and to researchers worldwide for sharing their SARS-CoV-2 genome sequences. References Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher RA. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018 Dec 1;34(23):4121-4123. PMID: 29790939; PMC: PMC6247931 nextstrainParsimony Nextstrain Parsimony Parsimony Scores for Nextstrain Mutations and Phylogenetic Tree Variation and Repeats Description Nextstrain.org displays data about single nucleotide mutation alleles in the SARS-CoV-2 RNA and protein sequences that have occurred in different samples of the virus during the current 2019/2020 outbreak. Nextstrain has a powerful user interface for viewing the time stamped phylogenetic tree that it infers from the patterns of mutations in sequences worldwide. Nextstrain maintains an ongoing pipeline that continuously obtains SARS-CoV-2 genome sequences and metadata from GISAID, aligns them against the reference genome (NC_045512.2), and infers a phylogenetic tree. A parsimony score can be computed for each mutation as the minimum number of nucleotide changes along branches of the tree that would lead to the observed sample genotypes at the leaves of the tree. For example, if there is a branch for which all leaves have a mutation, and no other leaves of the tree have the mutation, then the mutation presumably occurred once on that branch and the parsimony score would be one. However, when a mutation appears on leaves belonging to several branches whose other leaves do not have the mutation, then the mutation would need to occur on multiple branches in the tree, increasing the parsimony score. Mutations with a parsimony score that is relatively high, especially when compared to alternate allele count (the number of samples/leaves with the mutation), may be of interest when identifying systematic errors and/or sites of recurrent mutations. This track shows the parsimony score of each single-nucleotide substitution reported by Nextstrain as a bar graph with the height indicating the score. (The Nextstrain Mutations track displays the phylogenetic tree and sample genotypes from which the parsimony scores were generated. Methods Nextstrain downloads SARS-CoV-2 genomes from GISAID EpiCoV TM as they are submitted by labs worldwide. The sequences are processed by an automated pipeline and annotations are written to a data file that UCSC downloads and extracts annotations for display. UCSC computes parsimony scores using the phylogenetic tree and mutations extracted from Nextstrain. Data Access You can download the bigWig file underlying this track (nextstrainParsimony.bw) from our Download Server. The data can be explored interactively with the Table Browser or the Data Integrator. The data can be accessed from scripts through our API. Nextstrain.org offers phylogenetic trees and metadata files: scroll to the bottom of the page and click "DOWNLOAD DATA", and a dialog with download options appears. Credits This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions. Special thanks to nextstrain.org for sharing its analysis of genomes collected by GISAID. References Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher RA. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018 Dec 1;34(23):4121-4123. PMID: 29790939; PMC: PMC6247931 phyloGenes PhyloCSF Genes PhyloCSF Genes - Curated conserved genes Genes and Gene Predictions Description These tracks show curated SARS-CoV-2 protein-coding genes conserved within the Sarbecovirus subgenus as determined using PhyloCSF [1], FRESCo [2], and other comparative genomics methods, consistent with experimental evidence in SARS-CoV-2. Ambiguous gene names were resolved according to the recommendations in [3]. For a complete description of the evidence, see [4]. For a complete description of the evidence, see [4]. The PhyloCSF Genes track shows the conserved protein-coding genes, namely ORF1a, ORF1ab, S, ORF3a, ORF3c, E, M, ORF6, ORF7a, ORF7b, ORF8, N, and ORF9b. Notes: ORF3c is a 41 codon ORF overlapping ORF3a in a different frame with coordinates 25457-25582; it has also been referred to as ORF3h, ORF3a*, and 3a.iORF1. ORF9b is a 97 codon ORF overlapping N in a different frame with coordinates 28284-28577; it has also been referred to as ORF9a. The PhyloCSF Rejected Genes track shows other gene candidates that have been proposed that do not show the signature of conserved protein-coding genes or persuasive experimental evidence of function [4], and are thus unlikely to be actual protein-coding genes, namely ORF2b, ORF3d, ORF3d-2, ORF3b, ORF9c, and ORF10. Notes: ORF2b is a 39 codon ORF with coordinates 21744-21860 overlapping the spike protein in a different frame; it has also been referred to as S.iORF1. ORF3d is a 57 codon ORF with coordinates 25524-25697 overlapping ORF3a in a different frame; it has also been referred to as ORF3b. ORF3d-2 is a 33 codon ORF with coordinates 25596-25697 that is a subset of ORF3d starting at a downstream in-frame AUG codon; it has also been referred to as 3a.iORF2. ORF3b is the 22 codon ortholog of the 5' end of SARS-CoV ORF3b with coordinates 25814-25882, ending at an in-frame stop codon that is not present in SARS-CoV. ORF9c is a 73 codon ORF overlapping N in a different frame with coordinates 28734-28955; it has also been referred to as ORF9b and ORF14. Data Access The raw data can be explored interactively with the Table Browser or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/phyloGenes/PhyloCSFgenes.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Methods See [4]. Note that the data was updated in June 2021: ORF14 was renamed to ORF9c, ORF2b and ORF3d-2 were added. Credits Questions should be directed to Irwin Jungreis. If you use the SARS-CoV-2 PhyloCSF Genes Track Hub, please cite Jungreis et al. 2021 [4]. References [1] Lin MF, Jungreis I, and Kellis M (2011). PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions. Bioinformatics 27:i275-i282 (ISMB/ECCB 2011). [2] Sealfon RS, Lin MF, Jungreis I, Wolf MY, Kellis M, Sabeti PC (2015). FRESCo: finding regions of excess synonymous constraint in diverse viruses. Genome Biol. doi: 10.1186/s13059-015-0603-7. [3] Jungreis, I., Nelson, C. W., Ardern, Z., Finkel, Y., Krogan, N. J., Sato, K., ... & Kellis, M. (2021). Conflicting and ambiguous names of overlapping ORFs in the SARS-CoV-2 genome: A homology-based resolution. Virology 558, 145-151. doi.org/10.1016/j.virol.2021.02.013 [4] Jungreis I, Sealfon R, Kellis M (2021). SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nature Communications 12(1), 1-20. doi:10.1038/s41467-021-22905-7 PhyloCSFrejected PhyloCSF Rejected Genes Genes rejected from PhyloCSF genes list Genes and Gene Predictions PhyloCSFgenes PhyloCSF Genes PhyloCSF Genes - curated conserved genes Genes and Gene Predictions poranHla1 Poran HLA I RECON HLA-I epitopes Immunology Description This track shows putative epitopes for CD4+ and CD8+ T cells whose HLA binding properties cover over 99% for US, European, and Asian populations, for both HLA-I and HLA-II. The track includes 11,776 CD8 epitopes restricted to HLA-I as predicted by RECON. All the epitopes are scored using a combined coverage score reported for USA, EUR, and API. Specifically, score = (USA_coverage+EUR_coverage+API_coverage)*1000/3. For more details, see here. Display Conventions and Configuration Genomic locations of epitopes are labeled with a unique ID. Mousing over an item shows the protein name and restrictions to HLA-I. A click on an item shows a standard feature detail page with the both the ID and the mouse over information. Methods For a full description of the methods used, refer here. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/poranHla/CD8-hla1.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Asaf Poran, Dewi Harjanto, Matthew Malloy, Michael S. Rooney, Lakshmi Srinivasan, Richard B. Gaynor. Sequence-based prediction of vaccine targets for inducing T cell responses to SARS-CoV-2 utilizing the bioinformatics predictor RECON. bioRxiv 2020.04.06.027805 poranHla2 Poran HLA II RECON HLA-II epitopes Immunology Description This track shows putative epitopes for CD4+ and CD8+ T cells whose HLA binding properties cover over 99% for US, European, and Asian populations, for both HLA-I and HLA-II. The track includes 11,776 CD8 epitopes restricted to HLA-I as predicted by RECON. All the epitopes are scored using a combined coverage score reported for USA, EUR, and API. Specifically, score = (USA_coverage+EUR_coverage+API_coverage)*1000/3. For more details, see here. Display Conventions and Configuration Genomic locations of epitopes are labeled with a unique ID. Mousing over an item shows the protein name and restrictions to HLA-I. A click on an item shows a standard feature detail page with the both the ID and the mouse over information. Methods For a full description of the methods used, refer here. Data Access The raw data can be explored interactively with the Table Browser, or combined with other datasets in the Data Integrator tool. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. Annotations can be converted from binary to ASCII text by our command-line tool bigBedToBed. Instructions for downloading this command can be found on our utilities page. The tool can also be used to obtain features within a given range without downloading the file, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/poranHla/CD8-hla1.bb -chrom=NC_045512v2 -start=0 -end=29902 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. References Asaf Poran, Dewi Harjanto, Matthew Malloy, Michael S. Rooney, Lakshmi Srinivasan, Richard B. Gaynor. Sequence-based prediction of vaccine targets for inducing T cell responses to SARS-CoV-2 utilizing the bioinformatics predictor RECON. bioRxiv 2020.04.06.027805 potPathoIndel Pot. pathogenic indels Potential pathogenic insertions and deletions from Gussow et al, PNAS 2020 Variation and Repeats Description This track shows genomic features that differentiate SARS-CoV-2 and the viruses behind the two previous deadly coronavirus outbreaks, SARS-CoV and Middle East respiratory syndrome coronavirus (MERS-CoV), from less pathogenic coronaviruses. These features include: Enhancement of the nuclear localization signals in the nucleocapsid protein Inserts in the spike glycoprotein that appear to be associated with the high case fatality rate of these coronaviruses Inserts in the spike glycoprotein that appear to be associated with the host switch from animals to humans Methods We searched for specific regions, within an alignment of 944 human coronaviruses, that contributed the most to the separation between coronaviruses with a high case fatality rate and those with a low case fatality rate, using a combination of comparative genomics and machine learning techniques. For the zoonotic insertions, we scanned an alignment of human and nonhuman coronaviruses to find regions in which over 50% of the strains in the alignment differed from the human strain, and for which the differing strains were explicitly the most distant from human. We identified only one such location, across all three high case fatality rate virus groups. References Ayal B. Gussow*, Noam Auslander*, Guilhem Faure, Yuri I. Wolf, Feng Zhang, Eugene V. Koonin. Genomic determinants of pathogenicity in SARS-CoV-2 and other human coronaviruses. Proceedings of the National Academy of Sciences Jun 2020, 117 (26) 15193-15199; DOI: 10.1073/pnas.2008176117. resistPred Predicted Resistance Mutations that may confer drug resistance - from Coronavirus3d.org (label: count of GISAID sequences with mutation, Feb 2022) Variation and Repeats Description This track lists amino acid positions that are close to certain Covid drug target protein binding sites; potentially relevant for viral drug resistance. The protein products are the S, NSP3, and NSP5 proteins. The data is derived from 3D Protein Data Bank (PDB) X-ray crystal structures and was imported from Coronavirus3d.org, a project funded by NIAID, NIGMS and UCR. For more information, please visit that site and look through the 9 protein-drug structures listed. Display Conventions The display shows amino acids within 5 Angstroms of an inhibitor binding site on the S, NSP3, and NSP5 proteins. The label of the annotations is the number of GISAID sequences with at least one mutation at this position as of March 3, 2022. Click or mouse-over the mutations to show the full list of nucleotide changes and the frequency of each. Data Access You can see the original data as tables on Coronavirus3d.org. The data can be explored interactively at the Genome Browser with the Table Browser or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server or the API. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits Thanks to Anna Niewiadomska from the BV-BRC team at the J. Craig Venter Institute for finding the dataset and testing the track. Thanks to the Godzik Lab for making available the data in .tsv format on the coronavirus3d.org website. problematicSites Problematic Sites Problematic sites where masking or caution are recommended for analysis Mapping and Sequencing Description Attempts to infer phylogenetic relationships, sites under selection, or evidence of recombination from SARS-CoV-2 genome sequences can be led astray by sequencing errors, contamination, and hypermutable sites. In order to make reliable inferences, it is important to identify probable errors and susceptible sites within the genome sequences, carefully consider how those might affect the specific analysis one is about to perform, and perhaps exclude problematic sites from analysis. This track shows locations in the SARS-CoV-2 genome that have been identified as problematic for analysis for various reasons. They have been collected in the github repository https://github.com/W-L/ProblematicSites_SARS-CoV2/. Locations have been separated into two subtracks and colored corresponding to levels of severity: Mask: Problems are expected to affect most types of analysis, so it is recommended to mask out these sites before analysis. Caution: Some types of analysis may be affected while other types may not; caution is recommended. Locations are labeled with the following terms to indicate the type of potential problem: ambiguous: Sites which show an excess of ambiguous basecalls relative to the number of alternative alleles, often emerging from a single country or sequencing laboratory amended: Previous sequencing errors which now appear to have been fixed in the latest versions of the GISAID sequences, at least in sequences from some of the sequencing laboratories highly_ambiguous: Sites with a very high proportion of ambiguous characters, relative to the number of alternative alleles highly_homoplasic: Positions which are extremely homoplasic - it is sometimes not necessarily clear if these are hypermutable sites or sequencing artefacts homoplasic: Homoplasic sites, with many mutation events needed to explain a relatively small alternative allele count interspecific_contamination: Cases (only one instance as of July 2020) in which the known sequencing issue is due to contamination from genetic material that does not have SARS-CoV-2 origin nanopore_adapter: Cases in which the known sequencing issue is due to the adapter sequences in nanopore reads narrow_src: Mutations which are found in sequences from only a few sequencing labs (usually two or three), possibly as a consequence of the same artefact reproduced independently neighbour_linked: Proximal mutations displaying near perfect linkage seq_end: Alignment ends are affected by low coverage and high error rates (masking recommended, but might be more stringent than necessary) single_src: Only observed in samples from a single laboratory Methods Multiple groups applied various methods (De Maio, Walker et al.; De Maio, Gozashti et al.; Turakhia et al.) to identify sites that were homoplasic, likely contaminated, likely sequencing error and/or observed in multiple virus lineages by only one or a few laboratories. They contributed their observations and recommendations to the github repository https://github.com/W-L/ProblematicSites_SARS-CoV2/. UCSC downloaded the collection, split the sites into Mask and Caution subsets depending on the recommended action and reformatted the data for display in the Genome Browser. Data Access The original data file was downloaded from github: https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf. You can download the bigBed files underlying this track (problematicSites*.bb) from our Download Server. The data can be explored interactively with the Table Browser or the Data Integrator. The data can be accessed from scripts through our API. References De Maio N, Walker C, Borges R, Weilguny L, Slodkowicz G, Goldman N. Issues with SARS-CoV-2 sequencing data. virological.org. 2020 May 5. De Maio N, Gozashti L, Turakhia Y, Walker C, Lanfear R, Corbett-Detig R, Goldman N. Updated analysis with data from 12th June 2020. virological.org. 2020 July 14. Turakhia Y, Thornlow B, Gozashti L, Hinrichs AS, Fernandes JD, Haussler D, and Corbett-Detig R. Stability of SARS-CoV-2 Phylogenies. bioRxiv. 2020 June 9. problematicSitesCaution Caution Problematic sites where caution is recommended for analysis Mapping and Sequencing problematicSitesMask Mask Problematic sites where masking is recommended for analysis Mapping and Sequencing rnaStructRangan Rangan RNA Rangan et al. RNA predictions mRNA and EST Description This track shows the locations of RNA structure predictions reported in Rangan et al. bioRxiv 2020 as well as recognizable matches to Rfam annotations found at https://rfam.xfam.org/search?q=coronavirus. Display Conventions and Configuration At zoomed out positions, RNA structure is indicated by dark (stem) and light (loops and single strands) green color bars. Zooming in to base level displays the prediction in "extended dot-bracket" notation. Credits This track was created by UCSC undergraduate students Justin Sim and Alinne Gonzalez Armenta with the indispensable assistance of Brian Raney. References Rangan Ramya, Zheludev Ivan N., Das Rhiju, 2020. RNA genome conservation and secondary structure in SARS-CoV-2 and SARS-related viruses. bioRxiv recomb Recomb. Breakpoints Recombination Breakpoints from Thurakia et al 2022 Variation and Repeats Description This track shows recombination breakpoints inferred by the RIPPLES software from a phylogenetic tree of 1.6 million SARS-CoV-2 sequences, described by Thurakia et al, Nature 2022. The track is in "density" mode by default, it shows the density of recombinated sequences per nucleotide. By deactivating the "Density plot" checkbox on the configuration page, all recombinations can be shown. Methods From Thurakia et al, Nature 2022: "We developed a new method for detecting recombination in pandemic-scale phylogenies, Recombination Inference using Phylogenetic PLacEmentS, RIPPLES. Because recombination violates the central assumption of many phylogenetic methods, that is, that a single evolutionary history is shared across the genome, recombinant lineages arising from diverse genomes will often be found on 'long branches', which result from accommodating the divergent evolutionary histories of the two parental haplotypes. Note that as long as recombination is relatively uncommon, phylogenetic inference is expected to remain accurate even when branch lengths are artifactually expanded. RIPPLES exploits that signal by first identifying long branches on a comprehensive SARS-CoV-2 mutation-annotated tree. RIPPLES then exhaustively breaks the potential recombinant sequence into distinct segments and replaces each onto a global phylogeny using maximum parsimony. RIPPLES reports the two parental nodes-hereafter termed donor and acceptor-that result in the highest parsimony score improvement relative to the original placement on the global phylogeny. Our approach therefore leverages phylogenetic signals for each parental lineage and the spatial correlation of markers along the genome. We establish significance using a null model conditioned on the inferred site-specific rates of de novo mutation." Data Access You can download the bigBed file underlying this track (primers) from our Download Server. The data can be explored interactively with the Table Browser or the Data Integrator. The data can also be accessed from scripts through our API. Credits Thanks to Bryan Thornlow for sharing the data. References Turakhia Y, Thornlow B, Hinrichs A, McBroome J, Ayala N, Ye C, Smith K, De Maio N, Haussler D, Lanfear R et al. Pandemic-scale phylogenomics reveals the SARS-CoV-2 recombination landscape. Nature. 2022 Sep;609(7929):994-997. PMID: 35952714; PMC: PMC9519458 ucscToRefSeq RefSeq Acc RefSeq Accession Mapping and Sequencing Description This track associates UCSC Genome Browser chromosome names to accession identifiers from the NCBI Reference Sequence Database (RefSeq). The data were downloaded from the NCBI assembly database. Credits The data for this track was prepared by Hiram Clawson. rfam Rfam Rfam families mRNA and EST Rfam families The SARS-CoV-2 genome was annotated using the structured RNA families from the Rfam database including the following Coronavirus-specific families: Sarbecovirus-5UTR Sarbecovirus-3UTR Coronavirus frameshift stimulating element 3' UTR pseudoknot s2m RNA Methods The annotations were generated using the Infernal cmsearch program and the Rfam covariance models (release 14.2). The cmsearch output was manually edited to remove the lower-scoring Betacoronavirus-5UTR and Betacoronavirus-3UTR families that belong to the same Rfam clans as the Sarbecovirus-5UTR and Sarbecovirus-3UTR families (CL00116 and CL00117). The alignments and the secondary structure were produced using LocARNA and refined based on the latest literature. Credits The curated Sarbecovirus alignments were provided by Kevin Lamkiewicz and Manja Marz (Friedrich Schiller University Jena). Eric Nawrocki (NCBI) revised the existing Rfam entries (RF00164, RF00165, and RF00507). We also thank Ramakanth Madhugiri (Justus Liebig University Giessen) for reviewing the Coronavirus UTR alignments. The track was prepared by the Rfam team. This work is part of the BBSRC funded project to expand the coverage of viral RNAs in Rfam. References Find more information about Coronavirus families at https://rfam.org/covid-19 Citing Rfam primers RT-PCR Primers RT-PCR Detection Kit Primer Sets Mapping and Sequencing Description This track shows the locations of those primers in detection kits that match the reference sequence. The primers were copied from a spreadsheet created by the project OpenCovid19. The initial version of the track used the FASTA file from Design Flaws in COVID-19 Primers from Multiple International Labs. Most are RT-qPCR primer sets, sequencing primers have the prefix Seq1- or Seq2-. RT-qPCR sets consists of one forward, one reverse and one internal probe, as indicated by the names. As expected, the three control primers were not found at all: US-CDC-Control_RP-P, US-CDC-Control_RP-R, US-CDC-Control_RP-F. Here is a quick overview of the origin of the primers, please see the website and spreadsheet linked above for more details: Prefix Institution Overview Manual CN-CDC- China CDC NIID- Nat. Inst. of Infect. Dis., Japan WH- National Inst. of Health (Thailand) HKU- The University of Hong King Detects N gene and Orf1b. Not specific for SARS-Cov2, but other Sarbecovirus species are not in circulation WHO Peiris Protocol US-CDC- CDC, USA Three reactions, target: N gene. One primer/probe set detects all betacoronaviruses, two sets are specific for SARS-CoV-2. All 3 assays must be positive Instructions EU-Drosten Drosten Lab, Charite Berlin, Germany Set 1: run E and RdRp primers (designed for SARS-CoV, SARS-CoV-2, and bat-associated betacoronaviruses), if Set 1 is positive, use SARS-Cov-2 specific detection primer Instructions The sequences and the identifiers for these primers were obtained from the following sources, among others: CDC Laboratory Guidance Comparative analysis of primer-probe sets for the laboratory confirmation of SARS-CoV-2 Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR More details can be found in the spreadsheet linked above. Methods The primers were mapped with the following command:blat ../../wuhCor1.2bit primers.fa stdout -stepSize=3 -tileSize=6 -minScore=10 -oneOff=1 -noHead -fine | pslReps stdin stdout /dev/null -minNearTopSize=10 -minCover=0.8 -nohead > primers.psl Data Access You can download the PSL file underlying this track (primers) from our Download Server. The data can be explored interactively with the Table Browser or the Data Integrator. The data can also be accessed from scripts through our API. Credits This data annotation track was made by Maximilian Haeussler, with assistance from Daniel Schmelter. Fasta data collected by Tomer Altman (Biome Bioinformatics), prepared by Jason Fernandes (UCSC), updated by Darach Miller (Stanford) and the OpenCovid19 project. simpleRepeat Simple Repeats Simple Tandem Repeats by TRF Variation and Repeats Description This track displays simple tandem repeats (possibly imperfect repeats) located by Tandem Repeats Finder (TRF) which is specialized for this purpose. These repeats can occur within coding regions of genes and may be quite polymorphic. Repeat expansions are sometimes associated with specific diseases. Methods For more information about the TRF program, see Benson (1999). Credits TRF was written by Gary Benson. References Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999 Jan 15;27(2):573-80. PMID: 9862982; PMC: PMC148217 spikeMuts Spike Mutations Spike protein mutations from community annotation (Feb 2021) Variation and Repeats Description The SARS-CoV-2 spike protein, which binds the virus to host cells, is a key target of vaccine development to combat COVID-19, and mutations in this protein have potential to change infectivity and response to disease treatments as well as vaccine efficacy. As of February 2021 more than a dozen mutations in this protein have been detected via sequencing worldwide, and specific mutations have been identified to be associated with viruses that are significantly more transmissible. This track presents amino acid mutations identified in the SARS-CoV2 Spike protein, based on the community annotation at CoVariants.org, supplemented by the Variants of SARS-CoV-2 Wikipedia page. Mutations in this track (with nicknames): H69-, D80Y, S98F, A222V, N439K (Nick), L452R, Y453F, S477N, E484 (Eeek), N501 (Nelly), D614G (Doug), A626S, P681H (Pooh), A701B, V1122 Information provided for each mutation, if available, includes: Date first sequenced Notes of clinical significance Notes regarding geographic incidence Link to charts and tables presenting geographic distribution Link to Nextstrain build Links to relevant publications Display Conventions The track items are colored as follows: RedIdentified as strong antibody escape by Bloom Lab RBD-mutation screen PurpleACE2 receptor binding region (RBD) BlueOther RedMost recent (Fall/Winter 2020-2021) PurpleSummer 2020 BlueOldest (Winter/Spring 2020) --> The mutation name, e.g. N501 as well as alternative identifiers (e.g. N501Y, 501Y) or nickname if present, can be typed in to the browser position box to navigate the browser to the mutation position and highlight the mutation in the browser window. Release Notes This track was updated to include the L453R mutation in the B.1.429 variant (first identified in California), as displayed in the Variants of Concern track. The color scheme was also changed in this track update. Data Access The raw data can be explored interactively with the Table Browser, or Data Integrator. For automated analysis, the genome annotation can be downloaded from the downloads server. Data files for earlier versions of this track can be downloaded from our archive download server. Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits Thanks to Emma B. Hodcroft, Institute of Social and Preventive Medicine, University of Bern, Switzerland for leading and maintaining the community annotation resource on which this track is largely based. References CoVariants: SARS-CoV-2 Mutations and Variants of Interest Emma B. Hodcroft, Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland Variants of SARS-CoV-2, Wikipedia targets T-React. Epitopes T-cell reactive epitopes in patients and donors Immunology Description This track shows data from Braun et al, 2020, "Presence of SARS-CoV-2 reactive T cells in COVID-19 patients and healthy donors". The authors stimulated PBMCs with two sets (M1 and M2) of overlapping SARS-CoV peptide pools. Importantly, these are SARS-CoV peptides that have experimental support as putative MHC-II epitopes for SARS-CoV, but note they do not all 100% match SARS-Cov-2 sequence. Using these peptide sets, the authors "demonstrate the presence of S-reactive CD4+ 52 T cells in 83% of COVID-19 patients, 53 as well as in 34% of SARS-CoV-2 seronegative healthy donors (HD), albeit at lower frequencies." Display Conventions and Configuration Two tracks are available: M1_peptides: The more N-terminal peptide library M2 peptides: The more C-terminal peptide library The annotated interval represents the alignment of the peptide to the viral genome. The sequence displayed in the name is the SARS-CoV peptide sequence (the actual sequence that was used in the paper, not necessarily identical to the SARS-CoV-2 peptide). Methods Table S1 contains the peptide sequences (SARS-CoV sequence) used. This table was downloaded and tblastn was used to align the identified SARS-CoV peptides to the SARS-CoV-2 genome. A small number of peptides were not found or reported erroneous hits, in which the coordinates were identified by manually blat-ing the SARS-CoV-2 sequence from the alignments reported in Fig S1. References Braun et al, 2020. "Presence of SARS-CoV-2 reactive T cells in COVID-19 patients and healthy donors" M2_library M2 SARS-CoV peptides T-Cell reactive epitopes: M2 SARS-CoV peptides mapped to SARS-Cov-2 Immunology M1_library M1 SARS-CoV peptides T-Cell reactive epitopes: M1 SARS-CoV peptides mapped to SARS-Cov-2 Immunology vaccines Vaccines COVID Vaccines BioNTech/Pfizer BNT-162b2 and Moderna mRNA-1273 Immunology Description This track shows the alignment of three different mRNA vaccine sequences to the SARS-CoV-2 genome: The BioNTech/Pfizer BNT-162b2 sequence as published by the World Health Organization The reconstructed BioNTech/Pfizer BNT-162b2 RNA as sequenced by the Andrew Fire lab, Stanford University School of Medicine The Moderna mRNA-1273 sequence as sequenced by the Andrew Fire lab, Stanford University School of Medicine Note that the actual vaccines are synthesized with N1-methyl-pseudouridine (Ψ) in place of uridine. See paper by Hubert in References for a discussion. Display Conventions and Configuration The psl output from blat was converted to a bigPsl format file for display in this track. Depending upon the size of the section of the genome in display, the track will draw black where nucleotides are identical between vaccine sequence and the SARS-CoV-2 sequence. Red lines indicate differences in nucleotides. At viewpoints with smaller sections of the genome in view, setting the Color track by codons or bases: to different mRNA bases will show the nucleotides in the vaccine that are different than the SARS-CoV-2 sequence. Methods The mRNA sequences were obtained from the MS WORD documents as mentioned in the references below. And the Andrew Fire lab github supplied the fasta sequencing result for the BioNTech/Pfizer BNT-162b2 and Moderna mRNA-1273 samples. The PSL alignment file was obtained via the UCSC genome browser blat service with parameters -t=dnax -q=rnax and filtered to allow only scores above 1000 to filter out the polyA match: gfClient -maxIntron=10 -t=dnax -q=rnax <host> <port> \ /gbdb/wuhCor1 threeVaccines.fa stdout \ | pslFilter -minScore=1000 stdin wuhCor1.vaccines.psl pslScore wuhCor1.vaccines.psl #tName tStart tEnd qName:qStart-qEnd score percentIdent NC_045512v2 21559 25384 ModernaMrna1273:54-3879 1419 68.60 NC_045512v2 21559 25384 ReconstructedBNT162b2:51-3876 1701 72.30 NC_045512v2 21559 25384 WHO_BNT162b2:51-3876 1701 72.30 faCount threeVaccines.fa | tawk '{print $1,"1.."$2+1}' \ | head -4 | tail -3 > threeVaccines.cds pslToBigPsl -cds=threeVaccines.cds -fa=threeVaccines.fa wuhCor1.vaccines.psl stdout \ | sort -k1,1 -k2,2n > wuhCor1.vaccines.bigPsl bedToBigBed -type=bed12+13 -tab -as=HOME/kent/src/hg/lib/bigPsl.as \ wuhCor1.vaccines.bigPsl wuhCor1.chrom.sizes wuhCor1.vaccines.bb Data Access The fasta file sequences and psl alignment file can be obtained from our download server at: https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/vaccines/. The bigPsl alignment file used for the display of this track in the genome browser can be accessed from https://hgdownload.soe.ucsc.edu/gbdb/wuhCor1/bbi/wuhCor1.vaccines.bb. The kent command line access tool bigBedToBed, which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The protein encoded by the three sequences has two AA substitutions compared to the SARS-CoV-2 S glycoprotein. Variations: S:K986P and S:V987P in the vaccine sequence. See also: The tiny tweak behind COVID-19 vaccines. >BNT162b2 MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFD NPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVY SSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRV QPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSF VIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFL PFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGS NVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGF NFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAG TITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN TLVKQLSSNFGAISSVLNDILSRLDPPEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRV DFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNT FVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL QELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYTZZ References Dae Eun Jeong, Matthew McCoy, Karen Artiles, Orkan Ilbay, Andrew Fire, Kari Nadeau, Helen Park, Brooke Betts, Scott Boyd, Ramona Hoh, and Massa Shoura Assemblies of putative SARS-CoV2-spike-encoding mRNA sequences for vaccines BNT-162b2 and mRNA-1273 obtained from github Bert Hubert Reverse Engineering the source code of the BioNTech/Pfizer SARS-CoV-2 Vaccine 25 Dec 2020 WikiPedia Pfizer-BioNTech COVID-19 vaccine World Health Organization MedNet Messenger RNA encoding the full-length SARS-CoV-2 spike glycoprotein Sept. 2020 document 11889 Cyril Le Nouën, Peter L. Collins, and Ursula J. Buchholz Attenuation of Human Respiratory Viruses by Synonymous Genome Recoding Frontiers in Immunology 2019; 10: 1250. PMID: 31231383 Ryan Cross The tiny tweak behind COVID-19 vaccines, Chemical & Engineering News 29 September 2020 Vol 98, issue 38 Credits Thank you to the Andrew Fire lab, Stanford University School of Medicine for providing the sequencing data of these vaccines. The presentation of this track was prepared by Hiram Clawson (hclawson@ucsc.edu). iedb Validated epitopes from IEDB Validated epitopes from IEDB Immunology Description This track shows epitope sequences displayed by various class I MHC alleles as annotated by National Institute for Allergy and Infectious Diseases (NIAID) Immune Epitope Database (IEDB). Only the epitopes with positive assays are displayed on this track. These epitopes were validated using many different methods. Click through to the IEDB page for each individual epitope to see how that epitope was validated. EpitopeID is a clickable link to each epitope on the IEDB site where details about assays, literature and HLA restriction are provided. References Vita R, Zarebski L, Greenbaum JA, Emami H, Hoof I, Salimi N, Damle R, Sette A, Peters B. The immune epitope database 2.0. Nucleic Acids Res. 2010 Jan;38(Database issue):D854-62. PMID: 19906713; PMC: PMC2808938 variantMuts Variants of Concern Mutations in Variants of Concern (VOC), Interest (VOI), or Under Monitoring (VUM) (configure to show more lineages) Variation and Repeats Description This track displays amino acid and nucleotide mutations in SARS-CoV-2 variants as defined in December 2021 by the World Health Organization (WHO). Note that the Centers for Disease Control (CDC) classification of SARS-CoV-2 variants is slightly different than the WHO. Mutations in this track were identified from viral sequences from GISAID. Variant incidence and geographic distribution information is available from links to the Outbreak.info web resource on the mutation details pages. Variants of Concern (VOC) have evidence for increased transmissibility, virulence, and/or decreased diagnostic, therapeutic, or vaccine efficacy. Variants of Interest (VOI) contain mutations suspected or confirmed to cause a change in transmissibility, virulence, or diagnostic / therapeutic / vaccine efficacy, plus evidence of significant community transmission, a cluster of cases, or detection in multiple countries. Variants under Monitoring (VUM) include variants with unclear epidemiological impact. This track includes only the four VUMs which were previously identified as Variants of Interest, now reclassified at this lower level of concern. The related track B.1.1.7 in USA displays a phylogenetic tree of the first B.1.1.7 (Alpha) variant sequences collected in the United States. BV-BRC has a similar list of variants of concern and their mutations, but has added representative sequences. Display Conventions Track colors are based on Nextstrain.org clade coloring: The Greek-letter names assigned by the World Health Organization (WHO) are listed in this table, along with lineage and clade designations: ColorWHO label Pangolin lineageNextstrain cladeGISAID clade First detectedDate VOC/VOIType       Alpha B.1.1.7 20I (V1) GRY Sep 2020, United Kingdom 18-Dec-2020 former VOC       Beta B.1.351 20H (V2) GH/501Y.V2 May 2020, South Africa 18-Dec-2020 former VOC       Gamma P.1 20J (V3) GR/501Y.V3 Nov 2020, Brazil 11-Jan-2021 former VOC       Delta B.1.617.2 21A GK/478K.V1 Oct 2020, India 11-May-2021 former VOC       Omicron B.1.1.529/BA.1 21K GR/484A Nov 2021, Multiple countries 26-Nov-2021 former VOC       Omicron BA.2 21L GRA Nov 2021, Multiple countries 13-Dec-2021 former VOC       Omicron BA.4 22A GRA Jan. 2022 18-May-2022 former VUM       Omicron BA.5 22B GRA Jan. 2022 18-May-2022 former VUM       Omicron BA.2.12.1 22C GRA Dec. 2021 18-May-2022 former VUM       Omicron BA.2.75 22D GRA May 2022, India 29-Jul-2022 former VOC       Omicron BQ.1 22E GRA Feb. 2021 12-Oct-2022 former VUM       Omicron XBB 22F GRA Aug. 2022 12-Oct-2022 VUM as of 29 January 2024       Omicron XBB.1.5 23A GRA Oct. 2022 11-Jan-2023 VUM, 15-Mar-2023 VOI VOI as of 29 January 2024       Omicron XBB.1.16 23B GRA Jan. 2023 22-Mar-2023 VUM, 17-04-2023 VOI VOI as of 29 January 2024       Omicron CH.1.1 23C GRA July 2022 9-Feb-2023 former VUM       Omicron XBB.1.9 23D GRA Dec. 2022 30-Mar-2023 XBB.1.9.1 VUM as of 29 January 2024       Omicron XBB.2.3 23E GRA Sep. 2022 17-May-2023 VUM as of 29 January 2024       Omicron EG.5.1 23F GRA Feb. 2023 9-Aug-2023 EG.5 VOI as of 29 January 2024       Omicron XBB.1.5.70 23G GRA Mar. 2023 n/a n/a       Omicron HK.3 23H GRA June 2023 n/a n/a       Omicron BA.2.86 23I GRA July 2023 21-Nov-2023 VOI       Omicron JN.1 23I GRA Aug. 2023 18-Dec-2023 VOI       Lambda C.37 21G GR/452Q.V1 Dec 2020, Peru 14-Jun-2021 former VOI       Mu B.1.621 21H GH Jan 2021, Colombia 30-Aug-2021 former VOI       Epsilon B.1.429 21C GH/452R.V1 Mar 2020, USA 06-Jul-2021 former VUM       Eta B.1.525 21D G/484K.V3 Dec 2020 20-Sep-2021 former VUM       Iota B.1.526 21F GH/253G.V1 Nov 2020, USA 20-Sep-2021 former VUM       Kappa B.1.617.1 21B G/452R.V3 Oct 2020, India 20-Sep-2021 former VUM Mutations in the amino acid track are named with the format: [Reference amino acid][1-based coordinate in peptide][Alternate amino acid]. E.g., L452R Mutations in the nucleotide track are named with the format: [Reference nucleotide][1-based coordinate in genome][Alternate nucleotide]. E.g., T22918G Insertions and deletions in both tracks are named: [del/ins]_[1-based genomic coordinate of first affected nucleotide]. E.g., del_21991 Methods For each virus variant, SARS-CoV-2 genome sequences containing all characteristic mutations of the lineage were downloaded from GISAID using the lineage search feature (restricting to complete, high-coverage genomes, and restricting to earliest sample collection dates when there were too many results for the download limit of 10,000 sequences per query). Sequences were aligned to the SARS-CoV-2 reference genome using the global_profile_alignment.sh script from the sarscov2phylo repository. Single-nucleotide substitutions were extracted from the alignment using the UCSC tool faToVcf (available on the UCSC download server or from bioconda; also requires the SARS-CoV-2 reference sequence). Single-nucleotide substitutions present at a frequency of at least 0.95 (.70 for Delta, .80 for Omicron) were retained while all others are discarded. For indel detection, the Minimap2 suite of tools was used as follows: minimap2 --cs [Reference Sequence] [Set of Unaligned Sequences] | paftools.js call -L 10000 - Indels present at a frequency of at least 0.85 (.50 for Delta, .70 for Omicron) were retained. Less stringent cutoffs were applied to Delta and Omicron variant sequences due to low quality of early sequences. The results were then combined and formatted by lineageVariants.py. The entire pipeline was run using lineageVariants.sh. Data Access You can download the bigBed data files for this track from the UCSC Download Server. The data can be explored interactively with the Table Browser or the Data Integrator. The data can be accessed from scripts through our API. For complete genome Fasta sequences of variants of concern, please visit the following third-party page: SARS-CoV-2 Variants and Lineages of Concern from the Bacterial and Viral Bioinformatics Resource Center. Release Notes Version 2 of this track adds one new Variant of Concern (Delta), two new Variants of Interest (Lambda, Mu), and three named variants previously VOI, now designated as less concerning Variants under Monitoring (Eta, Iota, Kappa). The track labels of all variants have been updated to include WHO labels. Track colors reflect Nextstrain conventions at the time of track data update (September 10, 2021). Omicron BA.1 was added December 2, 2021 (called B.1.1.529 at the time of discovery). Omicron lineages BA.4 and BA.5 were added in May 2022. Omicron lineages BA.2, BA.2.12.1, BA.2.75, BQ.1, XBB, XBB.1.5, XBB.1.16, CH.1.1, XBB.1.9, XBB.2.3, and EG.5.1 were added in September 2023. Omicron lineages XBB.1.5.70, HK.3, BA.2.86 and JN.1 were added in January 2024. Credits This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions. We thank Rob Lanfear at the Australia National University for developing and maintaining the sarscov2phylo web resource. We also thank the Su, Wu, and Andersen labs at Scripps Research for creating the Outbreak.info resource. The lineageVariants scripts were developed and run at UCSC by Nick Keener, Kate Rosenbloom and Angie Hinrichs. References Rambaut A, Holmes EC, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020 Nov;5(11):1403-1407. PMID: 32669681 Rambaut A, Loman N, Pybus O, Barclay W, Barrett J, Carabelli A, Connor T, Peacock T, Robertson DL, Volz E, et al. Preliminary genomic characterization of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations. Virological. 2020 Dec 18. Volz E, Mishra S, Chand M, Barrett JC, Johnson E, Geidelberg L, Hinsley WR, Laydon DJ, Dabrera G, O'Toole Á, et al. Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: Insights from linking epidemiological and genetic data. Virological. 2020 Dec 31. Tegally et al, December 21, 2020. Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa medRxiv preprint. Zhang al, January 20, 2021. Emergence of a novel SARS-CoV-2 strain in Southern California, USA medRxiv preprint. Voloch et al, December 26, 2020. Genomic characterization of a novel SARS-CoV-2 lineage from Rio de Janeiro, Brazil medRxiv preprint. Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo DOI: 10.5281/zenodo.3958883 Li, Heng Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018 Sep 15;34(18):3094-3100. PMID: 29750242; PMC: PMC6137996 Gangavarapu, Karthik; Alkuzweny, Manar; Cano, Marco; Haag, Emily; Latif, Alaa Abdel; Mullen, Julia L.; Rush, Benjamin; Tsueng, Ginger; Zhou, Jerry; Andersen, Kristian G.; Wu, Chunlei; Su, Andrew I.; Hughes, Laura D. Outbreak.info variantNucMuts_XBB_2_3 Omicron XBB.2.3 Nuc Muts Omicron VOC (XBB.2.3) nucleotide mutations identifed from GISAID sequences (Sep 2023) Variation and Repeats variantAaMuts_XBB_2_3 Omicron XBB.2.3 AA Muts Omicron XBB.2.3 amino acid mutations from GISAID sequences (Sep 22, 2023) Variation and Repeats variantNucMuts_XBB_1_9 Omicron XBB.1.9 Nuc Muts Omicron VOC (XBB.1.9) nucleotide mutations identifed from GISAID sequences (Sep 2023) Variation and Repeats variantAaMuts_XBB_1_9 Omicron XBB.1.9 AA Muts Omicron XBB.1.9 amino acid mutations from GISAID sequences (Sep 22, 2023) Variation and Repeats variantNucMuts_CH_1_1 Omicron CH.1.1 Nuc Muts Omicron VOC (CH.1.1) nucleotide mutations identifed from GISAID sequences (Sep 2023) Variation and Repeats variantAaMuts_CH_1_1 Omicron CH.1.1 AA Muts Omicron CH.1.1 amino acid mutations from GISAID sequences (Sep 22, 2023) Variation and Repeats variantNucMuts_XBB Omicron XBB Nuc Muts Omicron VOC (XBB) nucleotide mutations identifed from GISAID sequences (Sep 2023) Variation and Repeats variantAaMuts_XBB Omicron XBB AA Muts Omicron XBB amino acid mutations from GISAID sequences (Sep 22, 2023) Variation and Repeats variantNucMuts_BQ_1 Omicron BQ.1 Nuc Muts Omicron VOC (BQ.1) nucleotide mutations identifed from GISAID sequences (Sep 2023) Variation and Repeats variantAaMuts_BQ_1 Omicron BQ.1 AA Muts Omicron BQ.1 amino acid mutations from GISAID sequences (Sep 22, 2023) Variation and Repeats variantNucMuts_BA_2_12_1 Omicron BA.2.12.1 Nuc Muts Omicron VOC (BA.2.12.1) nucleotide mutations identifed from GISAID sequences (Sep 2023) Variation and Repeats variantAaMuts_BA_2_12_1 Omicron BA.2.12.1 AA Muts Omicron BA.2.12.1 amino acid mutations from GISAID sequences (Sep 22, 2023) Variation and Repeats variantNucMuts_BA_5 Omicron BA.5 Nuc Muts Omicron VOC (BA.5) nucleotide mutations identifed from 287 GISAID sequences (May 2022) Variation and Repeats variantAaMuts_BA_5 Omicron BA.5 AA Muts Omicron BA.5 amino acid mutations from 287 GISAID sequences (May 4, 2022) Variation and Repeats variantNucMuts_BA_4 Omicron BA.4 Nuc Muts Omicron VOC (BA.4) nucleotide mutations identifed from 573 GISAID sequences (May 2022) Variation and Repeats variantAaMuts_BA_4 Omicron BA.4 AA Muts Omicron BA.4 amino acid mutations from 573 GISAID sequences (May 4, 2022) Variation and Repeats variantNucMutsV2_B_1_617_1 Kappa Nuc Muts Kappa VUM (B.1.617.1 India Oct-2020) nucleotide mutations in 3000 GISAID sequences (Sep 10, 2021) Variation and Repeats variantAaMutsV2_B_1_617_1 Kappa AA Muts Kappa VUM (B.1.617.1 India Oct-2020) amino acid mutations in 3000 GISAID sequences (Sep 10, 2021) Variation and Repeats variantNucMutsV2_B_1_526 Iota Nuc Muts Iota VUM (B.1.526 USA Nov-2020) nucleotide mutations in 3000 GISAID sequences (Sep 10, 2021) Variation and Repeats variantAaMutsV2_B_1_526 Iota AA Muts Iota VUM (B.1.526 USA Nov-2020) amino acid mutations in 3000 GISAID sequences (Sep 10, 2021) Variation and Repeats variantNucMutsV2_B_1_525 Eta Nuc Muts Eta VUM (B.1.525 Dec-2020) nucleotide mutations in 3000 GISAID sequences (Sep 10, 2021) Variation and Repeats variantAaMutsV2_B_1_525 Eta AA Muts Eta VUM (B.1.525 Dec-2020) amino acid mutations in 3000 GISAID sequences (Sep 10, 2021) Variation and Repeats variantNucMutsV2_B_1_429 Epsilon Nuc Muts Epsilon VUM (B.1.429 USA Mar-2020) nucleotide mutations in 1360 GISAID sequences (Feb 5, 2021) Variation and Repeats variantAaMutsV2_B_1_429 Epsilon AA Muts Epsilon VUM (B.1.429 USA Mar-2020) amino acid mutations in 1360 GISAID sequences (Feb 5, 2021) Variation and Repeats variantNucMuts_BA_2_86 Omicron BA.2.86 Nuc Muts Omicron VOC (BA.2.86) nucleotide mutations identifed from GISAID sequences (Jan 2024) Variation and Repeats variantAaMuts_BA_2_86 Omicron BA.2.86 AA Muts Omicron BA.2.86 amino acid mutations from GISAID sequences (Jan 29, 2024) Variation and Repeats variantNucMuts_JN_1 Omicron JN.1 Nuc Muts Omicron VOC (JN.1) nucleotide mutations identifed from GISAID sequences (Jan 2024) Variation and Repeats variantAaMuts_JN_1 Omicron JN.1 AA Muts Omicron JN.1 amino acid mutations from GISAID sequences (Jan 29, 2024) Variation and Repeats variantNucMuts_HK_3 Omicron HK.3 Nuc Muts Omicron VOC (HK.3) nucleotide mutations identifed from GISAID sequences (Jan 2024) Variation and Repeats variantAaMuts_HK_3 Omicron HK.3 AA Muts Omicron HK.3 amino acid mutations from GISAID sequences (Jan 29, 2024) Variation and Repeats variantNucMuts_XBB_1_5_70 Omicron XBB.1.5.70 Nuc Muts Omicron VOC (XBB.1.5.70) nucleotide mutations identifed from GISAID sequences (Jan 2024) Variation and Repeats variantAaMuts_XBB_1_5_70 Omicron XBB.1.5.70 AA Muts Omicron XBB.1.5.70 amino acid mutations from GISAID sequences (Jan 29, 2024) Variation and Repeats variantNucMuts_EG_5_1 Omicron EG.5.1 Nuc Muts Omicron VOC (EG.5.1) nucleotide mutations identifed from GISAID sequences (Sep 2023) Variation and Repeats variantAaMuts_EG_5_1 Omicron EG.5.1 AA Muts Omicron EG.5.1 amino acid mutations from GISAID sequences (Sep 22, 2023) Variation and Repeats variantNucMuts_XBB_1_16 Omicron XBB.1.16 Nuc Muts Omicron VOC (XBB.1.16) nucleotide mutations identifed from GISAID sequences (Sep 2023) Variation and Repeats variantAaMuts_XBB_1_16 Omicron XBB.1.16 AA Muts Omicron XBB.1.16 amino acid mutations from GISAID sequences (Sep 22, 2023) Variation and Repeats variantNucMuts_XBB_1_5 Omicron XBB.1.5 Nuc Muts Omicron VOC (XBB.1.5) nucleotide mutations identifed from GISAID sequences (Sep 2023) Variation and Repeats variantAaMuts_XBB_1_5 Omicron XBB.1.5 AA Muts Omicron XBB.1.5 amino acid mutations from GISAID sequences (Sep 22, 2023) Variation and Repeats variantNucMutsV2_B_1_621 Mu Nuc Muts Mu VOI (B.1.621 Columbia Jan-2021) nucleotide mutations in 3000 GISAID sequences (Sep 10, 2021) Variation and Repeats variantAaMutsV2_B_1_621 Mu AA Muts Mu VOI (B.1.621 Columbia Jan-2021) amino acid mutations in 3000 GISAID sequences (Sep 10, 2021) Variation and Repeats variantNucMutsV2_C_37 Lambda Nuc Muts Lambda VOI (C.37 Peru Mar-2020) nucleotide mutations in 3000 GISAID sequences (Sep 10, 2021) Variation and Repeats variantAaMutsV2_C_37 Lambda AA Muts Lambda VOI (C.37 Peru Mar-2020) amino acid mutations in 3000 GISAID sequences (Sep 10, 2021) Variation and Repeats variantNucMuts_BA_2_75 Omicron BA.2.75 Nuc Muts Omicron VOC (BA.2.75) nucleotide mutations identifed from GISAID sequences (Sep 2023) Variation and Repeats variantAaMuts_BA_2_75 Omicron BA.2.75 AA Muts Omicron BA.2.75 amino acid mutations from 287 GISAID sequences (Sep 22, 2023) Variation and Repeats variantNucMuts_BA_2 Omicron BA.2 Nuc Muts Omicron VOC (BA.2) nucleotide mutations identifed from GISAID sequences (Sep 2023) Variation and Repeats variantAaMuts_BA_2 Omicron BA.2 AA Muts Omicron BA.2 amino acid mutations from GISAID sequences (Sep 22, 2023) Variation and Repeats variantNucMuts_B_1_1_529 Omicron BA.1 Nuc Muts Omicron VOC (BA.1 SA Nov-2021) nucleotide mutations identifed from GISAID sequences (Nov 2021) Variation and Repeats variantAaMuts_B_1_1_529 Omicron BA.1 AA Muts Omicron (BA.1 SA Nov-2021) amino acid mutations from cov-lineages.org (Nov 2021) Variation and Repeats variantNucMutsV2_B_1_617_2 Delta Nuc Muts Delta VOC (B.1.617.2 India Oct-2020) nucleotide mutations in 3000 GISAID sequences (Sep 10, 2021) Variation and Repeats variantAaMutsV2_B_1_617_2 Delta AA Muts Delta VOC (B.1.617.2 India Oct-2020) amino acid mutations in 3000 GISAID sequences (Sep 10, 2021) Variation and Repeats variantNucMutsV2_P_1 Gamma Nuc Muts Gamma VOC (P.1 Brazil Nov-2020) nucleotide mutations in 78 GISAID sequences (Feb 5, 2021) Variation and Repeats variantAaMutsV2_P_1 Gamma AA Muts Gamma VOC (P.1 Brazil Nov-2020) amino acid mutations in 78 GISAID sequences (Feb 5, 2021) Variation and Repeats variantNucMutsV2_B_1_351 Beta Nuc Muts Beta VOC (B.1.351 SA Dec-2020) nucleotide mutations in 793 GISAID sequences (Feb 5, 2021) Variation and Repeats variantAaMutsV2_B_1_351 Beta AA Muts Beta VOC (B.1.351 SA Dec-2020) amino acid mutations in 793 GISAID sequences (Feb 5, 2021) Variation and Repeats variantNucMutsV2_B_1_1_7 Alpha Nuc Muts Alpha VOC (B.1.1.7 UK Sep-2020) nucleotide mutations in 9838 GISAID sequences (Feb 5, 2021) Variation and Repeats variantAaMutsV2_B_1_1_7 Alpha AA Muts Alpha VOC (B.1.1.7 UK Sep-2020) amino acid mutations in 9838 GISAID sequences (Feb 5, 2021) Variation and Repeats windowmaskerSdust WM + SDust Genomic Intervals Masked by WindowMasker + SDust Variation and Repeats Description This track depicts masked sequence as determined by WindowMasker. The WindowMasker tool is included in the NCBI C++ toolkit. The source code for the entire toolkit is available from the NCBI FTP site. Methods To create this track, WindowMasker was run with the following parameters: windowmasker -mk_counts true -input wuhCor1.fa -output wm_counts windowmasker -ustat wm_counts -sdust true -input wuhCor1.fa -output repeats.bed The repeats.bed (BED3) file was loaded into the "windowmaskerSdust" table for this track. References Morgulis A, Gertz EM, Schäffer AA, Agarwala R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics. 2006 Jan 15;22(2):134-41. PMID: 16287941