cons160way 160 Accessions Multiz Alignment & Conservation (160 Virus Strains, Accession Names) Comparative Genomics Downloads for data in this track are available: Multiz alignments (MAF format), and phylogenetic trees PhyloP conservation (WIG format) PhastCons conservation (WIG format) Description This track shows multiple alignments of 160 virus sequences, composed of 158 Ebola virus sequences and two Marburg virus sequences aligned to the Ebola virus reference sequence G3683/KM034562.1. It also includes measurements of evolutionary conservation using two methods (phastCons and phyloP) from the PHAST package, for all 160 virus sequences. The multiple alignments were generated using multiz and other tools in the UCSC/Penn State Bioinformatics comparative genomics alignment pipeline. Conserved elements identified by phastCons are also displayed in this track. PhastCons (which has been used in previous Conservation tracks) is a hidden Markov model-based method that estimates the probability that each nucleotide belongs to a conserved element, based on the multiple alignment. It considers not just each individual alignment column, but also its flanking columns. By contrast, phyloP separately measures conservation at individual columns, ignoring the effects of their neighbors. As a consequence, the phyloP plots have a less smooth appearance than the phastCons plots, with more "texture" at individual sites. The two methods have different strengths and weaknesses. PhastCons is sensitive to "runs" of conserved sites, and is therefore effective for picking out conserved elements. PhyloP, on the other hand, is more appropriate for evaluating signatures of selection at particular nucleotides or classes of nucleotides (e.g., third codon positions, or first positions of miRNA target sites). Another important difference is that phyloP can measure acceleration (faster evolution than expected under neutral drift) as well as conservation (slower than expected evolution). In the phyloP plots, sites predicted to be conserved are assigned positive scores (and shown in blue), while sites predicted to be fast-evolving are assigned negative scores (and shown in red). The absolute values of the scores represent -log p-values under a null hypothesis of neutral evolution. The phastCons scores, by contrast, represent probabilities of negative selection and range between 0 and 1. Both phastCons and phyloP treat alignment gaps and unaligned nucleotides as missing data. The data contained in the 160 Accessions and the 160 Strains tracks are the same. The only difference between these two tracks are the identifiers used to label the sequences. In the 160 Accessions track, the sequence is labeled using its NCBI Nucleotide accession number. In the 160 Strains track, we used a shortened version of the strain name from the NCBI Nucleotide entry to label each sequence, and when this was unavailable, we constructed our own using the DEFINITION, /country, and /collection_date lines from the NCBI record. The mapping between sequence identifiers and strain names is provided via a text file on our download server. Additional meta information from Genbank is provided in a tab-separated file. Display Conventions and Configuration Pairwise alignments of each species to the Ebola virus genome are displayed as a series of colored blocks indicating the functional effect of polymorphisms (in pack mode), or as a wiggle (in full mode) that indicates alignment quality. In dense display mode, percent identity of the whole alignments is shown in grayscale using darker values to indicate higher levels of identity. In pack mode, regions that align with 100% identity are not shown. When there is not 100% percent identity, blocks of four colors are drawn. Red blocks are drawn when a polymorphism in a coding region results in a change in the amino acid that is generated. Green blocks are drawn when a polymorphism in a coding region results in no change to the amino acid that is generated. Blue blocks are drawn when a polymorphism is outside a coding region. Pale yellow blocks are drawn when there are no aligning bases to that region in the reference genome. Checkboxes on the track configuration page allow selection of the species to include in the pairwise display. Configuration buttons are available to select all of the species (Set all), deselect all of the species (Clear all), or use the default settings (Set defaults). To view detailed information about the alignments at a specific position, zoom the display in to 30,000 or fewer bases, then click on the alignment. Base Level When zoomed-in to the base-level display, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the Ebola virus sequence at those alignment positions relative to the longest non-Ebola virus sequence. If there is sufficient space in the display, the size of the gap is shown. If the space is insufficient and the gap size is a multiple of 3, a "*" is displayed; other gap sizes are indicated by "+". Codon translation is available in base-level display mode if the displayed region is identified as a coding segment. To display this annotation, select the species for translation from the pull-down menu in the Codon Translation configuration section at the top of the page. Then, select one of the following modes: No codon translation: The gene annotation is not used; the bases are displayed without translation. Use default species reading frames for translation: The annotations from the genome displayed in the Default species to establish reading frame pull-down menu are used to translate all the aligned species present in the alignment. Use reading frames for species if available, otherwise no translation: Codon translation is performed only for those species where the region is annotated as protein coding. Use reading frames for species if available, otherwise use default species: Codon translation is done on those species that are annotated as being protein coding over the aligned region using species-specific annotation; the remaining species are translated using the default species annotation. Methods Pairwise alignments with the reference sequence were generated for each sequence using lastz version 1.03.52. Parameters used for each lastz alignment: # hsp_threshold = 2200 # gapped_threshold = 4000 = L # x_drop = 910 # y_drop = 3400 = Y # gap_open_penalty = 400 # gap_extend_penalty = 30 # A C G T # A 91 -90 -25 -100 # C -90 100 -100 -25 # G -25 -100 100 -90 # T -100 -25 -90 91 # seed=1110100110010101111 w/transition # step=1 Pairwise alignments were then linked into chains using a dynamic programming algorithm that finds maximally scoring chains of gapless subsections of the alignments organized in a kd-tree. Parameters used in the chaining (axtChain) step: -minScore=10 -linearGap=loose High-scoring chains were then placed along the genome, with gaps filled by lower-scoring chains, to produce an alignment net. The multiple alignment was constructed from the resulting best-in-genome pairwise alignments progressively aligned using multiz/autoMZ, following a simple binary tree phylogeny: ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( (KM034562v1 KJ660346v2) KJ660347v2) KJ660348v2) KM034554v1) KM034555v1) KM034557v1) KM034560v1) KM233039v1) KM233043v1) KM233045v1) KM233050v1) KM233051v1) KM233053v1) KM233056v1) KM233057v1) KM233063v1) KM233069v1) KM233070v1) KM233072v1) KM233089v1) KM233092v1) KM233096v1) KM233097v1) KM233098v1) KM233099v1) KM233103v1) KM233104v1) KM233109v1) KM233110v1) KM233113v1) AF086833v2) AF272001v1) AY142960v1) EU224440v2) KC242791v1) KC242792v1) KC242794v1) KC242796v1) KC242798v1) KC242799v1) KC242801v1) KM034551v1) KM034553v1) KM034556v1) KM034558v1) KM034559v1) KM034561v1) KM233035v1) KM233036v1) KM233037v1) KM233038v1) KM233040v1) KM233041v1) KM233042v1) KM233044v1) KM233046v1) KM233047v1) KM233048v1) KM233049v1) KM233052v1) KM233054v1) KM233055v1) KM233058v1) KM233059v1) KM233061v1) KM233062v1) KM233064v1) KM233065v1) KM233066v1) KM233067v1) KM233068v1) KM233071v1) KM233073v1) KM233074v1) KM233075v1) KM233076v1) KM233077v1) KM233078v1) KM233079v1) KM233080v1) KM233081v1) KM233082v1) KM233084v1) KM233085v1) KM233086v1) KM233087v1) KM233088v1) KM233093v1) KM233094v1) KM233095v1) KM233100v1) KM233101v1) KM233102v1) KM233105v1) KM233106v1) KM233107v1) KM233108v1) KM233111v1) KM233112v1) KM233114v1) KM233115v1) KM233116v1) KM233091v1) NC_002549v1) KM034552v1) KM233060v1) KM233083v1) KM233090v1) KM233117v1) KM233118v1) AY354458v1) KC242784v1) KC242785v1) KC242786v1) KC242787v1) KC242788v1) KC242789v1) KC242790v1) KC242793v1) KC242795v1) KC242797v1) KC242800v1) AF499101v1) JQ352763v1) HQ613402v1) HQ613403v1) KM034549v1) KM034550v1) KM034563v1) FJ217162v1) NC_014372v1) FJ217161v1) NC_014373v1) KC545395v1) KC545394v1) KC545393v1) KC545396v1) FJ621585v1) FJ621584v1) JX477166v1) AY769362v1) AB050936v1) EU338380v1) KC242783v2) JX477165v1) AF522874v1) NC_004161v1) FJ621583v1) KC589025v1) FJ968794v1) AY729654v1) NC_006432v1) KC545389v1) KC545390v1) KC545391v1) KC545392v1) JN638998v1) NC_024781v1) NC_001608v3) ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( (G3686v1_2014 Guinea_Kissidougou-C15_2014) Guinea_Gueckedou-C07_2014) Guinea_Gueckedou-C05_2014) G3676v1_2014) G3676v2_2014) G3677v2_2014) G3682v1_2014) EM112_2014) EM120_2014) EM124v1_2014) G3713v2_2014) G3713v3_2014) G3724_2014) G3735v1_2014) G3735v2_2014) G3764_2014) G3770v1_2014) G3770v2_2014) G3782_2014) G3814_2014) G3818_2014) G3822_2014) G3823_2014) G3825v1_2014) G3825v2_2014) G3831_2014) G3834_2014) G3846_2014) G3848_2014) G3856v1_2014) AF086833v2_1976) Mayinga_1976) Mayinga_2002) GuineaPig_Mayinga_2007) Bonduni_1977) Gabon_1994) 2Nza_1996) 13625Kikwit_1995) 1Ikot_Gabon_1996) 13709Kikwit_1995) deRoover_1976) EM096_2014) G3670v1_2014) G3677v1_2014) G3679v1_2014) G3680v1_2014) G3683v1_2014) EM104_2014) EM106_2014) EM110_2014) EM111_2014) EM113_2014) EM115_2014) EM119_2014) EM121_2014) EM124v2_2014) EM124v3_2014) EM124v4_2014) G3707_2014) G3713v4_2014) G3729_2014) G3734v1_2014) G3750v1_2014) G3750v2_2014) G3752_2014) G3758_2014) G3765v2_2014) G3769v1_2014) G3769v2_2014) G3769v3_2014) G3769v4_2014) G3771_2014) G3786_2014) G3787_2014) G3788_2014) G3789v1_2014) G3795_2014) G3796_2014) G3798_2014) G3799_2014) G3800_2014) G3805v1_2014) G3807_2014) G3808_2014) G3809_2014) G3810v1_2014) G3810v2_2014) G3819_2014) G3820_2014) G3821_2014) G3826_2014) G3827_2014) G3829_2014) G3838_2014) G3840_2014) G3841_2014) G3845_2014) G3850_2014) G3851_2014) G3856v3_2014) G3857_2014) NM042v1_2014) G3817_2014) NC_002549v1_1976) EM098_2014) G3750v3_2014) G3805v2_2014) G3816_2014) NM042v2_2014) NM042v3_2014) Zaire_1995) Luebo9_2007) Luebo0_2007) Luebo1_2007) Luebo23_2007) Luebo43_2007) Luebo4_2007) Luebo5_2007) 1Eko_1996) 1Mbie_Gabon_1996) 1Oba_Gabon_1996) Ilembe_2002) Mouse_Mayinga_2002) Kikwit_1995) 034-KS_2008) M-M_2007) EM095B_2014) EM095_2014) G3687v1_2014) Cote_dIvoire_CIEBOV_1994) Cote_dIvoire_1994) Bundibugyo_Uganda_2007) Bundibugyo_2007) EboBund-122_2012) EboBund-120_2012) EboBund-112_2012) EboBund-14_2012) Reston08-E_2008) Reston08-C_2008) Alice_TX_USA_MkCQ8167_1996) reconstructReston_2008) Reston_1996) Yambio_2004) Maleo_1979) Reston09-A_2009) Reston_PA_1990) Pennsylvania_1990) Reston08-A_2008) EboSud-639_2012) Boniface_1976) Gulu_Uganda_2000) Gulu_2000) EboSud-602_2012) EboSud-603_2012) EboSud-609_2012) EboSud-682_2012) Nakisamata_2011) Marburg_KitumCave_Kenya_1987) Marburg_MtElgon_Musoke_Kenya_1980) Framing tables from the genes were constructed to enable visualization of codons in the multiple alignment display. Phylogenetic Tree Model Both phastCons and phyloP are phylogenetic methods that rely on a tree model containing the tree topology, branch lengths representing evolutionary distance at neutrally evolving sites, the background distribution of nucleotides, and a substitution rate matrix. The all-species tree model for this track was generated using the phyloFit program from the PHAST package (REV model, EM algorithm, medium precision) using multiple alignments of 4-fold degenerate sites extracted from the 160-way alignment (msa_view). The 4d sites were derived from the NCBI gene set, filtered to select single-coverage long transcripts. This same tree model was used in the phyloP calculations; however, the background frequencies were modified to maintain reversibility. The resulting tree model: all species. PhastCons Conservation The phastCons program computes conservation scores based on a phylo-HMM, a type of probabilistic model that describes both the process of DNA substitution at each site in a genome and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2005). PhastCons uses a two-state phylo-HMM, with a state for conserved regions and a state for non-conserved regions. The value plotted at each site is the posterior probability that the corresponding alignment column was "generated" by the conserved state of the phylo-HMM. These scores reflect the phylogeny (including branch lengths) of the species in question, a continuous-time Markov model of the nucleotide substitution process, and a tendency for conservation levels to be autocorrelated along the genome (i.e., to be similar at adjacent sites). The general reversible (REV) substitution model was used. Unlike many conservation-scoring programs, phastCons does not rely on a sliding window of fixed size; therefore, short highly-conserved regions and long moderately conserved regions can both obtain high scores. More information about phastCons can be found in Siepel et al, 2005. The phastCons parameters used were: expected-length=45, target-coverage=0.3, rho=0.3. PhyloP Conservation The phyloP program supports several different methods for computing p-values of conservation or acceleration, for individual nucleotides or larger elements (http://compgen.cshl.edu/phast/). Here it was used to produce separate scores at each base (--wig-scores option), considering all branches of the phylogeny rather than a particular subtree or lineage (i.e., the --subtree option was not used). The scores were computed by performing a likelihood ratio test at each alignment column (--method LRT), and scores for both conservation and acceleration were produced (--mode CONACC). Conserved Elements The conserved elements were predicted by running phastCons with the --most-conserved option. The predicted elements are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM. Each element is assigned a log-odds score equal to its log probability under the conserved model minus its log probability under the non-conserved model. The "score" field associated with this track contains transformed log-odds scores, taking values between 0 and 1000. (The scores are transformed using a monotonic function of the form a * log(x) + b.) The raw log odds scores are retained in the "name" field and can be seen on the details page or in the browser when the track's display mode is set to "pack" or "full". Credits This track was created using the following programs: Alignment tools: lastz (formerly blastz) and multiz by Minmei Hou, Scott Schwartz, Robert Harris, and Webb Miller of the Penn State Bioinformatics Group Conservation scoring: phastCons, phyloP, phyloFit, tree_doctor, msa_view and other programs in PHAST by Adam Siepel at Cold Spring Harbor Laboratory (original development done at the Haussler lab at UCSC). Chaining and Netting: axtChain, chainNet by Jim Kent at UCSC MAF Annotation tools: mafAddIRows by Brian Raney, UCSC; mafAddQRows by Richard Burhans, Penn State; genePredToMafFrames by Mark Diekhans, UCSC Tree image generator: phyloPng by Galt Barber, UCSC Conservation track display: Kate Rosenbloom, Hiram Clawson (wiggle display), and Brian Raney (gap annotation and codon framing) at UCSC References Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632; Supplemental Materials and Methods Phylo-HMMs, phastCons, and phyloP: Felsenstein J, Churchill GA. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol. 1996 Jan;13(1):93-104. PMID: 8583911 Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010 Jan;20(1):110-21. PMID: 19858363; PMC: PMC2798823 Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. PMID: 16024819; PMC: PMC1182216 Siepel A, Haussler D. Phylogenetic Hidden Markov Models. In: Nielsen R, editor. Statistical Methods in Molecular Evolution. New York: Springer; 2005. pp. 325-351. Yang Z. A space-time process model for the evolution of DNA sequences. Genetics. 1995 Feb;139(2):993-1005. PMID: 7713447; PMC: PMC1206396 Chain/Net: Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Multiz: Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004 Apr;14(4):708-15. PMID: 15060014; PMC: PMC383317 Lastz (formerly Blastz): Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Harris RS. Improved pairwise alignment of genomic DNA. Ph.D. Thesis. Pennsylvania State University, USA. 2007. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 cons160wayViewalign Multiz Alignments Multiz Alignment & Conservation (160 Virus Strains, Accession Names) Comparative Genomics multiz160way Multiz Align Multiz Genome Alignments of 158 Ebola strains and 2 Marburg strains Comparative Genomics cons160wayViewphastcons Element Conservation (phastCons) Multiz Alignment & Conservation (160 Virus Strains, Accession Names) Comparative Genomics phastCons160way PhastCons 158 Ebola strains and 2 Marburg strains Basewise Conservation by PhastCons Comparative Genomics cons160wayViewelements Conserved Elements Multiz Alignment & Conservation (160 Virus Strains, Accession Names) Comparative Genomics phastConsElements160way Cons. Elements 158 Ebola strains and 2 Marburg strains Conserved Elements Comparative Genomics cons160wayViewphyloP Basewise Conservation (phyloP) Multiz Alignment & Conservation (160 Virus Strains, Accession Names) Comparative Genomics phyloP160way PhyloP 158 Ebola strains and 2 Marburg strains Basewise Conservation by PhyloP Comparative Genomics strainCons160way 160 Strains Multiz Alignment & Conservation (160 Virus Strains, Strain Names) Comparative Genomics Downloads for data in this track are available: Multiz alignments (MAF format), and phylogenetic trees PhyloP conservation (WIG format) PhastCons conservation (WIG format) Description This track shows multiple alignments of 160 virus sequences, composed of 158 Ebola virus sequences and two Marburg virus sequences aligned to the Ebola virus reference sequence G3683/KM034562.1. It also includes measurements of evolutionary conservation using two methods (phastCons and phyloP) from the PHAST package, for all 160 virus sequences. The multiple alignments were generated using multiz and other tools in the UCSC/Penn State Bioinformatics comparative genomics alignment pipeline. Conserved elements identified by phastCons are also displayed in this track. PhastCons (which has been used in previous Conservation tracks) is a hidden Markov model-based method that estimates the probability that each nucleotide belongs to a conserved element, based on the multiple alignment. It considers not just each individual alignment column, but also its flanking columns. By contrast, phyloP separately measures conservation at individual columns, ignoring the effects of their neighbors. As a consequence, the phyloP plots have a less smooth appearance than the phastCons plots, with more "texture" at individual sites. The two methods have different strengths and weaknesses. PhastCons is sensitive to "runs" of conserved sites, and is therefore effective for picking out conserved elements. PhyloP, on the other hand, is more appropriate for evaluating signatures of selection at particular nucleotides or classes of nucleotides (e.g., third codon positions, or first positions of miRNA target sites). Another important difference is that phyloP can measure acceleration (faster evolution than expected under neutral drift) as well as conservation (slower than expected evolution). In the phyloP plots, sites predicted to be conserved are assigned positive scores (and shown in blue), while sites predicted to be fast-evolving are assigned negative scores (and shown in red). The absolute values of the scores represent -log p-values under a null hypothesis of neutral evolution. The phastCons scores, by contrast, represent probabilities of negative selection and range between 0 and 1. Both phastCons and phyloP treat alignment gaps and unaligned nucleotides as missing data. The data contained in the 160 Accessions and the 160 Strains tracks are the same. The only difference between these two tracks are the identifiers used to label the sequences. In the 160 Accessions track, the sequence is labeled using its NCBI Nucleotide accession number. In the 160 Strains track, we used a shortened version of the strain name from the NCBI Nucleotide entry to label each sequence, and when this was unavailable, we constructed our own using the DEFINITION, /country, and /collection_date lines from the NCBI record. The mapping between sequence identifiers and strain names is provided via a text file on our download server. Additional meta information from Genbank is provided in a tab-separated file. Display Conventions and Configuration Pairwise alignments of each species to the Ebola virus genome are displayed as a series of colored blocks indicating the functional effect of polymorphisms (in pack mode), or as a wiggle (in full mode) that indicates alignment quality. In dense display mode, percent identity of the whole alignments is shown in grayscale using darker values to indicate higher levels of identity. In pack mode, regions that align with 100% identity are not shown. When there is not 100% percent identity, blocks of four colors are drawn. Red blocks are drawn when a polymorphism in a coding region results in a change in the amino acid that is generated. Green blocks are drawn when a polymorphism in a coding region results in no change to the amino acid that is generated. Blue blocks are drawn when a polymorphism is outside a coding region. Pale yellow blocks are drawn when there are no aligning bases to that region in the reference genome. Checkboxes on the track configuration page allow selection of the species to include in the pairwise display. Configuration buttons are available to select all of the species (Set all), deselect all of the species (Clear all), or use the default settings (Set defaults). To view detailed information about the alignments at a specific position, zoom the display in to 30,000 or fewer bases, then click on the alignment. Base Level When zoomed-in to the base-level display, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the Ebola virus sequence at those alignment positions relative to the longest non-Ebola virus sequence. If there is sufficient space in the display, the size of the gap is shown. If the space is insufficient and the gap size is a multiple of 3, a "*" is displayed; other gap sizes are indicated by "+". Codon translation is available in base-level display mode if the displayed region is identified as a coding segment. To display this annotation, select the species for translation from the pull-down menu in the Codon Translation configuration section at the top of the page. Then, select one of the following modes: No codon translation: The gene annotation is not used; the bases are displayed without translation. Use default species reading frames for translation: The annotations from the genome displayed in the Default species to establish reading frame pull-down menu are used to translate all the aligned species present in the alignment. Use reading frames for species if available, otherwise no translation: Codon translation is performed only for those species where the region is annotated as protein coding. Use reading frames for species if available, otherwise use default species: Codon translation is done on those species that are annotated as being protein coding over the aligned region using species-specific annotation; the remaining species are translated using the default species annotation. Methods Pairwise alignments with the reference sequence were generated for each sequence using lastz version 1.03.52. Parameters used for each lastz alignment: # hsp_threshold = 2200 # gapped_threshold = 4000 = L # x_drop = 910 # y_drop = 3400 = Y # gap_open_penalty = 400 # gap_extend_penalty = 30 # A C G T # A 91 -90 -25 -100 # C -90 100 -100 -25 # G -25 -100 100 -90 # T -100 -25 -90 91 # seed=1110100110010101111 w/transition # step=1 Pairwise alignments were then linked into chains using a dynamic programming algorithm that finds maximally scoring chains of gapless subsections of the alignments organized in a kd-tree. Parameters used in the chaining (axtChain) step: -minScore=10 -linearGap=loose High-scoring chains were then placed along the genome, with gaps filled by lower-scoring chains, to produce an alignment net. The multiple alignment was constructed from the resulting best-in-genome pairwise alignments progressively aligned using multiz/autoMZ, following a simple binary tree phylogeny: ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( (KM034562v1 KJ660346v2) KJ660347v2) KJ660348v2) KM034554v1) KM034555v1) KM034557v1) KM034560v1) KM233039v1) KM233043v1) KM233045v1) KM233050v1) KM233051v1) KM233053v1) KM233056v1) KM233057v1) KM233063v1) KM233069v1) KM233070v1) KM233072v1) KM233089v1) KM233092v1) KM233096v1) KM233097v1) KM233098v1) KM233099v1) KM233103v1) KM233104v1) KM233109v1) KM233110v1) KM233113v1) AF086833v2) AF272001v1) AY142960v1) EU224440v2) KC242791v1) KC242792v1) KC242794v1) KC242796v1) KC242798v1) KC242799v1) KC242801v1) KM034551v1) KM034553v1) KM034556v1) KM034558v1) KM034559v1) KM034561v1) KM233035v1) KM233036v1) KM233037v1) KM233038v1) KM233040v1) KM233041v1) KM233042v1) KM233044v1) KM233046v1) KM233047v1) KM233048v1) KM233049v1) KM233052v1) KM233054v1) KM233055v1) KM233058v1) KM233059v1) KM233061v1) KM233062v1) KM233064v1) KM233065v1) KM233066v1) KM233067v1) KM233068v1) KM233071v1) KM233073v1) KM233074v1) KM233075v1) KM233076v1) KM233077v1) KM233078v1) KM233079v1) KM233080v1) KM233081v1) KM233082v1) KM233084v1) KM233085v1) KM233086v1) KM233087v1) KM233088v1) KM233093v1) KM233094v1) KM233095v1) KM233100v1) KM233101v1) KM233102v1) KM233105v1) KM233106v1) KM233107v1) KM233108v1) KM233111v1) KM233112v1) KM233114v1) KM233115v1) KM233116v1) KM233091v1) NC_002549v1) KM034552v1) KM233060v1) KM233083v1) KM233090v1) KM233117v1) KM233118v1) AY354458v1) KC242784v1) KC242785v1) KC242786v1) KC242787v1) KC242788v1) KC242789v1) KC242790v1) KC242793v1) KC242795v1) KC242797v1) KC242800v1) AF499101v1) JQ352763v1) HQ613402v1) HQ613403v1) KM034549v1) KM034550v1) KM034563v1) FJ217162v1) NC_014372v1) FJ217161v1) NC_014373v1) KC545395v1) KC545394v1) KC545393v1) KC545396v1) FJ621585v1) FJ621584v1) JX477166v1) AY769362v1) AB050936v1) EU338380v1) KC242783v2) JX477165v1) AF522874v1) NC_004161v1) FJ621583v1) KC589025v1) FJ968794v1) AY729654v1) NC_006432v1) KC545389v1) KC545390v1) KC545391v1) KC545392v1) JN638998v1) NC_024781v1) NC_001608v3) ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( (G3686v1_2014 Guinea_Kissidougou-C15_2014) Guinea_Gueckedou-C07_2014) Guinea_Gueckedou-C05_2014) G3676v1_2014) G3676v2_2014) G3677v2_2014) G3682v1_2014) EM112_2014) EM120_2014) EM124v1_2014) G3713v2_2014) G3713v3_2014) G3724_2014) G3735v1_2014) G3735v2_2014) G3764_2014) G3770v1_2014) G3770v2_2014) G3782_2014) G3814_2014) G3818_2014) G3822_2014) G3823_2014) G3825v1_2014) G3825v2_2014) G3831_2014) G3834_2014) G3846_2014) G3848_2014) G3856v1_2014) AF086833v2_1976) Mayinga_1976) Mayinga_2002) GuineaPig_Mayinga_2007) Bonduni_1977) Gabon_1994) 2Nza_1996) 13625Kikwit_1995) 1Ikot_Gabon_1996) 13709Kikwit_1995) deRoover_1976) EM096_2014) G3670v1_2014) G3677v1_2014) G3679v1_2014) G3680v1_2014) G3683v1_2014) EM104_2014) EM106_2014) EM110_2014) EM111_2014) EM113_2014) EM115_2014) EM119_2014) EM121_2014) EM124v2_2014) EM124v3_2014) EM124v4_2014) G3707_2014) G3713v4_2014) G3729_2014) G3734v1_2014) G3750v1_2014) G3750v2_2014) G3752_2014) G3758_2014) G3765v2_2014) G3769v1_2014) G3769v2_2014) G3769v3_2014) G3769v4_2014) G3771_2014) G3786_2014) G3787_2014) G3788_2014) G3789v1_2014) G3795_2014) G3796_2014) G3798_2014) G3799_2014) G3800_2014) G3805v1_2014) G3807_2014) G3808_2014) G3809_2014) G3810v1_2014) G3810v2_2014) G3819_2014) G3820_2014) G3821_2014) G3826_2014) G3827_2014) G3829_2014) G3838_2014) G3840_2014) G3841_2014) G3845_2014) G3850_2014) G3851_2014) G3856v3_2014) G3857_2014) NM042v1_2014) G3817_2014) NC_002549v1_1976) EM098_2014) G3750v3_2014) G3805v2_2014) G3816_2014) NM042v2_2014) NM042v3_2014) Zaire_1995) Luebo9_2007) Luebo0_2007) Luebo1_2007) Luebo23_2007) Luebo43_2007) Luebo4_2007) Luebo5_2007) 1Eko_1996) 1Mbie_Gabon_1996) 1Oba_Gabon_1996) Ilembe_2002) Mouse_Mayinga_2002) Kikwit_1995) 034-KS_2008) M-M_2007) EM095B_2014) EM095_2014) G3687v1_2014) Cote_dIvoire_CIEBOV_1994) Cote_dIvoire_1994) Bundibugyo_Uganda_2007) Bundibugyo_2007) EboBund-122_2012) EboBund-120_2012) EboBund-112_2012) EboBund-14_2012) Reston08-E_2008) Reston08-C_2008) Alice_TX_USA_MkCQ8167_1996) reconstructReston_2008) Reston_1996) Yambio_2004) Maleo_1979) Reston09-A_2009) Reston_PA_1990) Pennsylvania_1990) Reston08-A_2008) EboSud-639_2012) Boniface_1976) Gulu_Uganda_2000) Gulu_2000) EboSud-602_2012) EboSud-603_2012) EboSud-609_2012) EboSud-682_2012) Nakisamata_2011) Marburg_KitumCave_Kenya_1987) Marburg_MtElgon_Musoke_Kenya_1980) Framing tables from the genes were constructed to enable visualization of codons in the multiple alignment display. Phylogenetic Tree Model Both phastCons and phyloP are phylogenetic methods that rely on a tree model containing the tree topology, branch lengths representing evolutionary distance at neutrally evolving sites, the background distribution of nucleotides, and a substitution rate matrix. The all-species tree model for this track was generated using the phyloFit program from the PHAST package (REV model, EM algorithm, medium precision) using multiple alignments of 4-fold degenerate sites extracted from the 160-way alignment (msa_view). The 4d sites were derived from the NCBI gene set, filtered to select single-coverage long transcripts. This same tree model was used in the phyloP calculations; however, the background frequencies were modified to maintain reversibility. The resulting tree model: all species. PhastCons Conservation The phastCons program computes conservation scores based on a phylo-HMM, a type of probabilistic model that describes both the process of DNA substitution at each site in a genome and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2005). PhastCons uses a two-state phylo-HMM, with a state for conserved regions and a state for non-conserved regions. The value plotted at each site is the posterior probability that the corresponding alignment column was "generated" by the conserved state of the phylo-HMM. These scores reflect the phylogeny (including branch lengths) of the species in question, a continuous-time Markov model of the nucleotide substitution process, and a tendency for conservation levels to be autocorrelated along the genome (i.e., to be similar at adjacent sites). The general reversible (REV) substitution model was used. Unlike many conservation-scoring programs, phastCons does not rely on a sliding window of fixed size; therefore, short highly-conserved regions and long moderately conserved regions can both obtain high scores. More information about phastCons can be found in Siepel et al, 2005. The phastCons parameters used were: expected-length=45, target-coverage=0.3, rho=0.3. PhyloP Conservation The phyloP program supports several different methods for computing p-values of conservation or acceleration, for individual nucleotides or larger elements (http://compgen.cshl.edu/phast/). Here it was used to produce separate scores at each base (--wig-scores option), considering all branches of the phylogeny rather than a particular subtree or lineage (i.e., the --subtree option was not used). The scores were computed by performing a likelihood ratio test at each alignment column (--method LRT), and scores for both conservation and acceleration were produced (--mode CONACC). Conserved Elements The conserved elements were predicted by running phastCons with the --most-conserved option. The predicted elements are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM. Each element is assigned a log-odds score equal to its log probability under the conserved model minus its log probability under the non-conserved model. The "score" field associated with this track contains transformed log-odds scores, taking values between 0 and 1000. (The scores are transformed using a monotonic function of the form a * log(x) + b.) The raw log odds scores are retained in the "name" field and can be seen on the details page or in the browser when the track's display mode is set to "pack" or "full". Credits This track was created using the following programs: Alignment tools: lastz (formerly blastz) and multiz by Minmei Hou, Scott Schwartz, Robert Harris, and Webb Miller of the Penn State Bioinformatics Group Conservation scoring: phastCons, phyloP, phyloFit, tree_doctor, msa_view and other programs in PHAST by Adam Siepel at Cold Spring Harbor Laboratory (original development done at the Haussler lab at UCSC). Chaining and Netting: axtChain, chainNet by Jim Kent at UCSC MAF Annotation tools: mafAddIRows by Brian Raney, UCSC; mafAddQRows by Richard Burhans, Penn State; genePredToMafFrames by Mark Diekhans, UCSC Tree image generator: phyloPng by Galt Barber, UCSC Conservation track display: Kate Rosenbloom, Hiram Clawson (wiggle display), and Brian Raney (gap annotation and codon framing) at UCSC References Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632; Supplemental Materials and Methods Phylo-HMMs, phastCons, and phyloP: Felsenstein J, Churchill GA. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol. 1996 Jan;13(1):93-104. PMID: 8583911 Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010 Jan;20(1):110-21. PMID: 19858363; PMC: PMC2798823 Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. PMID: 16024819; PMC: PMC1182216 Siepel A, Haussler D. Phylogenetic Hidden Markov Models. In: Nielsen R, editor. Statistical Methods in Molecular Evolution. New York: Springer; 2005. pp. 325-351. Yang Z. A space-time process model for the evolution of DNA sequences. Genetics. 1995 Feb;139(2):993-1005. PMID: 7713447; PMC: PMC1206396 Chain/Net: Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Multiz: Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004 Apr;14(4):708-15. PMID: 15060014; PMC: PMC383317 Lastz (formerly Blastz): Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Harris RS. Improved pairwise alignment of genomic DNA. Ph.D. Thesis. Pennsylvania State University, USA. 2007. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 strainCons160wayViewalign Multiz Alignments Multiz Alignment & Conservation (160 Virus Strains, Strain Names) Comparative Genomics strainName160way Multiz Align Multiz Genome Alignments of 158 Ebola strains and 2 Marburg strains Comparative Genomics strainCons160wayViewphastcons Element Conservation (phastCons) Multiz Alignment & Conservation (160 Virus Strains, Strain Names) Comparative Genomics strainPhastCons160way PhastCons 158 Ebola strains and 2 Marburg strains Basewise Conservation by PhastCons Comparative Genomics strainCons160wayViewelements Conserved Elements Multiz Alignment & Conservation (160 Virus Strains, Strain Names) Comparative Genomics strainPhastConsElements160way Cons. Elements 158 Ebola strains and 2 Marburg strains Conserved Elements Comparative Genomics strainCons160wayViewphyloP Basewise Conservation (phyloP) Multiz Alignment & Conservation (160 Virus Strains, Strain Names) Comparative Genomics strainPhyloP160way PhyloP 158 Ebola strains and 2 Marburg strains Basewise Conservation by PhyloP Comparative Genomics ncbiGene NCBI Genes NCBI Genes from KM034562 GenBank Record Genes and Gene Prediction Tracks Description This track contains genes extracted from the GenBank nuccore entry for KM034562.1. Display Conventions This track follows the display conventions for gene prediction tracks. Methods We downloaded the GenBank record for KM034562.1, extracted the entries for each gene and then loaded them into the UCSC database. Additional entries were added for the various forms of the GP gene. newSeqs New sequences Recent related sequences from GenBank Comparative Genomics Description This track shows the alignments of recently sequenced Ebola samples to the Ebola virus reference sequence G3683/KM034562.1. Display Conventions and Configuration Pairwise alignments of each species to the Ebola virus genome are displayed as a series of colored blocks indicating the functional effect of polymorphisms (in pack mode), or as a wiggle (in full mode) that indicates alignment quality. In dense display mode, percent identity of the whole alignments is shown in grayscale using darker values to indicate higher levels of identity. In pack mode, regions that align with 100% identity are not shown. When there is not 100% percent identity, blocks of four colors are drawn. Red blocks are drawn when a polymorphism in a coding region results in a change in the amino acid that is generated. Green blocks are drawn when a polymorphism in a coding region results in no change to the amino acid that is generated. Blue blocks are drawn when a polymorphism is outside a coding region. Pale yellow blocks are drawn when there is no aligning bases to that region in the reference genome. Checkboxes on the track configuration page allow selection of the species to include in the pairwise display. Configuration buttons are available to select all of the species (Set all), deselect all of the species (Clear all), or use the default settings (Set defaults). To view detailed information about the alignments at a specific position, zoom the display in to 30,000 or fewer bases, then click on the alignment. Base Level When zoomed-in to the base-level display, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the Ebola virus sequence at those alignment positions relative to the longest non-Ebola virus sequence. If there is sufficient space in the display, the size of the gap is shown. If the space is insufficient and the gap size is a multiple of 3, a "*" is displayed; other gap sizes are indicated by "+". Codon translation is available in base-level display mode if the displayed region is identified as a coding segment. To display this annotation, select the species for translation from the pull-down menu in the Codon Translation configuration section at the top of the page. Then, select one of the following modes: No codon translation: The gene annotation is not used; the bases are displayed without translation. Use default species reading frames for translation: The annotations from the genome displayed in the Default species to establish reading frame pull-down menu are used to translate all the aligned species present in the alignment. Use reading frames for species if available, otherwise no translation: Codon translation is performed only for those species where the region is annotated as protein coding. Use reading frames for species if available, otherwise use default species: Codon translation is done on those species that are annotated as being protein coding over the aligned region using species-specific annotation; the remaining species are translated using the default species annotation. Methods Ebola sequences are found NCBI Nucleotide with the search term: (ebola[title] or ebolavirus[title]) and genome The sequences are aligned to the reference sequence with an ordinary Smith-Waterman alignment command faAlign from the 'kent' source utilities. Source information psl score of alignments referencechrStartchrEndqueryquerysizescoreidentitycollectiondatecountryisolate KM034562v1018957KP096422v11895818941100.00Mar-2014GuineaH.sapiens-tc/GIN/14/WPG-C15 KM034562v1018957KP178538v11895818941100.0003-Aug-2014LiberiaEbola virus/H.sapiens-wt/LBR/2014/Makona-201403007 KM034562v1018957KP342330v11895818941100.00Oct-2014Guinea: ConacryH.sapiens-wt/GIN/2014/Conacry-192 KM034562v1018957KP096421v11895818937100.00Mar-2014GuineaH.sapiens-tc/GIN/14/WPG-C07 KM034562v1018957KP096420v11895818935100.00Mar-2014GuineaH.sapiens-tc/GIN/14/WPG-C05 KM034562v1018957KP260799v11895818933100.002014MaliEbola virus H.sapiens/MLI/14/Manoka-Mali-DPR1 KM034562v1118957KP184503v11895718932100.0025-Aug-2014UK: GBEbola virus /H.sapiens-tc/GBR/2014/Makona-UK1.1 KM034562v1018957KP260800v11895818927100.002014MaliEbola virus H.sapiens/MLI/14/Manoka-Mali-DPR2 KM034562v1018957KP260801v11895818925100.002014MaliEbola virus H.sapiens/MLI/14/Manoka-Mali-DPR3 KM034562v1018957KP260802v11895818923100.002014MaliEbola virus H.sapiens/MLI/14/Manoka-Mali-DPR4 KM034562v13618956KP120616v11892018898100.0025-Aug-2014UK: GBH.sapiens-wt/GBR/2014/ManoRiver-UK1 KM034562v12918957KP658432v11892918894100.0029-Dec-2014UK: GBEbola virus/H.sapiens-wt/GBR/2014/Makona-UK2 KM034562v1318956KM519951v1189531774196.902014DRCEbola virus/H.sap-wt/COD/2014/Boende-Lokolia KM034562v11618955KP271018v1189411771396.8020-Aug-2014DRCEbola virus/H.sapiens/COD/2014/Lomela-Lokolia16 KM034562v14518841KM655246v1187971768297.101976ZaireH.sapiens-tc/COD/1976/Yambuku-Ecran KM034562v15318914KP271020v1188611763596.8020-Aug-2014DRCEbola virus/H.sapiens/COD/2014/Lomela-Lokolia19 KM034562v117618936KP271019v1187601534096.7020-Aug-2014DRCEbola virus/H.sapiens/COD/2014/Lomela-Lokolia17 KM034562v1418336NC_016144v118927319458.902003SpainLloviu virus newSeqsViewalign New sequences Recent related sequences from GenBank Comparative Genomics newSequences New sequences Recent related sequences from GenBank Comparative Genomics gire2014SpecificSnps 2014 Specific Sites that Carry a Unique Base in 81 Sequences from 2014 Outbreak Variation and Repeats Description This track displays variants identified by Gire et al. in 81 isolates from the 2014 strain of Ebola virus (78 from Sierra Leone and 3 from Guinea) that are specific to the 2014 strain. At least one 2014 isolate carries a base not found elsewhere in the EBOV alignment. Display Conventions Items are labeled by ancestral nucleotide and 2014-derived nucleotide. Non-coding variants are black, synonymous variants are green, and missense variants are red and also labeled by gene and amino acid change. Click on an item to view more details such as the derived allele frequency in 2014 isolates and, for missense changes, the BLOSUM62 substitution score. Methods Blood samples were collected from 78 patients at Kenema Government Hospital in Sierra Leone. For details of RNA preservation, PCR, human RNA depletion, library construction and sequencing, see Supplemental Materials and Methods of Gire et al. Gire et al. analyzed the 78 Sierra Leone patient sequences together with 3 sequences from the 2014 outbreak in Guinea (Baize et al.; suspected sequencing errors were masked, see Supplemental Materials and Methods of Gire et al.), for a total of 81 sequences from 2014. In addition, some analyses included 20 sequences from past outbreaks of Zaire Ebola virus, 1976-2008, for a total of 101 sequences. Sequence variants were extracted directly from multiple sequence alignments of the group of 101 sequences (1976-2014). A custom release of SnpEff (v4.0, build 2014-07-01, to support ribosomal slippage in transcription of GP gene) was used to predict functional effect of variants on genes (noncoding, synonymous or missense). References Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keïta S, De Clerck H et al. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med. 2014 Oct 9;371(15):1418-25. PMID: 24738640 Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632 Supplemental Materials and Methods iedbBcell IEDB B Cell Immune Epitope Database and Analysis Resource (IEDB) B-Cell Epitopes Immunology Description This track shows protein sequences recognized by antibodies as annotated by the National Institute for Allergy and Infectious Diseases (NIAID) Immune Epitope Database (IEDB). Only sequences with a positive assay outcome are shown on the track. All fields annotated by IEDB and exported in their "compact" file are shown on the details page, which also provides links back to the IEDB. See also the detailed explanation of the curated Ebola data in the IEDB Knowledgebase. Display Conventions Peptide matches are labeled with the name of the antibody and the host organism, separated by a slash. When the antibody does not have a name in IEDB, the first six letters of the peptide are shown instead. Mouse over the features to see the authors of the study that described the epitope. References Vita R, Zarebski L, Greenbaum JA, Emami H, Hoof I, Salimi N, Damle R, Sette A, Peters B. The immune epitope database 2.0. Nucleic Acids Res. 2010 Jan;38(Database issue):D854-62. PMID: 19906713; PMC: PMC2808938 gire2014 2014 Variants Variants from 81 Sequences from 2014 Outbreak Variation and Repeats Description This track displays variants identified in 81 samples from the Zaire clade of Ebola viruses found by Gire et al., 2014. Display Conventions In "dense" mode, a vertical line is drawn at the position of each variant. In "full" mode, in addition to the vertical line, a label to the left shows the reference allele first and variant alleles below (A = red, C = blue, G = green, T = magenta, Indels = black). Hovering the pointer over any variant will prompt the display of the occurrences numbers for each allele in Gire et al., 2014. Clicking on any variant will result in full details of that variant being displayed. By default, in "pack" mode, the display shows a clustering of haplotypes in the viewed range, sorted by similarity of alleles weighted by proximity to a central variant. The clustering view can highlight local patterns of linkage. Each variant is a vertical bar with white (invisible) representing the reference allele and black representing the non-reference allele(s). Tick marks are drawn at the top and bottom of each variant's vertical bar to make the bar more visible when most alleles are reference alleles. The vertical bar for the central variant used in clustering is outlined in purple. In order to avoid long compute times, the range of alleles used in clustering may be limited; alleles used in clustering have purple tick marks at the top and bottom. The clustering tree is displayed to the left of the main image. It does not represent relatedness of individuals; it simply shows the arrangement of local haplotypes by similarity. When a rightmost branch is purple, it means that all haplotypes in that branch are identical, at least within the range of variants used in clustering. Methods Blood samples were collected from 78 patients at Kenema Government Hospital in Sierra Leone. For details of RNA preservation, PCR, human RNA depletion, library construction and sequencing, see Supplemental Materials and Methods of Gire et al. Gire et al. analyzed the 78 Sierra Leone patient sequences together with 3 sequences from the 2014 outbreak in Guinea (Baize et al.; suspected sequencing errors were masked, see Supplemental Materials and Methods of Gire et al.), for a total of 81 sequences from 2014. In addition, some analyses included 20 sequences from past outbreaks of Zaire Ebola virus, 1976-2008, for a total of 101 sequences. Sequence variants were extracted directly from multiple sequence alignments of the group of 101 sequences (1976-2014). A custom release of SnpEff (v4.0, build 2014-07-01, to support ribosomal slippage in transcription of GP gene) was used to predict functional effect of variants on genes (noncoding, synonymous or missense). References Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keïta S, De Clerck H et al. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med. 2014 Oct 9;371(15):1418-25. PMID: 24738640 Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632 Supplemental Materials and Methods gire2014Missense 2014 Missense Protein Changing Variants from 81 Sequences from 2014 Outbreak Variation and Repeats Description This track displays protein changing variants identified in 81 samples from the Zaire clade of Ebola viruses found by Gire et al., 2014. Display Conventions In "dense" mode, a vertical line is drawn at the position of each variant. In "full" mode, in addition to the vertical line, a label to the left shows the reference allele first and variant alleles below (A = red, C = blue, G = green, T = magenta, Indels = black). Hovering the pointer over any variant will prompt the display of the occurrences numbers for each allele in Gire et al., 2014. Clicking on any variant will result in full details of that variant being displayed. By default, in "pack" mode, the display shows a clustering of haplotypes in the viewed range, sorted by similarity of alleles weighted by proximity to a central variant. The clustering view can highlight local patterns of linkage. Each variant is a vertical bar with white (invisible) representing the reference allele and black representing the non-reference allele(s). Tick marks are drawn at the top and bottom of each variant's vertical bar to make the bar more visible when most alleles are reference alleles. The vertical bar for the central variant used in clustering is outlined in purple. In order to avoid long compute times, the range of alleles used in clustering may be limited; alleles used in clustering have purple tick marks at the top and bottom. The clustering tree is displayed to the left of the main image. It does not represent relatedness of individuals; it simply shows the arrangement of local haplotypes by similarity. When a rightmost branch is purple, it means that all haplotypes in that branch are identical, at least within the range of variants used in clustering. Methods Blood samples were collected from 78 patients at Kenema Government Hospital in Sierra Leone. For details of RNA preservation, PCR, human RNA depletion, library construction and sequencing, see Supplemental Materials and Methods of Gire et al. Gire et al. analyzed the 78 Sierra Leone patient sequences together with 3 sequences from the 2014 outbreak in Guinea (Baize et al.; suspected sequencing errors were masked, see Supplemental Materials and Methods of Gire et al.), for a total of 81 sequences from 2014. In addition, some analyses included 20 sequences from past outbreaks of Zaire Ebola virus, 1976-2008, for a total of 101 sequences. Sequence variants were extracted directly from multiple sequence alignments of the group of 101 sequences (1976-2014). A custom release of SnpEff (v4.0, build 2014-07-01, to support ribosomal slippage in transcription of GP gene) was used to predict functional effect of variants on genes (noncoding, synonymous or missense). References Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keïta S, De Clerck H et al. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med. 2014 Oct 9;371(15):1418-25. PMID: 24738640 Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632 Supplemental Materials and Methods gireZebov Zaire Variants Variants from 81 Sequences from 2014 Outbreak and 20 Zaire-clade Sequences 1976-2008 Variation and Repeats Description This track displays variants identified in 101 samples from the Zaire clade of Ebola viruses found by Gire et al. Display Conventions In "dense" mode, a vertical line is drawn at the position of each variant. In "full" mode, in addition to the vertical line, a label to the left shows the reference allele first and variant alleles below (A = red, C = blue, G = green, T = magenta, Indels = black). Hovering the pointer over any variant will prompt the display of the occurrences numbers for each allele in Gire et al., 2014. Clicking on any variant will result in full details of that variant being displayed. By default, in "pack" mode, the display shows a clustering of haplotypes in the viewed range, sorted by similarity of alleles weighted by proximity to a central variant. The clustering view can highlight local patterns of linkage. Each variant is a vertical bar with white (invisible) representing the reference allele and black representing the non-reference allele(s). Tick marks are drawn at the top and bottom of each variant's vertical bar to make the bar more visible when most alleles are reference alleles. The vertical bar for the central variant used in clustering is outlined in purple. In order to avoid long compute times, the range of alleles used in clustering may be limited; alleles used in clustering have purple tick marks at the top and bottom. The clustering tree is displayed to the left of the main image. It does not represent relatedness of individuals; it simply shows the arrangement of local haplotypes by similarity. When a rightmost branch is purple, it means that all haplotypes in that branch are identical, at least within the range of variants used in clustering. Methods Blood samples were collected from 78 patients at Kenema Government Hospital in Sierra Leone. For details of RNA preservation, PCR, human RNA depletion, library construction and sequencing, see Supplemental Materials and Methods of Gire et al. Gire et al. analyzed the 78 Sierra Leone patient sequences together with 3 sequences from the 2014 outbreak in Guinea (Baize et al.; suspected sequencing errors were masked, see Supplemental Materials and Methods of Gire et al.), for a total of 81 sequences from 2014. In addition, some analyses included 20 sequences from past outbreaks of Zaire Ebola virus, 1976-2008, for a total of 101 sequences. Sequence variants were extracted directly from multiple sequence alignments of the group of 101 sequences (1976-2014). A custom release of SnpEff (v4.0, build 2014-07-01, to support ribosomal slippage in transcription of GP gene) was used to predict functional effect of variants on genes (noncoding, synonymous or missense). References Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keïta S, De Clerck H et al. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med. 2014 Oct 9;371(15):1418-25. PMID: 24738640 Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632 Supplemental Materials and Methods gireZebovMissense Zaire Missense Protein Changing Variants from 81 Sequences from 2014 Outbreak and 20 Zaire-clade Sequences 1976-2008 Variation and Repeats Description This track displays protein changing variants identified in 101 samples from the Zaire clade of Ebola viruses found by Gire et al., 2014. Display Conventions In "dense" mode, a vertical line is drawn at the position of each variant. In "full" mode, in addition to the vertical line, a label to the left shows the reference allele first and variant alleles below (A = red, C = blue, G = green, T = magenta, Indels = black). Hovering the pointer over any variant will prompt the display of the occurrences numbers for each allele in Gire et al., 2014. Clicking on any variant will result in full details of that variant being displayed. By default, in "pack" mode, the display shows a clustering of haplotypes in the viewed range, sorted by similarity of alleles weighted by proximity to a central variant. The clustering view can highlight local patterns of linkage. Each variant is a vertical bar with white (invisible) representing the reference allele and black representing the non-reference allele(s). Tick marks are drawn at the top and bottom of each variant's vertical bar to make the bar more visible when most alleles are reference alleles. The vertical bar for the central variant used in clustering is outlined in purple. In order to avoid long compute times, the range of alleles used in clustering may be limited; alleles used in clustering have purple tick marks at the top and bottom. The clustering tree is displayed to the left of the main image. It does not represent relatedness of individuals; it simply shows the arrangement of local haplotypes by similarity. When a rightmost branch is purple, it means that all haplotypes in that branch are identical, at least within the range of variants used in clustering. Methods Blood samples were collected from 78 patients at Kenema Government Hospital in Sierra Leone. For details of RNA preservation, PCR, human RNA depletion, library construction and sequencing, see Supplemental Materials and Methods of Gire et al. Gire et al. analyzed the 78 Sierra Leone patient sequences together with 3 sequences from the 2014 outbreak in Guinea (Baize et al.; suspected sequencing errors were masked, see Supplemental Materials and Methods of Gire et al.), for a total of 81 sequences from 2014. In addition, some analyses included 20 sequences from past outbreaks of Zaire Ebola virus, 1976-2008, for a total of 101 sequences. Sequence variants were extracted directly from multiple sequence alignments of the group of 101 sequences (1976-2014). A custom release of SnpEff (v4.0, build 2014-07-01, to support ribosomal slippage in transcription of GP gene) was used to predict functional effect of variants on genes (noncoding, synonymous or missense). References Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keïta S, De Clerck H et al. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med. 2014 Oct 9;371(15):1418-25. PMID: 24738640 Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632 Supplemental Materials and Methods gireIntraHost 2014 Intra-host Intra-host variants for 78 sequences from Sierra Leone 2014 Outbreak Variation and Repeats Description This track represents intrahost variants for a subset of the 78 Sierra Leone EVD patients described in Gire et al., 2014. For a subset of these 78 patients, researchers took multiple blood samples from each patient at various time points, then isolated and sequenced the Ebola virus from each sample. This allowed them to to track the mutation and evolution of the virus within a single patient or host. Display Conventions In "dense" mode, a vertical line is drawn at the position of each variant. In "full" mode, in addition to the vertical line, a label to the left shows the reference allele first and variant alleles below (A = red, C = blue, G = green, T = magenta, Indels = black). Hovering the pointer over any variant will prompt the display of the occurrences numbers for each allele in Gire et al., 2014. Clicking on any variant will result in full details of that variant being displayed. By default, in "pack" mode, the display shows a clustering of haplotypes in the viewed range, sorted by similarity of alleles weighted by proximity to a central variant. The clustering view can highlight local patterns of linkage. Each variant is a vertical bar with white (invisible) representing the reference allele and black representing the non-reference allele(s). Tick marks are drawn at the top and bottom of each variant's vertical bar to make the bar more visible when most alleles are reference alleles. The vertical bar for the central variant used in clustering is outlined in purple. In order to avoid long compute times, the range of alleles used in clustering may be limited; alleles used in clustering have purple tick marks at the top and bottom. The clustering tree is displayed to the left of the main image. It does not represent relatedness of individuals; it simply shows the arrangement of local haplotypes by similarity. When a rightmost branch is purple, it means that all haplotypes in that branch are identical, at least within the range of variants used in clustering. Methods Blood samples were collected from 78 patients at Kenema Government Hospital in Sierra Leone. For details of RNA preservation, PCR, human RNA depletion, library construction and sequencing, see Supplemental Materials and Methods of Gire et al. Gire et al. analyzed the 78 Sierra Leone patient sequences together with 3 sequences from the 2014 outbreak in Guinea (Baize et al.; suspected sequencing errors were masked, see Supplemental Materials and Methods of Gire et al.), for a total of 81 sequences from 2014. In addition, some analyses included 20 sequences from past outbreaks of Zaire Ebola virus, 1976-2008, for a total of 101 sequences. Sequence variants were extracted directly from multiple sequence alignments of the group of 101 sequences (1976-2014). A custom release of SnpEff (v4.0, build 2014-07-01, to support ribosomal slippage in transcription of GP gene) was used to predict functional effect of variants on genes (noncoding, synonymous or missense). References Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keïta S, De Clerck H et al. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med. 2014 Oct 9;371(15):1418-25. PMID: 24738640 Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632 Supplemental Materials and Methods gire2014FixedMissense 2014 Fixed Aa Chg Protein Changes That Are Fixed in 81 Sequences from 2014 Outbreak Variation and Repeats Description This track displays variants identified by Gire et al. in 81 isolates from the 2014 strain of Ebola virus (78 from Sierra Leone and 3 from Guinea) that are specific to the 2014 strain, fixed (i.e. present in all 81 isolates), and that cause protein-coding changes. Display Conventions Items are labeled by gene and amino acid change. The labels of two items also contain "*CONS" to indicate that the changed position is fully conserved across the Ebola genus. Click on an item to view the BLOSUM62 substitution score. Methods Blood samples were collected from 78 patients at Kenema Government Hospital in Sierra Leone. For details of RNA preservation, PCR, human RNA depletion, library construction and sequencing, see Supplemental Materials and Methods of Gire et al. Gire et al. analyzed the 78 Sierra Leone patient sequences together with 3 sequences from the 2014 outbreak in Guinea (Baize et al.; suspected sequencing errors were masked, see Supplemental Materials and Methods of Gire et al.), for a total of 81 sequences from 2014. In addition, some analyses included 20 sequences from past outbreaks of Zaire Ebola virus, 1976-2008, for a total of 101 sequences. Sequence variants were extracted directly from multiple sequence alignments of the group of 101 sequences (1976-2014). A custom release of SnpEff (v4.0, build 2014-07-01, to support ribosomal slippage in transcription of GP gene) was used to predict functional effect of variants on genes (noncoding, synonymous or missense). References Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keïta S, De Clerck H et al. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med. 2014 Oct 9;371(15):1418-25. PMID: 24738640 Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632 Supplemental Materials and Methods gire2014PolymorphicMissense 2014 Polym Aa Chg Protein Changes That Are Polymorphic in 81 Sequences from 2014 Outbreak Variation and Repeats Description This track displays variants identified by Gire et al. in 81 isolates from the 2014 strain of Ebola virus (78 from Sierra Leone and 3 from Guinea) that are specific to the 2014 strain, polymorphic in 2014 isolates, and that cause protein-coding changes. Display Conventions Items are labeled by gene and amino acid change. The labels of two items also contain "*CONS" to indicate that the changed position is fully conserved across the Ebola genus. Click on an item to view the BLOSUM62 substitution score. Methods Blood samples were collected from 78 patients at Kenema Government Hospital in Sierra Leone. For details of RNA preservation, PCR, human RNA depletion, library construction and sequencing, see Supplemental Materials and Methods of Gire et al. Gire et al. analyzed the 78 Sierra Leone patient sequences together with 3 sequences from the 2014 outbreak in Guinea (Baize et al.; suspected sequencing errors were masked, see Supplemental Materials and Methods of Gire et al.), for a total of 81 sequences from 2014. In addition, some analyses included 20 sequences from past outbreaks of Zaire Ebola virus, 1976-2008, for a total of 101 sequences. Sequence variants were extracted directly from multiple sequence alignments of the group of 101 sequences (1976-2014). A custom release of SnpEff (v4.0, build 2014-07-01, to support ribosomal slippage in transcription of GP gene) was used to predict functional effect of variants on genes (noncoding, synonymous or missense). References Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keïta S, De Clerck H et al. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med. 2014 Oct 9;371(15):1418-25. PMID: 24738640 Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632 Supplemental Materials and Methods gire2014KikwitMissense 2014 vs Kikwit GP Protein Changes in GP Between 81 Sequences from 2014 Outbreak and Kikwit 1995 (JQ352763) Variation and Repeats Description This track displays variants identified by Gire et al. in 81 isolates from the 2014 strain of Ebola virus (78 from Sierra Leone and 3 from Guinea) relative to a Kikwit isolate (JQ352763) that cause protein-coding changes in the GP gene. Display Conventions Items are labeled by amino acid change and the frequency of the change in 2014 isolates. Methods Blood samples were collected from 78 patients at Kenema Government Hospital in Sierra Leone. For details of RNA preservation, PCR, human RNA depletion, library construction and sequencing, see Supplemental Materials and Methods of Gire et al. Gire et al. analyzed the 78 Sierra Leone patient sequences together with 3 sequences from the 2014 outbreak in Guinea (Baize et al.; suspected sequencing errors were masked, see Supplemental Materials and Methods of Gire et al.), for a total of 81 sequences from 2014. In addition, some analyses included 20 sequences from past outbreaks of Zaire Ebola virus, 1976-2008, for a total of 101 sequences. Sequence variants were extracted directly from multiple sequence alignments of the group of 101 sequences (1976-2014). A custom release of SnpEff (v4.0, build 2014-07-01, to support ribosomal slippage in transcription of GP gene) was used to predict functional effect of variants on genes (noncoding, synonymous or missense). References Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keïta S, De Clerck H et al. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med. 2014 Oct 9;371(15):1418-25. PMID: 24738640 Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632 Supplemental Materials and Methods gire2014MayingaMissense 2014 vs Maynga GP Protein Changes in GP Between 81 Sequences from 2014 Outbreak and Mayinga 1976 (NC002549) Variation and Repeats Description This track displays variants identified by Gire et al. in 81 isolates from the 2014 strain of Ebola virus (78 from Sierra Leone and 3 from Guinea) relative to a Mayinga isolate (NC002549) that cause protein-coding changes in the GP gene. Display Conventions Items are labeled by amino acid change and the frequency of the change in 2014 isolates. Methods Blood samples were collected from 78 patients at Kenema Government Hospital in Sierra Leone. For details of RNA preservation, PCR, human RNA depletion, library construction and sequencing, see Supplemental Materials and Methods of Gire et al. Gire et al. analyzed the 78 Sierra Leone patient sequences together with 3 sequences from the 2014 outbreak in Guinea (Baize et al.; suspected sequencing errors were masked, see Supplemental Materials and Methods of Gire et al.), for a total of 81 sequences from 2014. In addition, some analyses included 20 sequences from past outbreaks of Zaire Ebola virus, 1976-2008, for a total of 101 sequences. Sequence variants were extracted directly from multiple sequence alignments of the group of 101 sequences (1976-2014). A custom release of SnpEff (v4.0, build 2014-07-01, to support ribosomal slippage in transcription of GP gene) was used to predict functional effect of variants on genes (noncoding, synonymous or missense). References Baize S, Pannetier D, Oestereich L, Rieger T, Koivogui L, Magassouba N, Soropogui B, Sow MS, Keïta S, De Clerck H et al. Emergence of Zaire Ebola virus disease in Guinea. N Engl J Med. 2014 Oct 9;371(15):1418-25. PMID: 24738640 Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, Jalloh S, Momoh M, Fullah M, Dudas G et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 2014 Sep 12;345(6202):1369-72. PMID: 25214632 Supplemental Materials and Methods iedbBcellNeg IEDB B Cell Neg Immune Epitope Database and Analysis Resource (IEDB) B-Cell Epitopes with Negative Assay Result Immunology Description This track shows protein sequences assayed but not recognized by antibodies. They were annotated by the National Institute for Allergy and Infectious Diseases (NIAID) Immune Epitope Database (IEDB). Only sequences with at least one negative assay outcome are shown on the track. All fields annotated by IEDB and exported in their "compact" file are shown on the details page, which also provides links back to the IEDB. See also the detailed explanation of the curated Ebola data in the IEDB Knowledgebase. Display Conventions Peptide matches are labeled with the name of the antibody and the host organism, separated by a slash. When the antibody does not have a name in IEDB, the first six letters of the peptide are shown instead. Mouse over the features to see the authors of the study that described the epitope. References Vita R, Zarebski L, Greenbaum JA, Emami H, Hoof I, Salimi N, Damle R, Sette A, Peters B. The immune epitope database 2.0. Nucleic Acids Res. 2010 Jan;38(Database issue):D854-62. PMID: 19906713; PMC: PMC2808938 iedbTcellI IEDB T Cell I Immune Epitope Database and Analysis Resource (IEDB) Curated T-Cell Epitopes, MHC Class I Immunology Description This track shows protein sequences displayed by virus-infected cells to T-cells as annotated by the National Institute for Allergy and Infectious Diseases (NIAID) Immune Epitope Database (IEDB). Only sequences with a positive assay outcome are shown on the track. All fields annotated by IEDB and exported in their "compact" file are shown on the details page, which also provides links back to the IEDB. See also the detailed explanation of the curated Ebola data in the IEDB Knowledgebase. To enrich for MHC Class I epitopes, this track shows epitopes with annotated Class I alleles or, if the allele was not annotated, alleles with less than or equal to 12 amino acids. Display Conventions Matching peptide sequences are shown. Mouse over the features to see the exact MHC allele, if annotated. References Vita R, Zarebski L, Greenbaum JA, Emami H, Hoof I, Salimi N, Damle R, Sette A, Peters B. The immune epitope database 2.0. Nucleic Acids Res. 2010 Jan;38(Database issue):D854-62. PMID: 19906713; PMC: PMC2808938 pdb PDB Protein Data Bank (PDB) Sequence Matches Genes and Gene Prediction Tracks Description This track shows alignments of sequences with known protein structures in the Protein Data Bank (PDB). Display Conventions and Configuration Genomic locations of PDB matches are labeled with the accession number. A click on them shows a standard feature detail page with the PDB page integrated into it. The protein structure is shown on the PDB page. Methods PDB sequences were downloaded from the PDB website and aligned with BLAST (tblastn). Only alignments with a minimum identity of 80% that span at least 80% of the query sequence were kept. References Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000 Jan 1;28(1):235-42. PMID: 10592235; PMC: PMC102472 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 iedbTcellII IEDB T Cell II Immune Epitope Database and Analysis Resource (IEDB) Curated T-Cell Epitopes, MHC Class II Immunology Description This track shows protein sequences displayed by virus-infected cells to T-cells as annotated by the National Institute for Allergy and Infectious Diseases (NIAID) Immune Epitope Database (IEDB). Only sequences with a positive assay outcome are shown on the track. All fields annotated by IEDB and exported in their "compact" file are shown on the details page, which also provides links back to the IEDB. See also the detailed explanation of the curated Ebola data in the IEDB Knowledgebase. To enrich for MHC Class II epitopes, this track shows epitopes with annotated Class II alleles or, if the allele was not annotated, alleles with more than 12 amino acids. Display Conventions Matching peptide sequences are shown. Mouse over the features to see the exact MHC allele, if annotated. References Vita R, Zarebski L, Greenbaum JA, Emami H, Hoof I, Salimi N, Damle R, Sette A, Peters B. The immune epitope database 2.0. Nucleic Acids Res. 2010 Jan;38(Database issue):D854-62. PMID: 19906713; PMC: PMC2808938 iedbPred1 IEDB Pred Human I IEDB Epitope Predictions, Human T-Cell Class I Immunology Description This track shows peptides that are predicted to be displayed by human class I MHCs. Display Conventions and Configuration Peptides are shaded by IC50 values and split into subtracks by HLA-alleles. Methods For human epitope prediction, binding affinity for all possible 9-mer peptides from each protein sequence to a set of 27 HLA alleles that are most frequent in the global population, using the SMM method available in the IEDB MHC class I binding prediction tool. All peptides with predicted binding affinity (expressed as IC50 nM) below previously optimized allele-specific thresholds were selected as predicted binders. In the case of rhesus macaque class I T cell epitope prediction, the binding affinity of all 9-mer peptides was predicted for a total of 19 alleles (7 Indian and 12 Chinese rhesus macaques) that are most frequent in these species. The binding affinity predictions utilized the "IEDB recommended" method from the IEDB MHC class I binding prediction tool. Peptides with the IEDB consensus percentile rank <= 1.0 were selected. Lower values for either IC50 or percentile rank indicate stronger binding. References Ponomarenko J, Vaughan K, Paul S, Haeussler M, Maurer-Stroh S, Peters B, Sette A. Ebola: an analysis of immunity at the molecular level . submitted. --> iedbPred IEDB Prediction Immune Epitope Database and Analysis Resource (IEDB) HLA binding predictions Immunology Description The subtracks of this track shows peptides that are predicted to be displayed by human and macaque class I MHCs and human class II MHCs. Class I predictions are split by HLA allele, class II predictions are summarized into a single track. Display Conventions and Configuration Class I peptides are shaded by IC50 values and grouped by HLA-alleles. Class II peptides are shaded by percentile. Methods Class I T cell epitope prediction: For human epitope prediction, binding affinity for all possible 9-mer peptides from each protein sequence to a set of 27 HLA alleles that are most frequent in the global population, using the SMM method available in the IEDB MHC class I binding prediction tool. All peptides with predicted binding affinity (expressed as IC50 nM) below previously optimized allele-specific thresholds were selected as predicted binders. In the case of rhesus macaque class I T cell epitope prediction, the binding affinity of all 9-mer peptides was predicted for a total of 19 alleles (7 Indian and 12 Chinese rhesus macaques) that are most frequent in these species. The binding affinity predictions utilized the "IEDB recommended" method from the IEDB MHC class I binding prediction tool. Peptides with the IEDB consensus percentile rank <= 1.0 were selected. Lower values for either IC50 or percentile rank indicate stronger binding. Class II T cell epitope prediction: For human epitope predictions, 15-mer peptides overlapping by 10 aa residues were generated from aligned sequences, to avoid redundant peptides that share the same 9-mer binding core. For each 15-mer peptide, the binding affinity was predicted (expressed as the IEDB consensus percentile rank) for seven class II human HLA alleles (HLA-DRB1*03:01, HLA-DRB1*07:01, HLA-DRB1*15:01, HLA-DRB3*01:01, HLA-DRB3*02:02, HLADRB4*01:01 and HLA-DRB5*01:01), using the IEDB MHC class I binding prediction tool ("IEDB recommended" method). Predicted binders were selected based on the median consensus percentile rank estimated from the consensus percentile ranks for the seven alleles. All peptides with median consensus percentile rank <= 20.0 were selected as predicted binders, the threshold being previously optimized for capturing 50% of class II human T cell responses. References Ponomarenko J, Vaughan K, Paul S, Haeussler M, Maurer-Stroh S, Peters B, Sette A. Ebola: an analysis of immunity at the molecular level . submitted. --> iedbsupp1B5801 B*58:01 IEDB predicted binding Human HLA T-Cell Class I B*58:01 Immunology iedbsupp1B5701 B*57:01 IEDB predicted binding Human HLA T-Cell Class I B*57:01 Immunology iedbsupp1B5301 B*53:01 IEDB predicted binding Human HLA T-Cell Class I B*53:01 Immunology iedbsupp1B5101 B*51:01 IEDB predicted binding Human HLA T-Cell Class I B*51:01 Immunology iedbsupp1B4403 B*44:03 IEDB predicted binding Human HLA T-Cell Class I B*44:03 Immunology iedbsupp1B4402 B*44:02 IEDB predicted binding Human HLA T-Cell Class I B*44:02 Immunology iedbsupp1B4001 B*40:01 IEDB predicted binding Human HLA T-Cell Class I B*40:01 Immunology iedbsupp1B3501 B*35:01 IEDB predicted binding Human HLA T-Cell Class I B*35:01 Immunology iedbsupp1B1501 B*15:01 IEDB predicted binding Human HLA T-Cell Class I B*15:01 Immunology iedbsupp1B0801 B*08:01 IEDB predicted binding Human HLA T-Cell Class I B*08:01 Immunology iedbsupp1B0702 B*07:02 IEDB predicted binding Human HLA T-Cell Class I B*07:02 Immunology iedbsupp1A6802 A*68:02 IEDB predicted binding Human HLA T-Cell Class I A*68:02 Immunology iedbsupp1A6801 A*68:01 IEDB predicted binding Human HLA T-Cell Class I A*68:01 Immunology iedbsupp1A3301 A*33:01 IEDB predicted binding Human HLA T-Cell Class I A*33:01 Immunology iedbsupp1A3201 A*32:01 IEDB predicted binding Human HLA T-Cell Class I A*32:01 Immunology iedbsupp1A3101 A*31:01 IEDB predicted binding Human HLA T-Cell Class I A*31:01 Immunology iedbsupp1A3002 A*30:02 IEDB predicted binding Human HLA T-Cell Class I A*30:02 Immunology iedbsupp1A3001 A*30:01 IEDB predicted binding Human HLA T-Cell Class I A*30:01 Immunology iedbsupp1A2601 A*26:01 IEDB predicted binding Human HLA T-Cell Class I A*26:01 Immunology iedbsupp1A2402 A*24:02 IEDB predicted binding Human HLA T-Cell Class I A*24:02 Immunology iedbsupp1A2301 A*23:01 IEDB predicted binding Human HLA T-Cell Class I A*23:01 Immunology iedbsupp1A1101 A*11:01 IEDB predicted binding Human HLA T-Cell Class I A*11:01 Immunology iedbsupp1A0301 A*03:01 IEDB predicted binding Human HLA T-Cell Class I A*03:01 Immunology iedbsupp1A0206 A*02:06 IEDB predicted binding Human HLA T-Cell Class I A*02:06 Immunology iedbsupp1A0203 A*02:03 IEDB predicted binding Human HLA T-Cell Class I A*02:03 Immunology iedbsupp1A0201 A*02:01 IEDB predicted binding Human HLA T-Cell Class I A*02:01 Immunology iedbsupp1A0101 A*01:01 IEDB predicted binding Human HLA T-Cell Class I A*01:01 Immunology iedbPredClassIMac IEDB Pred Macaque I IEDB Epitope Predictions, Macaque T-Cell Class I Immunology Description This track shows peptides that are predicted to be displayed by macaque class I MHCs. Display Conventions and Configuration Peptides are shaded by IC50 values and split into subtracks by HLA-alleles. Methods For human epitope prediction, binding affinity for all possible 9-mer peptides from each protein sequence to a set of 27 HLA alleles that are most frequent in the global population, using the SMM method available in the IEDB MHC class I binding prediction tool. All peptides with predicted binding affinity (expressed as IC50 nM) below previously optimized allele-specific thresholds were selected as predicted binders. In the case of rhesus macaque class I T cell epitope prediction, the binding affinity of all 9-mer peptides was predicted for a total of 19 alleles (7 Indian and 12 Chinese rhesus macaques) that are most frequent in these species. The binding affinity predictions utilized the "IEDB recommended" method from the IEDB MHC class I binding prediction tool. Peptides with the IEDB consensus percentile rank <= 1.0 were selected. Lower values for either IC50 or percentile rank indicate stronger binding. References Ponomarenko J, Vaughan K, Paul S, Haeussler M, Maurer-Stroh S, Peters B, Sette A. Ebola: an analysis of immunity at the molecular level . submitted. --> iedbsupp2B17 B*17 IEDB Predicted binding Macaque HLA T-Cell Class I B*17 Immunology iedbsupp2B09001 B*09001 IEDB Predicted binding Macaque HLA T-Cell Class I B*09001 Immunology iedbsupp2B08701 B*08701 IEDB Predicted binding Macaque HLA T-Cell Class I B*08701 Immunology iedbsupp2B08301 B*08301 IEDB Predicted binding Macaque HLA T-Cell Class I B*08301 Immunology iedbsupp2B08 B*08 IEDB Predicted binding Macaque HLA T-Cell Class I B*08 Immunology iedbsupp2B06601 B*06601 IEDB Predicted binding Macaque HLA T-Cell Class I B*06601 Immunology iedbsupp2B03901 B*03901 IEDB Predicted binding Macaque HLA T-Cell Class I B*03901 Immunology iedbsupp2B01001 B*01001 IEDB Predicted binding Macaque HLA T-Cell Class I B*01001 Immunology iedbsupp2B01 B*01 IEDB Predicted binding Macaque HLA T-Cell Class I B*01 Immunology iedbsupp2B00301 B*00301 IEDB Predicted binding Macaque HLA T-Cell Class I B*00301 Immunology iedbsupp2B00101 B*00101 IEDB Predicted binding Macaque HLA T-Cell Class I B*00101 Immunology iedbsupp2A70103 A7*0103 IEDB Predicted binding Macaque HLA T-Cell Class I A7*0103 Immunology iedbsupp2A20102 A2*0102 IEDB Predicted binding Macaque HLA T-Cell Class I A2*0102 Immunology iedbsupp2A102601 A1*02601 IEDB Predicted binding Macaque HLA T-Cell Class I A1*02601 Immunology iedbsupp2A102201 A1*02201 IEDB Predicted binding Macaque HLA T-Cell Class I A1*02201 Immunology iedbsupp2A11 A*11 IEDB Predicted binding Macaque HLA T-Cell Class I A*11 Immunology iedbsupp2A07 A*07 IEDB Predicted binding Macaque HLA T-Cell Class I A*07 Immunology iedbsupp2A02 A*02 IEDB Predicted binding Macaque HLA T-Cell Class I A*02 Immunology iedbsupp2A01 A*01 IEDB Predicted binding Macaque HLA T-Cell Class I A*01 Immunology iedbSupp3 IEDB Pred II IEDB Epitope Predictions, Human T-Cell Class II Immunology Description This track shows peptides that are predicted to be displayed by human class II MHCs. Display Conventions and Configuration Peptides are shaded by percentile. Methods For human epitope predictions, 15-mer peptides overlapping by 10 aa residues were generated from aligned sequences, to avoid redundant peptides that share the same 9-mer binding core. For each 15-mer peptide, the binding affinity was predicted (expressed as the IEDB consensus percentile rank) for seven class II human HLA alleles (HLA-DRB1*03:01, HLA-DRB1*07:01, HLA-DRB1*15:01, HLA-DRB3*01:01, HLA-DRB3*02:02, HLADRB4*01:01 and HLA-DRB5*01:01), using the MHC class I binding prediction tool ("IEDB recommended" method). Predicted binders were selected based on the median consensus percentile rank estimated from the consensus percentile ranks for the seven alleles. All peptides with median consensus percentile rank <= 20.0 were selected as predicted binders, the threshold being previously optimized for capturing 50% of class II human T cell responses. References Ponomarenko J, Vaughan K, Paul S, Haeussler M, Maurer-Stroh S, Peters B, Sette A. Ebola: an analysis of immunity at the molecular level . submitted. --> gold Assembly Assembly from Fragments Mapping and Sequencing Tracks Description This track shows the sequences used in the Jun. 2014 ebola virus genome assembly. Genome assembly procedures are covered in the NCBI assembly documentation. NCBI also provides specific information about this assembly. The definition of this assembly is from the AGP file delivered with the sequence. The NCBI document AGP Specification describes the format of the AGP file. In dense mode, this track depicts the contigs that make up the currently viewed scaffold. Contig boundaries are distinguished by the use of alternating gold and brown coloration. Where gaps exist between contigs, spaces are shown between the gold and brown blocks. The relative order and orientation of the contigs within a scaffold is always known; therefore, a line is drawn in the graphical display to bridge the blocks. Component types found in this track (with counts of that type in parentheses): F - one finished sequence (KM034562v1/KM034562.1) cytoBandIdeo Chromosome Band (Ideogram) Ideogram for Orientation Mapping and Sequencing Tracks gap Gap Gap Locations Mapping and Sequencing Tracks Description This track shows the gaps in the Jun. 2014 ebola virus genome assembly. Genome assembly procedures are covered in the NCBI assembly documentation. NCBI also provides specific information about this assembly. The definition of the gaps in this assembly is from the AGP file delivered with the sequence. The NCBI document AGP Specification describes the format of the AGP file. Gaps are represented as black boxes in this track. If the relative order and orientation of the contigs on either side of the gap is supported by read pair data, it is a bridged gap and a white line is drawn through the black box representing the gap. This assembly contains the following principal types of gaps: --> gc5BaseBw GC Percent GC Percent in 5-Base Windows Mapping and Sequencing Tracks Description The GC percent track shows the percentage of G (guanine) and C (cytosine) bases in 5-base windows. High GC content is typically associated with gene-rich areas. This track may be configured in a variety of ways to highlight different apsects of the displayed information. Click the "Graph configuration help" link for an explanation of the configuration options. Credits The data and presentation of this graph were prepared by Hiram Clawson. patSeq Lens Patents Lens PatSeq Patent Document Sequences Literature Description This track shows genome matches to biomedical sequences submitted with patent application documents to patent offices around the world. The sequences, their mappings, and selected patent information were graciously provided by PatSeq, a search tool part of The Lens, Cambia. This track contains more data than the NCBI Genbank Division "Patents", as the sequences were extracted from patents directly. Display Convention and Configuration The data is split into two subtracks: one for sequences that are only part of patents that have submitted more than 100 sequences ("bulk patents") and a second track for all other sequences ("non-bulk patents"). A sequence can be part of many patent documents, with some being found in several thousand patents. This track shows only a single alignment for every sequence, colored based on its occurrence in the different patent documents and using a color schema similar to The Lens. Based on the first sequence match, the four different item colors follow this priority ranking in descending order: the sequence is referenced in the claims of a granted patent the sequence is disclosed in a granted patent the sequence is referenced in the claims of a patent application the sequence is disclosed in a patent application Sequences referenced in the claims section of a patent document define the scope of the invention and are important during litigation. Therefore, they are given priority in the color scheme. Patent grant documents form the basis of patent protection and are prioritized over applications. Hover over a feature with the mouse to see the total number of documents where the sequence has been referenced, how many of these documents are granted patents and how often the sequence has been referenced in the claims. A randomly selected document title is also shown in the mouseover. Clicking on a feature will bring up the details page, which contains information about the sequence and alignment of that feature. The link at the top of the page opens the PatSeq Analyzer with the chromosomal region covered by the feature that was clicked. The PatSeq Analyzer is a specialized genome browser that allows for the viewing and filtering of patent sequence matches in detail. The next section of the details page is a list of up to ten patent documents that include this sequence, with the number of occurrences within each document in parentheses. This is followed by up to thirty links to patent documents. The patent documents listed in these sections are displayed in order of the number of sequence occurrences in the document. Shown below these are the links to the sequence in The Lens, in the format "patentDocumentIdentifier-SEQIDNO (docSequenceCount)". The "SEQ ID NO" is an integer number, the unique identifier of a patent sequence in a patent document. When a protein sequence has been annotated on a nucleotide sequence, the "SEQ ID NO" contains the reading frame separated by a ".", e.g. "1.1" would indicate the first frame of SEQIDNO 1. The total number of sequences submitted with the patent document ("docSequenceCount") is shown in parentheses after the SEQIDNO. The links to the sequence are separated into the categories "granted and in claims", "granted", "in claims" and "applications" (=all others). Sequence identifiers link to the respective pages on PatSeq. A maximum of thirty documents are linked from this page per category listed in order of the number of sequence occurrences; please use PatSeq Analyzer to view all matching documents. The score of the features in this track is the number of documents where the sequence appears in the claims. For example, by setting the score filter to 1, only sequences are shown that have been referenced at least once in the claims. Methods More than 96 million patent document files were collected by The Lens. The ST.25-formatted sequences were extracted and mapped to genomes with the aligners BLAT and BWA. The minimal identity of the query over the alignment is 95%. Note that for hg19, no patents are shown on chrM, as the mitochondrial chromosome used for the mapping was the one from the Ensembl genome FASTA files. Credits Thanks to the team behind The Lens, in particular, Osmat Jefferson and Deniz Koellhofer, for making these data available. Feedback Send suggestions on the way data in this track is visualized to our support address genome@soe.ucsc.edu. Questions on the data itself are best directed to support@cambia.org. Data access The raw data can be explored interactively with the Table Browser. For automated download and analysis, the genome annotation is stored in a bigBed file that can be downloaded from our download server. The files for this track are called patNonBulk.bb and patBulk.bb. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The command to obtain the data as a tab-separated table looks like this: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/eboVir3/bbi/patNonBulk.bb -chrom=chr5 -start=1000000 -end=2000000 output.tsv A full log of the commands that were used to build this annotation is available from our database build description. In this text file, search for "patNonBulk" to find the right section. References Editorial: The patent bargain Nature. 2013 Dec 12;504(7479):187-188. Patently transparent. Nat Biotechnol. 2006 May;24(5):474. PMID: 16680110 Jefferson OA, Köllhofer D, Ehrich TH, Jefferson RA. Transparency tools in gene patenting for informing policy and practice. Nat Biotechnol. 2013 Dec;31(12):1086-93. PMID: 24316644 patBulk Bulk patents Patent Lens Bulk patents Literature patNonBulk Non-bulk patents Patent Lens Non-bulk patents Literature muPIT muPIT protein map muPIT - Mapping Genomic Positions on Protein Structures Genes and Gene Prediction Tracks Description This track indicates the mapped locations viewable in the MuPIT interactive system. Use the URL link MuPIT protein structures to enter the viewing system. MuPIT interactive is a tool that allows you to map a sequence variant from its position in the human genome onto a protein structure. Viewing a variant on a protein structure can be useful in interpreting its potential biological consequences. After you have done the mapping, you can play with the protein structure by turning it around, zooming in and out, and turning color-coded annotations about the protein on and off. Credits A data mapping pipeline was developed in the Karchin lab to map from any genomic location to position in PDB structure(s). Noushin Niknafs Rachel Karchin Dewey Kim In collaboration with: Mark Diekhans (UCSC) Jing Zhu (UCSC) David Haussler (UCSC) The web server and visualization scripting was developed by In Silico Solutions: Rick Kim Michael Ryan Sean Ryan Gabriel Ritter Funding: NIH 5U01CA180956-02 NIH 3U24CA143858-2S1 NIH 5R21CA152432-02 References Niknafs N, Kim D, Kim R, Diekhans M, Ryan M, Stenson PD, Cooper DN, Karchin R. MuPIT interactive: webserver for mapping variant positions to annotated, interactive 3D structures. Hum Genet. 2013 Nov;132(11):1235-43. PMID: 23793516; PMC: PMC3797853 ncbiGenePfam Pfam in NCBI Gene Pfam Domains in NCBI Genes Genes and Gene Prediction Tracks Description Most proteins are composed of one or more conserved functional regions called domains. This track shows the high-quality, manually curated Pfam-A domains found in transcripts from the NCBI Genes track. Display Conventions and Configuration This track follows the display conventions for gene tracks. Methods The amino acid sequences from the NCBI Genes are submitted to the set of Pfam-A HMMs, which annotate regions within the predicted peptide that are recognizable as Pfam protein domains. These regions are then mapped to the transcripts themselves using the pslMap utility. Credits pslMap was written by Mark Diekhans at UCSC. References Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K et al. The Pfam protein families database. Nucleic Acids Res. 2010 Jan;38(Database issue):D211-22. PMID: 19920124; PMC: PMC2808889 uniprot UniProt UniProt SwissProt/TrEMBL Protein Annotations Genes and Gene Prediction Tracks Description This track shows protein sequences and annotations on them from the UniProt/SwissProt database, mapped to genomic coordinates. UniProt/SwissProt data has been curated from scientific publications by the UniProt staff, UniProt/TrEMBL data has been predicted by various computational algorithms. The annotations are divided into multiple subtracks, based on their "feature type" in UniProt. The first two subtracks below - one for SwissProt, one for TrEMBL - show the alignments of protein sequences to the genome, all other tracks below are the protein annotations mapped through these alignments to the genome. Track Name Description UCSC Alignment, SwissProt = curated protein sequences Protein sequences from SwissProt mapped to the genome. All other tracks are (start,end) SwissProt annotations on these sequences mapped through this alignment. Even protein sequences without a single curated annotation (splice isoforms) are visible in this track. Each UniProt protein has one main isoform, which is colored in dark. Alternative isoforms are sequences that do not have annotations on them and are colored in light-blue. They can be hidden with the TrEMBL/Isoform filter (see below). UCSC Alignment, TrEMBL = predicted protein sequences Protein sequences from TrEMBL mapped to the genome. All other tracks below are (start,end) TrEMBL annotations mapped to the genome using this track. This track is hidden by default. To show it, click its checkbox on the track configuration page. UniProt Signal Peptides Regions found in proteins destined to be secreted, generally cleaved from mature protein. UniProt Extracellular Domains Protein domains with the comment "Extracellular". UniProt Transmembrane Domains Protein domains of the type "Transmembrane". UniProt Cytoplasmic Domains Protein domains with the comment "Cytoplasmic". UniProt Polypeptide Chains Polypeptide chain in mature protein after post-processing. UniProt Regions of Interest Regions that have been experimentally defined, such as the role of a region in mediating protein-protein interactions or some other biological process. UniProt Domains Protein domains, zinc finger regions and topological domains. UniProt Disulfide Bonds Disulfide bonds. UniProt Amino Acid Modifications Glycosylation sites, modified residues and lipid moiety-binding regions. UniProt Amino Acid Mutations Mutagenesis sites and sequence variants. UniProt Protein Primary/Secondary Structure Annotations Beta strands, helices, coiled-coil regions and turns. UniProt Sequence Conflicts Differences between Genbank sequences and the UniProt sequence. UniProt Repeats Regions of repeated sequence motifs or repeated domains. UniProt Other Annotations All other annotations, e.g. compositional bias For consistency and convenience for users of mutation-related tracks, the subtrack "UniProt/SwissProt Variants" is a copy of the track "UniProt Variants" in the track group "Phenotype and Literature", or "Variation and Repeats", depending on the assembly. Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. TrEMBL annotations are always shown in light blue, except in the Signal Peptides, Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks. Mouse over a feature to see the full UniProt annotation comment. For variants, the mouse over will show the full name of the UniProt disease acronym. The subtracks for domains related to subcellular location are sorted from outside to inside of the cell: Signal peptide, extracellular, transmembrane, and cytoplasmic. In the "UniProt Modifications" track, lipoification sites are highlighted in dark blue, glycosylation sites in dark green, and phosphorylation in light green. Duplicate annotations are removed as far as possible: if a TrEMBL annotation has the same genome position and same feature type, comment, disease and mutated amino acids as a SwissProt annotation, it is not shown again. Two annotations mapped through different protein sequence alignments but with the same genome coordinates are only shown once. On the configuration page of this track, you can choose to hide any TrEMBL annotations. This filter will also hide the UniProt alternative isoform protein sequences because both types of information are less relevant to most users. Please contact us if you want more detailed filtering features. Note that for the human hg38 assembly and SwissProt annotations, there also is a public track hub prepared by UniProt itself, with genome annotations maintained by UniProt using their own mapping method based on those Gencode/Ensembl gene models that are annotated in UniProt for a given protein. For proteins that differ from the genome, UniProt's mapping method will, in most cases, map a protein and its annotations to an unexpected location (see below for details on UCSC's mapping method). Methods Briefly, UniProt protein sequences were aligned to the transcripts associated with the protein, the top-scoring alignments were retained, and the result was projected to the genome through a transcript-to-genome alignment. Depending on the genome, the transcript-genome alignments was either provided by the source database (NBCI RefSeq), created at UCSC (UCSC RefSeq) or derived from the transcripts (Ensembl/Augustus). The transcript set is NCBI RefSeq for hg38, UCSC RefSeq for hg19 (due to alt/fix haplotype misplacements in the NCBI RefSeq set on hg19). For other genomes, RefSeq, Ensembl and Augustus are tried, in this order. The resulting protein-genome alignments of this process are available in the file formats for liftOver or pslMap from our data archive (see "Data Access" section below). An important step of the mapping process protein -> transcript -> genome is filtering the alignment from protein to transcript. Due to differences between the UniProt proteins and the transcripts (proteins were made many years before the transcripts were made, and human genomes have variants), the transcript with the highest BLAST score when aligning the protein to all transcripts is not always the correct transcript for a protein sequence. Therefore, the protein sequence is aligned to only a very short list of one or sometimes more transcripts, selected by a three-step procedure: Use transcripts directly annotated by UniProt: for organisms that have a RefSeq transcript track, proteins are aligned to the RefSeq transcripts that are annotated by UniProt for this particular protein. Use transcripts for NCBI Gene ID annotated by UniProt: If no transcripts are annotated on the protein, or the annotated ones have been deprecated by NCBI, but a NCBI Gene ID is annotated, the RefSeq transcripts for this Gene ID are used. This can result in multiple matching transcripts for a protein. Use best matching transcript: If no NCBI Gene is annotated, then BLAST scores are used to pick the transcripts. There can be multiple transcripts for one protein, as their coding sequences can be identical. All transcripts within 1% of the highest observed BLAST score are used. For strategy 2 and 3, many of the transcripts found do not differ in coding sequence, so the resulting alignments on the genome will be identical. Therefore, any identical alignments are removed in a final filtering step. The details page of these alignments will contain a list of all transcripts that result in the same protein-genome alignment. On hg38, only a handful of edge cases (pseudogenes, very recently added proteins) remain in 2023 where strategy 3 has to be used. In other words, when an NCBI or UCSC RefSeq track is used for the mapping and to align a protein sequence to the correct transcript, we use a three stage process: If UniProt has annotated a given RefSeq transcript for a given protein sequence, the protein is aligned to this transcript. Any difference in the version suffix is tolerated in this comparison. If no transcript is annotated or the transcript cannot be found in the NCBI/UCSC RefSeq track, the UniProt-annotated NCBI Gene ID is resolved to a set of NCBI RefSeq transcript IDs via the most current version of NCBI genes tables. Only the top match of the resulting alignments and all others within 1% of its score are used for the mapping. If no transcript can be found after step (2), the protein is aligned to all transcripts, the top match, and all others within 1% of its score are used. This system was designed to resolve the problem of incorrect mappings of proteins, mostly on hg38, due to differences between the SwissProt sequences and the genome reference sequence, which has changed since the proteins were defined. The problem is most pronounced for gene families composed of either very repetitive or very similar proteins. To make sure that the alignments always go to the best chromosome location, all _alt and _fix reference patch sequences are ignored for the alignment, so the patches are entirely free of UniProt annotations. Please contact us if you have feedback on this process or example edge cases. We are not aware of a way to evaluate the results completely and in an automated manner. Proteins were aligned to transcripts with TBLASTN, converted to PSL, filtered with pslReps (93% query coverage, keep alignments within top 1% score), lifted to genome positions with pslMap and filtered again with pslReps. UniProt annotations were obtained from the UniProt XML file. The UniProt annotations were then mapped to the genome through the alignment described above using the pslMap program. This approach draws heavily on the LS-SNP pipeline by Mark Diekhans. Like all Genome Browser source code, the main script used to build this track can be found on Github. Older releases This track is automatically updated on an ongoing basis, every 2-3 months. The current version name is always shown on the track details page, it includes the release of UniProt, the version of the transcript set and a unique MD5 that is based on the protein sequences, the transcript sequences, the mapping file between both and the transcript-genome alignment. The exact transcript that was used for the alignment is shown when clicking a protein alignment in one of the two alignment tracks. For reproducibility of older analysis results and for manual inspection, previous versions of this track are available for browsing in the form of the UCSC UniProt Archive Track Hub (click this link to connect the hub now). The underlying data of all releases of this track (past and current) can be obtained from our downloads server, including the UniProt protein-to-genome alignment. Data Access The raw data of the current track can be explored interactively with the Table Browser, or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The exact filenames can be found in the track configuration file. Annotations can be converted to ASCII text by our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/eboVir3/uniprot/unipStruct.bb -chrom=chr6 -start=0 -end=1000000 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Lifting from UniProt to genome coordinates in pipelines To facilitate mapping protein coordinates to the genome, we provide the alignment files in formats that are suitable for our command line tools. Our command line programs liftOver or pslMap can be used to map coordinates on protein sequences to genome coordinates. The filenames are unipToGenome.over.chain.gz (liftOver) and unipToGenomeLift.psl.gz (pslMap). Example commands: wget -q https://hgdownload.soe.ucsc.edu/goldenPath/archive/hg38/uniprot/2022_03/unipToGenome.over.chain.gz wget -q https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver chmod a+x liftOver echo 'Q99697 1 10 annotationOnProtein' > prot.bed liftOver prot.bed unipToGenome.over.chain.gz genome.bed cat genome.bed Credits This track was created by Maximilian Haeussler at UCSC, with a lot of input from Chris Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff, Alejo Mujica, Regeneron Pharmaceuticals and Pia Riestra, GeneDx. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 unipConflict Seq. Conflicts UniProt Sequence Conflicts Genes and Gene Prediction Tracks unipRepeat Repeats UniProt Repeats Genes and Gene Prediction Tracks unipStruct Structure UniProt Protein Primary/Secondary Structure Annotations Genes and Gene Prediction Tracks unipOther Other Annot. UniProt Other Annotations Genes and Gene Prediction Tracks unipMut Mutations UniProt Amino Acid Mutations Genes and Gene Prediction Tracks unipModif AA Modifications UniProt Amino Acid Modifications Genes and Gene Prediction Tracks unipDomain Domains UniProt Domains Genes and Gene Prediction Tracks unipDisulfBond Disulf. Bonds UniProt Disulfide Bonds Genes and Gene Prediction Tracks unipChain Chains UniProt Mature Protein Products (Polypeptide Chains) Genes and Gene Prediction Tracks unipLocCytopl Cytoplasmic UniProt Cytoplasmic Domains Genes and Gene Prediction Tracks unipLocTransMemb Transmembrane UniProt Transmembrane Domains Genes and Gene Prediction Tracks unipInterest Interest UniProt Regions of Interest Genes and Gene Prediction Tracks unipLocExtra Extracellular UniProt Extracellular Domain Genes and Gene Prediction Tracks unipLocSignal Signal Peptide UniProt Signal Peptides Genes and Gene Prediction Tracks unipAliTrembl TrEMBL Aln. UCSC alignment of TrEMBL proteins to genome Genes and Gene Prediction Tracks unipAliSwissprot SwissProt Aln. UCSC alignment of SwissProt proteins to genome (dark blue: main isoform, light blue: alternative isoforms) Genes and Gene Prediction Tracks spUniprot UniProt UniProt/SwissProt Annotations Genes and Gene Prediction Tracks Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt staff. The annotations are divided into two subtracks, one for all secondary structure annotations and another one for all other annotations. For the mutations curated by UniProt/SwissProt, please open the track "UniProt Variants" in the track group "Phenotype and Literature". Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disul", "signal pep" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for more details. Mouse over a feature to see the full UniProt annotation comment. Modified residues are highlighted in light blue, transmembrane regions in blue, glycosylation sites in yellow, disulfide bonds in grey, topological domains in red. Note that for the human hg38 assembly, there also is a public track hub prepared by UniProt itself, with genome annotations produced and maintained by UniProt using their mapping method. Methods UniProt sequences were aligned to UCSC/Gencode transcript sequences first with BLAT, then lifted to genome positions with pslMap. UniProt variants were obtained from the UniProt XML file. The variants were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. The complete script is part of the kent source tree and is located in src/utils/uniprotLift. The exact commands that were used to build this track can be found on github. Data Access The raw data can be explored interactively with the Table Browser, or the Data Integrator. For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the download server. The files for this track are called spAnnot.bb and spStruct.bb. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, for example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/eboVir3/bbi/spStruct.bb -start=0 -end=100000 stdout Please refer to our mailing list archives for questions, or our Data Access FAQ for more information. Credits This track was created by Maximilian Haeussler, with advice from Mark Diekhans and Brian Raney. Thanks to UniProt for making all data available for download. References UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 spStruct UniProt Structure UniProt/SwissProt Protein Primary/Secondary Structure Annotations Genes and Gene Prediction Tracks Description This track shows the genomic positions of protein secondary structures and amino acid modifications in the UniProt/SwissProt database. These data have been curated from scientific publications by the UniProt staff. Display Conventions and Configuration Genomic locations of UniProt/SwissProt protein secondary structures and amino acid modifications are labeled with the feature name (e.g. helix, beta, coiled-coil, disulf bond, glyco) at a given position. Mouse over a feature to see the UniProt comments. Methods UniProt sequences were aligned to RefSeq sequences first with BLAT, then lifted to genome positions with pslMap. UniProt protein secondary structures and amino acid modifications were parsed from the UniProt XML file. The features were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. The complete script is part of the kent source tree and is located in src/hg/utils/uniprotMutations. Credits This track was created by Maximilian Haeussler, with advice from Mark Diekhans and Brian Raney. References UniProt Consortium. Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2014 Jan;42(Database issue):D191-8. PMID: 24253303; PMC: PMC3965022 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 spAnnot UniProt Annot. UniProt/SwissProt Protein Annotations Genes and Gene Prediction Tracks Description This track shows protein sequence annotations from the UniProt/SwissProt database, mapped to genomic coordinates. The data has been curated from scientific publications by the UniProt staff. Display Conventions and Configuration Genomic locations of UniProt/SwissProt annotations are labeled with a short name for the type of annotation (e.g. "glyco", "disul" "signal pep" etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt record for all details. Mouse over a mutation to see the UniProt comments. Modified residues are highlighted in light blue, transmembrane regions in blue, glycosylation sites in yellow, disulfide bonds in grey, topological domains in red. Methods UniProt sequences were aligned to RefSeq sequences first with BLAT, then lifted to genome positions with pslMap. UniProt variants were parsed from the UniProt XML file. The variants were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. The complete script is part of the kent source tree and is located in src/hg/utils/uniprotMutations. Credits This track was created by Maximilian Haeussler, with advice from Mark Diekhans and Brian Raney. Thanks to UniProt for making all data available for download. References UniProt Consortium. Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2014 Jan;42(Database issue):D191-8. PMID: 24253303; PMC: PMC3965022 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278 spMut UniProt Mutations UniProt/SwissProt Amino Acid Substitutions Variation and Repeats Description This track shows the genomic positions of natural and artifical amino acid variants in the UniProt/SwissProt database. The data has been curated from scientific publications by the UniProt staff. Display Conventions and Configuration Genomic locations of UniProt/SwissProt variants are labeled with the amino acid change at a given position and, if known, the abbreviated disease name. A "?" is used if there is no disease annotated at this location, but the protein is described as being linked to only a single disease in UniProt. Mouse over a mutation to see the UniProt comments. Artificially introduced mutations are colored green and naturally occurring variants are colored red. For full information about a particular variant, click the "UniProt variant" linkout. The "UniProt record" linkout lists all variants of a particular protein sequence. The "Source articles" linkout lists the articles in PubMed that originally described the variant(s) and were used as evidence by the UniProt curators. Methods UniProt sequences were aligned to RefSeq sequences first with BLAT, then lifted to genome positions with pslMap. UniProt variants were parsed from the UniProt XML file. The variants were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the LS-SNP pipeline by Mark Diekhans. The complete script is part of the kent source tree and is located in src/hg/utils/uniprotMutations. Credits This track was created by Maximilian Haeussler, with advice from Mark Diekhans and Brian Raney. References UniProt Consortium. Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2014 Jan;42(Database issue):D191-8. PMID: 24253303; PMC: PMC3965022 Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004 May;23(5):464-70. PMID: 15108278