Introduction
^^^^^^^^^^^^
This directory contains the Dec. 2011 (GRCm38/mm10) assembly of the mouse genome
(mm10, Genome Reference Consortium Mouse Build 38 (GCA_000001635.2)), as well
as repeat annotations and GenBank sequences.
This assembly was produced by the Mouse Genome Sequencing Consortium,
and the National Center for Biotechnology Information (NCBI).
For more information on the mouse genome, see the project website:
See also: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/mouse/
http://www.ncbi.nlm.nih.gov/genome/52
Patches
^^^^^^^
mm10 has been updated with patches since its release in 2012.
GRC patch releases do not change any previously existing sequences; they simply add
new sequences for fix patches or alternate haplotypes that correspond to
specific regions of the main chromosome sequences. These sequences are typically
relatively short, on the order for a few dozen kbp. For most users, the patches
are unlikely to make a difference and may complicate the analysis as they
introduce more duplication.
The initial/ subdirectory contains files for the initial release of GRCm38,
which has 66 sequences, and no original alternate sequences and no fix sequences.
It is the same as the parent download directory. This is probably the
best genome file for aligners and most analysis tasks, a version called
"analysisSet" for the human genome.
The p6/ subdirectory contains files for GRCm38.p6 (patch release 6).
It has 239 sequences including alternate and fix sequences. Note that
these patches includes "strain-specific" sequences. You may want to
check with the authors of your aligner if the software can recognize these
sequences.
The "latest/" symbolic link points to the subdirectory for the most recent patch version.
mm10.* files in this directory are the same as files in the initial/
subdirectory, i.e. they are from the initial GRCm38 release and do not
include the patch sequences that are now included in the Genome Browser.
This means that old software that downloads these files will not report
different results.
Sequence names
^^^^^^^^^^^^^^
For historical reasons, what UCSC calls "chr1", Ensembl calls "1" and NCBI
calls "NC_000067.6". The sequences are identical though. To map between UCSC,
Ensembl and NCBI names, use our table "chromAlias", available via our Table
Browser or as file:
https://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/chromAlias.txt.gz We
also provide a Python command line tool to convert sequence names in the most
common genomics file formats:
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/chromToUcsc
During genome assembly, reads are assembled into "contigs" (a few kbp long),
which are then joined into longer "scaffolds" of a few hundred kbp. These are
finally placed, often manually e.g. with FISH assays, onto chromosomes.
As a result, the mm10 genome sequence files contains these types of sequences:
Chromosomes:
- made from scaffolds placed onto chromosome locations, 95% of the genome file
- format: chr{chromosome number or name}
- e.g. chr1 or chrX, chrM for the mitochondrial genome.
Unlocalized scaffolds:
- a sequence found in an assembly that is associated with a specific
chromosome but cannot be ordered or oriented on that chromosome.
- format: chr{chromosome number or name}_{sequence_accession}_random
- e.g. chr5_JH584299_random
Unplaced scaffolds:
- a sequence found in an assembly that is not associated with any chromosome.
- format: chrUn_{sequence_accession}
- e.g. chrUn_GL456379
Alternate loci scaffolds:
- a scaffold that provides an alternate representation of a locus found
in the primary assembly. These sequences do not represent a complete
chromosome sequence although there is no hard limit on the size of the
alternate locus; currently these are less than 1 Mb. These could either
be NOVEL patch sequences, added through patch releases, or present in the
initial assembly release.
- format: chr{chromosome number or name}_{sequence_accession}_alt
- e.g. chr6_GL456054_alt
Strain-specific alternate loci sequences:
- these are alternate loci that are not different versions of the sequence
in the same mouse population, but from other strains,
Fix loci scaffolds:
- a patch that corrects sequence or reduces an assembly gap in a given
major release. FIX patch sequences are meant to be incorporated into
the primary or existing alt-loci assembly units at the next major
release.
- these sequences are not part of the files in the initial/ directory
- format: chr{chromosome number or name}_{sequence_accession}_fix
- e.g. chr5_KV575237_fix
Files
^^^^^
Files included in this directory:
mm10.2bit - contains the complete mouse/mm10 genome sequence
in the 2bit file format. Repeats from RepeatMasker and Tandem Repeats
Finder (with period of 12 or less) are shown in lower case; non-repeating
sequence is shown in upper case. The utility program, twoBitToFa (available
from the kent src tree), can be used to extract .fa file(s) from
this file. A pre-compiled version of the command line tool can be
found at:
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/
See also:
http://genome.ucsc.edu/admin/git.html
http://genome.ucsc.edu/admin/jk-install.html
chromAgp.tar.gz - Description of how the assembly was generated from
fragments, unpacking to one file per chromosome.
chromFa.tar.gz - The assembly sequence in one file per chromosome.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.
chromFaMasked.tar.gz - The assembly sequence in one file per chromosome.
Repeats are masked by capital Ns; non-repeating sequence is shown in
upper case.
chromOut.tar.gz - RepeatMasker .out files (one file per chromosome).
RepeatMasker was run with the -s (sensitive) setting.
Repeat Masker library RELEASE 20110920
April 26 2011 (open-3-3-0) version of RepeatMasker
chromTrf.tar.gz - Tandem Repeats Finder locations, filtered to keep repeats
with period less than or equal to 12, and translated into UCSC's BED 5+
format (one file per chromosome).
est.fa.gz - Mouse ESTs in GenBank. This sequence data is updated once a
week via automatic GenBank updates.
md5sum.txt - checksums of files in this directory
mrna.fa.gz - Mouse mRNA from GenBank. This sequence data is updated
once a week via automatic GenBank updates.
refMrna.fa.gz - RefSeq mRNA from the same species as the genome.
This sequence data is updated once a week via automatic GenBank
updates.
upstream1000.fa.gz - Sequences 1000 bases upstream of annotated
transcription starts of RefSeq genes with annotated 5' UTRs.
This file is updated weekly so it might be slightly out of sync with
the RefSeq data which is updated daily for most assemblies.
upstream2000.fa.gz - Same as upstream1000, but 2000 bases.
upstream5000.fa.gz - Same as upstream1000, but 5000 bases.
xenoMrna.fa.gz - GenBank mRNAs from species other than that of
the genome. This sequence data is updated once a week via automatic
GenBank updates.
mm10.agp.gz - Description of how the assembly was generated from
fragments.
mm10.chrom.sizes - Two-column tab-separated text file containing assembly
sequence names and sizes.
mm10.chromAlias.txt - sequence name alias file, one line
for each sequence name. First column is sequence name followed by
tab separated alias names.
mm10.chromAlias.bb - bigBed file for alias sequence names, one line
for each sequence name. The first three columns are the sequence in
BED format, followed by tab-separated alias names.
The .bb file is used by bedToBigBed as a URL to avoid having to download
the entire chromAlias.txt file. From the usage message:
-sizesIsChromAliasBb -- If set, then chrom.sizes file is assumed to be a
chromAlias bigBed file or a URL to a such a file (see above).
More documentation is found here:
https://genomewiki.ucsc.edu/index.php?title=Chrom_Alias
mm10.fa.align.gz - RepeatMasker .align file. RepeatMasker was run with the
-s (sensitive) setting.
2012-02-06 version of RepeatMasker
mm10.fa.gz - "Soft-masked" assembly sequence in one file.
Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or
less) are shown in lower case; non-repeating sequence is shown in upper
case. (again, the most current version of this file is latest/mm10.fa.gz)
mm10.fa.masked.gz - "Hard-masked" assembly sequence in one file.
Repeats are masked by capital Ns; non-repeating sequence is shown in
upper case.
mm10.fa.out.gz - RepeatMasker .out file. RepeatMasker was run with the
-s (sensitive) setting.
2012-02-06 version of RepeatMasker
mm10.gc5Base.wigVarStep.gz - ascii data wiggle variable step values used
- to construct the GC Percent track
mm10.gc5Base.bw - binary bigWig data for the gc5Base track.
mm10.trf.bed.gz - Tandem Repeats Finder locations, filtered to keep repeats
with period less than or equal to 12, and translated into UCSC's BED
format.
------------------------------------------------------------------
If you plan to download a large file or multiple files from this
directory, we recommend that you use ftp rather than downloading the
files via our website. To do so, ftp to hgdownload.cse.ucsc.edu
[username: anonymous, password: your email address], then cd to the
directory goldenPath/mm10/bigZips. To download multiple files, use
the "mget" command:
mget <filename1> <filename2> ...
- or -
mget -a (to download all the files in the directory)
Alternate methods to ftp access.
Using an rsync command to download the entire directory:
rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/ .
For a single file, e.g. chromFa.tar.gz
rsync -avzP
rsync://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz .
Or with wget, all files:
wget --timestamping
'ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/*'
With wget, a single file:
wget --timestamping
'ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz'
-O chromFa.tar.gz
To unpack the *.tar.gz files:
tar xvzf <file>.tar.gz
To uncompress the fa.gz files:
gunzip <file>.fa.gz
------------------------------------------------------------------
This file last updated: 2020-04-20 - 20 April 2021
------------------------------------------------------------------
Name Last modified Size Description
Parent Directory -
chromAgp.tar.gz 2012-02-09 13:40 9.7K
chromFa.tar.gz 2012-02-09 13:54 830M
chromFaMasked.tar.gz 2012-02-09 14:02 479M
chromOut.tar.gz 2012-02-09 13:41 155M
chromTrf.tar.gz 2012-02-09 14:02 19M
est.fa.gz 2019-10-14 17:11 788M
est.fa.gz.md5 2019-10-14 17:11 44
genes/ 2024-02-29 14:26 -
initial/ 2023-02-23 15:10 -
latest/ 2023-02-21 23:09 -
md5sum.txt 2023-02-23 13:14 1.0K
mm10.2bit 2012-02-07 10:52 682M
mm10.agp.gz 2021-04-12 23:00 7.9K
mm10.chrom.sizes 2012-02-06 11:46 1.4K
mm10.chromAlias.txt 2020-09-29 10:49 2.5K
mm10.fa.align.gz 2012-02-06 22:48 1.8G
mm10.fa.gz 2021-04-19 18:42 830M
mm10.fa.masked.gz 2021-04-13 17:07 479M
mm10.fa.out.gz 2021-04-13 17:42 155M
mm10.gc5Base.bw 2012-02-06 12:08 1.4G
mm10.gc5Base.wib 2019-01-17 14:10 506M
mm10.gc5Base.wig.gz 2019-01-17 14:10 9.5M
mm10.gc5Base.wigVarStep.gz 2012-02-06 11:55 1.3G
mm10.trf.bed.gz 2021-04-13 17:55 19M
mrna.fa.gz 2019-10-14 16:53 261M
mrna.fa.gz.md5 2019-10-14 16:53 45
p6/ 2023-02-22 00:10 -
refMrna.fa.gz 2019-10-14 17:12 44M
refMrna.fa.gz.md5 2019-10-14 17:12 48
upstream1000.fa.gz 2012-02-09 14:18 6.6M
upstream1000.fa.gz.md5 2019-10-14 17:13 53
upstream2000.fa.gz 2012-02-09 14:19 13M
upstream2000.fa.gz.md5 2019-10-14 17:13 53
upstream5000.fa.gz 2012-02-09 14:20 30M
upstream5000.fa.gz.md5 2019-10-14 17:13 53
xenoMrna.fa.gz 2019-10-14 17:04 6.5G
xenoMrna.fa.gz.md5 2019-10-14 17:04 49
xenoRefMrna.fa.gz 2019-10-14 17:12 287M
xenoRefMrna.fa.gz.md5 2019-10-14 17:12 52