|
The bigGenePred format stores annotation items that are a
linked collection of exons, much as
BED files indexed as bigBeds do,
but bigGenePred has additional information about the coding frames and other
gene specific information in eight additional fields.
bigGenePred files are created using the program bedToBigBed with a
special AutoSQL file that defines the fields of the bigGenePred. The
resulting bigBed files are in an indexed binary format. The main advantage of
the bigBed files is that only portions of the files needed to display a
particular region are transferred to UCSC. So for large data sets, bigBed is
considerably faster than regular BED files. The bigBed file remains on
your web accessible server (http, https, or ftp), not on the UCSC server.
Only the portion that is needed
for the chromosomal position you are currently viewing is locally cached as a
"sparse file".
Big Gene Predictions
The following AutoSQL definition is used for bigGenePred gene prediction files.
This is the bigGenePred.as
file defined by the -as= option when using bedToBigBed.
Click this bed12+8 file for
an example of bigGenePred input. In alternative-splicing situations, each
transcript has its own row.
table bigGenePred
"bigGenePred gene models"
(
string chrom; "Reference sequence chromosome or scaffold"
uint chromStart; "Start position in chromosome"
uint chromEnd; "End position in chromosome"
string name; "Name or ID of item, ideally both human readable and unique"
uint score; "Score (0-1000)"
char[1] strand; "+ or - for strand"
uint thickStart; "Start of where display should be thick (start codon)"
uint thickEnd; "End of where display should be thick (stop codon)"
uint reserved; "RGB value (use R,G,B string in input file)"
int blockCount; "Number of blocks"
int[blockCount] blockSizes; "Comma separated list of block sizes"
int[blockCount] chromStarts; "Start positions relative to chromStart"
string name2; "Alternative/human readable name"
string cdsStartStat; "enum('none','unk','incmpl','cmpl')"
string cdsEndStat; "enum('none','unk','incmpl','cmpl')"
int[blockCount] exonFrames; "Exon frame {0,1,2}, or -1 if no frame for exon"
string type; "Transcript type"
string geneName; "Primary identifier for gene"
string geneName2; "Alternative/human readable gene name"
string geneType; "Gene type"
)
Note that the bedToBigBed utility uses a substantial amount of
memory; somewhere on the order of 1.25 times more RAM than the
uncompressed BED input file.
To create a bigGenePred track, follow these steps:
- Create a bed12+8 bigGenePred format file that has the first twelve fields
described by a normal BED file as described here.
(You can also read about genePred here.)
- Your bigGenePred file must have the extra eight fields described in the AutoSQL file above:
name2, cdsStartStat, cdsEndStat, exonFrames, type, geneName, geneName2, geneType.
- Your bigGenePred file must be sorted by chrom then chromStart. You can use
the UNIX sort command to do this:
sort -k1,1 -k2,2n unsorted.bed > input.bed
- Download the bedToBigBed program from the
directory
of binary utilities.
- Use the fetchChromSizes script from the same
directory
to create a chrom.sizes file for the UCSC database you are working with
(e.g. hg38). Alternatively, you can download the chrom.sizes file for
any assembly hosted at UCSC from our
downloads page (click on "Full data set" for any assembly). For example, for the hg38
database, the hg38.chrom.sizes are located at
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes.
- Create the bigBed file from your sorted bigGenePred input file using the bedToBigBed
utility like so:
bedToBigBed -as=bigGenePred.as bigGenePred.txt chrom.sizes myBigGenePred.bb
- Move the newly created bigBed file (myBigGenePred.bb) to an http,
https, or ftp location.
- Construct a custom track
using a single
track line.
Note that any of the track attributes listed
here are applicable
to tracks of type bigBed.
The most basic version of the "track" line will look something
like this:
track type=bigGenePred name="My Big GenePred" description="A Gene Set
Built from Data from My Lab" bigDataUrl=http://myorg.edu/mylab/myBigGenePred.bb
- Paste this custom track line into the text box in the
custom track management page.
The bedToBigBed program can also be run with several additional options.
A full list of the available options can be seen by running
bedToBigBed with no arguments to display the usage message.
Example One
In this example, you will use an existing bigGenePred file to create a bigGenePred
custom track. A bigGenePred file that contains data on the hg38
assembly has been placed on our http server.
You can create a custom track using this bigGenePred file by constructing a
"track" line that references this file like so:
track type=bigGenePred name="bigGenePred Example One"
description="A bigGenePred file"
bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigGenePred.bb
Paste the above "track" line into the
custom track management page for the
human assembly hg38 (Dec. 2013), then press the submit button.
Custom tracks can also be loaded via one URL line. The below link loads the same
bigGenePred track, but includes parameters on the URL line:
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&hgct_customText=track%20type=bigGenePred%20name=Example%20bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigGenePred.bb
With this example bigGenePred loaded, click into a gene from the track. Note
that the details page has a "Links to sequence:" section that includes
"Translated Protein", "Predicted mRNA", and "Genomic
Sequence" links. Click the "Go to ... track controls" link.
There change the "Color track by codons:" option from "OFF"
too "genomic codons" and be sure "Display mode:" is "full"
then click "Submit". Then zoom to a region where amino acids display,
such as chr9:133,255,650-133,255,700 and see how bigGenePred allows
the display of codons. Click back into the track controls page and click the box next
to "Show codon numbering". Return to the browser to see amino acid
numbering.
You can also add a parameter in the custom track line,
baseColorDefault=genomicCodons, to set the display of codons:
browser position chr10:67,884,600-67,884,900
track type=bigGenePred baseColorDefault=genomicCodons name="bigGenePred Example Two" description="A bigGenePred file" visibility=pack bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigGenePred.bb
Paste the above into the hg38 custom track
page to see an example of bigGenePred amino acid display around the beginning of the gene SIRT1
on chromosome 10.
Example Two
In this example, you will create your own bigGenePred file from an existing
bigGenePred input file.
- Save this bed12+8 bigGenePred.txt
example input file to your machine (satisfies above step 1).
- Download the bedToBigBed utility
(step 2).
- Save this hg38.chrom.sizes text file to your machine.
It contains the chrom.sizes for the human (hg38) assembly
(step 3).
- Save this bigGenePred.as text file to your machine.
- Run the utility to create the bigBed output file
(step 4):
bedToBigBed -type=bed12+8 -tab -as=bigGenePred.as bigGenePred.txt hg38.chrom.sizes bigGenePred.bb
- Place the bigBed file you just created (bigGenePred.bb) on a
web-accessible server (step 5).
- Construct a "track" line that points to your bigGenePred file
(see step 6).
- Create the custom track on the human assembly hg38 (Dec. 2013), and
view it in the genome browser (see step 7).
Note the above description in Example One on how to view genomic codons,
including numbering.
Sharing Your Data with Others
If you would like to share your bigGenePred data track with a colleague, learn
how to create a URL by looking at Example 11 on
this page.
Extracting Data from bigBed Format
Since the bigGenePred files are an extension of bigBed files, which are indexed binary files,
they can be difficult to
extract data from. We have developed the following
programs, all of which are available from the
directory of binary
utilities.
- bigBedToBed — this program converts a bigBed file
to ASCII BED format.
- bigBedSummary — this program extracts summary information
from a bigBed file.
- bigBedInfo — this program prints out information about a
bigBed file.
As with all UCSC Genome Browser programs, simply type the program name
at the command line with no parameters to see the usage statement.
Troubleshooting
If you encounter an error when you run the bedToBigBed program,
it may be because your input bigGenePred file has data off the end of a chromosome.
In this case, use the bedClip program
here before the
bedToBigBed program. It will remove the row(s) in your input BED
file that are off the end of a chromosome.
| |