Frequently Asked Questions: Blat
|
|
|
Blat vs. Blast |
|
|
|
|
Question:
"What are the differences between Blat and Blast?"
Response:
Blat is an alignment tool like BLAST, but it is structured differently. On
DNA, Blat works by keeping an index of an entire genome in memory.
Thus, the target database of BLAT is not a set of GenBank sequences, but
instead an index derived from the assembly of the entire genome. By default,
the index consists of all non-overlapping 11-mers except for those heavily
involved in repeats, and it uses less than a gigabyte of RAM. This smaller
size means that Blat is far more easily mirrored
than BLAST. Blat of DNA is designed to quickly find sequences of 95% and
greater similarity of length 40 bases or more. It may miss more divergent or
shorter sequence alignments. (The default settings and expected behavior of
standalone Blat are slightly different from those on the
graphical version of Blat.)
On proteins, Blat uses 4-mers rather than 11-mers, finding protein sequences
of 80% and greater similarity to the query of length 20+ amino acids. The
protein index requires slightly more than 2 gigabytes of RAM.
In practice -- due to sequence divergence rates over evolutionary time -- DNA
Blat works well within humans and primates, while protein Blat
continues to find good matches within terrestrial vertebrates and even earlier
organisms for conserved proteins. Within humans, protein Blat gives a much better
picture of gene families (paralogs) than DNA Blat. However, BLAST and
psi-BLAST at NCBI can find much more remote matches.
From a practical standpoint, Blat has several advantages over BLAST:
- speed (no queues, response in seconds) at the price of lesser homology depth
- the ability to submit a long list of simultaneous queries in fasta format
- five convenient output sort options
- a direct link into the UCSC browser
- alignment block details in natural genomic order
- an option to launch the alignment later as part of a custom track
Blat is commonly used to look up the location of a
sequence in the genome or determine the exon structure of an mRNA, but expert
users can run large batch jobs and make internal parameter sensitivity
changes by installing command line Blat on their own Linux server.
| |
|
|
Blat cannot find a sequence |
|
|
|
|
Question:
"I cannot find a sequence with Blat although I'm sure
it is in the genome. Am I doing something wrong?"
Response:
You may first check if you are using the correct version
of the genome. For example, two versions of the human genome
are currently in use (called hg19 and hg38) and your
sequence may be only in one of them. Many
published articles do not specify the version so trying
a few may be necessary.
Very short sequences that go over a splice site in a
cDNA sequence can not be found, as they are not
in the genome, QPCR primers are a typical example.
You can use In-Silico PCR and select a gene set as the
target for these cases. In general, In-Silico PCR
is more sensitive and should be preferred for primers.
If you are sure that the genome is the
right one and
that the sequence is indeed there, for example by using
the "Short match" track, the problem
may be a result of Blat's query masking. The online
version of Blat masks 11mers from the query that occur
more than 1024 times in the genome. The goal is to
improve speed but this may result in missing hits when
you are searching for sequences in repeats.
To find these matches with the online version of Blat,
you can add more flanking sequence to your query. If
this is not possible, the only alternative is to
download the executables of Blat and the .2bit file of
a genome to your own machine and use the command line.
See Downloading Blat source and
documentation.
| |
|
|
Blat use restrictions |
|
|
|
|
Question:
"I received a high-volume traffic warning from your Blat
server informing me that I had exceeded the server use
limitations. Can you give me information on the UCSC
Blat server use parameters?"
Response:
Due to the high demand on our Blat servers, we restrict
service for users who programatically query Blat or do
large batch queries. Program-driven use of Blat is
limited to a maximum of one hit every 15
seconds and no more than 5,000 hits per day. Please limit
batch queries to 25 sequences or less.
For users with high-volume Blat demands, we recommend
downloading Blat for local use. For more information,
see Downloading Blat source and
documentation.
| |
|
|
Downloading Blat source and documentation |
|
|
|
|
Question:
"Is the Blat source available for download? Is there
documentation available?"
Response:
Blat source and executables are freely available for
academic, nonprofit and personal use. Commercial licensing
information is available on the
Kent Informatics website.
Blat source may be downloaded from
http://www.soe.ucsc.edu/~kent
(look for the blatSrc* zip file with the most recent
date). For
Blat executables, go to
http://hgdownload.cse.ucsc.edu/admin/exe/; and choose your machine type.
Documentation on Blat program specifications is available
here.
| |
|
|
Replicating web-based Blat
parameters in command-line version
|
|
|
|
|
Question:
"I'm setting up my own Blat server and would like to use
the same parameter values that the UCSC web-based Blat
server uses."
Response:
We almost always expect there to be some small differences
between the hgBlat/gfServer and the stand-alone command-line blat.
The best matches can be found using pslReps and
pslCDnaFilter utilities. The web-based blat is tuned permissively
with a minimum cut-off score of 20, which will display most of the
alignments. Other than to confirm that your command-line blat is
working, there is little use in perfectly replicating the web-based blat results.
We advise deciding which filtering parameters make the most sense for
the experiment or analysis. Often these settings will
be different and more stringent than those of the web-based blat.
With that in mind, use the following settings to replicate the search results of the web-based blat:
faToTwoBit:
gfServer (this is how the UCSC web-based blat servers are configured):
- blat server (capable of PCR):
gfServer start blatMachine portX -stepSize=5 -log=untrans.log database.2bit
- translated blat server:
gfServer start blatMachine portY -trans -mask -log=trans.log database.2bit
For enabling DNA/DNA and DNA/RNA
matches, only the host, port and twoBit files are needed.
The same port is used for both untranslated blat (gfClient)
and PCR (webPcr). You'll need a separate blat server on a separate
port to enable translated blat (protein searches or translated searches in protein-space).
gfClient:
- Set -minScore=0 and
-minIdentity=0. This will result in some
low-scoring, generally spurious hits, but for
interactive use it's sufficiently easy to ignore them
(because results are sorted by score) and sometimes
the low-scoring hits come in handy.
standalone blat:
- blat search:
blat -stepSize=5 -repMatch=2253 -minScore=0 -minIdentity=0 database.2bit query.fa output.psl
Notes on repMatch:
The default setting for gfServer dna matches is: repMatch = 1024 * (tileSize/stepSize).
The default setting for blat dna matches is: repMatch = 1024 (if tileSize=11).
To get command-line results that are equivalent to web-based results, repMatch must
be specified when using blat.
For more information about how to replicate the score and percent identity matches displayed
by our web-based blat, please see the following
blat FAQ.
For more information on the parameters available for
blat, gfServer, and gfClient, see the
blat
specifications.
| |
|
|
Using the -ooc flag |
|
|
|
|
Question:
"What does the -ooc flag do?"
Response:
Using any -ooc option in blat, such
as -ooc=11.ooc, simply serves to speed up
searches similar to repeat-masking sequence. The
11.ooc file contains sequences
determined to be over-represented in the genome
sequence. To speed up searches, these sequences are not
used when seeding an alignment against the genome. For
reasonably-sized sequences, this will not create a
problem and will significantly reduce processing time.
By not using the 11.ooc file, you will increase
alignment time, but will also slightly increase
sensitivity. This may be important if you are aligning
shorter sequences or sequences of poor quality. For example,
if a particular sequence consists primarily of
sequences in the 11.ooc file, it will
never be seeded correctly for an alignment if the
-ooc flag is used.
In summary,
if you are not finding certain sequences and can afford
the extra processing time, you may want to run blat
without the 11.ooc file if your particular
situation warrants its use.
| |
|
|
Replicating web-based Blat percent identity and score calculations |
|
|
|
|
Question:
"Using my own command-line Blat server, how can I
replicate the percent identity and score calculations
produced by web-based Blat?"
Response:
There is no option to command-line Blat that gives
you the percent ID and the score. However, we have
created scripts that include the calculations.
- View the perl script from the source tree:
pslScore.pl
- View the corresponding C program:
pslScore.c
and associated library functions pslScore
and pslCalcMilliBad in
psl.c
See our FAQ on source code
licensing and downloads for information on obtaining
the source.
| |
|
|
Replicating web-based Blat "I'm feeling lucky" search results |
|
|
|
|
Question:
"How do I generate the same search results as web-based
Blat's "I'm feeling lucky" option using
command-line blat?"
Response:
The code for the "I'm feeling lucky" Blat
search orders the results based on the sort output
option that you selected on the query page. It then
returns the highest-scoring alignment of the first
query sequence.
If you are sorting results by "query, start"
or "chrom, start", generating the "I'm
feeling lucky" result is straightforward:
sort the output file by these columns, then select the
top result.
To replicate any of the sort options involving score,
you first must calculate the score for each result in
your PSL output file, then sort the results by score or
other combination (e.g. "query,
score" and "chrom, score").
See the section on Replicating
web-based Blat percent identity and score
calculations for information on calculating the
score.
Alternatively, you can try filtering your Blat PSL
output using either the pslReps or
pslCDnaFilter program available in the Genome
Browser source code. For information on obtaining the
source code, see our FAQ
on source code licensing and downloads.
| |
|
|
Using Blat for short sequences with maximum sensitivity |
|
|
|
|
Question:
"How do I configure blat for short sequences with
maximum sensitivity?"
Response:
Here are some guidelines for configuring standalone
blat and gfServer/gfClient for these conditions:
-
The formula to find the shortest query size that will
guarantee a match (if matching tiles are not marked as
overused) is: 2 * stepSize + tileSize - 1
For example, with stepSize set to 5 and
tileSize set to 11, matches of query size
2 * 5 + 11 - 1 = 20 bp will be found if the query matches the target exactly.
The stepSize parameter can range from 1 to tileSize.
The tileSize parameter can range from 6 to 15. For protein, the
range starts lower.
For minMatch=1 (e.g., protein), the minimum guaranteed match length is:
1 * stepSize + tileSize - 1
-
Try using -fine.
-
Use a large value for repMatch (e.g.
-repMatch = 1000000)
to reduce the chance of a tile being marked as
over-used.
-
Do not use an .ooc file.
-
Do not use -fastMap.
-
Do not use masking command-line options.
The above changes will make BLAT more sensitive, but
will also slow the speed and increase the memory usage.
It may be necessary to process one chromosome
at a time to reduce the memory requirements.
A note on filtering output: increasing the
-minScore parameter value beyond one-half of
the query size has no further effect. Therefore, use
either the pslReps or pslCDnaFilter
program available in the Genome Browser source code to
filter for the size, score, coverage, or quality
desired. For information on obtaining the
source code, see our FAQ
on source code licensing and downloads.
| |
|
|
| |