Tool: Blast Protein
Blast Protein runs a protein sequence similarity search
using a BLAST web service hosted by the
UCSF
Resource for Biocomputing, Visualization, and Informatics (RBVI).
One use is to search with a target sequence of unknown structure
to find templates for comparative modeling
The related tool Foldseek
(Similar Structures) can also search with BLAST and other methods,
but only using a structure chain as the query; it facilitates
exploring large sets of similar structures by efficiently
showing them in 3D as backbone traces and in 2D as sequence alignment
schematics or scatter plots based on conformation.
See also:
AlphaFold,
ESMFold,
Matchmaker,
Task Manager
The Blast Protein tool can be started from the
Sequence section of the Tools menu, or by using the
Sequence Viewer context menu.
It can be manipulated like other panels
(more...).
It is also implemented as the
blastprotein command.
See also:
alphafold search,
esmfold search
[back to top: Blast Protein]
Search Parameters
- Matrix
– amino acid substitution matrix to use for alignment scoring:
- BLOSUM45
- BLOSUM50
- BLOSUM62 (default)
- BLOSUM80
- BLOSUM90
- PAM30
- PAM70
- PAM250
- IDENTITY
- Cutoff (default 1e-3) – significance cutoff;
only hits with
E-value
no larger than the specified value will be returned
- # Sequences (default 100)
– maximum number of unique sequences to return; more hits than this number
may be obtained because multiple structures or other sequence-database entries
may have the same sequence
- Query – amino acid sequence with which to search the database,
specified as one of the following:
- UniProt ID –
a UniProt name or
accession number
- Raw Sequence – amino acid sequence as plain text
pasted into the entry field
- a single protein chain in an atomic structure currently open in ChimeraX
(available chains will be listed individually)
- Database – protein sequence database to search:
- pdb (default) – experimentally determined structures in the
Protein Data Bank (PDB)
- nr – NCBI “non-redundant” database containing
GenBank
CDS translations + PDB
+ SwissProt +
PIR
+ PRF
excluding environmental samples from whole-genome sequencing; this database
is much larger than PDB alone and takes much longer to search
- alphafold
– artificial-intelligence-predicted structures in the
AlphaFold Database
(more...)
- uniref100 –
UniProt Reference (UniRef) cluster at 100% identity
(identical sequences and subfragments collapsed into a single entry)
- uniref90 – based on uniref100, but omitting
sequences shorter than 11 residues and clustering at 90% identity
- uniref50 – based on uniref100, but omitting
sequences shorter than 11 residues and clustering at 50% identity
- esmfold
– artificial-intelligence-predicted structures in the
ESM
Metagenomic Atlas (more...)
Clicking Apply (or OK, which also dismisses the dialog)
runs the search, whereas Close dismisses the dialog without
starting a search. Help opens this page in the
Help Viewer, and
Reset restores the parameters to factory default settings.
[back to top: Blast Protein]
BLAST Protein Results
When results are returned, the table of hits is shown in a separate window.
These results are saved in
ChimeraX sessions.
If you wish to prevent the results from docking into the main window
(which may resize it), see Tool windows start undocked in the
Window preferences.
Checkboxes in the bottom section of the panel control which columns of
information are shown in the table of hits, with buttons:
- All – show all columns
- Default – restore the previously saved
preference for which columns are shown
- Standard – use “factory default” set of
initially displayed columns
- Set Default – save the currently shown set as a
user preference
- Toggle Controls – show/hide the checkbox section
List only best-matching chain per PDB entry –
searches of the PDB usually give multiple hits per PDB entry
(to multiple chains in that entry, typically redundant because
the structure contains multiple copies of the same protein).
This option allows collapsing the results list to show
only a single hit per PDB entry, the one that best matches the query
according to its BLAST score. If multiple chains from the same PDB
entry have identical scores, the first in the list is retained.
Clicking a column header sorts by the values in that column.
Double-clicking a row with a corresponding structure
fetches it,
and if a structure chain was used as the query,
automatically superimposes the hit onto the query
using matchmaker.
If the query was sequence-only (not a structure chain), the first
structure opened from the results will serve as the reference for
superimposing the others.
AlphaFold-predicted structures are
colored by
confidence 0-100.
ESMFold-predicted structures are
colored by
confidence 0-1.
One or more hits can be chosen (highlighted) in the list
by clicking and dragging with the left mouse button;
Ctrl-click (or command-click if using a Mac)
toggles whether a row is chosen.
The result panel's context menu
or the buttons across the bottom of the results dialog
can be used to:
- Load Structures
– fetch and superimpose all of the corresponding structures, if any,
for the chosen hits
- Show Sequence Alignment
– show the multiple sequence alignment of the
chosen hits with the query in the
Sequence Viewer
- Open Database Webpage
– show sequence database webpages for the chosen hits
Regardless of which hits are chosen and which columns are shown,
clicking Save Results as TSV brings up a file browser to save
the entire set of results as a tab-separated values file (*.tsv).
Some columns of data are available no matter which database is searched:
- Hit # – hit number (sorted by most significant
E-value)
- Name
- for PDB sequences, the corresponding PDB identifier, including the chain
- for AlphaFold predictions or UniRef entries, the corresponding
UniProt entry name
- otherwise, sequence GI number
- E-Value – significance of the hit; BLAST
Expect value
- Score – alignment score
- Title – a brief description of the sequence
- Species – source organism,
or for UniRef entries, the taxonomic range of source organisms in the
UniRef cluster (typically broader than a single species)
Additional columns for AlphaFold entries:
Additional columns for PDB entries (from searching PDB or other databases
that include it, such as NR):
- # Atoms
– total number of atoms in the structure (all chains)
- # Polymers
– number of different polymer chains in the structure
(not counting multiple copies of the same sequence)
- # Residues
– total number of residues in the structure (all chains)
- Authors – structure authors
- Chain Copies
– number of copies of the hit chain in the structure
- Chain Names
– chain identifiers and descriptions of polymer chains in the structure
- Chain Residues
– number of residues in the hit chain
- Chain Weight
– molecular weight of the hit chain
- Date – structure deposition date
- Ligand Formulas – chemical formulae of ligand chemical components
- Ligand Names – names of ligand chemical components
- Ligand Smiles –
SMILES strings of ligand chemical components
- Ligand Symbols – residue names of ligand chemical components
- Ligand Weights
– molecular weights of ligand chemical components
- Method – method of structure determination
- PubMed ID – PubMed identifier of literature reference, if any
- Resolution – crystallographic resolution
- UniProt ID
– UniProt accession number, if any, for the hit chain
Additional columns for UniRef entries:
[back to top: Blast Protein]
Notes
Pseudo-multiple alignment.
The pseudo-multiple alignment from BLAST is not a true multiple alignment,
but a consolidation of the pairwise alignments of individual hits to the query.
This output corresponds to the BLAST formatting option
(alignment view)
“flat query-anchored with letters for identities.”
Basic Local Alignment Search Tool (BLAST).
The BLAST
software is provided by the
NCBI and described in:
Gapped BLAST and PSI-BLAST:
a new generation of protein database search programs.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ.
Nucleic Acids Res. 1997 Sep 1;25(17):3389-402.
Basic local alignment search tool.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ.
J Mol Biol. 1990 Oct 5;215(3):403-10.
UCSF Resource for Biocomputing, Visualization, and Informatics /
November 2024