Tool: Blast Protein

Blast Protein runs a protein sequence similarity search using a BLAST web service hosted by the UCSF Resource for Biocomputing, Visualization, and Informatics (RBVI). One use is to search with a target sequence of unknown structure to find templates for comparative modeling

The related tool Foldseek (Similar Structures) can also search with BLAST and other methods, but only using a structure chain as the query; it facilitates exploring large sets of similar structures by efficiently showing them in 3D as backbone traces and in 2D as sequence alignment schematics or scatter plots based on conformation. See also: AlphaFold, ESMFold, Matchmaker, Task Manager

The Blast Protein tool can be started from the Sequence section of the Tools menu, or by using the Sequence Viewer context menu. It can be manipulated like other panels (more...). It is also implemented as the blastprotein command. See also: alphafold search, esmfold search

[back to top: Blast Protein]

Search Parameters

Matrix – amino acid substitution matrix to use for alignment scoring:
- BLOSUM45
- BLOSUM50
- BLOSUM62 (default)
- BLOSUM80
- BLOSUM90
- PAM30
- PAM70
- PAM250
- IDENTITY
Cutoff (default 1e-3) – significance cutoff; only hits with E-value no larger than the specified value will be returned
# Sequences (default 100) – maximum number of unique sequences to return; more hits than this number may be obtained because multiple structures or other sequence-database entries may have the same sequence
Query – amino acid sequence with which to search the database, specified as one of the following:
- UniProt ID – a UniProt name or accession number
- Raw Sequence – amino acid sequence as plain text pasted into the entry field
- a single protein chain in an atomic structure currently open in ChimeraX (available chains will be listed individually)
Database – protein sequence database to search:
- pdb (default) – experimentally determined structures in the Protein Data Bank (PDB)
- nr – NCBI “non-redundant” database containing GenBank CDS translations + PDB + SwissProt + PIR + PRF excluding environmental samples from whole-genome sequencing; this database is much larger than PDB alone and takes much longer to search
- alphafold – artificial-intelligence-predicted structures in the AlphaFold Database (more...)
- uniref100 – UniProt Reference (UniRef) cluster at 100% identity (identical sequences and subfragments collapsed into a single entry)
- uniref90 – based on uniref100, but omitting sequences shorter than 11 residues and clustering at 90% identity
- uniref50 – based on uniref100, but omitting sequences shorter than 11 residues and clustering at 50% identity
- esmfold – artificial-intelligence-predicted structures in the ESM Metagenomic Atlas (more...)

Clicking Apply (or OK, which also dismisses the dialog) runs the search, whereas Close dismisses the dialog without starting a search. Help opens this page in the Help Viewer, and Reset restores the parameters to factory default settings.

[back to top: Blast Protein]

BLAST Protein Results

When results are returned, the table of hits is shown in a separate window. These results are saved in ChimeraX sessions. If you wish to prevent the results from docking into the main window (which may resize it), see Tool windows start undocked in the Window preferences.

Checkboxes in the bottom section of the panel control which columns of information are shown in the table of hits, with buttons:

All – show all columns
Default – restore the previously saved preference for which columns are shown
Standard – use “factory default” set of initially displayed columns
Set Default – save the currently shown set as a user preference
Toggle Controls – show/hide the checkbox section

List only best-matching chain per PDB entry – searches of the PDB usually give multiple hits per PDB entry (to multiple chains in that entry, typically redundant because the structure contains multiple copies of the same protein). This option allows collapsing the results list to show only a single hit per PDB entry, the one that best matches the query according to its BLAST score. If multiple chains from the same PDB entry have identical scores, the first in the list is retained.

Clicking a column header sorts by the values in that column.

Double-clicking a row with a corresponding structure fetches it, and if a structure chain was used as the query, automatically superimposes the hit onto the query using matchmaker. If the query was sequence-only (not a structure chain), the first structure opened from the results will serve as the reference for superimposing the others. AlphaFold-predicted structures are colored by confidence 0-100. ESMFold-predicted structures are colored by confidence 0-1.

One or more hits can be chosen (highlighted) in the list by clicking and dragging with the left mouse button; Ctrl-click (or command-click if using a Mac) toggles whether a row is chosen. The result panel's context menu or the buttons across the bottom of the results dialog can be used to:

Load Structures – fetch and superimpose all of the corresponding structures, if any, for the chosen hits
Show Sequence Alignment – show the multiple sequence alignment of the chosen hits with the query in the Sequence Viewer
Open Database Webpage – show sequence database webpages for the chosen hits

Regardless of which hits are chosen and which columns are shown, clicking Save Results as TSV brings up a file browser to save the entire set of results as a tab-separated values file (*.tsv).

Some columns of data are available no matter which database is searched:

Hit # – hit number (sorted by most significant E-value)
Name
- for PDB sequences, the corresponding PDB identifier, including the chain
- for AlphaFold predictions or UniRef entries, the corresponding UniProt entry name
- otherwise, sequence GI number
E-Value – significance of the hit; BLAST Expect value
Score – alignment score
Title – a brief description of the sequence
Species – source organism, or for UniRef entries, the taxonomic range of source organisms in the UniRef cluster (typically broader than a single species)

Additional columns for AlphaFold entries:

Gene – gene name
Protein Existence – type of evidence that the protein exists
Sequence Version – version number
Taxonomic ID – NCBI taxid of the source organism

Additional columns for PDB entries (from searching PDB or other databases that include it, such as NR):

# Atoms – total number of atoms in the structure (all chains)
# Polymers – number of different polymer chains in the structure (not counting multiple copies of the same sequence)
# Residues – total number of residues in the structure (all chains)
Authors – structure authors
Chain Copies – number of copies of the hit chain in the structure
Chain Names – chain identifiers and descriptions of polymer chains in the structure
Chain Residues – number of residues in the hit chain
Chain Weight – molecular weight of the hit chain
Date – structure deposition date
Ligand Formulas – chemical formulae of ligand chemical components
Ligand Names – names of ligand chemical components
Ligand Smiles – SMILES strings of ligand chemical components
Ligand Symbols – residue names of ligand chemical components
Ligand Weights – molecular weights of ligand chemical components
Method – method of structure determination
PubMed ID – PubMed identifier of literature reference, if any
Resolution – crystallographic resolution
UniProt ID – UniProt accession number, if any, for the hit chain

Additional columns for UniRef entries:

Cluster ID – ID number of the UniRef cluster represented by the hit
Cluster Size – number of sequences in the UniRef cluster represented by the hit
UniProt ID – UniProt accession number, if any, for the hit chain

[back to top: Blast Protein]

Notes

Pseudo-multiple alignment. The pseudo-multiple alignment from BLAST is not a true multiple alignment, but a consolidation of the pairwise alignments of individual hits to the query. This output corresponds to the BLAST formatting option (alignment view) “flat query-anchored with letters for identities.”

Basic Local Alignment Search Tool (BLAST). The BLAST software is provided by the NCBI and described in:

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Nucleic Acids Res. 1997 Sep 1;25(17):3389-402.

Basic local alignment search tool. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. J Mol Biol. 1990 Oct 5;215(3):403-10.

UCSF Resource for Biocomputing, Visualization, and Informatics / November 2024