Writing mmCIF Files in ChimeraX¶
- Date:
April 2018
Introduction¶
There are several goals for ChimeraX’s mmCIF writer:
write out the information that ChimeraX wants to read
output inferred parts of mmCIF (so other mmCIF readers don’t have to infer the same things)
use stylized output (for readability and fast reading)
output should pass the PDB’s online deposition validator
output should pass the PDB’s mmCIF validation software
full connectivity should optionally be generated (not done)
What ChimeraX Wants¶
What ChimeraX wants to read from a mmCIF file is documented in ChimeraX Fast mmCIF Guidelines. Saving connectivity is a major issue and is discussed separately.
Inferred mmCIF Tables¶
Many data relationships in a mmCIF file can be inferred. For example, from the contents of the atom_site table, the chain identifier to entity identifier mapping can be computed (the struct_asym table). Consequently, ChimeraX’s mmCIF reader skips reading tables when it can infer all of the information it needs from a table. Other application’s mmCIF readers might not compute the same information, so ChimeraX outputs the tables it infers for completeness. This is also needed for the mmCIF files to validate.
Stylized Output¶
As shown in Benchmarking readcif, stylized PDBx/mmCIF output can be read faster than unstylized output. It is also easier to visually scan fixed column width tables for interesting values. ChimeraX ouputs the chimerax_audit_syntax.case_sensitive_flag as Y to indicate that all keywords are lowercase and appear at the beginning of a line. And outputs chimerax_audit_syntax.fixed_width with just the atom_site and atom_site_anisotrop tables listed (since the those tables are typically the largest ones in the mmCIF file).
Validation¶
We tested the mmCIF output in two ways:
Using the World Wide PDB’s online validator
Validated the mmCIF file using the associated mmCIF dictionary
There are several software packages that will try to validate a mmCIF file using the associate dictionary. We used the mmCIF Dictionary Suite from the wwPDB, since it supports the current mmCIF dictionary (version 5).
Connectivity¶
TODO: full connectivity should optionally be generated
Problems¶
ChimeraX does not save enough information to completely regenerate some of the mmCIF tables it uses.
Heterogeneous information is discarded when reading, so it is not present when writing.
ChimeraX is only concerned about strands, so the sheet information in the struct_sheet_range table is lost. On output, the sheet identifier is given as unknown (?).
In other cases, the original mmCIF table, that is copied verbatim into the output, is non-conforming. For example, in wwPDB provided mmCIF files, the mandatory item pdbx_src_id is often missing from the entity_src_gen and entity_src_nat tables.
Only the single letter code for a residue in a chain’s sequence is kept. So, if that residue is not present in any entity with the same sequence, then the name of the residue defaults to the standard one for that letter.
Generated mmCIF Categories and Keywords¶
For reference, all of the mmCIF categories and keywords that ChimeraX outputs are listed below.
Recognized Data Categories and Keywords¶
Category
Keywords
atom_type
symbol
atom_site
group_PDB, id, label_entity_id, label_asym_id, auth_asym_id, pdbx_PDB_ins_code, label_seq_id, auth_seq_id, label_alt_id, type_symbol, label_atom_id, label_comp_id, Cartn_x, Cartn_y, Cartn_z, occupancy, B_iso_or_equiv, pdbx_PDB_model_num
atom_site_anisotrop
id, U[1]_[1], U[1]_[2], U[1]_[3], U[2]_[2], U[2]_[3], U[3]_[3]
audit_conform
dict_name, dict_version
chimerax_audit_syntax
case_sensitive_flags, fixed_width
cell
copied from original file
chem_comp
id, type, name extract from original file
citation
merged from original file id, title, journal_abbrev, journal_volume, year, page_first, page_last, journal_issue, pdbx_database_id_PubMed, pdbx_database_id_DOI
citation_author
citation_id, name, ordinal
entry
id
entity
id, type, pdbx_description
entity_poly
entity_id, type, nstd_monomer, pdbx_seq_one_letter_code_can
entity_poly_seq
entity_id, num, mon_id
entity_src_gen
copied from original file
entity_src_nat
copied from original file
pdbx_poly_seq_scheme
entity_id, asym_id, mon_id, seq_id, pdb_strand_id, pdb_seq_num, pdb_ins_code
pdbx_struct_assembly
copied from original file
pdbx_struct_assembly_gen
copied from original file
pdbx_struct_oper_list
copied from original file
software
name, version, location, classification, os, type, citation_id, pdbx_ordinal
struct_asym
id, entity_id
struct_conf
id, conf_type_id, beg_label_asym_id, beg_label_comp_id, beg_label_seq_id, end_label_asym_id, end_label_comp_id, end_label_seq_id, beg_auth_asym_id, beg_auth_seq_id, pdbx_beg_PDB_ins_code, end_auth_asym_id, end_auth_seq_id, pdbx_end_PDB_ins_code,
struct_conf_type
id
struct_conn
id, conn_type_id, ptnr1_label_asym_id, ptnr1_auth_asym_id, pdbx_ptnr1_PDB_ins_code, ptnr1_label_seq_id, ptnr1_auth_seq_id, pdbx_ptnr1_label_alt_id, ptnr1_label_atom_id, ptnr1_label_comp_id, ptnr1_symmetry, ptnr2_label_asym_id, ptnr2_auth_asym_id, pdbx_ptnr2_PDB_ins_code, ptnr2_label_seq_id, ptnr2_auth_seq_id, pdbx_ptnr2_label_alt_id, ptnr2_label_atom_id, ptnr2_label_comp_id, ptnr2_symmetry, pdbx_dist_value
struct_conn_type
id
struct_sheet_range
sheet_id, id, beg_label_asym_id, beg_label_comp_id, beg_label_seq_id, end_label_asym_id, end_label_comp_id, end_label_seq_id symmetry, beg_auth_asym_id, beg_auth_seq_id, pdbx_beg_PDB_ins_code, end_auth_asm_id, end_auth_seq_id, pdbx_end_PDB_ins_code
symmetry
copied from original file