Tutorial¶
This tutorial demonstrates the main functionality of ChemRecon.
[3]:
import chemrecon
1. Connecting to the Database¶
ChemRecon relies on a database of chemical information, sourced, combined, and processed based on a number of source databases.
A public ChemRecon database is available, hosted by the University of Southern Denmark.
A local Docker container can be hosted based on a ChemRecon database image (TODO link). This is typically faster, as the response time between the application and the database is faster. See the documentation for details.
An advantage of using the local database is that, if ChemRecon has write permissions on the database, certain results and computations can be cached, allowing for faster executions. For the public database, we have cached a number of computations for common compounds and reactions.
For other options, ChemRecon can connect to an arbitrary PostgreSQL endpoint.
For this tutorial, we connect to the public database.
[4]:
# For this demonstration, we will connect to a Docker image running on my machine.
chemrecon.connect_local_docker()
# Alternatively, we could connect to the public database
# chemrecon.connect_public()
# It is also possible to connect to the local instance with write(dev) access, which will allow caching of results in the DB.
# chemrecon.connect_local_docker_dev()
# Or, connect to an arbitrary endpoint.
# chemrecon.connect(
# chemrecon.Params(
# connection_title = 'custom',
# db_name = 'chemrecon_db',
# db_host = '1.2.3.4',
# db_port = '5432',
# username = 'public',
# password = '1234'
# ),
# can_write = True
# )
[ChemRecon] Attempting to connect with string: postgres://public_user:chemrecon_public_password@localhost:54320/chemrecon_db
[ChemRecon] Connection succesful.
Handler set: postgres://public_user:chemrecon_public_password@localhost:54320/chemrecon_db
2. Working with Entries and Relations¶
The central objects in ChemRecon are entries and relations. The entries and relations page of the documentation provides a detailed description of all types of entries and relations available in ChemRecon.
Let us start with the entry M_cit from the BiGG database, representing the compound citrate. We can look for the entry in the ChemRecon database using the find_entry() functions.
Compounds are indexed by an id type (corresponding to the source database) and a source id. Id types are prefixed by C_ for compounds, so C_BIGG is the type denoting compounds in the BiGG database.
[5]:
citrate_entry = chemrecon.find_entry(id_type = chemrecon.C_BIGG, source_id = 'M_cit')
print(citrate_entry)
<Compound 9234: source_id: M_cit, id_type: bigg, name: Citrate>
We can see that this entry is present in the ChemRecon database. We even get some extra information carried over from the BiGG databse, such as the name.
The number ([compound 8954]) is a ChemRecon-specific identifier (entry.recon_id). These can be used to identify entries within ChemRecon, but they are not guaranteed to be stable across versions or implementations.
2.2 Identifying Relations¶
The power of ChemRecon comes from the ability to relate data across databases. Connections between entries are called relations. For instance, the references between database entries are a type of relation, the CompoundReferenceRelation. We can get all references for our compound:
[6]:
result = chemrecon.get_relations_from_entry(
entry = citrate_entry,
relation_type = chemrecon.CompoundReference
)
# The result is a list of (relation, target_entry) tuples.
# We print them
for relation, target_entry in result:
print(f'{relation} \tTarget: {target_entry}')
<CompoundReference 9234 <-> 6059704] src: METANETX> Target: <Compound 6059704: source_id: MNXM1107753, id_type: mnx>
<CompoundReference 9234 <-> 27379493] src: AUTOMATIC> Target: <Compound 27379493: source_id: Citrate, id_type: cname>
<CompoundReference 9234 <-> 9235] src: BIGG> Target: <Compound 9235: source_id: R-ALL-29654, id_type: reactome>
<CompoundReference 9234 <-> 9236] src: BIGG> Target: <Compound 9236: source_id: R-ALL-433138, id_type: reactome>
<CompoundReference 9234 <-> 9237] src: BIGG> Target: <Compound 9237: source_id: R-ALL-76190, id_type: reactome>
<CompoundReference 9234 <-> 9238] src: BIGG> Target: <Compound 9238: source_id: C00158, id_type: kegg>
<CompoundReference 9234 <-> 9239] src: BIGG> Target: <Compound 9239: source_id: CHEBI:132362, id_type: chebi, name: citrate(4-), quality: manual>
<CompoundReference 9234 <-> 9240] src: BIGG> Target: <Compound 9240: source_id: CHEBI:133748, id_type: chebi, name: citrate anion, quality: manual>
<CompoundReference 9234 <-> 9241] src: BIGG> Target: <Compound 9241: source_id: CHEBI:13999, id_type: chebi, quality: manual>
<CompoundReference 9234 <-> 9242] src: BIGG> Target: <Compound 9242: source_id: CHEBI:16947, id_type: chebi, name: citrate(3-), quality: manual>
<CompoundReference 9234 <-> 9243] src: BIGG> Target: <Compound 9243: source_id: CHEBI:23321, id_type: chebi, quality: manual>
<CompoundReference 9234 <-> 9244] src: BIGG> Target: <Compound 9244: source_id: CHEBI:23322, id_type: chebi, quality: manual>
<CompoundReference 9234 <-> 9245] src: BIGG> Target: <Compound 9245: source_id: CHEBI:30769, id_type: chebi, name: citric acid, quality: manual>
<CompoundReference 9234 <-> 9246] src: BIGG> Target: <Compound 9246: source_id: CHEBI:35802, id_type: chebi, name: 3-carboxy-2-(carboxymethyl)-2-hydroxypropanoate, quality: manual>
<CompoundReference 9234 <-> 9247] src: BIGG> Target: <Compound 9247: source_id: CHEBI:35804, id_type: chebi, name: citrate(1-), quality: manual>
<CompoundReference 9234 <-> 9248] src: BIGG> Target: <Compound 9248: source_id: CHEBI:35806, id_type: chebi, name: 3,4-dicarboxy-3-hydroxybutanoate, quality: manual>
<CompoundReference 9234 <-> 9249] src: BIGG> Target: <Compound 9249: source_id: CHEBI:35808, id_type: chebi, name: citrate(2-), quality: manual>
<CompoundReference 9234 <-> 9250] src: BIGG> Target: <Compound 9250: source_id: CHEBI:35809, id_type: chebi, name: 2-(carboxymethyl)-2-hydroxysuccinate, quality: manual>
<CompoundReference 9234 <-> 9251] src: BIGG> Target: <Compound 9251: source_id: CHEBI:35810, id_type: chebi, name: 3-carboxy-3-hydroxypentanedioate, quality: manual>
<CompoundReference 9234 <-> 9252] src: BIGG> Target: <Compound 9252: source_id: CHEBI:3727, id_type: chebi, quality: manual>
<CompoundReference 9234 <-> 9253] src: BIGG> Target: <Compound 9253: source_id: CHEBI:41523, id_type: chebi, quality: manual>
<CompoundReference 9234 <-> 9254] src: BIGG> Target: <Compound 9254: source_id: CHEBI:42563, id_type: chebi, quality: manual>
<CompoundReference 9234 <-> 9255] src: BIGG> Target: <Compound 9255: source_id: CHEBI:76049, id_type: chebi, name: citric acid-d4, quality: manual>
<CompoundReference 9234 <-> 9256] src: BIGG> Target: <Compound 9256: source_id: HMDB00094, id_type: hmdb>
<CompoundReference 9234 <-> 9257] src: BIGG> Target: <Compound 9257: source_id: KRKNYBCHXYNGOX-UHFFFAOYSA-K, id_type: inchikey>
<CompoundReference 9234 <-> 9258] src: BIGG> Target: <Compound 9258: source_id: META:CIT, id_type: biocyc>
<CompoundReference 9234 <-> 9259] src: BIGG> Target: <Compound 9259: source_id: MNXM131, id_type: mnx>
<CompoundReference 9234 <-> 9260] src: BIGG> Target: <Compound 9260: source_id: cpd00137, id_type: seed>
Each CompoundReference lists the recon_id of its target and source, as well as the source of the reference (src).
We can also get all relations relating to the citrate entry. This will give relations of all types, not just compound references, giving a more complete view of the information related to the entry.
[7]:
result = chemrecon.get_all_relations(entry = citrate_entry)
# We print every N entries (for brevity).
for relation, entry in result[::8]:
print(f'{relation} {entry}')
<CompoundHasNewID 113818 -> 9234] src: BIGG> <Compound 113818: source_id: M_cit[e], id_type: bigg>
<CompoundParticipatesInReaction 9234 -> 89180] n: 1> <Reaction 89180: source_id: R_CITt4pp_1, id_type: bigg, name: Citrate transport via sodium symport periplasm >
<CompoundParticipatesInReaction 9234 -> 77266] n: -1> <Reaction 77266: source_id: R_CITt3, id_type: bigg, name: Citrate transport out via proton antiport>
<CompoundParticipatesInReaction 9234 -> 69934] n: 1> <Reaction 69934: source_id: R_CITt14, id_type: bigg, name: Citrate transport in via Ca complex>
<CompoundParticipatesInReaction 9234 -> 66483] n: 1> <Reaction 66483: source_id: R_CITx, id_type: bigg, name: Citrate transport, glyoxysome>
<CompoundParticipatesInReaction 9234 -> 53262] n: 1> <Reaction 53262: source_id: R_CITtbm, id_type: bigg, name: Citrate transport mitochondrial>
<CompoundParticipatesInReaction 9234 -> 47732] n: 1> <Reaction 47732: source_id: R_CITt3pp, id_type: bigg, name: Citrate transport out via proton antiport (periplasm)>
<CompoundParticipatesInReaction 9234 -> 42379] n: -1> <Reaction 42379: source_id: R_ACONTa_1, id_type: bigg, name: Aconitase I>
<CompoundParticipatesInReaction 9234 -> 31999] n: -1> <Reaction 31999: source_id: R_r2384, id_type: bigg, name: Mitochondrial Carrier (MC) TCDB:2.A.29.7.2>
<CompoundParticipatesInReaction 9234 -> 24980] n: 1> <Reaction 24980: source_id: R_CITt15, id_type: bigg, name: Citrate transport in via Zn complex>
<CompoundParticipatesInReaction 9234 -> 21726] n: -1> <Reaction 21726: source_id: R_AKGCITtm, id_type: bigg, name: Dicarboxylate/tricarboxylate carrier (akg:cit), mitochondrial>
<CompoundParticipatesInReaction 9234 -> 13429] n: 1> <Reaction 13429: source_id: R_CITt4pp, id_type: bigg, name: Citrate transport via sodium symport periplasm >
<CompoundParticipatesInReaction 9234 -> 11580] n: 1> <Reaction 11580: source_id: R_CITt_kt, id_type: bigg, name: Citrate proton symport periplasm >
<CompoundParticipatesInReaction 9234 -> 8083] n: 1> <Reaction 8083: source_id: R_CITt2r, id_type: bigg, name: Citrate reversible transport via symport>
<CompoundHasConjugateBase 9234 -> 113818] src: BIGG> <Compound 113818: source_id: M_cit[e], id_type: bigg>
<CompoundReference 9234 <-> 9258] src: BIGG> <Compound 9258: source_id: META:CIT, id_type: biocyc>
<CompoundReference 9234 <-> 9250] src: BIGG> <Compound 9250: source_id: CHEBI:35809, id_type: chebi, name: 2-(carboxymethyl)-2-hydroxysuccinate, quality: manual>
<CompoundReference 9234 <-> 9242] src: BIGG> <Compound 9242: source_id: CHEBI:16947, id_type: chebi, name: citrate(3-), quality: manual>
<CompoundIsPartOf 9234 -> 9234] src: BIGG> <Compound 9234: source_id: M_cit, id_type: bigg, name: Citrate>
<CompoundHasOldID 9234 -> 9263] src: BIGG> <Compound 9263: source_id: M_cit[m], id_type: bigg>
<ReactionInvolvesCompound 85373 -> 9234] n: 1> <Reaction 85373: source_id: R_r_hmr_4957, id_type: bigg, name: Citratetransporter,mitochondrial>
<ReactionInvolvesCompound 77100 -> 9234] n: 1> <Reaction 77100: source_id: R_r2381, id_type: bigg, name: Mitochondrial Carrier (MC) TCDB:2.A.29.7.2>
<ReactionInvolvesCompound 69931 -> 9234] n: -1> <Reaction 69931: source_id: R_CITt12, id_type: bigg, name: Citrate transport in via Ni complex>
<ReactionInvolvesCompound 58288 -> 9234] n: -1> <Reaction 58288: source_id: R_CITACt, id_type: bigg, name: Citrate transport via acetate antiport>
<ReactionInvolvesCompound 53257 -> 9234] n: -1> <Reaction 53257: source_id: R_CITtam, id_type: bigg, name: Citrate transport mitochondrial>
<ReactionInvolvesCompound 45487 -> 9234] n: -1> <Reaction 45487: source_id: R_EX_cit, id_type: bigg, name: Citrate exchange>
<ReactionInvolvesCompound 41715 -> 9234] n: 2> <Reaction 41715: source_id: R_FE3DCITabc, id_type: bigg, name: Iron transport from ferric-dicitrate via ABC system>
<ReactionInvolvesCompound 30662 -> 9234] n: -1> <Reaction 30662: source_id: R_r1109, id_type: bigg, name: Citrate oxaloacetate-lyase ((pro-3S)-CH2COO- -->acetate) Citrate cycle (TCA cycle) EC:4.1.3.6>
<ReactionInvolvesCompound 24977 -> 9234] n: -1> <Reaction 24977: source_id: R_CITt13, id_type: bigg, name: Citrate transport in via Co complex>
<ReactionInvolvesCompound 16312 -> 9234] n: 1> <Reaction 16312: source_id: R_CITt4_2, id_type: bigg, name: Citrate transport via sodium symport>
<ReactionInvolvesCompound 13241 -> 9234] n: -1> <Reaction 13241: source_id: R_CITt4_1, id_type: bigg, name: Citrate transport via sodium symport>
<ReactionInvolvesCompound 8165 -> 9234] n: 1> <Reaction 8165: source_id: R_CSm, id_type: bigg, name: Citrate synthase>
<ReactionInvolvesCompound 2631 -> 9234] n: 1> <Reaction 2631: source_id: R_r_cs, id_type: bigg, name: Citrate synthase>
Here, we observe various kinds of relations to other entries. For example:
The
ReactionInvolvesCompoundtells us that the listed BiGG reactions contain this compound. This relation has thenattribute, which gives the stoichiometric coefficient of citrate in that reaction.The
CompoundHasConjugateBaserelates this compound to other forms.
3. Working with Entry Graphs¶
At this point, we can apply the above method iteratively in order to completely explore the space of related database entries. ChemRecon has a mechanism for this, called Entry Graphs.
An Entry Graph is a directed graph in which the vertices represent database entries, and arcs/edges represent relations between these.
An entry graph needs one or more starting entries. Let us use our citrate entry from earlier.
3.1 Constructing an Entry Graph¶
[8]:
entrygraph_citrate = chemrecon.EntryGraph(
initial_entries = {citrate_entry}
)
entrygraph_citrate.draw()
[8]:
Right now, this entry graph is not particularly interesting. In order to add more entries, we need to explore the database, traversing relations and adding newly discovered connected entries to the graph.
Exploration is performed according to a protocol, defining which relations can be traversed, and optionally filtering entries and relations along the way.
We start by using the built-in protocol_compound_structure, which allows exploring compound references to expand the set of compound entries, and including the associated structures, as well as standardized versions thereof.
[9]:
chemrecon.explore(entrygraph_citrate, chemrecon.protocol_compound_structure, steps = 4)
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:46] Initializing Normalizer
[16:07:46] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running LargestFragmentChooser
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running Uncharger
[16:07:51] Removed negative charge.
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running LargestFragmentChooser
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running Uncharger
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running LargestFragmentChooser
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running Uncharger
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running LargestFragmentChooser
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running Uncharger
[16:07:51] Removed negative charge.
[16:07:51] Removed negative charge.
[16:07:51] Removed negative charge.
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running LargestFragmentChooser
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running Uncharger
[16:07:51] Removed negative charge.
[16:07:51] Removed negative charge.
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running LargestFragmentChooser
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running Uncharger
[16:07:51] Removed negative charge.
[16:07:51] Removed negative charge.
[16:07:51] Removed negative charge.
[16:07:51] Removed negative charge.
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running LargestFragmentChooser
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running Uncharger
[16:07:51] Removed negative charge.
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running LargestFragmentChooser
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Running Uncharger
[16:07:51] Removed negative charge.
[16:07:51] Removed negative charge.
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:51] Initializing Normalizer
[16:07:51] Running Normalizer
[16:07:53] Initializing Normalizer
[16:07:53] Running Normalizer
[16:07:53] Running LargestFragmentChooser
[16:07:53] Initializing Normalizer
[16:07:53] Running Normalizer
[16:07:53] Initializing Normalizer
[16:07:53] Running Normalizer
[16:07:53] Running Uncharger
[16:07:53] Initializing Normalizer
[16:07:53] Running Normalizer
[16:07:53] Initializing Normalizer
[16:07:53] Running Normalizer
[16:07:53] Initializing Normalizer
[16:07:53] Running Normalizer
.draw() will return an image of the entrygraph for manual inspection. If the graph is too large, this might not look good in a notebook!
[10]:
entrygraph_citrate.draw()
[10]:
Alternatively, the .show() method will open the generated image using the system svg viewer.
Inspecting the graph, we see the initial entry at the top. Edges representing references, annotated with the source, connect this to a bunch of other Compound entries.
MolStructure entries also appear, being referenced by various compounds. We can see that some MolStructure entries are referenced by multiple Compound entries from various databases, representing agreement.
Sometimes, the databases agree on the general structure of the compound, but not on various features such as charge or stereoisomerism. To consolidate these, ChemRecon has a Standardize relation, which in the graph connect MolStructure entries. Each such relation is annotated with the feature w.r.t. which the structure is standardized:
F: Fragment
I: Isotope
C: Charge
T: Tautomerism
S: Stereoisomerism
In this case, we observe that when standardizing by all features, all databases present in this graph ultimately agree on one structure, the one at the bottom.
3.2 Inspecting the Entry Graph¶
ChemRecon provides a number of methods for inspecting the entry graph. However, for more advanced analysis, we recommend using the underlying graph library, rustworkx. The underlying graph object is available as the g attribute of the entry graph. For the directed graphs used in ChemRecon, refer to the following documentation: rustworkx.PyDiGraph.
For instance, listing the vertices of the graph:
[11]:
for v in entrygraph_citrate.vertices():
print(v.entry)
<Compound 9234: source_id: M_cit, id_type: bigg, name: Citrate>
<Compound 6059704: source_id: MNXM1107753, id_type: mnx>
<Compound 27379493: source_id: Citrate, id_type: cname>
<Compound 9235: source_id: R-ALL-29654, id_type: reactome>
<Compound 9236: source_id: R-ALL-433138, id_type: reactome>
<Compound 9237: source_id: R-ALL-76190, id_type: reactome>
<Compound 9238: source_id: C00158, id_type: kegg>
<Compound 9239: source_id: CHEBI:132362, id_type: chebi, name: citrate(4-), quality: manual>
<Compound 9240: source_id: CHEBI:133748, id_type: chebi, name: citrate anion, quality: manual>
<Compound 9241: source_id: CHEBI:13999, id_type: chebi, quality: manual>
<Compound 9242: source_id: CHEBI:16947, id_type: chebi, name: citrate(3-), quality: manual>
<Compound 9243: source_id: CHEBI:23321, id_type: chebi, quality: manual>
<Compound 9244: source_id: CHEBI:23322, id_type: chebi, quality: manual>
<Compound 9245: source_id: CHEBI:30769, id_type: chebi, name: citric acid, quality: manual>
<Compound 9246: source_id: CHEBI:35802, id_type: chebi, name: 3-carboxy-2-(carboxymethyl)-2-hydroxypropanoate, quality: manual>
<Compound 9247: source_id: CHEBI:35804, id_type: chebi, name: citrate(1-), quality: manual>
<Compound 9248: source_id: CHEBI:35806, id_type: chebi, name: 3,4-dicarboxy-3-hydroxybutanoate, quality: manual>
<Compound 9249: source_id: CHEBI:35808, id_type: chebi, name: citrate(2-), quality: manual>
<Compound 9250: source_id: CHEBI:35809, id_type: chebi, name: 2-(carboxymethyl)-2-hydroxysuccinate, quality: manual>
<Compound 9251: source_id: CHEBI:35810, id_type: chebi, name: 3-carboxy-3-hydroxypentanedioate, quality: manual>
<Compound 9252: source_id: CHEBI:3727, id_type: chebi, quality: manual>
<Compound 9253: source_id: CHEBI:41523, id_type: chebi, quality: manual>
<Compound 9254: source_id: CHEBI:42563, id_type: chebi, quality: manual>
<Compound 9255: source_id: CHEBI:76049, id_type: chebi, name: citric acid-d4, quality: manual>
<Compound 9256: source_id: HMDB00094, id_type: hmdb>
<Compound 9257: source_id: KRKNYBCHXYNGOX-UHFFFAOYSA-K, id_type: inchikey>
<Compound 9258: source_id: META:CIT, id_type: biocyc>
<Compound 9259: source_id: MNXM131, id_type: mnx>
<Compound 9260: source_id: cpd00137, id_type: seed>
<Compound 6654884: source_id: 650babc9-9d68-4b73-9332-11972ca26f7b/compound/cb525a90-920b-49d9-b0fc-c42655323a65, id_type: envipath>
<Compound 6654888: source_id: HMDB0000094, id_type: hmdb>
<Compound 6654896: source_id: M_C00158, id_type: kegg>
<Compound 6654898: source_id: CIT, id_type: metacyc>
<Compound 6654914: source_id: 1952, id_type: sabiork>
<Compound 6654920: source_id: M_cpd00137, id_type: seed>
<Compound 1089728: source_id: ECMDB00094, id_type: ecmdb, name: Citric acid>
<Compound 12346300: source_id: 31348, id_type: pubchem_cid, name: Citrate>
<Compound 1531575: source_id: (MNXM1107752), id_type: mnx, name: citrate(4-), properties: ['charge: -4', 'formula: C6H4O7', 'mass: 187.99790']>
<Compound 6059702: source_id: MNXM1107752, id_type: mnx>
<Compound 27368765: source_id: citrate(4-), id_type: cname>
<Compound 27379518: source_id: 3,4-dicarboxy-3-hydroxybutanoate, id_type: cname>
<Compound 27379520: source_id: citrate(2-), id_type: cname>
<Compound 27220635: source_id: citrate anion, id_type: cname>
<Compound 27379498: source_id: citric acid, id_type: cname>
<Compound 27379521: source_id: 2-(carboxymethyl)-2-hydroxysuccinate, id_type: cname>
<Compound 27379499: source_id: 3-carboxy-2-(carboxymethyl)-2-hydroxypropanoate, id_type: cname>
<Compound 27088467: source_id: citric acid-d4, id_type: cname>
<Compound 1531579: source_id: (MNXM1107753), id_type: mnx, name: citrate, properties: ['charge: -3', 'formula: C6H5O7', 'mass: 189.00517']>
<Compound 27379495: source_id: citrate(3-), id_type: cname>
<Compound 27379523: source_id: 3-carboxy-3-hydroxypentanedioate, id_type: cname>
<Compound 27379501: source_id: citrate(1-), id_type: cname>
<MolStructure -3: smiles: O=C([O-])CC([O-])(CC(=O)[O-])C(=O)[O-]>
<MolStructure -5: smiles: O=C([O-])CC(O)(CC(=O)O)C(=O)O>
<MolStructure -1: smiles: O=C(O)CC(O)(CC(=O)O)C(=O)O>
<MolStructure -7: smiles: O=C([O-])CC(O)(CC(=O)O)C(=O)[O-]>
<MolStructure -4: smiles: O=C([O-])CC(O)(CC(=O)[O-])C(=O)O>
<MolStructure -8: smiles: O=C(O)CC(O)(CC(=O)O)C(=O)[O-]>
<MolStructure -6: smiles: [2H]C([2H])(C(=O)O)C(O)(C(=O)O)C([2H])([2H])C(=O)O>
<MolStructure -2: smiles: O=C([O-])CC(O)(CC(=O)[O-])C(=O)[O-]>
<Compound 1089731: source_id: CIT, id_type: biocyc>
<Compound 1089732: source_id: KRKNYBCHXYNGOX-UHFFFAOYSA-N, id_type: inchikey>
<Compound 27199326: source_id: Citric acid, id_type: cname>
<Compound 1531577: source_id: InChIKey=KSXLKRAZYZIYCZ-UHFFFAOYSA-K, id_type: inchikey>
<Compound 1531581: source_id: InChIKey=KRKNYBCHXYNGOX-UHFFFAOYSA-K, id_type: inchikey>
<Compound 27368766: source_id: citrate, id_type: cname>
<MolStructure -9: smiles: [H]C([H])(C(=O)O)C(O)(C(=O)O)C([H])([H])C(=O)O>
4. Scoring Entries¶
Once an entry graph has been created and expanded, the entries in the graph can be assigned scores in order to evaluate the ‘connectedness’ of each entry in the graph.
In this example, in order to find a consensus molecular structure for a given compound, an entry graph of Compound and MolStructure can be created and expanded, after which the MolStructure vertices can be scored in order to find the most likely structure for that compound.
We first create a Scorer instance which scores only MolStructure vertices, and apply it to the graph.
[12]:
scorer = chemrecon.Scorer(
score_entry_type = chemrecon.MolStructure,
)
scores = scorer(entrygraph_citrate)
for struct, score in scores.items():
print(f'{score:.3f}: {struct}')
0.597: <MolStructure -1: smiles: O=C(O)CC(O)(CC(=O)O)C(=O)O>
0.082: <MolStructure -5: smiles: O=C([O-])CC(O)(CC(=O)O)C(=O)O>
0.082: <MolStructure -4: smiles: O=C([O-])CC(O)(CC(=O)[O-])C(=O)O>
0.075: <MolStructure -2: smiles: O=C([O-])CC(O)(CC(=O)[O-])C(=O)[O-]>
0.054: <MolStructure -6: smiles: [2H]C([2H])(C(=O)O)C(O)(C(=O)O)C([2H])([2H])C(=O)O>
0.048: <MolStructure -3: smiles: O=C([O-])CC([O-])(CC(=O)[O-])C(=O)[O-]>
0.027: <MolStructure -7: smiles: O=C([O-])CC(O)(CC(=O)O)C(=O)[O-]>
0.027: <MolStructure -8: smiles: O=C(O)CC(O)(CC(=O)O)C(=O)[O-]>
0.007: <MolStructure -9: smiles: [H]C([H])(C(=O)O)C(O)(C(=O)O)C([H])([H])C(=O)O>
We observe that the fully standardized structure represents the best consensus structure by a large margin.
The score of an entry is (informally) the probability that a random walk starting at one of the initial entries of the entry graph will terminate at that entry.
The parameters of the random walk can be customized by specifying weights (probabilities) using a weight function on entries and relations, which alters the probability of choosing a given path. The default weight of all entries and relations is 1.
[13]:
def entry_weight_fn(e: chemrecon.Entry) -> float:
match e:
case chemrecon.Compound(id_type = chemrecon.C_NAME.enum_type):
# Connections by common name should be weighted less
return .25
case _:
return 1
def relation_weight_fn(r: chemrecon.Relation) -> float:
match r:
case chemrecon.CompoundReference(src = chemrecon.SourceDatabase.METANETX):
# What if we don't trust MNX
return .1
case _:
return 1
scorer_modified = chemrecon.Scorer(
score_entry_type = chemrecon.MolStructure,
entry_weight = entry_weight_fn,
relation_weight = relation_weight_fn
)
We now apply this custom scoring algorithm:
[14]:
scores = scorer_modified(entrygraph_citrate)
for struct, score in scores.items():
print(f'{score:.3f}: {struct}')
0.599: <MolStructure -1: smiles: O=C(O)CC(O)(CC(=O)O)C(=O)O>
0.081: <MolStructure -5: smiles: O=C([O-])CC(O)(CC(=O)O)C(=O)O>
0.081: <MolStructure -4: smiles: O=C([O-])CC(O)(CC(=O)[O-])C(=O)O>
0.071: <MolStructure -2: smiles: O=C([O-])CC(O)(CC(=O)[O-])C(=O)[O-]>
0.054: <MolStructure -3: smiles: O=C([O-])CC([O-])(CC(=O)[O-])C(=O)[O-]>
0.054: <MolStructure -6: smiles: [2H]C([2H])(C(=O)O)C(O)(C(=O)O)C([2H])([2H])C(=O)O>
0.027: <MolStructure -7: smiles: O=C([O-])CC(O)(CC(=O)O)C(=O)[O-]>
0.027: <MolStructure -8: smiles: O=C(O)CC(O)(CC(=O)O)C(=O)[O-]>
0.007: <MolStructure -9: smiles: [H]C([H])(C(=O)O)C(O)(C(=O)O)C([H])([H])C(=O)O>
5. Custom Exploration Protocols¶
Arbitary protocols are defined by specifying the relation types which can be traversed, and optionally applying filters to entries and relations.
In this example, say we are interested in finding the set of enzymes which could possibly interact with a given molecule. Looking at the overview of entry and relation types, we realize that this can be achieved by looking at related compounds using CompoundReference, the reactions in which these are involved using CompoundParticipatesInReaction, and finally getting the associated enzymes using ReactionHasEnzyme.
Finally, we would be interested in the graph depicting whether the enzyme entries represent classes, or individual enzymes. The ontological relation EnzymeHasInstance encodes this hierarchical relationship. We add these as ‘terminal relation types’, meaning that the database will not be explored by traversing these, but they are added after exploration if both the source and target entries are present.
[18]:
custom_protocol = chemrecon.ExplorationProtocol(
relation_types = {
chemrecon.CompoundReference,
chemrecon.CompoundParticipatesInReaction,
chemrecon.ReactionHasEnzyme
},
relation_types_terminal = {
chemrecon.EnzymeHasInstance
}
)
Let us apply this to find the enzymes which interact with citrate.
[19]:
eg_custom = chemrecon.EntryGraph(initial_entries = {citrate_entry})
chemrecon.explore(eg_custom, custom_protocol, steps = 3)
eg_custom.draw()
[19]: