PubChem API in Mathematica#
by Vishank Patel
PubChem API Documentation: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access
Mathematica PubChem documentation: https://reference.wolfram.com/language/ref/service/PubChem.html
These recipe examples were tested on March 30, 2022.
Attribution: This tutorial was adapted from supporting information in:
Scalfani, V. F.; Ralph, S. C. Alshaikh, A. A.; Bara, J. E. Programmatic Compilation of Chemical Data and Literature From PubChem Using Matlab. Chemical Engineering Education, 2020, 54, 230. https://doi.org/10.18260/2-1-370.660-115508 and vfscalfani/MATLAB-cheminformatics)
Setup#
Establish the Mathematica PubChem connection:
pubchem = ServiceConnect["PubChem"]
1. PubChem Similarity#
Search for chemical structures in PubChem via a Fingerprint Tanimoto Similarity Search.
Get compound image#
compoundID = "2734162";
pubchem["CompoundImage", {"CompoundID" -> compoundID}]
(*Replace the above CompoundID value to customize*)
Retrieve InChI and SMILES#
compProperties = pubchem["CompoundProperties", {"CompoundID" -> compoundID}][[1]]
(*Mathematica's output is a list of associations, storing the first
element of the list (the needed output) helps query the data better*)
compProperties //OutputForm (*Changed to plain text output*)
Dataset[<|CompoundID -> 2734162, MolecularFormula -> C8H15N2+, > MolecularWeight -> 139.22 grams per mole, CanonicalSMILES -> CCCCN1C=C[N+](=C1)C, > IsomericSMILES -> CCCCN1C=C[N+](=C1)C, > InChI -> InChI=1S/C8H15N2/c1-3-4-5-10-7-6-9(2)8-10/h6-8H,3-5H2,1-2H3/q+1, > InChIKey -> IQQRAVYLUAZUGX-UHFFFAOYSA-N, > IUPACName -> 1-butyl-3-methylimidazol-3-ium, XLogP -> 1.3, > ExactMass -> 139.123523487 grams per mole, > MonoisotopicMass -> 139.123523487 grams per mole, TPSA -> 8.8, Complexity -> 93, > Charge -> 1, HBondDonorCount -> 0, HBondAcceptorCount -> 0, > RotatableBondCount -> 3, HeavyAtomCount -> 10, IsotopeAtomCount -> 0, > AtomStereoCount -> 0, DefinedAtomStereoCount -> 0, UndefinedAtomStereoCount -> 0, > BondStereoCount -> 0, DefinedBondStereoCount -> 0, UndefinedBondStereoCount -> 0, > CovalentUnitCount -> 1, Volume3D -> 121.3, XStericQuadrupole3D -> 4.97, > YStericQuadrupole3D -> 1.63, ZStericQuadrupole3D -> 0.91, FeatureCount3D -> 3, > FeatureAcceptorCount3D -> 0, FeatureDonorCount3D -> 0, FeatureAnionCount3D -> 0, > FeatureCationCount3D -> 1, FeatureRingCount3D -> 1, FeatureHydrophobeCount3D -> 1, > ConformerModelRMSD3D -> 0.6, EffectiveRotorCount3D -> 3, ConformerCount3D -> 10, > Fingerprint2D -> > AAADccBzAAAAAAAAAAAAAAAAAAAAAWAAAAAAAAAAAAAAAAABgAAAHAAAAAAACADBAgQvkBcMEACgABAnZA\ > AAgC0REqAJQAAYMACASAAAiAAUAAAIAAKAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==|>, > TypeSystem`Assoc[TypeSystem`Atom[String], TypeSystem`AnyType, 41], <||>]
To extract the properties:
compProperties["InChI"]
InChI=1S/C8H15N2/c1-3-4-5-10-7-6-9(2)8-10/h6-8H,3-5H2,1-2H3/q+1
compProperties["IsomericSMILES"]
CCCCN1C=C[N+](=C1)C
Perform a Similarity Search#
We will use the PubChem API to perform a Fingerprint Tanimoto Similarity Search (SS).
(2D Tanimoto threshold 95% to 1-Butyl-3-methyl-imidazolium; CID = 2734162)
rawSSCIDs = pubchem["CompoundCID", {"CompoundID" -> compoundID, Method -> "Similarity2DSearch", "Threshold" -> 95}];
ssCIDs = Normal[rawSSCIDs["CompoundID"][[;;25]]]; (*Taking the first 25 matches*)
ssCIDs // Shallow //OutputForm
{12971008, 304622, 11448496, 11424151, 11171745, 2734161, 529334, 118785, 61347, > 11160028, <<15>>}
In the above SS_url value, you can adjust to the desired Tanimoto threshold (i.e., 97, 90, etc.)
similarCompoundData = {};
For[i = 1, i <= Length[ssCIDs], i++,
tempID = ssCIDs[[i]];
tempData =
pubchem["CompoundProperties", "CompoundID" -> tempID][[1]][{"IsomericSMILES", "CompoundID", "InChI", "MolecularWeight", "HeavyAtomCount", "RotatableBondCount", "Charge"}];
AppendTo[similarCompoundData, tempData]
]
similarCompoundData[[;;3]] // Dataset (*Displaying the first three elements*)
similarCompoundData[[;;3]] //OutputForm (*Changed to plain text output*)
{Dataset[<|IsomericSMILES -> CCCN1C=C[N+](=C1)C.[I-], CompoundID -> 12971008, > InChI -> InChI=1S/C7H13N2.HI/c1-3-4-9-6-5-8(2)7-9;/h5-7H,3-4H2,1-2H3;1H/q+1;/p-1, > MolecularWeight -> 252.10 grams per mole, HeavyAtomCount -> 10, > RotatableBondCount -> 2, Charge -> 0|>, > TypeSystem`Assoc[TypeSystem`Atom[String], TypeSystem`AnyType, 7], <||>], > Dataset[<|IsomericSMILES -> CCCCN1C=CN=C1C, CompoundID -> 304622, > InChI -> InChI=1S/C8H14N2/c1-3-4-6-10-7-5-9-8(10)2/h5,7H,3-4,6H2,1-2H3, > MolecularWeight -> 138.21 grams per mole, HeavyAtomCount -> 10, > RotatableBondCount -> 3, Charge -> 0|>, > TypeSystem`Assoc[TypeSystem`Atom[String], TypeSystem`AnyType, 7], <||>], > Dataset[<|IsomericSMILES -> CCCCN1C=C[N+](=C1)C.[I-], CompoundID -> 11448496, > InChI -> InChI=1S/C8H15N2.HI/c1-3-4-5-10-7-6-9(2)8-10;/h6-8H,3-5H2,1-2H3;1H/q+1;/p\ > -1, MolecularWeight -> 266.12 grams per mole, HeavyAtomCount -> 11, > RotatableBondCount -> 3, Charge -> 0|>, > TypeSystem`Assoc[TypeSystem`Atom[String], TypeSystem`AnyType, 7], <||>]}
Exporting the data as a CSV file,
data = {};
For[i = 1, i <= Length[ssCIDs], i++,
entries = Values[similarCompoundData[[i]]];
AppendTo[data, entries]
]
data // Normal // Shallow
data //Normal //Shallow //OutputForm (*Changes to normal text output*)
{{CCCN1C=C[N+](=C1)C.[I-], 12971008, > InChI=1S/C7H13N2.HI/c1-3-4-9-6-5-8(2)7-9;/h5-7H,3-4H2,1-2H3;1H/q+1;/p-1, > Quantity[<<2>>], 10, 2, 0}, {CCCCN1C=CN=C1C, 304622, > InChI=1S/C8H14N2/c1-3-4-6-10-7-5-9-8(10)2/h5,7H,3-4,6H2,1-2H3, Quantity[<<2>>], 10, > 3, 0}, {CCCCN1C=C[N+](=C1)C.[I-], 11448496, > InChI=1S/C8H15N2.HI/c1-3-4-5-10-7-6-9(2)8-10;/h6-8H,3-5H2,1-2H3;1H/q+1;/p-1, > Quantity[<<2>>], 11, 3, 0}, {CCCCN1C=C[N+](=C1)C.C(#N)[S-], 11424151, > InChI=1S/C8H15N2.CHNS/c1-3-4-5-10-7-6-9(2)8-10;2-1-3/h6-8H,3-5H2,1-2H3;3H/q+1;/p-1, > Quantity[<<2>>], 13, 3, 0}, {CCCCN1C=C[N+](=C1)C.C(=[N-])=NC#N, 11171745, > InChI=1S/C8H15N2.C2N3/c1-3-4-5-10-7-6-9(2)8-10;3-1-5-2-4/h6-8H,3-5H2,1-2H3;/q+1;-1, > Quantity[<<2>>], 15, 3, 0}, {CCCCN1C=C[N+](=C1)C.[Cl-], 2734161, > InChI=1S/C8H15N2.ClH/c1-3-4-5-10-7-6-9(2)8-10;/h6-8H,3-5H2,1-2H3;1H/q+1;/p-1, > Quantity[<<2>>], 11, 3, 0}, {CCCCCN1C=CN=C1, 529334, > InChI=1S/C8H14N2/c1-2-3-4-6-10-7-5-9-8-10/h5,7-8H,2-4,6H2,1H3, Quantity[<<2>>], 10, > 4, 0}, {CCCN1C=CN=C1, 118785, InChI=1S/C6H10N2/c1-2-4-8-5-3-7-6-8/h3,5-6H,2,4H2,1\ > H3, Quantity[<<2>>], 8, 2, 0}, > {CCCCN1C=CN=C1, 61347, InChI=1S/C7H12N2/c1-2-3-5-9-6-4-8-7-9/h4,6-7H,2-3,5H2,1H3, > Quantity[<<2>>], 9, 3, 0}, {CCCN1C=C[N+](=C1)C.[Br-], 11160028, > InChI=1S/C7H13N2.BrH/c1-3-4-9-6-5-8(2)7-9;/h5-7H,3-4H2,1-2H3;1H/q+1;/p-1, > Quantity[<<2>>], 10, 2, 0}, <<15>>}
Export["pubchem_similarity_data.csv", Normal[data]]
pubchem_similarity_data.csv
Retrieve Images of Compounds from Similarity Search#
Table[pubchem["CompoundImage", {"CompoundID" -> id}], {id, ssCIDs}]
2. PubChem SMARTS Search#
Search for chemical structures from a SMARTS substructure query.
Define SMARTS queries#
View pattern syntax at: https://smartsview.zbh.uni-hamburg.de/
Note: These are vinyl imidazolium substructure searches
smartsQ = {"[CR0H2][n+]1[cH1][cH1]n([CR0H1]=[CR0H2])[cH1]1","[CR0H2][n+]1[cH1][cH1]n([CR0H2][CR0H1]=[CR0H2])[cH1]1","[CR0H2][n+]1[cH1][cH1]n([CR0H2][CR0H2][CR0H1]=[CR0H2])[cH1]1"};
Add your own SMARTS queries to customize. You can add as many as desired within a cell array.
Perform a SMARTS query search
api = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/";
smartsQURL = {};
For[i = 1, i <= Length[smartsQ], i++,
tempURL = api <> "fastsubstructure/smarts/" <> smartsQ[[i]] <> "/cids/JSON";
AppendTo[smartsQURL, tempURL]
]
(*performing substructure searches for each query link in smartsQURL*)
hitCIDs = {};
For[i = 1, i <= Length[smartsQURL], i++,
tempData = Import[smartsQURL[[i]], "RawJSON"];
AppendTo[hitCIDs, tempData];
Pause[1]
]
hitCIDsAll = Flatten[hitCIDs[[All, "IdentifierList", "CID"]]];
hitCIDsAll // Shallow //OutputForm
{121235111, 2881855, 86657882, 46178576, 23724184, 139254006, 132274871, 87560886, > 87559770, 87327009, <<819>>}
Just like PubChem Similarity search, we will operate on and extract the data for the first 25 CIDs
hitCIDsShort = hitCIDsAll[[;; 25]]
{121235111, 2881855, 86657882, 46178576, 23724184, 139254006, 132274871, 87560886, > 87559770, 87327009, 59435292, 24766550, 2881640, 2881449, 2881324, 2881232, > 141176071, 139241369, 138404213, 138373746, 135377330, 135361018, 132427329, > 132275640, 129853306}
smartsCompoundData = {};
For[i = 1, i <= Length[hitCIDsShort], i++,
tempID = hitCIDsShort[[i]];
tempData =
pubchem["CompoundProperties", "CompoundID" -> tempID][[1]][{"CompoundID", "InChI", "CanonicalSMILES", "MolecularWeight",
"IUPACName", "HeavyAtomCount", "CovalentUnitCount", "Charge"}];
AppendTo[smartsCompoundData, tempData]
]
smartsCompoundData[[;;3]]
smartsCompoundData[[;;3]] //OutputForm (*Changed to normal text output*)
{Dataset[<|CompoundID -> 121235111, > InChI -> InChI=1S/C7H11N2.C2F6NO4S2/c1-3-8-5-6-9(4-2)7-8;3-1(4,5)14(10,11)9-15(12,\ > 13)2(6,7)8/h3,5-7H,1,4H2,2H3;/q+1;-1, > CanonicalSMILES -> CC[N+]1=CN(C=C1)C=C.C(F)(F)(F)S(=O)(=O)[N-]S(=O)(=O)C(F)(F)F, > MolecularWeight -> 403.3 grams per mole, > IUPACName -> bis(trifluoromethylsulfonyl)azanide;1-ethenyl-3-ethylimidazol-3-ium, > HeavyAtomCount -> 24, CovalentUnitCount -> 2, Charge -> 0|>, > TypeSystem`Assoc[TypeSystem`Atom[String], TypeSystem`AnyType, 8], <||>], > Dataset[<|CompoundID -> 2881855, > InChI -> InChI=1S/C15H17N2O3.BrH/c1-4-16-7-8-17(11-16)10-13(18)12-5-6-14(19-2)15(9\ > -12)20-3;/h4-9,11H,1,10H2,2-3H3;1H/q+1;/p-1, > CanonicalSMILES -> COC1=C(C=C(C=C1)C(=O)C[N+]2=CN(C=C2)C=C)OC.[Br-], > MolecularWeight -> 353.21 grams per mole, > IUPACName -> > 1-(3,4-dimethoxyphenyl)-2-(3-ethenylimidazol-1-ium-1-yl)ethanone;bromide, > HeavyAtomCount -> 21, CovalentUnitCount -> 2, Charge -> 0|>, > TypeSystem`Assoc[TypeSystem`Atom[String], TypeSystem`AnyType, 8], <||>], > Dataset[<|CompoundID -> 86657882, > InChI -> InChI=1S/C13H23N2.BrH/c1-3-5-6-7-8-9-10-15-12-11-14(4-2)13-15;/h4,11-13H,\ > 2-3,5-10H2,1H3;1H/q+1;/p-1, CanonicalSMILES -> CCCCCCCC[N+]1=CN(C=C1)C=C.[Br-], > MolecularWeight -> 287.24 grams per mole, > IUPACName -> 1-ethenyl-3-octylimidazol-3-ium;bromide, HeavyAtomCount -> 16, > CovalentUnitCount -> 2, Charge -> 0|>, > TypeSystem`Assoc[TypeSystem`Atom[String], TypeSystem`AnyType, 8], <||>]}
Exporting the Data to a CSV file#
(*Initializing the data with headers*)
smartsData = {Normal[Keys[smartsCompoundData[[1]]]]}; (*Normal turns the keys and values from a dataset to a list*)
For[i = 1, i <= Length[hitCIDsShort], i++,
entries = Normal[Values[smartsCompoundData[[i]]]];
AppendTo[smartsData, entries]
]
smartsData[[;;3]] //Dataset
smartsData[[;;3]] //OutputForm (*Changed to normal text output*)
{{CompoundID, InChI, CanonicalSMILES, MolecularWeight, IUPACName, HeavyAtomCount, > CovalentUnitCount, Charge}, {121235111, > InChI=1S/C7H11N2.C2F6NO4S2/c1-3-8-5-6-9(4-2)7-8;3-1(4,5)14(10,11)9-15(12,13)2(6,7)8\ > /h3,5-7H,1,4H2,2H3;/q+1;-1, CC[N+]1=CN(C=C1)C=C.C(F)(F)(F)S(=O)(=O)[N-]S(=O)(=O)C(\ > F)(F)F, 403.3 grams per mole, > bis(trifluoromethylsulfonyl)azanide;1-ethenyl-3-ethylimidazol-3-ium, 24, 2, 0}, > {2881855, InChI=1S/C15H17N2O3.BrH/c1-4-16-7-8-17(11-16)10-13(18)12-5-6-14(19-2)15(9-\ > 12)20-3;/h4-9,11H,1,10H2,2-3H3;1H/q+1;/p-1, > COC1=C(C=C(C=C1)C(=O)C[N+]2=CN(C=C2)C=C)OC.[Br-], 353.21 grams per mole, > 1-(3,4-dimethoxyphenyl)-2-(3-ethenylimidazol-1-ium-1-yl)ethanone;bromide, 21, 2, 0}}
Export["pubchem_smarts_data.csv", smartsData]
pubchem_smarts_data.csv
Retrieve Images of CID Compounds from SMARTS query match#
Table[pubchem["CompoundImage", {"CompoundID" -> id}], {id, hitCIDsShort}]