# PubChem API in Mathematica

by Vishank Patel

**PubChem API Documentation**: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access

**Mathematica PubChem documentation:** https://reference.wolfram.com/language/ref/service/PubChem.html

These recipe examples were tested on March 30, 2022.

**Attribution:** This tutorial was adapted from supporting information in:

**Scalfani, V. F.**; Ralph, S. C. Alshaikh, A. A.; Bara, J. E. Programmatic Compilation of Chemical Data and Literature From PubChem Using Matlab. *Chemical Engineering Education*, **2020**, *54*, 230. https://doi.org/10.18260/2-1-370.660-115508 and https://github.com/vfscalfani/MATLAB-cheminformatics)

### Setup

Establish the Mathematica PubChem connection:

In [None]:
pubchem = ServiceConnect["PubChem"]

## 1. PubChem Similarity

Search for chemical structures in PubChem via a Fingerprint Tanimoto Similarity Search.

### Get compound image

In [None]:
compoundID = "2734162";

pubchem["CompoundImage", {"CompoundID" -> compoundID}] 
(*Replace the above CompoundID value to customize*)

### Retrieve InChI and SMILES

In [None]:
compProperties = pubchem["CompoundProperties", {"CompoundID" -> compoundID}][[1]]
(*Mathematica's output is a list of associations, storing the first
element of the list (the needed output) helps query the data better*)

In [None]:
compProperties //OutputForm  (*Changed to plain text output*)

To extract the properties:

In [None]:
compProperties["InChI"]

In [None]:
compProperties["IsomericSMILES"]

### Perform a Similarity Search

We will use the PubChem API to perform a Fingerprint Tanimoto Similarity Search (SS).

(2D Tanimoto threshold 95% to 1-Butyl-3-methyl-imidazolium; CID = 2734162)

In [None]:
rawSSCIDs = pubchem["CompoundCID", {"CompoundID" -> compoundID, Method -> "Similarity2DSearch", "Threshold" -> 95}];

In [None]:
ssCIDs = Normal[rawSSCIDs["CompoundID"][[;;25]]];  (*Taking the first 25 matches*)
ssCIDs // Shallow //OutputForm

In the above SS_url value, you can adjust to the desired Tanimoto threshold (i.e., 97, 90, etc.)

In [None]:
similarCompoundData = {};

For[i = 1, i <= Length[ssCIDs], i++,

 tempID = ssCIDs[[i]];
 tempData = 
  pubchem["CompoundProperties", "CompoundID" -> tempID][[1]][{"IsomericSMILES", "CompoundID", "InChI", "MolecularWeight", "HeavyAtomCount", "RotatableBondCount", "Charge"}];
 AppendTo[similarCompoundData, tempData]
 ]

In [None]:
similarCompoundData[[;;3]] // Dataset (*Displaying the first three elements*)

In [None]:
similarCompoundData[[;;3]] //OutputForm (*Changed to plain text output*)

Exporting the data as a CSV file,

In [None]:
data = {};
For[i = 1, i <= Length[ssCIDs], i++,
 entries = Values[similarCompoundData[[i]]];
 AppendTo[data, entries]
 ]

In [None]:
data // Normal // Shallow

In [None]:
data //Normal //Shallow //OutputForm (*Changes to normal text output*)

In [None]:
Export["pubchem_similarity_data.csv", Normal[data]]

### Retrieve Images of Compounds from Similarity Search

In [None]:
Table[pubchem["CompoundImage", {"CompoundID" -> id}], {id, ssCIDs}]

## 2. PubChem SMARTS Search

Search for chemical structures from a SMARTS substructure query.

### Define SMARTS queries

View pattern syntax at: https://smartsview.zbh.uni-hamburg.de/

Note: These are vinyl imidazolium substructure searches

In [None]:
smartsQ = {"[CR0H2][n+]1[cH1][cH1]n([CR0H1]=[CR0H2])[cH1]1","[CR0H2][n+]1[cH1][cH1]n([CR0H2][CR0H1]=[CR0H2])[cH1]1","[CR0H2][n+]1[cH1][cH1]n([CR0H2][CR0H2][CR0H1]=[CR0H2])[cH1]1"};

Add your own SMARTS queries to customize. You can add as many as desired within a cell array.

Perform a SMARTS query search

In [None]:
api = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/";

smartsQURL = {};
For[i = 1, i <= Length[smartsQ], i++,
 tempURL = api <> "fastsubstructure/smarts/" <> smartsQ[[i]] <> "/cids/JSON";
 AppendTo[smartsQURL, tempURL]
 ]
 
 (*performing substructure searches for each query link in smartsQURL*)

hitCIDs = {};
For[i = 1, i <= Length[smartsQURL], i++,
 tempData = Import[smartsQURL[[i]], "RawJSON"];
 AppendTo[hitCIDs, tempData];
 Pause[1]
 ]

In [None]:
hitCIDsAll = Flatten[hitCIDs[[All, "IdentifierList", "CID"]]];
hitCIDsAll // Shallow //OutputForm

Just like PubChem Similarity search, we will operate on and extract the data for the first 25 CIDs

In [None]:
hitCIDsShort = hitCIDsAll[[;; 25]]

In [None]:
smartsCompoundData = {};
For[i = 1, i <= Length[hitCIDsShort], i++,
 tempID = hitCIDsShort[[i]];
 tempData = 
  pubchem["CompoundProperties", "CompoundID" -> tempID][[1]][{"CompoundID", "InChI", "CanonicalSMILES", "MolecularWeight", 
    "IUPACName", "HeavyAtomCount", "CovalentUnitCount", "Charge"}];
 AppendTo[smartsCompoundData, tempData]
 ]

In [None]:
smartsCompoundData[[;;3]]

In [None]:
smartsCompoundData[[;;3]] //OutputForm (*Changed to normal text output*)

### Exporting the Data to a CSV file

In [None]:
(*Initializing the data with headers*)
smartsData = {Normal[Keys[smartsCompoundData[[1]]]]}; (*Normal turns the keys and values from a dataset to a list*)

For[i = 1, i <= Length[hitCIDsShort], i++,
 entries = Normal[Values[smartsCompoundData[[i]]]];
 AppendTo[smartsData, entries]
 ]

In [None]:
smartsData[[;;3]] //Dataset

In [None]:
smartsData[[;;3]] //OutputForm (*Changed to normal text output*)

In [None]:
Export["pubchem_smarts_data.csv", smartsData]

### Retrieve Images of CID Compounds from SMARTS query match

In [None]:
Table[pubchem["CompoundImage", {"CompoundID" -> id}], {id, hitCIDsShort}]