PubChem API in Unix Shell#

by Avery Fernandez and Vincent F. Scalfani

These recipe examples were tested on August 4, 2022 using GNOME Terminal (with Bash 4.4.20) in Ubuntu 18.04.

PubChem API Documentation: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access

Attribution: This tutorial was adapted from supporting information in:

Scalfani, V. F.; Ralph, S. C. Alshaikh, A. A.; Bara, J. E. Programmatic Compilation of Chemical Data and Literature From PubChem Using Matlab. Chemical Engineering Education, 2020, 54, 230. https://doi.org/10.18260/2-1-370.660-115508 and vfscalfani/MATLAB-cheminformatics)

Note

This tutorial uses curl and jq for interacting with the PubChem API. You may also be interested in using the NCBI EDirect command line program. We have several tutorials for EDirect in our EDirectChemInfo repository.

Setup#

Program requirements#

In order to run this code, you will need to first install curl, and jq. curl is used to request the data from the API, and jq is used to parse the JSON data. In addition, if you want to be able to print the molecules as ASCII characters in your terminal, you will need to install RDKit and download the print_mols Python script.

Define base URL#

Define the PubChem PUG-REST API base URL:

api="https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/"

1. PubChem Similarity#

Get Compound Image#

We can search for a compound and display an image, for example: 1-Butyl-3-methyl-imidazolium; CID = 2734162

compoundID="2734162"
curl -s "$api"$"cid/""$compoundID"$"/PNG" -o CID_2734162.png

Note

The silent option (-s) for curl was used to hide the progress outputs.

If you want to open the PNG file in an image viewer program from your terminal, try xdg-open:

xdg-open CID_2734162.png

Output:

../../_images/CID_2734162.png

Retrieve InChI and SMILES#

request=$(curl -s "$api""cid/""$compoundID""/property/inchi,IsomericSMILES/JSON")
echo "$request" | jq '.'

Output:

{
  "PropertyTable": {
    "Properties": [
      {
        "CID": 2734162,
        "IsomericSMILES": "CCCCN1C=C[N+](=C1)C",
        "InChI": "InChI=1S/C8H15N2/c1-3-4-5-10-7-6-9(2)8-10/h6-8H,3-5H2,1-2H3/q+1"
      }
    ]
  }
}

Now, extract out the InChI:

echo "$request" | jq '.["PropertyTable"]["Properties"][0]["InChI"]'

Output:

"InChI=1S/C8H15N2/c1-3-4-5-10-7-6-9(2)8-10/h6-8H,3-5H2,1-2H3/q+1"

And the IsomericSMILES:

echo "$request" | jq '.["PropertyTable"]["Properties"][0]["IsomericSMILES"]'

Output:

"CCCCN1C=C[N+](=C1)C"

Display Molecule as ASCII Drawing#

We can use the extracted SMILES to generate an ASCII drawing within our terminal. First, we will extract the SMILES using jq, and then pipe the SMILES to a print_mols Python script, which uses the cheminformatics program RDKit to parse the SMILES, compute drawing coordinates, and then print the molecule as ASCII characters:

echo "$request" | jq '.["PropertyTable"]["Properties"][0]["IsomericSMILES"]' | tr -d '"' | python3 print_mols.py -

Output:

                                            C
                                        *
                                    C         *

                                  *             N
                                                      *
C               C               N             *             C
    *       *                         *
        C           *       *               C

                        C

Note

tr -d '"' removes the quotes around the extracted SMILES; python3 print_mols.py - prints the molecule.

Retrieve Identifier and Property Data#

Get the following data for the retrieved CIDs (idList): InChI, Isomeric SMILES, MW, Heavy Atom Count, Rotable Bond Count, and Charge. As a test, we will only get data for the first 5 CIDs:

for id in "${idList[@]:0:5}"
do
  compound=$(echo "$id" | sed 's/ //g')
  request=$(curl -s "$api"$"cid/""$compound"$"/property/InChI,IsomericSMILES,MolecularWeight,HeavyAtomCount,RotatableBondCount,Charge/JSON")
  echo "$request" | jq '.["PropertyTable"]["Properties"][0]'
  sleep 1
done

Output:

{
  "CID": 2734161,
  "MolecularWeight": "174.67",
  "IsomericSMILES": "CCCCN1C=C[N+](=C1)C.[Cl-]",
  "InChI": "InChI=1S/C8H15N2.ClH/c1-3-4-5-10-7-6-9(2)8-10;/h6-8H,3-5H2,1-2H3;1H/q+1;/p-1",
  "Charge": 0,
  "RotatableBondCount": 3,
  "HeavyAtomCount": 11
}
{
  "CID": 61347,
  "MolecularWeight": "124.18",
  "IsomericSMILES": "CCCCN1C=CN=C1",
  "InChI": "InChI=1S/C7H12N2/c1-2-3-5-9-6-4-8-7-9/h4,6-7H,2-3,5H2,1H3",
  "Charge": 0,
  "RotatableBondCount": 3,
 "HeavyAtomCount": 9
}
{
  "CID": 529334,
  "MolecularWeight": "138.21",
  "IsomericSMILES": "CCCCCN1C=CN=C1",
  "InChI": "InChI=1S/C8H14N2/c1-2-3-4-6-10-7-5-9-8-10/h5,7-8H,2-4,6H2,1H3",
  "Charge": 0,
  "RotatableBondCount": 4,
  "HeavyAtomCount": 10
}
{
  "CID": 304622,
  "MolecularWeight": "138.21",
  "IsomericSMILES": "CCCCN1C=CN=C1C",
  "InChI": "InChI=1S/C8H14N2/c1-3-4-6-10-7-5-9-8(10)2/h5,7H,3-4,6H2,1-2H3",
  "Charge": 0,
  "RotatableBondCount": 3,
  "HeavyAtomCount": 10
}
{
  "CID": 118785,
  "MolecularWeight": "110.16",
  "IsomericSMILES": "CCCN1C=CN=C1",
  "InChI": "InChI=1S/C6H10N2/c1-2-4-8-5-3-7-6-8/h3,5-6H,2,4H2,1H3",
  "Charge": 0,
  "RotatableBondCount": 2,
  "HeavyAtomCount": 8
}

Note

sed 's/ //g' removes the extra space before the CID values. tr -d ' ' should also work to remove the extra space.

We can modify the jq line to extract out specific data values such as the MolecularWeight:

for id in "${idList[@]:0:5}"
do
  compound=$(echo "$id" | sed 's/ //g')
  request=$(curl -s "$api"$"cid/""$compound"$"/property/InChI,IsomericSMILES,MolecularWeight,HeavyAtomCount,RotatableBondCount,Charge/JSON")
  echo "$request" | jq '.["PropertyTable"]["Properties"][0]["MolecularWeight"]'
  sleep 1
done

Output:

"174.67"
"124.18"
"138.21"
"138.21"
"110.16"