# PubChem API in C
by Cyrus Gomes

**PubChem API Documentation**: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access

These recipe examples were tested on July 25, 2023.

**Attribution:** This tutorial was adapted from supporting information in:

**Scalfani, V. F.**; Ralph, S. C. Alshaikh, A. A.; Bara, J. E. Programmatic Compilation of Chemical Data and Literature From PubChem Using Matlab. *Chemical Engineering Education*, **2020**, *54*, 230. https://doi.org/10.18260/2-1-370.660-115508 and https://github.com/vfscalfani/MATLAB-cheminformatics)

## Setup

First, install the CURL package by typing the following command in the terminal:

In [None]:
!sudo apt install curl jq libcurl4-openssl-dev

Then we set a directory where we want the PubChem directory for our projects to be created:

In [1]:
!mkdir Pub_Chem

Finally, we change the directory to the folder we created:

In [None]:
%cd Pub_Chem

## 1. PubChem Property

### Get property details

Then we initialize a folder for the current project that we are working on. And then change to that directory

In [3]:
!mkdir Property

In [None]:
%cd Property

We utilize the `%%file` command to create the following makefile which will compile our program and create an executable.

In [None]:
%%file makefile

# Set the variable CC to gcc, which is used to build the program
CC=gcc

# Enable debugging information and enable all compiler warnings
CFLAGS=-g -Wall

# Set the bin variable as the name of the binary file we are creating
BIN=property_search

# Create the binary file with the name we put
all: $(BIN)

# Map any file ending in .c to a binary executable. 
# "$<" represents the .c file and "$@" represents the target binary executable
%: %.c

	# Compile the .c file using the gcc compiler with the CFLAGS and links 
	# resulting binary with the CURL library
	$(CC) $(CFLAGS) $< -o $@ -lcurl

# Clean target which removes specific files
clean:

	# Remove the binary file and an ".dSYM" (debug symbols for debugging) directories
	# the RM command used -r to remove directories and -f to force delete
	$(RM) -rf $(BIN) *.dSYM


The command is used again to create our .c file which contains the code for the program

In [None]:
%%file property_search.c

#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* CURL program that retrieves property details about the CID 
and outputs to terminal. Custom property fields can be added */

int main (int argc, char* argv[]) {
    
    // If arguments are invalid then return
    if (argc < 2){                                                                                      
        printf("Error. Please try again correctly.\n");
        return -1;
    }

    // Initialize the CURL HTTP connection
    CURL *curl = curl_easy_init();

    // Bits of the url that are joined together later                                                                      
    char api[] = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/";                            
    char url[1000];
    char label_1[] = "/property/";
    char format[] = "/JSON";

    // Check if CURL initialization is a success or not
    if (!curl) {                                                                                         
        fprintf(stderr, "init failed\n");
        return EXIT_FAILURE;
    }

    // Check if the conditions match for using the default property
    if ((argc==2)||((argc==3) && (strcmp(argv[2],"-p")==0))) {
        char search_type[] = "/property/inchi,IsomericSMILES,MolecularFormula,MolecularWeight/JSON";
        
        // Combine all the bits to produce a functioning url
        sprintf(url, "%s%s%s", api, argv[1], search_type);                                              
    
    }

    // Check if the conditions match for using custom property
    else if ((argc==4)&&(strcmp(argv[2],"-p")==0)) {                                                     

        // Combine all the bits to produce a functioning url
        sprintf(url, "%s%s%s%s%s", api, argv[1], label_1, argv[3], format);                             
    
    }

    // If the arguments are invalid then return
    else {                                                                                              
        curl_easy_cleanup(curl);
        return 0;
    }                                            

    // Set the url to which the HTTP request will be sent to
    // first parameter is for the initialized curl HTTP request, second for the option to be set, and third for the value to be set
    curl_easy_setopt(curl, CURLOPT_URL, url);

    // If result is not retrieved then output error
    CURLcode result = curl_easy_perform(curl);

    // If result is not retrieved then output error
    if (result != CURLE_OK) {                                                                            
        fprintf(stderr, "download problem: %s\n", curl_easy_strerror(result));
    }

    // Deallocate memory for the CURL connection
    curl_easy_cleanup(curl);                                                                            
    return EXIT_SUCCESS;
}

The folowing program is run, and an executable is created after using the following command:

In [None]:
!make

We can search for a compound and display an image, for example: 1-Butyl-3-methyl-imidazolium; CID = 2734162

If we run the executable and enter the CID and the custom properties that we want to add, we get the result:

In [8]:
!./property_search 2734162 -p "inchi"

{
  "PropertyTable": {
    "Properties": [
      {
        "CID": 2734162,
        "InChI": "InChI=1S/C8H15N2/c1-3-4-5-10-7-6-9(2)8-10/h6-8H,3-5H2,1-2H3/q+1"
      }
    ]
  }
}


We can add additional properties as follows:

In [9]:
!./property_search 2734162 -p "inchi,XLogP,HBondDonorCount,HBondAcceptorCount,RotatableBondCount"

{
  "PropertyTable": {
    "Properties": [
      {
        "CID": 2734162,
        "InChI": "InChI=1S/C8H15N2/c1-3-4-5-10-7-6-9(2)8-10/h6-8H,3-5H2,1-2H3/q+1",
        "XLogP": 1.3,
        "HBondDonorCount": 0,
        "HBondAcceptorCount": 0,
        "RotatableBondCount": 3
      }
    ]
  }
}


The following command is used to output the default fields (inchi,IsomericSMILES,MolecularFormula,MolecularWeight):

In [10]:
!./property_search 2734162

{
  "PropertyTable": {
    "Properties": [
      {
        "CID": 2734162,
        "MolecularFormula": "C8H15N2+",
        "MolecularWeight": "139.22",
        "IsomericSMILES": "CCCCN1C=C[N+](=C1)C",
        "InChI": "InChI=1S/C8H15N2/c1-3-4-5-10-7-6-9(2)8-10/h6-8H,3-5H2,1-2H3/q+1"
      }
    ]
  }
}


The following command is used to output the SMILES with jq:

In [11]:
# Get SMILES with jq
!./property_search 2734162 | jq '.["PropertyTable"]["Properties"][0]["IsomericSMILES"]'

[0;32m"CCCCN1C=C[N+](=C1)C"[0m


## 2. PubChem Compound Image

### Download image of the requested CID

We change the directory of the Pub_Chem folder to create a new one for our project

In [None]:
%cd ..

In [13]:
!mkdir Image

In [None]:
%cd Image

In [None]:
%%file makefile

# Set the variable CC to gcc, which is used to build the program
CC=gcc

# Enable debugging information and enable all compiler warnings
CFLAGS=-g -Wall

# Set the bin variable as the name of the binary file we are creating
BIN=image_download

# Create the binary file with the name we put
all: $(BIN)

# Map any file ending in .c to a binary executable. 
# "$<" represents the .c file and "$@" represents the target binary executable
%: %.c

	# Compile the .c file using the gcc compiler with the CFLAGS and links 
	# resulting binary with the CURL library
	$(CC) $(CFLAGS) $< -o $@ -lcurl

# Clean target which removes specific files
clean:

	# Remove the binary file and an ".dSYM" (debug symbols for debugging) directories
	# the RM command used -r to remove directories and -f to force delete
	$(RM) -rf $(BIN) *.dSYM


In [None]:
%%file image_download.c

#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>

/* This code was adapted from https://stackoverflow.com/questions/10112959/download-an-image-from-server-curl-however-taking-suggestions-c
and modified to download the pubchem images */

// Download custom CID image in a .png format

// Retrieve the file data from the URL and writes them into the file
size_t callbackfunction(void *ptr, size_t size, size_t nmemb, void* userdata) {
    // Declare a file stream used to hold data
    FILE* stream = (FILE*)userdata;

    // Check if a stream is detected to write into the file
    if (!stream) {
        printf("!!! No stream\n");
        return 0;
    }

    // Retrieve the size of the data to be downloaded
    size_t written = fwrite((FILE*)ptr, size, nmemb, stream);
    return written;
}

// Retrieve the image result and checks whether it is found or not
bool download_png(char* url, char name[]) {
    // Combine the name and the .txt and creates the following file
    strcat(name, ".png");
    FILE* fp = fopen(name, "wb");

    // If file is not created abort the system
    if (!fp) {
        printf("!!! Failed to create file on the disk\n");                                     
        return false;
    }

    // Initialize the CURL connection
    CURL* curlCtx = curl_easy_init();                                                           
    
    // If initialization does not work then error
    if (!curlCtx) {                                                                              
        fprintf(stderr, "init failed\n");
        return EXIT_FAILURE;
    }

    // Set the url to which the HTTP request will be sent to
    // first parameter is for the initialized curl HTTP request, second for the option to be set, and third for the value to be set
    curl_easy_setopt(curlCtx, CURLOPT_URL, url);

    // Set the data pointer for writing the response body of the HTTP request
    // The third parameter is a pointer to the file where the response data will be written.
    curl_easy_setopt(curlCtx, CURLOPT_WRITEDATA, fp);

    // Set the callback function which is called by libcurl for the response body of the HTTP request
    curl_easy_setopt(curlCtx, CURLOPT_WRITEFUNCTION, callbackfunction);
    
    // Set the option to enable HTTP redirects
    // For the third parameter the value of 1L enables following of HTTP redirects, and a value of 0L disables it.
    curl_easy_setopt(curlCtx, CURLOPT_FOLLOWLOCATION, 1);

    // Perform an HTTP rquest
    CURLcode rc = curl_easy_perform(curlCtx);                                                       
    
    // If request is unsuccessful then abort   
    if (rc) {
        printf("!!! Failed to download: %s\n", url);
        return false;
    }

    long res_code = 0;

    // Set the resposnse code retrieved from the HTTP website                  
    curl_easy_getinfo(curlCtx, CURLINFO_RESPONSE_CODE, &res_code);
    
    // Deallocate memory for the CURL connection
    curl_easy_cleanup(curlCtx);                                                                     

    // Avoid memory leaks by closing file pointer   
    fclose(fp);

    return true;
}

int main(int argc, char* argv[]) {
    // If arguments are lower than or greater than 2 then error
    if (argc < 2 || argc > 2) {
        printf("Error. Please try again correctly");
        return 0;
    }

    // Bits of data required for the API search
    char api[] = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound//cid/";
    char type[] = "/PNG";
    char url[1000];

    // Combine all the bits together to create the final URL
    sprintf(url, "%s%s%s", api, argv[1], type);                                                     

    // If image not found retrieve error
    if (!download_png(url, argv[1])) {
        printf("!! Failed to download file \n");
        return -1;
    }

    return 0;
}

In [None]:
!make

We can change the CID to our own preference to download images

In [18]:
!./image_download 2734162

## 3. PubChem Similarity Search

### Performs a similarity search and returns the CID list

In [None]:
%cd ..

In [20]:
!mkdir Similarity

In [None]:
%cd Similarity

In [None]:
%%file makefile

# Set the variable CC to gcc, which is used to build the program
CC=gcc

# Enable debugging information and enable all compiler warnings
CFLAGS=-g -Wall

# Set the bin variable as the name of the binary file we are creating
BIN=similarity_search

# Create the binary file with the name we put
all: $(BIN)

# Map any file ending in .c to a binary executable. 
# "$<" represents the .c file and "$@" represents the target binary executable
%: %.c

	# Compile the .c file using the gcc compiler with the CFLAGS and links 
	# resulting binary with the CURL library
	$(CC) $(CFLAGS) $< -o $@ -lcurl

# Clean target which removes specific files
clean:

	# Remove the binary file and an ".dSYM" (debug symbols for debugging) directories
	# the RM command used -r to remove directories and -f to force delete
	$(RM) -rf $(BIN) *.dSYM


In [None]:
%%file similarity_search.c

#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>

// Perform a similarity search with results (CID) in a .txt file

// Retrieve the search result and output to a .txt file
bool similarity_search_file(char* url, char name[])                                                           
{
    // Combine the name and the .txt and creates the following file
    strcat(name, ".txt");
    FILE* fp = fopen(name, "wb");

    // If file is not created abort the system                                                                   
    if (!fp) {                                                                                                
        printf("!!! Failed to create file on the disk\n");
        return false;
    }

    // Initialize the CURL connection
    CURL* curl = curl_easy_init();                                                                 
    
    // If initialization does not work then error
    if (!curl) {                                                                                    
        fprintf(stderr, "init failed\n");
        return EXIT_FAILURE;
    }

    // Set the url to which the HTTP request will be sent to
    // first parameter is for the initialized curl HTTP request, second for the URL option to be set, and third for the URL to be set    
    curl_easy_setopt(curl, CURLOPT_URL, url);

    // Set the data pointer for writing the response body of the HTTP request
    // The third parameter is a pointer to the file where the response data will be written.
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, fp);

    // Perform an HTTP request
    CURLcode rc = curl_easy_perform(curl);

    // If request is unsuccessful then abort                                                           
    if (rc) {
        printf("!!! Failed to download: %s\n", url);
        return false;
    }

    // Clean up allocated resources
    curl_easy_cleanup(curl);                                                                         

    // Avoid memory leaks by closing file pointer                                                                                                   
    fclose(fp);                                                                                          
    return true;
}

// Retrieve the search result and output to stdout
bool similarity_search(char* url)                                                       
{
    // Initialize the curl connection (http request)
    CURL* curl = curl_easy_init();                                                                 
    
    // If initialization does not work then error
    if (!curl) {                                                                                    
        fprintf(stderr, "init failed\n");
        return EXIT_FAILURE;
    }
    
    // Set the url to which the HTTP request will be sent to
    // first parameter is for the initialized curl HTTP request, second for the URL option to be set, and third for the URL to be set
    curl_easy_setopt(curl, CURLOPT_URL, url);

    // Perform an HTTP rquest
    CURLcode result = curl_easy_perform(curl);
    
    // If result is not retrieved then output error
    if (result != CURLE_OK){                                                                            
        fprintf(stderr, "download problem: %s\n", curl_easy_strerror(result));
    }

    // Clean up allocated resources
    curl_easy_cleanup(curl);                                                                                                                                                                
    return true;
}

int main(int argc, char* argv[]) {

    // If arguments are lower than 2 then error
    if (argc < 2){                                                                                        
        printf("Error. Please try again correctly.\n");
        return -1;
    }

    // Bits of data required for the API search
    char api[] = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/";
    char search_type[] = "fastsimilarity_2d/cid/";
    char url[1000];
    char ending[] = "/cids/JSON?Threshold=95";

    // Combine all the bits together to create the final URL
    sprintf(url, "%s%s%s%s", api, search_type, argv[1], ending);                                           

    // Check if conditions match to output to stdout
    if (argc==2) {                                                                           

        // Check if the API request was fulfilled and downloaded
        if (!similarity_search(url)) {
            printf("!! Failed to retrieve data\n");
            return -1;
        }                                           
    
    }

    // Check if conditions match to output to the default (.txt) file
    else if ((argc==3)&&(strcmp(argv[2],"-o")==0)) {                                                     

        // Check if the api request was fulfilled and downloaded to the file
        if (!similarity_search_file(url, argv[1])) {
            printf("!! Failed to download file \n");
            return -1;
        }
    
    }

    return 0;
}

In [24]:
!make

gcc -g -Wall similarity_search.c -o similarity_search -lcurl


We will use the PubChem API to perform a Fingerprint Tanimoto Similarity Search (SS).

(2D Tanimoto threshold 95% to 1-Butyl-3-methyl-imidazolium; CID = 2734162)

We can change the CID and output a list of similar CIDs

In [25]:
!./similarity_search 2734162 | jq ".IdentifierList.CID[0:5]"

[1;39m[
  [0;39m61347[0m[1;39m,
  [0;39m529334[0m[1;39m,
  [0;39m2734161[0m[1;39m,
  [0;39m12971008[0m[1;39m,
  [0;39m304622[0m[1;39m
[1;39m][0m


We can output the list of CIDs in a .txt file

In [26]:
!./similarity_search 2734162 -o

## 4. PubChem SMARTS Search

### Performs a similarity search and returns the CID list

In [None]:
%cd ..

In [28]:
!mkdir Smarts

In [None]:
%cd Smarts

In [None]:
%%file makefile

# Set the variable CC to gcc, which is used to build the program
CC=gcc

# Enable debugging information and enable all compiler warnings
CFLAGS=-g -Wall

# Set the bin variable as the name of the binary file we are creating
BIN=smarts_search

# Create the binary file with the name we put
all: $(BIN)

# Map any file ending in .c to a binary executable. 
# "$<" represents the .c file and "$@" represents the target binary executable
%: %.c

	# Compile the .c file using the gcc compiler with the CFLAGS and links 
	# resulting binary with the CURL library
	$(CC) $(CFLAGS) $< -o $@ -lcurl

# Clean target which removes specific files
clean:

	# Removes the binary file and an ".dSYM" (debug symbols for debugging) directories
	# the RM command used -r to remove directories and -f to force delete
	$(RM) -rf $(BIN) *.dSYM


In [None]:
%%file smarts_search.c

#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>

// The following program outputs all the smarts query in a combined list (CID) and in the stdout or a custom/default .txt file

// Retrieve the search result and outputs it to a file
bool smarts_search_file(char* url, char name[]) {

    // Combine the name and the .txt and create the following file
    // strcat(name, ".txt");
    FILE* fp = fopen(name, "wb");

    // If file is not created abort the system                                                                   
    if (!fp) {                                                                                               
        printf("!!! Failed to create file on the disk\n");
        return false;
    }

    // Initialize the CURL connection
    CURL* curl = curl_easy_init();                                                                 
    
    // If initialization does not work then error
    if (!curl) {                                                                                   
        fprintf(stderr, "init failed\n");
        return EXIT_FAILURE;
    }

    // Set the url to which the HTTP request will be sent to
    // first parameter is for the initialized curl HTTP request, second for the URL option to be set, and third for the URL to be set    
    curl_easy_setopt(curl, CURLOPT_URL, url);

    // Set the data pointer for writing the response body of the HTTP request
    // The third parameter is a pointer to the file where the response data will be written.
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, fp);

    // Perform an HTTP request
    CURLcode rc = curl_easy_perform(curl);

    // If request is unsuccessful then abort                                                              
    if (rc) {
        printf("!!! Failed to download: %s\n", url);
        return false;
    }

    // Clean up allocated resources
    curl_easy_cleanup(curl);                                                                         

    // Avoid memory leaks by closing file pointer                                                                                                      
    fclose(fp);                                                                                          
    return true;
}


// Retrieve the search result and outputs to stdout
bool smarts_search(char* url) {

    // Initialize the CURL connection
    CURL* curl = curl_easy_init();                                                                 
    
    // If initialization does not work then error
    if (!curl) {                                                                                    
        fprintf(stderr, "init failed\n");
        return EXIT_FAILURE;
    }
    
    // Set the URL to which the HTTP request will be sent to
    // first parameter is for the initialized curl HTTP request, second for the URL option to be set, and third for the URL to be set
    curl_easy_setopt(curl, CURLOPT_URL, url);

    // Perform an HTTP request
    CURLcode result = curl_easy_perform(curl);
    
    // If result is not retrieved then output error
    if (result != CURLE_OK) {                                                                            
        return false;
    }

    // Clean up allocated resources
    curl_easy_cleanup(curl);                                                                                                                                                                 
    return true;
}

int main(int argc, char* argv[]) {
    // If no argument options are provided please return an error
    if (argc < 2){
        printf("Error. Please try again correctly.\n");
        return 0;
    }

    // Call the libcurl library to initialize the HTTP request for encoding
    CURL *curl_en = curl_easy_init();                                                          

    // If initialization does not work then error
    if (!curl_en) {                                                                             
        fprintf(stderr, "init failed\n");
        return EXIT_FAILURE;
    }
    
    // Check if conditions match to output to stdout
    if (argc == 2) {
        
        // Check if the initialization of the HTTP request works
        if (curl_en) {
        
            // Bits of data required for the API search 
            char api[] = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/";             
            char search_type[] = "fastsubstructure/smarts/";
            char url[1000];
            char ending[] = "/cids/TXT";

            // Function which encodes the query
            char *encoded_smarts = curl_easy_escape(curl_en, argv[1], 0);                          

            // Combine the bis to form a complete url
            sprintf(url, "%s%s%s%s", api, search_type, encoded_smarts, ending);                 

            /* Condition to check whether the api request was fulfilled
            and downloaded*/
            if (!smarts_search(url)) {                                            
                printf("!! Failed to download file \n");
                return -1;
            }

            curl_free(encoded_smarts);
        }
    }
    
    // Check if conditions match to output to the default (output.txt) file
    else if ((argc==3) && (strcmp(argv[2],"-o")==0)) {
        
        // Check if the initialization of the HTTP request works                              
        if (curl_en) {        

            // Bits of data required for the API search                                                                 
            char api[] = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/";             
            char search_type[] = "fastsubstructure/smarts/";
            char url[1000];
            char ending[] = "/cids/TXT";

            // Function which encodes the query
            char *encoded_smarts = curl_easy_escape(curl_en, argv[1], 0);                          
            
            // Combines the bits to form a complete url
            sprintf(url, "%s%s%s%s", api, search_type, encoded_smarts, ending);                 
            char filename[] = "output";

            /* Condition to check whether the api request was fulfilled
            and downloaded to a default file*/
            if (!smarts_search_file(url, filename)) {                                                                                
                printf("!! Failed to download file \n");
                return -1;
            }

            curl_free(encoded_smarts);
        }
    }

    // Check if conditions match to output to the custom (.txt) file
    else if ((argc==4)&&(strcmp(argv[2],"-o")==0)){

        // Check if the initialization of the HTTP request works                                        
        if (curl_en) {    

            // Bits of data required for the API search                                                                       
            char api[] = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/";
            char search_type[] = "fastsubstructure/smarts/";
            char url[1000];
            char ending[] = "/cids/TXT";
            char *encoded_smarts = curl_easy_escape(curl_en, argv[1], 0);

            // Combine all the bits together to create the final URL
            sprintf(url, "%s%s%s%s", api, search_type, encoded_smarts, ending);                 

            // Check if the api request was fulfilled and downloaded to a default file
            if (!smarts_search_file(url, argv[3])) {
                printf("!! Failed to download file \n");
                return -1;
            }

            // Free the memory occupied for encoded url
            curl_free(encoded_smarts);      
        }
    }
    
    // Free the memory for the curl_en connection
    curl_easy_cleanup(curl_en);                                                                

    return 0;
}

In [None]:
!make

We can input the custom query and output it to a custom file with the desired number of CIDs

In [33]:
!./smarts_search "CCCCCCC#C" | head -n10

6291
6231
5991
9839306
6540478
64139
55245
40973
27812
14687


We can input the custom query and output them to a default file

In [34]:
!./smarts_search "[CR0H2][n+]1[cH1][cH1]n([CR0H1]=[CR0H2])[cH1]1" -o test1

We can print the first n lines

In [35]:
!head -n 5 test1

121235111
132274871
129853306
129853221
129850195
