PubChem API in C#
by Cyrus Gomes
PubChem API Documentation: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access
These recipe examples were tested on July 25, 2023.
Attribution: This tutorial was adapted from supporting information in:
Scalfani, V. F.; Ralph, S. C. Alshaikh, A. A.; Bara, J. E. Programmatic Compilation of Chemical Data and Literature From PubChem Using Matlab. Chemical Engineering Education, 2020, 54, 230. https://doi.org/10.18260/2-1-370.660-115508 and vfscalfani/MATLAB-cheminformatics)
Setup#
First, install the CURL package by typing the following command in the terminal:
!sudo apt install curl jq libcurl4-openssl-dev
Then we set a directory where we want the PubChem directory for our projects to be created:
!mkdir Pub_Chem
Finally, we change the directory to the folder we created:
%cd Pub_Chem
1. PubChem Property#
Get property details#
Then we initialize a folder for the current project that we are working on. And then change to that directory
!mkdir Property
%cd Property
We utilize the %%file
command to create the following makefile which will compile our program and create an executable.
%%file makefile
# Set the variable CC to gcc, which is used to build the program
CC=gcc
# Enable debugging information and enable all compiler warnings
CFLAGS=-g -Wall
# Set the bin variable as the name of the binary file we are creating
BIN=property_search
# Create the binary file with the name we put
all: $(BIN)
# Map any file ending in .c to a binary executable.
# "$<" represents the .c file and "$@" represents the target binary executable
%: %.c
# Compile the .c file using the gcc compiler with the CFLAGS and links
# resulting binary with the CURL library
$(CC) $(CFLAGS) $< -o $@ -lcurl
# Clean target which removes specific files
clean:
# Remove the binary file and an ".dSYM" (debug symbols for debugging) directories
# the RM command used -r to remove directories and -f to force delete
$(RM) -rf $(BIN) *.dSYM
The command is used again to create our .c file which contains the code for the program
%%file property_search.c
#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* CURL program that retrieves property details about the CID
and outputs to terminal. Custom property fields can be added */
int main (int argc, char* argv[]) {
// If arguments are invalid then return
if (argc < 2){
printf("Error. Please try again correctly.\n");
return -1;
}
// Initialize the CURL HTTP connection
CURL *curl = curl_easy_init();
// Bits of the url that are joined together later
char api[] = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/";
char url[1000];
char label_1[] = "/property/";
char format[] = "/JSON";
// Check if CURL initialization is a success or not
if (!curl) {
fprintf(stderr, "init failed\n");
return EXIT_FAILURE;
}
// Check if the conditions match for using the default property
if ((argc==2)||((argc==3) && (strcmp(argv[2],"-p")==0))) {
char search_type[] = "/property/inchi,IsomericSMILES,MolecularFormula,MolecularWeight/JSON";
// Combine all the bits to produce a functioning url
sprintf(url, "%s%s%s", api, argv[1], search_type);
}
// Check if the conditions match for using custom property
else if ((argc==4)&&(strcmp(argv[2],"-p")==0)) {
// Combine all the bits to produce a functioning url
sprintf(url, "%s%s%s%s%s", api, argv[1], label_1, argv[3], format);
}
// If the arguments are invalid then return
else {
curl_easy_cleanup(curl);
return 0;
}
// Set the url to which the HTTP request will be sent to
// first parameter is for the initialized curl HTTP request, second for the option to be set, and third for the value to be set
curl_easy_setopt(curl, CURLOPT_URL, url);
// If result is not retrieved then output error
CURLcode result = curl_easy_perform(curl);
// If result is not retrieved then output error
if (result != CURLE_OK) {
fprintf(stderr, "download problem: %s\n", curl_easy_strerror(result));
}
// Deallocate memory for the CURL connection
curl_easy_cleanup(curl);
return EXIT_SUCCESS;
}
The folowing program is run, and an executable is created after using the following command:
!make
We can search for a compound and display an image, for example: 1-Butyl-3-methyl-imidazolium; CID = 2734162
If we run the executable and enter the CID and the custom properties that we want to add, we get the result:
!./property_search 2734162 -p "inchi"
{
"PropertyTable": {
"Properties": [
{
"CID": 2734162,
"InChI": "InChI=1S/C8H15N2/c1-3-4-5-10-7-6-9(2)8-10/h6-8H,3-5H2,1-2H3/q+1"
}
]
}
}
We can add additional properties as follows:
!./property_search 2734162 -p "inchi,XLogP,HBondDonorCount,HBondAcceptorCount,RotatableBondCount"
{
"PropertyTable": {
"Properties": [
{
"CID": 2734162,
"InChI": "InChI=1S/C8H15N2/c1-3-4-5-10-7-6-9(2)8-10/h6-8H,3-5H2,1-2H3/q+1",
"XLogP": 1.3,
"HBondDonorCount": 0,
"HBondAcceptorCount": 0,
"RotatableBondCount": 3
}
]
}
}
The following command is used to output the default fields (inchi,IsomericSMILES,MolecularFormula,MolecularWeight):
!./property_search 2734162
{
"PropertyTable": {
"Properties": [
{
"CID": 2734162,
"MolecularFormula": "C8H15N2+",
"MolecularWeight": "139.22",
"IsomericSMILES": "CCCCN1C=C[N+](=C1)C",
"InChI": "InChI=1S/C8H15N2/c1-3-4-5-10-7-6-9(2)8-10/h6-8H,3-5H2,1-2H3/q+1"
}
]
}
}
The following command is used to output the SMILES with jq:
# Get SMILES with jq
!./property_search 2734162 | jq '.["PropertyTable"]["Properties"][0]["IsomericSMILES"]'
"CCCCN1C=C[N+](=C1)C"
2. PubChem Compound Image#
Download image of the requested CID#
We change the directory of the Pub_Chem folder to create a new one for our project
%cd ..
!mkdir Image
%cd Image
%%file makefile
# Set the variable CC to gcc, which is used to build the program
CC=gcc
# Enable debugging information and enable all compiler warnings
CFLAGS=-g -Wall
# Set the bin variable as the name of the binary file we are creating
BIN=image_download
# Create the binary file with the name we put
all: $(BIN)
# Map any file ending in .c to a binary executable.
# "$<" represents the .c file and "$@" represents the target binary executable
%: %.c
# Compile the .c file using the gcc compiler with the CFLAGS and links
# resulting binary with the CURL library
$(CC) $(CFLAGS) $< -o $@ -lcurl
# Clean target which removes specific files
clean:
# Remove the binary file and an ".dSYM" (debug symbols for debugging) directories
# the RM command used -r to remove directories and -f to force delete
$(RM) -rf $(BIN) *.dSYM
%%file image_download.c
#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
/* This code was adapted from https://stackoverflow.com/questions/10112959/download-an-image-from-server-curl-however-taking-suggestions-c
and modified to download the pubchem images */
// Download custom CID image in a .png format
// Retrieve the file data from the URL and writes them into the file
size_t callbackfunction(void *ptr, size_t size, size_t nmemb, void* userdata) {
// Declare a file stream used to hold data
FILE* stream = (FILE*)userdata;
// Check if a stream is detected to write into the file
if (!stream) {
printf("!!! No stream\n");
return 0;
}
// Retrieve the size of the data to be downloaded
size_t written = fwrite((FILE*)ptr, size, nmemb, stream);
return written;
}
// Retrieve the image result and checks whether it is found or not
bool download_png(char* url, char name[]) {
// Combine the name and the .txt and creates the following file
strcat(name, ".png");
FILE* fp = fopen(name, "wb");
// If file is not created abort the system
if (!fp) {
printf("!!! Failed to create file on the disk\n");
return false;
}
// Initialize the CURL connection
CURL* curlCtx = curl_easy_init();
// If initialization does not work then error
if (!curlCtx) {
fprintf(stderr, "init failed\n");
return EXIT_FAILURE;
}
// Set the url to which the HTTP request will be sent to
// first parameter is for the initialized curl HTTP request, second for the option to be set, and third for the value to be set
curl_easy_setopt(curlCtx, CURLOPT_URL, url);
// Set the data pointer for writing the response body of the HTTP request
// The third parameter is a pointer to the file where the response data will be written.
curl_easy_setopt(curlCtx, CURLOPT_WRITEDATA, fp);
// Set the callback function which is called by libcurl for the response body of the HTTP request
curl_easy_setopt(curlCtx, CURLOPT_WRITEFUNCTION, callbackfunction);
// Set the option to enable HTTP redirects
// For the third parameter the value of 1L enables following of HTTP redirects, and a value of 0L disables it.
curl_easy_setopt(curlCtx, CURLOPT_FOLLOWLOCATION, 1);
// Perform an HTTP rquest
CURLcode rc = curl_easy_perform(curlCtx);
// If request is unsuccessful then abort
if (rc) {
printf("!!! Failed to download: %s\n", url);
return false;
}
long res_code = 0;
// Set the resposnse code retrieved from the HTTP website
curl_easy_getinfo(curlCtx, CURLINFO_RESPONSE_CODE, &res_code);
// Deallocate memory for the CURL connection
curl_easy_cleanup(curlCtx);
// Avoid memory leaks by closing file pointer
fclose(fp);
return true;
}
int main(int argc, char* argv[]) {
// If arguments are lower than or greater than 2 then error
if (argc < 2 || argc > 2) {
printf("Error. Please try again correctly");
return 0;
}
// Bits of data required for the API search
char api[] = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound//cid/";
char type[] = "/PNG";
char url[1000];
// Combine all the bits together to create the final URL
sprintf(url, "%s%s%s", api, argv[1], type);
// If image not found retrieve error
if (!download_png(url, argv[1])) {
printf("!! Failed to download file \n");
return -1;
}
return 0;
}
!make
We can change the CID to our own preference to download images
!./image_download 2734162
3. PubChem Similarity Search#
Performs a similarity search and returns the CID list#
%cd ..
!mkdir Similarity
%cd Similarity
%%file makefile
# Set the variable CC to gcc, which is used to build the program
CC=gcc
# Enable debugging information and enable all compiler warnings
CFLAGS=-g -Wall
# Set the bin variable as the name of the binary file we are creating
BIN=similarity_search
# Create the binary file with the name we put
all: $(BIN)
# Map any file ending in .c to a binary executable.
# "$<" represents the .c file and "$@" represents the target binary executable
%: %.c
# Compile the .c file using the gcc compiler with the CFLAGS and links
# resulting binary with the CURL library
$(CC) $(CFLAGS) $< -o $@ -lcurl
# Clean target which removes specific files
clean:
# Remove the binary file and an ".dSYM" (debug symbols for debugging) directories
# the RM command used -r to remove directories and -f to force delete
$(RM) -rf $(BIN) *.dSYM
%%file similarity_search.c
#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
// Perform a similarity search with results (CID) in a .txt file
// Retrieve the search result and output to a .txt file
bool similarity_search_file(char* url, char name[])
{
// Combine the name and the .txt and creates the following file
strcat(name, ".txt");
FILE* fp = fopen(name, "wb");
// If file is not created abort the system
if (!fp) {
printf("!!! Failed to create file on the disk\n");
return false;
}
// Initialize the CURL connection
CURL* curl = curl_easy_init();
// If initialization does not work then error
if (!curl) {
fprintf(stderr, "init failed\n");
return EXIT_FAILURE;
}
// Set the url to which the HTTP request will be sent to
// first parameter is for the initialized curl HTTP request, second for the URL option to be set, and third for the URL to be set
curl_easy_setopt(curl, CURLOPT_URL, url);
// Set the data pointer for writing the response body of the HTTP request
// The third parameter is a pointer to the file where the response data will be written.
curl_easy_setopt(curl, CURLOPT_WRITEDATA, fp);
// Perform an HTTP request
CURLcode rc = curl_easy_perform(curl);
// If request is unsuccessful then abort
if (rc) {
printf("!!! Failed to download: %s\n", url);
return false;
}
// Clean up allocated resources
curl_easy_cleanup(curl);
// Avoid memory leaks by closing file pointer
fclose(fp);
return true;
}
// Retrieve the search result and output to stdout
bool similarity_search(char* url)
{
// Initialize the curl connection (http request)
CURL* curl = curl_easy_init();
// If initialization does not work then error
if (!curl) {
fprintf(stderr, "init failed\n");
return EXIT_FAILURE;
}
// Set the url to which the HTTP request will be sent to
// first parameter is for the initialized curl HTTP request, second for the URL option to be set, and third for the URL to be set
curl_easy_setopt(curl, CURLOPT_URL, url);
// Perform an HTTP rquest
CURLcode result = curl_easy_perform(curl);
// If result is not retrieved then output error
if (result != CURLE_OK){
fprintf(stderr, "download problem: %s\n", curl_easy_strerror(result));
}
// Clean up allocated resources
curl_easy_cleanup(curl);
return true;
}
int main(int argc, char* argv[]) {
// If arguments are lower than 2 then error
if (argc < 2){
printf("Error. Please try again correctly.\n");
return -1;
}
// Bits of data required for the API search
char api[] = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/";
char search_type[] = "fastsimilarity_2d/cid/";
char url[1000];
char ending[] = "/cids/JSON?Threshold=95";
// Combine all the bits together to create the final URL
sprintf(url, "%s%s%s%s", api, search_type, argv[1], ending);
// Check if conditions match to output to stdout
if (argc==2) {
// Check if the API request was fulfilled and downloaded
if (!similarity_search(url)) {
printf("!! Failed to retrieve data\n");
return -1;
}
}
// Check if conditions match to output to the default (.txt) file
else if ((argc==3)&&(strcmp(argv[2],"-o")==0)) {
// Check if the api request was fulfilled and downloaded to the file
if (!similarity_search_file(url, argv[1])) {
printf("!! Failed to download file \n");
return -1;
}
}
return 0;
}
!make
gcc -g -Wall similarity_search.c -o similarity_search -lcurl
We will use the PubChem API to perform a Fingerprint Tanimoto Similarity Search (SS).
(2D Tanimoto threshold 95% to 1-Butyl-3-methyl-imidazolium; CID = 2734162)
We can change the CID and output a list of similar CIDs
!./similarity_search 2734162 | jq ".IdentifierList.CID[0:5]"
[
61347,
529334,
2734161,
12971008,
304622
]
We can output the list of CIDs in a .txt file
!./similarity_search 2734162 -o
4. PubChem SMARTS Search#
Performs a similarity search and returns the CID list#
%cd ..
!mkdir Smarts
%cd Smarts
%%file makefile
# Set the variable CC to gcc, which is used to build the program
CC=gcc
# Enable debugging information and enable all compiler warnings
CFLAGS=-g -Wall
# Set the bin variable as the name of the binary file we are creating
BIN=smarts_search
# Create the binary file with the name we put
all: $(BIN)
# Map any file ending in .c to a binary executable.
# "$<" represents the .c file and "$@" represents the target binary executable
%: %.c
# Compile the .c file using the gcc compiler with the CFLAGS and links
# resulting binary with the CURL library
$(CC) $(CFLAGS) $< -o $@ -lcurl
# Clean target which removes specific files
clean:
# Removes the binary file and an ".dSYM" (debug symbols for debugging) directories
# the RM command used -r to remove directories and -f to force delete
$(RM) -rf $(BIN) *.dSYM
%%file smarts_search.c
#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
// The following program outputs all the smarts query in a combined list (CID) and in the stdout or a custom/default .txt file
// Retrieve the search result and outputs it to a file
bool smarts_search_file(char* url, char name[]) {
// Combine the name and the .txt and create the following file
// strcat(name, ".txt");
FILE* fp = fopen(name, "wb");
// If file is not created abort the system
if (!fp) {
printf("!!! Failed to create file on the disk\n");
return false;
}
// Initialize the CURL connection
CURL* curl = curl_easy_init();
// If initialization does not work then error
if (!curl) {
fprintf(stderr, "init failed\n");
return EXIT_FAILURE;
}
// Set the url to which the HTTP request will be sent to
// first parameter is for the initialized curl HTTP request, second for the URL option to be set, and third for the URL to be set
curl_easy_setopt(curl, CURLOPT_URL, url);
// Set the data pointer for writing the response body of the HTTP request
// The third parameter is a pointer to the file where the response data will be written.
curl_easy_setopt(curl, CURLOPT_WRITEDATA, fp);
// Perform an HTTP request
CURLcode rc = curl_easy_perform(curl);
// If request is unsuccessful then abort
if (rc) {
printf("!!! Failed to download: %s\n", url);
return false;
}
// Clean up allocated resources
curl_easy_cleanup(curl);
// Avoid memory leaks by closing file pointer
fclose(fp);
return true;
}
// Retrieve the search result and outputs to stdout
bool smarts_search(char* url) {
// Initialize the CURL connection
CURL* curl = curl_easy_init();
// If initialization does not work then error
if (!curl) {
fprintf(stderr, "init failed\n");
return EXIT_FAILURE;
}
// Set the URL to which the HTTP request will be sent to
// first parameter is for the initialized curl HTTP request, second for the URL option to be set, and third for the URL to be set
curl_easy_setopt(curl, CURLOPT_URL, url);
// Perform an HTTP request
CURLcode result = curl_easy_perform(curl);
// If result is not retrieved then output error
if (result != CURLE_OK) {
return false;
}
// Clean up allocated resources
curl_easy_cleanup(curl);
return true;
}
int main(int argc, char* argv[]) {
// If no argument options are provided please return an error
if (argc < 2){
printf("Error. Please try again correctly.\n");
return 0;
}
// Call the libcurl library to initialize the HTTP request for encoding
CURL *curl_en = curl_easy_init();
// If initialization does not work then error
if (!curl_en) {
fprintf(stderr, "init failed\n");
return EXIT_FAILURE;
}
// Check if conditions match to output to stdout
if (argc == 2) {
// Check if the initialization of the HTTP request works
if (curl_en) {
// Bits of data required for the API search
char api[] = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/";
char search_type[] = "fastsubstructure/smarts/";
char url[1000];
char ending[] = "/cids/TXT";
// Function which encodes the query
char *encoded_smarts = curl_easy_escape(curl_en, argv[1], 0);
// Combine the bis to form a complete url
sprintf(url, "%s%s%s%s", api, search_type, encoded_smarts, ending);
/* Condition to check whether the api request was fulfilled
and downloaded*/
if (!smarts_search(url)) {
printf("!! Failed to download file \n");
return -1;
}
curl_free(encoded_smarts);
}
}
// Check if conditions match to output to the default (output.txt) file
else if ((argc==3) && (strcmp(argv[2],"-o")==0)) {
// Check if the initialization of the HTTP request works
if (curl_en) {
// Bits of data required for the API search
char api[] = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/";
char search_type[] = "fastsubstructure/smarts/";
char url[1000];
char ending[] = "/cids/TXT";
// Function which encodes the query
char *encoded_smarts = curl_easy_escape(curl_en, argv[1], 0);
// Combines the bits to form a complete url
sprintf(url, "%s%s%s%s", api, search_type, encoded_smarts, ending);
char filename[] = "output";
/* Condition to check whether the api request was fulfilled
and downloaded to a default file*/
if (!smarts_search_file(url, filename)) {
printf("!! Failed to download file \n");
return -1;
}
curl_free(encoded_smarts);
}
}
// Check if conditions match to output to the custom (.txt) file
else if ((argc==4)&&(strcmp(argv[2],"-o")==0)){
// Check if the initialization of the HTTP request works
if (curl_en) {
// Bits of data required for the API search
char api[] = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/";
char search_type[] = "fastsubstructure/smarts/";
char url[1000];
char ending[] = "/cids/TXT";
char *encoded_smarts = curl_easy_escape(curl_en, argv[1], 0);
// Combine all the bits together to create the final URL
sprintf(url, "%s%s%s%s", api, search_type, encoded_smarts, ending);
// Check if the api request was fulfilled and downloaded to a default file
if (!smarts_search_file(url, argv[3])) {
printf("!! Failed to download file \n");
return -1;
}
// Free the memory occupied for encoded url
curl_free(encoded_smarts);
}
}
// Free the memory for the curl_en connection
curl_easy_cleanup(curl_en);
return 0;
}
!make
We can input the custom query and output it to a custom file with the desired number of CIDs
!./smarts_search "CCCCCCC#C" | head -n10
6291
6231
5991
9839306
6540478
64139
55245
40973
27812
14687
We can input the custom query and output them to a default file
!./smarts_search "[CR0H2][n+]1[cH1][cH1]n([CR0H1]=[CR0H2])[cH1]1" -o test1
We can print the first n lines
!head -n 5 test1
121235111
132274871
129853306
129853221
129850195