# PubMed API in C

by Cyrus Gomes

These recipe examples were tested on July 25, 2023.

**NCBI Entrez Programming Utilities documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25501/

**Please see NCBI's Data Usage Policies and Disclaimers:** https://www.ncbi.nlm.nih.gov/home/about/policies/

## Setup

First, install the CURL and jq package by typing the following command in the terminal:

In [None]:
!sudo apt install curl jq libcurl4-openssl-dev

Then we set a directory where we want the PubMed directory for our projects to be created:

In [1]:
!mkdir PubMed

Finally, we change the directory to the folder we created:

In [None]:
%cd PubMed

## 1. Basic PubMed API call

We initialize a folder for the current project that we are working on. And then change to that directory

In [3]:
!mkdir basic_api_call

In [None]:
%cd basic_api_call

Then we utilize `%%file` command to create the following makefile which will compile our program and create an executable.

In [None]:
%%file makefile

# Set the variable CC to gcc, which is used to build the program
CC=gcc

# Enable debugging information and enable all compiler warnings
CFLAGS=-g -Wall

# Sets the bin variable as the name of the binary file we are creating
BIN=api_call

# Create the binary file with the name we put
all: $(BIN)

# Map any file ending in .c to a binary executable. 
# "$<" represents the .c file and "$@" represents the target binary executable
%: %.c

	# Compile the .c file using the gcc compiler with the CFLAGS and links 
	# resulting binary with the CURL library
	$(CC) $(CFLAGS) $< -o $@ -lcurl

# Clean target which removes specific files
clean:

	# Remove the binary file and an ".dSYM" (debug symbols for debugging) directories
	# the RM command used -r to remove directories and -f to force delete
	$(RM) -rf $(BIN) *.dSYM

The command is used again to create our .c file which contains the code for the program

In [None]:
%%file api_call.c

#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* CURL program that retrieves JSON data from the Pub Chem API
This program allows custom indicator data set to be used */

/* We are going to be inputting the custom ID like this: ./api_call -i 42342346
If the arguments are missing then we use the default: "27933103" */

int main (int argc, char* argv[]) {
    
    // If arguments are invalid then return
    if (argc > 5) {                                                                                      
        printf("Error. Please try again correctly.\n");
        return -1;
    }

    // Default indicator code
    char indicator[100] = {}; 

    // If there is ./api_call or -i
    if ((argc == 1) || ((argc == 2) && (strcmp(argv[1], "-i")==0))) {
        // These arguments run the default parameters and keeps the codes as they are
        strcat(indicator, "27933103");
    }

    // If there is ./api_call -i 34813985
    else if ((argc == 3) && (strcmp(argv[1], "-i")==0)) {
        // Only the country code is changed
        strcat(indicator, argv[2]);
    }

    else {
        printf("usage: ./api_call [-i] indicator\n\n");
        printf("the custom_ID program is used to retrieve json data from the Pub Med API\n\n");
        printf("optional arguments\n");
        printf("\t -i ID    optional custom PubMed ID; default is '27933103'\n");
        return -1;
    }

    // Initialize the CURL HTTP connection
    CURL *curl = curl_easy_init();

    // Bits of the url that are joined together later
    char api[] = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&";                                                                     
    char type1[] = "id=";                          
    char url[1000];
    char label[] = "&retmode=json";

    // Check if CURL initialization is a success or not
    if (!curl) {                                                                                         
        fprintf(stderr, "init failed\n");
        return EXIT_FAILURE;
    }
        
    // Combine all the bits to produce a functioning url
    sprintf(url, "%s%s%s%s", api, type1 , indicator, label);                                             
                                          
    // Set the url to which the HTTP request will be sent to
    // first parameter is for the initialized curl HTTP request, second for the option to be set, and third for the value to be set
    curl_easy_setopt(curl, CURLOPT_URL, url);

    // If result is not retrieved then output error
    CURLcode result = curl_easy_perform(curl);

    // If result is not retrieved then output error
    if (result != CURLE_OK) {                                                                            
        fprintf(stderr, "download problem: %s\n", curl_easy_strerror(result));
    }

    // Deallocate memory for the CURL connection
    curl_easy_cleanup(curl);                                                                            
    return EXIT_SUCCESS;
}

The folowing program is run, and an executable is created after using the following command:

In [None]:
!make

The article we are requesting has PubMed ID: 27933103

To print the following json data we do the following:

In [None]:
!./api_call | jq '.'

To output the data for multiple ids from the PubMed API, we enter the following command:

In [None]:
!./api_call -i "34813985,34813140" | jq '.'

To output the data for multiple ids from the PubMed API, we enter the following command:

In [None]:
!./api_call | jq '.["result"]["27933103"]["authors"][]'

To output only the author names:

In [11]:
!./api_call | jq '.["result"]["27933103"]["authors"][]["name"]'

[0;32m"Scalfani VF"[0m
[0;32m"Williams AJ"[0m
[0;32m"Tkachenko V"[0m
[0;32m"Karapetyan K"[0m
[0;32m"Pshenichnov A"[0m
[0;32m"Hanson RM"[0m
[0;32m"Liddie JM"[0m
[0;32m"Bara JE"[0m


To output the source name from the PubMed API, we enter the following command:

In [12]:
!./api_call -i 34813072 | jq '.["result"]["34813072"]["source"]'

[0;32m"Methods Mol Biol"[0m


Here, we output the source name for multiple ids:

In [2]:
%%bash

# List of IDs
idList=('34813985' '34813932' '34813684' '34813661' '34813372' '34813140' '34813072')

for id in "${idList[@]}"; do 

    # Retrieve the source name for the given id
    ./api_call -i "$id" | jq --arg location "$id" '.["result"][$location]["source"]'
    
    # Sleep delay
    sleep 1
    
done

[0;32m"Cell Calcium"[0m
[0;32m"Methods"[0m
[0;32m"FEBS J"[0m
[0;32m"Dev Growth Differ"[0m
[0;32m"CRISPR J"[0m
[0;32m"Chembiochem"[0m
[0;32m"Methods Mol Biol"[0m


## 2. PubMed API Calls with Requests & Parameters

We go back to our original directory

In [None]:
%cd ..

We initialize a folder for the current project that we are working on. And then change to that directory

In [15]:
!mkdir api_request_parameter

We then change directory to the project that we are working on

In [None]:
%cd api_request_parameter

In [None]:
%%file makefile

# Set the variable CC to gcc, which is used to build the program
CC=gcc

# Enable debugging information and enable all compiler warnings
CFLAGS=-g -Wall

# Set the bin variable as the name of the binary file we are creating
BIN=api_req_par

# Create the binary file with the name we put
all: $(BIN)

# Map any file ending in .c to a binary executable. 
# "$<" represents the .c file and "$@" represents the target binary executable
%: %.c

	# Compile the .c file using the gcc compiler with the CFLAGS and links 
	# resulting binary with the CURL library
	$(CC) $(CFLAGS) $< -o $@ -lcurl

# Clean target which removes specific files
clean:

	# Remove the binary file and an ".dSYM" (debug symbols for debugging) directories
	# the RM command used -r to remove directories and -f to force delete
	$(RM) -rf $(BIN) *.dSYM

The command is used again to create our .c file which contains the code for the program

In [None]:
%%file api_req_par.c

#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* CURL program that retrieves JSON data from the PubMed API
This program allows custom request to be used along with the parameter */

/* We will input the custom database and query like this: ./api_req_par -d "pubmed" -q "neuroscience+intervention+learning"
If the arguments are missing then we use the default: "pubmed" "neuroscience" */

int main (int argc, char* argv[]) {
    
    // If arguments are invalid just return
    if (argc > 5) {                                                                                      
        printf("Error. Please try again correctly.\n");
        return -1;
    }

    // Default parameter and request codes
    char parameter[100] = {};
    char request[500] = {}; 

    // If there is ./api_req_par -d/-q
    if ((argc == 1) || ((argc == 2) && ((strcmp(argv[1], "-d")==0) || (strcmp(argv[1], "-q")==0)))) {
        // These arguments run the default parameters and keeps the codes as they are
        strcat(parameter,"pubmed");
        strcat(request, "neuroscience");
    }

    // If there is ./api_req_par -d "pubmed"
    else if ((argc == 3) && (strcmp(argv[1], "-d")==0)) {
        // Only the parameter code is changed
        strcat(parameter,argv[2]);
        strcat(request, "neuroscience");
    }

    // If there is ./api_req_par -d "pubmed" -q
    else if ((argc == 4) && (strcmp(argv[1], "-d")==0) && (strcmp(argv[3], "-q")==0)) {
        // Only the parameter code is changed
        strcat(parameter,argv[2]);
        strcat(request, "neuroscience");
    }

    // If there is ./api_req_par -d "pubmed" -q "neuroscience+intervention+learning"
    else if ((argc == 5) && (strcmp(argv[1], "-d")==0) && (strcmp(argv[3], "-q")==0)) {
        // Both the parameter and request codes are changed
        strcat(parameter,argv[2]);
        strcat(request, argv[4]);
    }

    // If there is ./api_req_par -q "neuroscience+intervention+learning"
    else if ((argc == 3) && (strcmp(argv[1], "-q")==0)) {
        // Only the request code is changed
        strcat(parameter,"pubmed");
        strcat(request, argv[2]);
    }

    // If there is ./api_req_par -q "neuroscience+intervention+learning" -d
    else if ((argc == 4) && (strcmp(argv[1], "-q")==0) && (strcmp(argv[3], "-d")==0)) {
        // Only the request code is changed
        strcat(parameter,"pubmed");
        strcat(request, argv[2]);
    }

    // If there is ./api_req_par -q "neuroscience+intervention+learning" -d "pubmed" 
    else if ((argc == 5) && (strcmp(argv[1], "-q")==0) && (strcmp(argv[3], "-d")==0)) {
        // Both the request and parameter codes are changed
        strcat(parameter,argv[4]);
        strcat(request, argv[2]);
    }

    else {
        printf("usage: ./api_req_par [-q] request [-d] parameter\n\n");
        printf("the api_req_par program is used to retrieve json data from the PubMed API\n\n");
        printf("optional arguments\n");
        printf("\t -q query        optional custom query; default is 'neuroscience'\n");
        printf("\t -d parameter    optional custom database code; default is 'pubmed',  see: https://www.ncbi.nlm.nih.gov/books/NBK25499/\n");
        return -1;
    }

    // Initialize the CURL HTTP connection
    CURL *curl = curl_easy_init();

    // Bits of the url that are joined together later
    char api[] = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=";                                                                     
    char type1[] = "&";
    char type2[] = "term=";
    char type3[] = "&retmode=json";                           
    char url[1000];

    // Check if CURL initialization is a success or not
    if (!curl) {                                                                                         
        fprintf(stderr, "init failed\n");
        return EXIT_FAILURE;
    }
        
    // Combine all the bits to produce a functioning url
    sprintf(url, "%s%s%s%s%s%s", api, parameter, type1 , type2, request, type3);                                             
                                          

    // Set the url to which the HTTP request will be sent to
    // first parameter is for the initialized curl HTTP request, second for the option to be set, and third for the value to be set
    curl_easy_setopt(curl, CURLOPT_URL, url);

    // If result is not retrieved then output error
    CURLcode result = curl_easy_perform(curl);

    // If result is not retrieved then output error
    if (result != CURLE_OK) {                                                                            
        fprintf(stderr, "download problem: %s\n", curl_easy_strerror(result));
    }

    // Deallocate memory for the CURL connection
    curl_easy_cleanup(curl);                                                                            
    return EXIT_SUCCESS;
}

The folowing program is run, and an executable is created after using the following command:

In [None]:
!make

The default parameter is "pubmed" and the default requests are "neuroscience"

The folowing program is run, and an executable is created after using the following command:  

In [None]:
!./api_req_par| jq '.'

In [None]:
!./api_req_par -q "aspirin" -d "pccompound" | jq '.'

The number of returned IDs can be adjusted with the retmax paramater:

In [1]:
!./api_req_par -q "neuroscience+intervention+learning&retmax=25" | jq '.esearchresult.idlist'

[1;39m[
  [0;32m"38305455"[0m[1;39m,
  [0;32m"38304851"[0m[1;39m,
  [0;32m"38304576"[0m[1;39m,
  [0;32m"38303964"[0m[1;39m,
  [0;32m"38303627"[0m[1;39m,
  [0;32m"38302998"[0m[1;39m,
  [0;32m"38302981"[0m[1;39m,
  [0;32m"38302296"[0m[1;39m,
  [0;32m"38301832"[0m[1;39m,
  [0;32m"38301514"[0m[1;39m,
  [0;32m"38301234"[0m[1;39m,
  [0;32m"38300213"[0m[1;39m,
  [0;32m"38299388"[0m[1;39m,
  [0;32m"38298927"[0m[1;39m,
  [0;32m"38298912"[0m[1;39m,
  [0;32m"38298803"[0m[1;39m,
  [0;32m"38298796"[0m[1;39m,
  [0;32m"38298788"[0m[1;39m,
  [0;32m"38298783"[0m[1;39m,
  [0;32m"38298781"[0m[1;39m,
  [0;32m"38298775"[0m[1;39m,
  [0;32m"38297494"[0m[1;39m,
  [0;32m"38296969"[0m[1;39m,
  [0;32m"38295471"[0m[1;39m,
  [0;32m"38293166"[0m[1;39m
[1;39m][0m


In [23]:
!./api_req_par -q "neuroscience+intervention+learning&retmax=25" | jq '.esearchresult.idlist | length'

[0;39m25[0m


We can also use the query to search for an author.

We will add `[au]` after the name to specify it is an author

In [24]:
!./api_req_par -q "Darwin[au]" | jq '.esearchresult.count'

[0;32m"630"[0m


We get the `idlist` for the custom request:

In [25]:
!./api_req_par -q "Coral+Reefs&retmode=json&usehistory=y&sort=pub+date" | jq '.esearchresult.idlist'

[1;39m[
  [0;32m"37393678"[0m[1;39m,
  [0;32m"37315600"[0m[1;39m,
  [0;32m"37209734"[0m[1;39m,
  [0;32m"37290662"[0m[1;39m,
  [0;32m"37286001"[0m[1;39m,
  [0;32m"37257610"[0m[1;39m,
  [0;32m"37247740"[0m[1;39m,
  [0;32m"37286027"[0m[1;39m,
  [0;32m"37399735"[0m[1;39m,
  [0;32m"37385181"[0m[1;39m,
  [0;32m"37331272"[0m[1;39m,
  [0;32m"37311517"[0m[1;39m,
  [0;32m"37137368"[0m[1;39m,
  [0;32m"37105476"[0m[1;39m,
  [0;32m"37022443"[0m[1;39m,
  [0;32m"36549653"[0m[1;39m,
  [0;32m"37465983"[0m[1;39m,
  [0;32m"37487981"[0m[1;39m,
  [0;32m"37481620"[0m[1;39m,
  [0;32m"37100135"[0m[1;39m
[1;39m][0m


Searching based on publication types:

we can do this by adding **AND** into the search
```
term=<searchQuery>+AND+filter[filterType]
```
```[pt]``` specifies that the filter type is publication type

More filters can be found at https://pubmed.ncbi.nlm.nih.gov/help/

In [None]:
!./api_req_par -q "stem+cells+AND+clinical+trial[pt]" | jq '{esearchresult: .esearchresult}'

## 3. PubMed API metadata visualization

### Frequency of topic sortpubdate field
Extracting the sortpubdate field for a “hydrogel drug” search results, limited to publication type clinical trials:

In [27]:
!./api_req_par -q "hydrogel+drug+AND+clinical+trial[pt]&sort=pub+date&retmax=500" | jq '.esearchresult.idlist[0:10]'

[1;39m[
  [0;32m"36418469"[0m[1;39m,
  [0;32m"36870516"[0m[1;39m,
  [0;32m"36842739"[0m[1;39m,
  [0;32m"36203046"[0m[1;39m,
  [0;32m"36261491"[0m[1;39m,
  [0;32m"35830550"[0m[1;39m,
  [0;32m"34653384"[0m[1;39m,
  [0;32m"35556170"[0m[1;39m,
  [0;32m"35413602"[0m[1;39m,
  [0;32m"35041809"[0m[1;39m
[1;39m][0m


In [28]:
!./api_req_par -q "hydrogel+drug+AND+clinical+trial[pt]&sort=pub+date&retmax=500" | jq '.esearchresult.idlist | length'

[0;39m302[0m


The following code will store the list of IDs in a text file:

In [29]:
!./api_req_par -q "hydrogel+drug+AND+clinical+trial[pt]&sort=pub+date&retmax=500" | jq '.esearchresult.idlist' > idList.txt

To format the text file we use:

In [30]:
!cat idList.txt | tr -d '",[]' > idList2.txt

In [31]:
!sed -i '/^$/d' idList2.txt

In [32]:
!cat idList2.txt | wc -l

302


Show the first 10 IDs:

In [33]:
!head -10 idList2.txt

  36418469
  36870516
  36842739
  36203046
  36261491
  35830550
  34653384
  35556170
  35413602
  35041809


We want to get the E-summary of each of the IDs:

Hence we copy the `api_call` program from our previous project to our current directory

In [34]:
!cp ../basic_api_call/api_call .

We test to see if we get the date for one ID

In [35]:
!./api_call -i 34813072 | jq '.["result"]["34813072"]["sortpubdate"][0:10]'

[0;32m"2022/01/01"[0m


We then do the same to all the IDs and store them in a .txt file

In [3]:
%%bash

# Now loop through each IDs and get the sortpubdate field. 
# Note that this sortpubdate field may not necassarily be equivalent to a publication date

while read id; do

  # Retrieve data from the api and append the date to the .txt file
  ./api_call -i "$id" | jq --arg ids "$id" '.["result"][$ids]["sortpubdate"][0:10]' >> date_time.txt
  
  # Sleep delay
  sleep 1
  
done < idList2.txt

In [4]:
!head -10 date_time.txt

"2010/09/01"
"2009/03/01"
"2009/03/01"
"2010/08/01"
"2009/03/01"
"2010/08/01"
"2009/02/15"
"2010/07/01"
"2009/02/01"
"2010/01/01"


### Frequency of publication for an author search

In [38]:
!./api_req_par -q "Reed+LK[au]&sort=pub+date&retmax=500" | jq '.["esearchresult"]["count"]'

[0;32m"59"[0m


We store the id list data in a .txt file

In [5]:
!./api_req_par -q "Reed+LK[au]&sort=pub+date&retmax=500" | jq '.["esearchresult"]["idlist"]' > id_list3.txt

To format the text file we use:

In [6]:
!cat id_list3.txt | tr -d '",[]' > idList4.txt

In [7]:
!sed -i '/^$/d' idList4.txt

In [8]:
!cat idList4.txt | wc -l

67


Show the first 10 IDs:

In [43]:
!head -10 idList4.txt

  37302379
  36871651
  37292993
  36468157
  35691520
  36061313
  35856017
  34801137
  34786536
  34425636


In [44]:
%%bash

# Algorithm to retrieve the dates for each of the ids

while read id; do

  ./api_call -i "$id" | jq --arg ids "$id" '.["result"][$ids]["sortpubdate"][0:10]' >> date_time2.txt
  
  # Sleep delay
  sleep 1

done < idList4.txt