# Scopus API in Python

by Vincent F. Scalfani

These recipe examples use the Elsevier Scopus API and the Python Scopus API-wrapper package, [pybliometrics](https://pybliometrics.readthedocs.io/en/stable/). Code was tested and sample data downloaded from the Scopus API on February 16, 2022 via http://api.elsevier.com and http://www.scopus.com. This tutorial content is intended to help facillitate academic research. Before continuing or reusing any of this code, please be aware of Elsevier's [API policies and appropiate use-cases](https://dev.elsevier.com/use_cases.html). You will also need to register for an API key in order to use the Scopus API.

## 1. Initial Pybliometrics Setup

The first time you run `import pybliometrics`, it will prompt you for your Elsevier Scopus API Key,
which is then saved to a local config file. See the documentation:
https://pybliometrics.readthedocs.io/en/stable/configuration.html

In [1]:
import pybliometrics

In [2]:
# import other libraries needed
from pybliometrics.scopus import ScopusSearch
import time
import numpy as np
import pandas as pd

## 2. Get Author Data

### Number of Records for Author

In [3]:
# Scopus Author ID field (AU-ID): 55764087400, Vincent Scalfani
q1 = ScopusSearch('AU-ID(55764087400)', download=False)
q1.get_results_size()

21

### Download Record Data

In [4]:
q1 = ScopusSearch('AU-ID(55764087400)')

# save to dataframe
df1 = pd.DataFrame(q1.results)

In [5]:
# view column names
df1.columns

Index(['eid', 'doi', 'pii', 'pubmed_id', 'title', 'subtype',
       'subtypeDescription', 'creator', 'afid', 'affilname',
       'affiliation_city', 'affiliation_country', 'author_count',
       'author_names', 'author_ids', 'author_afids', 'coverDate',
       'coverDisplayDate', 'publicationName', 'issn', 'source_id', 'eIssn',
       'aggregationType', 'volume', 'issueIdentifier', 'article_number',
       'pageRange', 'description', 'authkeywords', 'citedby_count',
       'openaccess', 'freetoread', 'freetoreadLabel', 'fund_acr', 'fund_no',
       'fund_sponsor'],
      dtype='object')

In [6]:
# number of rows
len(df1)

21

In [None]:
# view first 5 rows
# df1.head(5)

In [8]:
# We can index data from our new dataframe, df1.
# For example, create a list of just the DOIs
dois = df1.doi.tolist()
print(dois)

['10.1021/acs.jchemed.1c00904', '10.5860/crln.82.9.428', '10.1021/acs.iecr.8b02573', '10.1021/acs.jchemed.6b00602', '10.5062/F4TD9VBX', '10.1021/acs.macromol.6b02005', '10.1186/s13321-016-0181-z', '10.1021/acs.chemmater.5b04431', '10.1021/acs.jchemed.5b00512', '10.1021/acs.jchemed.5b00375', '10.5860/crln.76.9.9384', '10.5860/crln.76.2.9259', '10.1126/science.346.6214.1258', '10.1021/ed400887t', '10.1016/j.acalib.2014.03.015', '10.5062/F4XS5SB9', '10.1021/ma300328u', '10.1021/mz200108a', '10.1021/ma201170y', '10.1021/ma200184u', '10.1021/cm102374t']


In [9]:
# Get a list of article titles
titles = df1.title.tolist()
titles

['Using NCBI Entrez Direct (EDirect) for Small Molecule Chemical Information Searching in a Unix Terminal',
 'Using the linux operating system full-time tips and experiences from a subject liaison librarian',
 'Analysis of the Frequency and Diversity of 1,3-Dialkylimidazolium Ionic Liquids Appearing in the Literature',
 'Rapid Access to Multicolor Three-Dimensional Printed Chemistry and Biochemistry Models Using Visualization and Three-Dimensional Printing Software Programs',
 'Text analysis of chemistry thesis and dissertation titles',
 'Phototunable Thermoplastic Elastomer Hydrogel Networks',
 'Programmatic conversion of crystal structures into 3D printable files using Jmol',
 'Dangling-End Double Networks: Tapping Hidden Toughness in Highly Swollen Thermoplastic Elastomer Hydrogels',
 'Replacing the Traditional Graduate Chemistry Literature Seminar with a Chemical Research Literacy Course',
 '3D Printed Block Copolymer Nanostructures',
 'Hypotheses in librarianship: Applying the sci

In [10]:
# now a list of the cited by count
cited_by = df1.citedby_count.tolist()
print(cited_by)

[0, 0, 16, 23, 4, 11, 18, 6, 10, 24, 0, 0, 0, 94, 6, 34, 39, 31, 18, 44, 11]


In [11]:
# get sum of cited_by counts
sum(cited_by)

389

## 3. Get Author Data in a Loop

### Number of Records for Author

In [12]:
# load a list of author names and Scopus AUIDs
import csv
with open('authors.txt') as infile:
          rows = csv.reader(infile, delimiter='\t')
          author_list = list(rows)
print(author_list)  

[['Emy Decker', '36660678600'], ['Lindsey Lowry', '57210944451'], ['Karen Chapman', '35783926100'], ['Kevin Walker', '56133961300'], ['Sara Whitver', '57194760730']]


In [13]:
# get number of Scopus records for each author
num_records = []
for author,authorID in author_list:
    
    # query search
    q = ScopusSearch('AU-ID' +'(' + authorID + ')', download=False)
    num = q.get_results_size()
    
    # compile saved scopus data into a list of lists               
    num_records.append([author, authorID, num])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)

In [14]:
num_records

[['Emy Decker', '36660678600', 14],
 ['Lindsey Lowry', '57210944451', 4],
 ['Karen Chapman', '35783926100', 29],
 ['Kevin Walker', '56133961300', 8],
 ['Sara Whitver', '57194760730', 4]]

### Download Record Data

In [15]:
# Let's say we want the DOIs and cited by counts in a list
cites = []
for author,authorID in author_list:
    
    # query search
    q = ScopusSearch('AU-ID' +'(' + authorID + ')')
    
    # create a dataframe
    q_df = pd.DataFrame(q.results)
       
    # save DOIs to a list
    doi = q_df.doi.tolist()
    
    # save citedby_count to a list
    citedby_count = q_df.citedby_count.tolist()
       
    # compile saved scopus data into a list of lists               
    cites.append([author, doi, citedby_count])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)   

In [16]:
# The cites variable is a list of list with the data
# view data for first two authors
cites[0:2]

[['Emy Decker',
  ['10.1108/RSR-08-2021-0051',
   '10.1080/1072303X.2021.1929642',
   '10.1080/15367967.2021.1900740',
   '10.1080/15367967.2020.1826951',
   '10.1080/10691316.2020.1781725',
   '10.1145/3347709.3347805',
   '10.4018/978-1-5225-5631-2.ch09',
   '10.1016/B978-0-08-102409-6.00007-9',
   '10.1108/LM-10-2016-0078',
   '10.1016/B978-0-08-100775-4.00010-8',
   '10.1108/S0732-067120160000036013',
   '10.4018/978-1-4666-8624-3',
   '10.1108/S0065-2830(2013)0000037006',
   '10.1108/07378831011096268'],
  [0, 0, 7, 0, 0, 0, 3, 0, 6, 1, 2, 0, 0, 10]],
 ['Lindsey Lowry',
  ['10.1080/1941126X.2021.1949153',
   '10.5860/lrts.65n1.4-13',
   '10.1080/00987913.2020.1733173',
   '10.1080/1941126X.2019.1634951'],
  [1, 0, 1, 0]]]

In [17]:
# We can transform this into a flat list as follows
# credit to Avery Fernandez for help with this clever transformation!
cites_flat = []
for authors in range(len(cites)):
    for doi in range(len(cites[authors][1])):
        cites_flat.append([cites[authors][0], cites[authors][1][doi], cites[authors][2][doi]])
cites_flat[0:18] # show first 2 author sets

[['Emy Decker', '10.1108/RSR-08-2021-0051', 0],
 ['Emy Decker', '10.1080/1072303X.2021.1929642', 0],
 ['Emy Decker', '10.1080/15367967.2021.1900740', 7],
 ['Emy Decker', '10.1080/15367967.2020.1826951', 0],
 ['Emy Decker', '10.1080/10691316.2020.1781725', 0],
 ['Emy Decker', '10.1145/3347709.3347805', 0],
 ['Emy Decker', '10.4018/978-1-5225-5631-2.ch09', 3],
 ['Emy Decker', '10.1016/B978-0-08-102409-6.00007-9', 0],
 ['Emy Decker', '10.1108/LM-10-2016-0078', 6],
 ['Emy Decker', '10.1016/B978-0-08-100775-4.00010-8', 1],
 ['Emy Decker', '10.1108/S0732-067120160000036013', 2],
 ['Emy Decker', '10.4018/978-1-4666-8624-3', 0],
 ['Emy Decker', '10.1108/S0065-2830(2013)0000037006', 0],
 ['Emy Decker', '10.1108/07378831011096268', 10],
 ['Lindsey Lowry', '10.1080/1941126X.2021.1949153', 1],
 ['Lindsey Lowry', '10.5860/lrts.65n1.4-13', 0],
 ['Lindsey Lowry', '10.1080/00987913.2020.1733173', 1],
 ['Lindsey Lowry', '10.1080/1941126X.2019.1634951', 0]]

In [18]:
# add to dataframe
cites_df = pd.DataFrame(cites_flat)
cites_df.head(18)

Unnamed: 0,0,1,2
0,Emy Decker,10.1108/RSR-08-2021-0051,0
1,Emy Decker,10.1080/1072303X.2021.1929642,0
2,Emy Decker,10.1080/15367967.2021.1900740,7
3,Emy Decker,10.1080/15367967.2020.1826951,0
4,Emy Decker,10.1080/10691316.2020.1781725,0
5,Emy Decker,10.1145/3347709.3347805,0
6,Emy Decker,10.4018/978-1-5225-5631-2.ch09,3
7,Emy Decker,10.1016/B978-0-08-102409-6.00007-9,0
8,Emy Decker,10.1108/LM-10-2016-0078,6
9,Emy Decker,10.1016/B978-0-08-100775-4.00010-8,1


### Save Record Data to a file

Here is one method if you want to loop over author queries and save all Scopus document data to a file

In [19]:
# load a list of author names and Scopus AUIDs
import csv
with open('authors.txt') as infile:
          rows = csv.reader(infile, delimiter='\t')
          author_list = list(rows)
print(author_list) 

[['Emy Decker', '36660678600'], ['Lindsey Lowry', '57210944451'], ['Karen Chapman', '35783926100'], ['Kevin Walker', '56133961300'], ['Sara Whitver', '57194760730']]


In [20]:
# ****this writes one file for each author dataset*****

for authorName,authorID in author_list:
    
    # create new empty dataFrame on each loop
    df = pd.DataFrame()
    
    # query search by Author ID
    q = ScopusSearch('AU-ID' +'(' + authorID + ')')
    
    # convert to dataframe
    df = pd.DataFrame(q.results)
    
    # Save to file
    df.to_csv(str(authorName).replace(' ','_') + "_" + str(authorID) + "_ScopusData" + ".tsv", sep = '\t', index=False)
    
    # delay two seconds between api calls to be nice to Elsevier servers
    time.sleep(2)

In [None]:
# load one of the files into pandas
df_author3 = pd.read_csv('Karen_Chapman_35783926100_ScopusData.tsv', delimiter='\t')
# df_author3.head(5) # view first 5

In [22]:
# get info about citedby_count
df_author3.citedby_count.describe()

count    29.000000
mean      5.034483
std       5.703901
min       0.000000
25%       1.000000
50%       3.000000
75%       8.000000
max      21.000000
Name: citedby_count, dtype: float64

In [23]:
# get info about publication titles
df_author3.publicationName.describe()

count                                           29
unique                                          11
top       Behavioral and Social Sciences Librarian
freq                                             8
Name: publicationName, dtype: object

## 4. Get References via a Title Search

### Number of Title Match Records

In [24]:
# Search Scopus for all references containing 'ChemSpider' in the record title
q2 = ScopusSearch('TITLE(ChemSpider)',download=False)
q2.get_results_size()

7

In [25]:
# repeat this in a loop
titleWord_list = ['ChemSpider', 'PubChem', 'ChEMBL', 'Reaxys', 'SciFinder']

# get number of Scopus records for each title search
num_records_title = []
for titleWord in titleWord_list:
    
    # query search
    qt = ScopusSearch('TITLE' +'(' + titleWord + ')',download=False)
    numt = qt.get_results_size()
    
    # compile saved scopus data into a list of lists               
    num_records_title.append([titleWord,numt])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)

In [26]:
num_records_title

[['ChemSpider', 7],
 ['PubChem', 79],
 ['ChEMBL', 53],
 ['Reaxys', 8],
 ['SciFinder', 30]]

### Download Title Match Record Data

In [27]:
# download records and create a list of selected metadata
titleWord_list = ['ChemSpider', 'PubChem', 'ChEMBL', 'Reaxys', 'SciFinder']
scopus_title_data = []

for titleWord in titleWord_list:
    
    # query search
    qt = ScopusSearch('TITLE' +'(' + titleWord + ')') 
    
    # create the dataframe
    qt_df = pd.DataFrame(qt.results)
    
    # save DOIs to a list
    doi = qt_df.doi.tolist()
    
    # save title to a list
    title = qt_df.title.tolist()

    # save coverDate to a list
    coverDate = qt_df.coverDate.tolist()
    
    # compile saved scopus_title_data into a list of lists               
    scopus_title_data.append([titleWord, doi, title, coverDate])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)

In [28]:
# create a flat list of scopus_title_data
scopus_title_data_flat = []
for titleWord in range(len(scopus_title_data)):
    for doi in range(len(scopus_title_data[titleWord][1])):
        scopus_title_data_flat.append([scopus_title_data[titleWord][0], # titleWord
                                       scopus_title_data[titleWord][1][doi], # doi
                                       scopus_title_data[titleWord][2][doi], # title
                                       scopus_title_data[titleWord][3][doi]]) # coverdate

# add to dataFrame
scopus_title_data_df = pd.DataFrame(scopus_title_data_flat)


scopus_title_data_df.rename(columns={0:"titleWord",1: "doi",2: "title", 3: "coverDate"},
                            inplace=True)
scopus_title_data_df

Unnamed: 0,titleWord,doi,title,coverDate
0,ChemSpider,10.1039/c5np90022k,Editorial: ChemSpider-a tool for Natural Produ...,2015-08-01
1,ChemSpider,10.1021/bk-2013-1128.ch020,ChemSpider: How a free community resource of d...,2013-01-01
2,ChemSpider,10.1007/s13361-011-0265-y,"Identification of ""known unknowns"" utilizing a...",2012-01-01
3,ChemSpider,10.1002/9781118026038.ch22,Chemspider: A Platform for Crowdsourced Collab...,2011-05-03
4,ChemSpider,10.1021/ed100697w,Chemspider: An online chemical information res...,2010-11-01
...,...,...,...,...
172,SciFinder,10.1021/ci0003808,Strategies for chemical reaction searching in ...,2000-01-01
173,SciFinder,10.1002/nadc.19990471212,SciFinder scholar - Ein erster erfahrungsbericht,1999-01-01
174,SciFinder,10.1021/cen-v074n025.p043,Chemical abstracts service launches release 2....,1996-01-01
175,SciFinder,,Scientists online at their desktops SciFinder,1996-01-01
