Scopus API in Python#

by Vincent F. Scalfani

These recipe examples use the Elsevier Scopus API and the Python Scopus API-wrapper package, pybliometrics. Code was tested and sample data downloaded from the Scopus API on February 16, 2022 via http://api.elsevier.com and http://www.scopus.com. This tutorial content is intended to help facillitate academic research. Before continuing or reusing any of this code, please be aware of Elsevier’s API policies and appropiate use-cases. You will also need to register for an API key in order to use the Scopus API.

1. Initial Pybliometrics Setup#

The first time you run import pybliometrics, it will prompt you for your Elsevier Scopus API Key, which is then saved to a local config file. See the documentation: https://pybliometrics.readthedocs.io/en/stable/configuration.html

import pybliometrics
# import other libraries needed
from pybliometrics.scopus import ScopusSearch
import time
import numpy as np
import pandas as pd

2. Get Author Data#

Number of Records for Author#

# Scopus Author ID field (AU-ID): 55764087400, Vincent Scalfani
q1 = ScopusSearch('AU-ID(55764087400)', download=False)
q1.get_results_size()
21

Download Record Data#

q1 = ScopusSearch('AU-ID(55764087400)')

# save to dataframe
df1 = pd.DataFrame(q1.results)
# view column names
df1.columns
Index(['eid', 'doi', 'pii', 'pubmed_id', 'title', 'subtype',
       'subtypeDescription', 'creator', 'afid', 'affilname',
       'affiliation_city', 'affiliation_country', 'author_count',
       'author_names', 'author_ids', 'author_afids', 'coverDate',
       'coverDisplayDate', 'publicationName', 'issn', 'source_id', 'eIssn',
       'aggregationType', 'volume', 'issueIdentifier', 'article_number',
       'pageRange', 'description', 'authkeywords', 'citedby_count',
       'openaccess', 'freetoread', 'freetoreadLabel', 'fund_acr', 'fund_no',
       'fund_sponsor'],
      dtype='object')
# number of rows
len(df1)
21
# view first 5 rows
# df1.head(5)
# We can index data from our new dataframe, df1.
# For example, create a list of just the DOIs
dois = df1.doi.tolist()
print(dois)
['10.1021/acs.jchemed.1c00904', '10.5860/crln.82.9.428', '10.1021/acs.iecr.8b02573', '10.1021/acs.jchemed.6b00602', '10.5062/F4TD9VBX', '10.1021/acs.macromol.6b02005', '10.1186/s13321-016-0181-z', '10.1021/acs.chemmater.5b04431', '10.1021/acs.jchemed.5b00512', '10.1021/acs.jchemed.5b00375', '10.5860/crln.76.9.9384', '10.5860/crln.76.2.9259', '10.1126/science.346.6214.1258', '10.1021/ed400887t', '10.1016/j.acalib.2014.03.015', '10.5062/F4XS5SB9', '10.1021/ma300328u', '10.1021/mz200108a', '10.1021/ma201170y', '10.1021/ma200184u', '10.1021/cm102374t']
# Get a list of article titles
titles = df1.title.tolist()
titles
['Using NCBI Entrez Direct (EDirect) for Small Molecule Chemical Information Searching in a Unix Terminal',
 'Using the linux operating system full-time tips and experiences from a subject liaison librarian',
 'Analysis of the Frequency and Diversity of 1,3-Dialkylimidazolium Ionic Liquids Appearing in the Literature',
 'Rapid Access to Multicolor Three-Dimensional Printed Chemistry and Biochemistry Models Using Visualization and Three-Dimensional Printing Software Programs',
 'Text analysis of chemistry thesis and dissertation titles',
 'Phototunable Thermoplastic Elastomer Hydrogel Networks',
 'Programmatic conversion of crystal structures into 3D printable files using Jmol',
 'Dangling-End Double Networks: Tapping Hidden Toughness in Highly Swollen Thermoplastic Elastomer Hydrogels',
 'Replacing the Traditional Graduate Chemistry Literature Seminar with a Chemical Research Literacy Course',
 '3D Printed Block Copolymer Nanostructures',
 'Hypotheses in librarianship: Applying the scientific method',
 'Recruiting students to campus: Creating tangible and digital products in the academic library',
 'Finally free',
 '3D printed molecules and extended solid models for teaching symmetry and point groups',
 'Repurposing Space in a Science and Engineering Library: Considerations for a Successful Outcome',
 'A model for managing 3D printing services in academic libraries',
 'Morphological phase behavior of poly(RTIL)-containing diblock copolymer melts',
 'Network formation in an orthogonally self-assembling system',
 'Access to nanostructured hydrogel networks through photocured body-centered cubic block copolymer melts',
 'Synthesis and ordered phase separation of imidazolium-based alkyl-ionic diblock copolymers made via ROMP',
 'Thermally stable photocuring chemistry for selective morphological trapping in block copolymer melt systems']
# now a list of the cited by count
cited_by = df1.citedby_count.tolist()
print(cited_by)
[0, 0, 16, 23, 4, 11, 18, 6, 10, 24, 0, 0, 0, 94, 6, 34, 39, 31, 18, 44, 11]
# get sum of cited_by counts
sum(cited_by)
389

3. Get Author Data in a Loop#

Number of Records for Author#

# load a list of author names and Scopus AUIDs
import csv
with open('authors.txt') as infile:
          rows = csv.reader(infile, delimiter='\t')
          author_list = list(rows)
print(author_list)  
[['Emy Decker', '36660678600'], ['Lindsey Lowry', '57210944451'], ['Karen Chapman', '35783926100'], ['Kevin Walker', '56133961300'], ['Sara Whitver', '57194760730']]
# get number of Scopus records for each author
num_records = []
for author,authorID in author_list:
    
    # query search
    q = ScopusSearch('AU-ID' +'(' + authorID + ')', download=False)
    num = q.get_results_size()
    
    # compile saved scopus data into a list of lists               
    num_records.append([author, authorID, num])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)
num_records
[['Emy Decker', '36660678600', 14],
 ['Lindsey Lowry', '57210944451', 4],
 ['Karen Chapman', '35783926100', 29],
 ['Kevin Walker', '56133961300', 8],
 ['Sara Whitver', '57194760730', 4]]

Download Record Data#

# Let's say we want the DOIs and cited by counts in a list
cites = []
for author,authorID in author_list:
    
    # query search
    q = ScopusSearch('AU-ID' +'(' + authorID + ')')
    
    # create a dataframe
    q_df = pd.DataFrame(q.results)
       
    # save DOIs to a list
    doi = q_df.doi.tolist()
    
    # save citedby_count to a list
    citedby_count = q_df.citedby_count.tolist()
       
    # compile saved scopus data into a list of lists               
    cites.append([author, doi, citedby_count])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)   
# The cites variable is a list of list with the data
# view data for first two authors
cites[0:2]
[['Emy Decker',
  ['10.1108/RSR-08-2021-0051',
   '10.1080/1072303X.2021.1929642',
   '10.1080/15367967.2021.1900740',
   '10.1080/15367967.2020.1826951',
   '10.1080/10691316.2020.1781725',
   '10.1145/3347709.3347805',
   '10.4018/978-1-5225-5631-2.ch09',
   '10.1016/B978-0-08-102409-6.00007-9',
   '10.1108/LM-10-2016-0078',
   '10.1016/B978-0-08-100775-4.00010-8',
   '10.1108/S0732-067120160000036013',
   '10.4018/978-1-4666-8624-3',
   '10.1108/S0065-2830(2013)0000037006',
   '10.1108/07378831011096268'],
  [0, 0, 7, 0, 0, 0, 3, 0, 6, 1, 2, 0, 0, 10]],
 ['Lindsey Lowry',
  ['10.1080/1941126X.2021.1949153',
   '10.5860/lrts.65n1.4-13',
   '10.1080/00987913.2020.1733173',
   '10.1080/1941126X.2019.1634951'],
  [1, 0, 1, 0]]]
# We can transform this into a flat list as follows
# credit to Avery Fernandez for help with this clever transformation!
cites_flat = []
for authors in range(len(cites)):
    for doi in range(len(cites[authors][1])):
        cites_flat.append([cites[authors][0], cites[authors][1][doi], cites[authors][2][doi]])
cites_flat[0:18] # show first 2 author sets
[['Emy Decker', '10.1108/RSR-08-2021-0051', 0],
 ['Emy Decker', '10.1080/1072303X.2021.1929642', 0],
 ['Emy Decker', '10.1080/15367967.2021.1900740', 7],
 ['Emy Decker', '10.1080/15367967.2020.1826951', 0],
 ['Emy Decker', '10.1080/10691316.2020.1781725', 0],
 ['Emy Decker', '10.1145/3347709.3347805', 0],
 ['Emy Decker', '10.4018/978-1-5225-5631-2.ch09', 3],
 ['Emy Decker', '10.1016/B978-0-08-102409-6.00007-9', 0],
 ['Emy Decker', '10.1108/LM-10-2016-0078', 6],
 ['Emy Decker', '10.1016/B978-0-08-100775-4.00010-8', 1],
 ['Emy Decker', '10.1108/S0732-067120160000036013', 2],
 ['Emy Decker', '10.4018/978-1-4666-8624-3', 0],
 ['Emy Decker', '10.1108/S0065-2830(2013)0000037006', 0],
 ['Emy Decker', '10.1108/07378831011096268', 10],
 ['Lindsey Lowry', '10.1080/1941126X.2021.1949153', 1],
 ['Lindsey Lowry', '10.5860/lrts.65n1.4-13', 0],
 ['Lindsey Lowry', '10.1080/00987913.2020.1733173', 1],
 ['Lindsey Lowry', '10.1080/1941126X.2019.1634951', 0]]
# add to dataframe
cites_df = pd.DataFrame(cites_flat)
cites_df.head(18)
0 1 2
0 Emy Decker 10.1108/RSR-08-2021-0051 0
1 Emy Decker 10.1080/1072303X.2021.1929642 0
2 Emy Decker 10.1080/15367967.2021.1900740 7
3 Emy Decker 10.1080/15367967.2020.1826951 0
4 Emy Decker 10.1080/10691316.2020.1781725 0
5 Emy Decker 10.1145/3347709.3347805 0
6 Emy Decker 10.4018/978-1-5225-5631-2.ch09 3
7 Emy Decker 10.1016/B978-0-08-102409-6.00007-9 0
8 Emy Decker 10.1108/LM-10-2016-0078 6
9 Emy Decker 10.1016/B978-0-08-100775-4.00010-8 1
10 Emy Decker 10.1108/S0732-067120160000036013 2
11 Emy Decker 10.4018/978-1-4666-8624-3 0
12 Emy Decker 10.1108/S0065-2830(2013)0000037006 0
13 Emy Decker 10.1108/07378831011096268 10
14 Lindsey Lowry 10.1080/1941126X.2021.1949153 1
15 Lindsey Lowry 10.5860/lrts.65n1.4-13 0
16 Lindsey Lowry 10.1080/00987913.2020.1733173 1
17 Lindsey Lowry 10.1080/1941126X.2019.1634951 0

Save Record Data to a file#

Here is one method if you want to loop over author queries and save all Scopus document data to a file

# load a list of author names and Scopus AUIDs
import csv
with open('authors.txt') as infile:
          rows = csv.reader(infile, delimiter='\t')
          author_list = list(rows)
print(author_list) 
[['Emy Decker', '36660678600'], ['Lindsey Lowry', '57210944451'], ['Karen Chapman', '35783926100'], ['Kevin Walker', '56133961300'], ['Sara Whitver', '57194760730']]
# ****this writes one file for each author dataset*****

for authorName,authorID in author_list:
    
    # create new empty dataFrame on each loop
    df = pd.DataFrame()
    
    # query search by Author ID
    q = ScopusSearch('AU-ID' +'(' + authorID + ')')
    
    # convert to dataframe
    df = pd.DataFrame(q.results)
    
    # Save to file
    df.to_csv(str(authorName).replace(' ','_') + "_" + str(authorID) + "_ScopusData" + ".tsv", sep = '\t', index=False)
    
    # delay two seconds between api calls to be nice to Elsevier servers
    time.sleep(2)
# load one of the files into pandas
df_author3 = pd.read_csv('Karen_Chapman_35783926100_ScopusData.tsv', delimiter='\t')
# df_author3.head(5) # view first 5
# get info about citedby_count
df_author3.citedby_count.describe()
count    29.000000
mean      5.034483
std       5.703901
min       0.000000
25%       1.000000
50%       3.000000
75%       8.000000
max      21.000000
Name: citedby_count, dtype: float64
# get info about publication titles
df_author3.publicationName.describe()
count                                           29
unique                                          11
top       Behavioral and Social Sciences Librarian
freq                                             8
Name: publicationName, dtype: object