Scopus API in Python

Scopus API in Python#

by Vincent F. Scalfani

These recipe examples use the Elsevier Scopus API and the Python Scopus API-wrapper package, pybliometrics. Code was tested and sample data downloaded from the Scopus API on February 16, 2022 via http://api.elsevier.com and http://www.scopus.com. This tutorial content is intended to help facillitate academic research. Before continuing or reusing any of this code, please be aware of Elsevier’s API policies and appropiate use-cases. You will also need to register for an API key in order to use the Scopus API.

1. Initial Pybliometrics Setup#

The first time you run import pybliometrics, it will prompt you for your Elsevier Scopus API Key, which is then saved to a local config file. See the documentation: https://pybliometrics.readthedocs.io/en/stable/configuration.html

import pybliometrics

# import other libraries needed
from pybliometrics.scopus import ScopusSearch
import time
import numpy as np
import pandas as pd

2. Get Author Data#

Number of Records for Author#

# Scopus Author ID field (AU-ID): 55764087400, Vincent Scalfani
q1 = ScopusSearch('AU-ID(55764087400)', download=False)
q1.get_results_size()

Download Record Data#

q1 = ScopusSearch('AU-ID(55764087400)')

# save to dataframe
df1 = pd.DataFrame(q1.results)

# view column names
df1.columns

Index(['eid', 'doi', 'pii', 'pubmed_id', 'title', 'subtype',
       'subtypeDescription', 'creator', 'afid', 'affilname',
       'affiliation_city', 'affiliation_country', 'author_count',
       'author_names', 'author_ids', 'author_afids', 'coverDate',
       'coverDisplayDate', 'publicationName', 'issn', 'source_id', 'eIssn',
       'aggregationType', 'volume', 'issueIdentifier', 'article_number',
       'pageRange', 'description', 'authkeywords', 'citedby_count',
       'openaccess', 'freetoread', 'freetoreadLabel', 'fund_acr', 'fund_no',
       'fund_sponsor'],
      dtype='object')

# number of rows
len(df1)

# view first 5 rows
# df1.head(5)

# We can index data from our new dataframe, df1.
# For example, create a list of just the DOIs
dois = df1.doi.tolist()
print(dois)

['10.1021/acs.jchemed.1c00904', '10.5860/crln.82.9.428', '10.1021/acs.iecr.8b02573', '10.1021/acs.jchemed.6b00602', '10.5062/F4TD9VBX', '10.1021/acs.macromol.6b02005', '10.1186/s13321-016-0181-z', '10.1021/acs.chemmater.5b04431', '10.1021/acs.jchemed.5b00512', '10.1021/acs.jchemed.5b00375', '10.5860/crln.76.9.9384', '10.5860/crln.76.2.9259', '10.1126/science.346.6214.1258', '10.1021/ed400887t', '10.1016/j.acalib.2014.03.015', '10.5062/F4XS5SB9', '10.1021/ma300328u', '10.1021/mz200108a', '10.1021/ma201170y', '10.1021/ma200184u', '10.1021/cm102374t']

# Get a list of article titles
titles = df1.title.tolist()
titles

['Using NCBI Entrez Direct (EDirect) for Small Molecule Chemical Information Searching in a Unix Terminal',
 'Using the linux operating system full-time tips and experiences from a subject liaison librarian',
 'Analysis of the Frequency and Diversity of 1,3-Dialkylimidazolium Ionic Liquids Appearing in the Literature',
 'Rapid Access to Multicolor Three-Dimensional Printed Chemistry and Biochemistry Models Using Visualization and Three-Dimensional Printing Software Programs',
 'Text analysis of chemistry thesis and dissertation titles',
 'Phototunable Thermoplastic Elastomer Hydrogel Networks',
 'Programmatic conversion of crystal structures into 3D printable files using Jmol',
 'Dangling-End Double Networks: Tapping Hidden Toughness in Highly Swollen Thermoplastic Elastomer Hydrogels',
 'Replacing the Traditional Graduate Chemistry Literature Seminar with a Chemical Research Literacy Course',
 '3D Printed Block Copolymer Nanostructures',
 'Hypotheses in librarianship: Applying the scientific method',
 'Recruiting students to campus: Creating tangible and digital products in the academic library',
 'Finally free',
 '3D printed molecules and extended solid models for teaching symmetry and point groups',
 'Repurposing Space in a Science and Engineering Library: Considerations for a Successful Outcome',
 'A model for managing 3D printing services in academic libraries',
 'Morphological phase behavior of poly(RTIL)-containing diblock copolymer melts',
 'Network formation in an orthogonally self-assembling system',
 'Access to nanostructured hydrogel networks through photocured body-centered cubic block copolymer melts',
 'Synthesis and ordered phase separation of imidazolium-based alkyl-ionic diblock copolymers made via ROMP',
 'Thermally stable photocuring chemistry for selective morphological trapping in block copolymer melt systems']

# now a list of the cited by count
cited_by = df1.citedby_count.tolist()
print(cited_by)

[0, 0, 16, 23, 4, 11, 18, 6, 10, 24, 0, 0, 0, 94, 6, 34, 39, 31, 18, 44, 11]

# get sum of cited_by counts
sum(cited_by)

3. Get Author Data in a Loop#

Number of Records for Author#

# load a list of author names and Scopus AUIDs
import csv
with open('authors.txt') as infile:
          rows = csv.reader(infile, delimiter='\t')
          author_list = list(rows)
print(author_list)  

[['Emy Decker', '36660678600'], ['Lindsey Lowry', '57210944451'], ['Karen Chapman', '35783926100'], ['Kevin Walker', '56133961300'], ['Sara Whitver', '57194760730']]

# get number of Scopus records for each author
num_records = []
for author,authorID in author_list:
    
    # query search
    q = ScopusSearch('AU-ID' +'(' + authorID + ')', download=False)
    num = q.get_results_size()
    
    # compile saved scopus data into a list of lists               
    num_records.append([author, authorID, num])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)

num_records

[['Emy Decker', '36660678600', 14],
 ['Lindsey Lowry', '57210944451', 4],
 ['Karen Chapman', '35783926100', 29],
 ['Kevin Walker', '56133961300', 8],
 ['Sara Whitver', '57194760730', 4]]

Download Record Data#

# Let's say we want the DOIs and cited by counts in a list
cites = []
for author,authorID in author_list:
    
    # query search
    q = ScopusSearch('AU-ID' +'(' + authorID + ')')
    
    # create a dataframe
    q_df = pd.DataFrame(q.results)
       
    # save DOIs to a list
    doi = q_df.doi.tolist()
    
    # save citedby_count to a list
    citedby_count = q_df.citedby_count.tolist()
       
    # compile saved scopus data into a list of lists               
    cites.append([author, doi, citedby_count])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)   

# The cites variable is a list of list with the data
# view data for first two authors
cites[0:2]

[['Emy Decker',
  ['10.1108/RSR-08-2021-0051',
   '10.1080/1072303X.2021.1929642',
   '10.1080/15367967.2021.1900740',
   '10.1080/15367967.2020.1826951',
   '10.1080/10691316.2020.1781725',
   '10.1145/3347709.3347805',
   '10.4018/978-1-5225-5631-2.ch09',
   '10.1016/B978-0-08-102409-6.00007-9',
   '10.1108/LM-10-2016-0078',
   '10.1016/B978-0-08-100775-4.00010-8',
   '10.1108/S0732-067120160000036013',
   '10.4018/978-1-4666-8624-3',
   '10.1108/S0065-2830(2013)0000037006',
   '10.1108/07378831011096268'],
  [0, 0, 7, 0, 0, 0, 3, 0, 6, 1, 2, 0, 0, 10]],
 ['Lindsey Lowry',
  ['10.1080/1941126X.2021.1949153',
   '10.5860/lrts.65n1.4-13',
   '10.1080/00987913.2020.1733173',
   '10.1080/1941126X.2019.1634951'],
  [1, 0, 1, 0]]]

# We can transform this into a flat list as follows
# credit to Avery Fernandez for help with this clever transformation!
cites_flat = []
for authors in range(len(cites)):
    for doi in range(len(cites[authors][1])):
        cites_flat.append([cites[authors][0], cites[authors][1][doi], cites[authors][2][doi]])
cites_flat[0:18] # show first 2 author sets

[['Emy Decker', '10.1108/RSR-08-2021-0051', 0],
 ['Emy Decker', '10.1080/1072303X.2021.1929642', 0],
 ['Emy Decker', '10.1080/15367967.2021.1900740', 7],
 ['Emy Decker', '10.1080/15367967.2020.1826951', 0],
 ['Emy Decker', '10.1080/10691316.2020.1781725', 0],
 ['Emy Decker', '10.1145/3347709.3347805', 0],
 ['Emy Decker', '10.4018/978-1-5225-5631-2.ch09', 3],
 ['Emy Decker', '10.1016/B978-0-08-102409-6.00007-9', 0],
 ['Emy Decker', '10.1108/LM-10-2016-0078', 6],
 ['Emy Decker', '10.1016/B978-0-08-100775-4.00010-8', 1],
 ['Emy Decker', '10.1108/S0732-067120160000036013', 2],
 ['Emy Decker', '10.4018/978-1-4666-8624-3', 0],
 ['Emy Decker', '10.1108/S0065-2830(2013)0000037006', 0],
 ['Emy Decker', '10.1108/07378831011096268', 10],
 ['Lindsey Lowry', '10.1080/1941126X.2021.1949153', 1],
 ['Lindsey Lowry', '10.5860/lrts.65n1.4-13', 0],
 ['Lindsey Lowry', '10.1080/00987913.2020.1733173', 1],
 ['Lindsey Lowry', '10.1080/1941126X.2019.1634951', 0]]

# add to dataframe
cites_df = pd.DataFrame(cites_flat)
cites_df.head(18)

	0	1	2
0	Emy Decker	10.1108/RSR-08-2021-0051	0
1	Emy Decker	10.1080/1072303X.2021.1929642	0
2	Emy Decker	10.1080/15367967.2021.1900740	7
3	Emy Decker	10.1080/15367967.2020.1826951	0
4	Emy Decker	10.1080/10691316.2020.1781725	0
5	Emy Decker	10.1145/3347709.3347805	0
6	Emy Decker	10.4018/978-1-5225-5631-2.ch09	3
7	Emy Decker	10.1016/B978-0-08-102409-6.00007-9	0
8	Emy Decker	10.1108/LM-10-2016-0078	6
9	Emy Decker	10.1016/B978-0-08-100775-4.00010-8	1
10	Emy Decker	10.1108/S0732-067120160000036013	2
11	Emy Decker	10.4018/978-1-4666-8624-3	0
12	Emy Decker	10.1108/S0065-2830(2013)0000037006	0
13	Emy Decker	10.1108/07378831011096268	10
14	Lindsey Lowry	10.1080/1941126X.2021.1949153	1
15	Lindsey Lowry	10.5860/lrts.65n1.4-13	0
16	Lindsey Lowry	10.1080/00987913.2020.1733173	1
17	Lindsey Lowry	10.1080/1941126X.2019.1634951	0

Save Record Data to a file#

Here is one method if you want to loop over author queries and save all Scopus document data to a file

# load a list of author names and Scopus AUIDs
import csv
with open('authors.txt') as infile:
          rows = csv.reader(infile, delimiter='\t')
          author_list = list(rows)
print(author_list) 

[['Emy Decker', '36660678600'], ['Lindsey Lowry', '57210944451'], ['Karen Chapman', '35783926100'], ['Kevin Walker', '56133961300'], ['Sara Whitver', '57194760730']]

# ****this writes one file for each author dataset*****

for authorName,authorID in author_list:
    
    # create new empty dataFrame on each loop
    df = pd.DataFrame()
    
    # query search by Author ID
    q = ScopusSearch('AU-ID' +'(' + authorID + ')')
    
    # convert to dataframe
    df = pd.DataFrame(q.results)
    
    # Save to file
    df.to_csv(str(authorName).replace(' ','_') + "_" + str(authorID) + "_ScopusData" + ".tsv", sep = '\t', index=False)
    
    # delay two seconds between api calls to be nice to Elsevier servers
    time.sleep(2)

# load one of the files into pandas
df_author3 = pd.read_csv('Karen_Chapman_35783926100_ScopusData.tsv', delimiter='\t')
# df_author3.head(5) # view first 5

# get info about citedby_count
df_author3.citedby_count.describe()

count    29.000000
mean      5.034483
std       5.703901
min       0.000000
25%       1.000000
50%       3.000000
75%       8.000000
max      21.000000
Name: citedby_count, dtype: float64

# get info about publication titles
df_author3.publicationName.describe()

count                                           29
unique                                          11
top       Behavioral and Social Sciences Librarian
freq                                             8
Name: publicationName, dtype: object

4. Get References via a Title Search#

Number of Title Match Records#

# Search Scopus for all references containing 'ChemSpider' in the record title
q2 = ScopusSearch('TITLE(ChemSpider)',download=False)
q2.get_results_size()

# repeat this in a loop
titleWord_list = ['ChemSpider', 'PubChem', 'ChEMBL', 'Reaxys', 'SciFinder']

# get number of Scopus records for each title search
num_records_title = []
for titleWord in titleWord_list:
    
    # query search
    qt = ScopusSearch('TITLE' +'(' + titleWord + ')',download=False)
    numt = qt.get_results_size()
    
    # compile saved scopus data into a list of lists               
    num_records_title.append([titleWord,numt])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)

num_records_title

[['ChemSpider', 7],
 ['PubChem', 79],
 ['ChEMBL', 53],
 ['Reaxys', 8],
 ['SciFinder', 30]]

Download Title Match Record Data#

# download records and create a list of selected metadata
titleWord_list = ['ChemSpider', 'PubChem', 'ChEMBL', 'Reaxys', 'SciFinder']
scopus_title_data = []

for titleWord in titleWord_list:
    
    # query search
    qt = ScopusSearch('TITLE' +'(' + titleWord + ')') 
    
    # create the dataframe
    qt_df = pd.DataFrame(qt.results)
    
    # save DOIs to a list
    doi = qt_df.doi.tolist()
    
    # save title to a list
    title = qt_df.title.tolist()

    # save coverDate to a list
    coverDate = qt_df.coverDate.tolist()
    
    # compile saved scopus_title_data into a list of lists               
    scopus_title_data.append([titleWord, doi, title, coverDate])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)

# create a flat list of scopus_title_data
scopus_title_data_flat = []
for titleWord in range(len(scopus_title_data)):
    for doi in range(len(scopus_title_data[titleWord][1])):
        scopus_title_data_flat.append([scopus_title_data[titleWord][0], # titleWord
                                       scopus_title_data[titleWord][1][doi], # doi
                                       scopus_title_data[titleWord][2][doi], # title
                                       scopus_title_data[titleWord][3][doi]]) # coverdate

# add to dataFrame
scopus_title_data_df = pd.DataFrame(scopus_title_data_flat)


scopus_title_data_df.rename(columns={0:"titleWord",1: "doi",2: "title", 3: "coverDate"},
                            inplace=True)
scopus_title_data_df

	titleWord	doi	title	coverDate
0	ChemSpider	10.1039/c5np90022k	Editorial: ChemSpider-a tool for Natural Produ...	2015-08-01
1	ChemSpider	10.1021/bk-2013-1128.ch020	ChemSpider: How a free community resource of d...	2013-01-01
2	ChemSpider	10.1007/s13361-011-0265-y	Identification of "known unknowns" utilizing a...	2012-01-01
3	ChemSpider	10.1002/9781118026038.ch22	Chemspider: A Platform for Crowdsourced Collab...	2011-05-03
4	ChemSpider	10.1021/ed100697w	Chemspider: An online chemical information res...	2010-11-01
...	...	...	...	...
172	SciFinder	10.1021/ci0003808	Strategies for chemical reaction searching in ...	2000-01-01
173	SciFinder	10.1002/nadc.19990471212	SciFinder scholar - Ein erster erfahrungsbericht	1999-01-01
174	SciFinder	10.1021/cen-v074n025.p043	Chemical abstracts service launches release 2....	1996-01-01
175	SciFinder	None	Scientists online at their desktops SciFinder	1996-01-01
176	SciFinder	None	SciFinder from CAS: Information at the desktop...	1995-07-01

177 rows × 4 columns

Scopus API in Python

Contents

Scopus API in Python#

1. Initial Pybliometrics Setup#

2. Get Author Data#

Number of Records for Author#

Download Record Data#

3. Get Author Data in a Loop#

Number of Records for Author#

Download Record Data#

Save Record Data to a file#

4. Get References via a Title Search#

Number of Title Match Records#

Download Title Match Record Data#