College Scorecard API in C#
by Cyrus Gomes and Avery Fernandez
College Scorecard API documentation: https://collegescorecard.ed.gov/data/data-documentation/
College Scorecard Copyright Status: https://www2.ed.gov/notices/copyright/index.html
The College Scorecard API is an online tool hosted by the U.S. Department of Education that contains data concerning higher education institutions.
These recipe examples were tested on August 23, 2024.
Setup#
First, install the CURL and jq packages by typing the following command in the terminal:
!sudo apt install curl jq
Create a directory for this tutorial:
!mkdir College_Scorecard
Change to the newly created directory:
%cd College_Scorecard
Create a variable for API Key#
An API key is required to access the College Scorecard API. This API key can be obtained here.
Save your API key to a separate text file, then create a variable for your key. Avoid displaying your API key in your terminal (to prevent accidental sharing).
# Create the key file
!touch "apiKey.txt"
We use the following command to access the key as Jupyter does not allow variable sharing for bash scripts.
# Input the key into the file by copy/paste or keying in manually
# Read the key from the file
!apiKey=$(cat "apiKey.txt")
We use the %%file
command to create the following makefile which will compile our program and create an executable.
%%file makefile
# Set the variable CC to gcc, which is used to build the program
CC=gcc
# Enable debugging information and enable all compiler warnings
CFLAGS=-g -Wall
# Set the bin variable as the name of the binary file we are creating
BIN=college_score
# Create the binary file with the name we put
all: $(BIN)
# Map any file ending in .c to a binary executable.
# "$<" represents the .c file and "$@" represents the target binary executable
%: %.c
# Compile the .c file using the gcc compiler with the CFLAGS and links
# resulting binary with the CURL library
$(CC) $(CFLAGS) $< -o $@ -lcurl
# Clean target which removes specific files
clean:
# Remove the binary file and an ".dSYM" (debug symbols for debugging) directories
# the RM command used -r to remove directories and -f to force delete
$(RM) -rf $(BIN) *.dSYM
Writing makefile
This command is used again to create our .c file which contains the code for the program
%%file ./college_score.c
#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/*CURL program that retrieves College Scorecard data from
http://api.data.gov/ed/collegescorecard/v1/ */
int main(int argc, char* argv[]) {
// Return if arguments are invalid
if (argc < 2 || argc > 3) {
printf("Error. Please try again correctly. (./college_score -url [url])\n");
return -1;
}
// Initialize CURL HTTP connection
CURL *curl = curl_easy_init();
if (!curl) {
fprintf(stderr, "CURL initialization failed\n");
return EXIT_FAILURE;
}
char link[] = "http://api.data.gov/ed/collegescorecard/v1/schools?fields=";
char url[1000];
if ((argc==2) && (strcmp(argv[1], "-url")==0)) {
sprintf(url, "%s", link);
}
else if ((argc==3) && (strcmp(argv[1], "-url")==0)) {
sprintf(url, "%s", argv[2]);
}
// Set the URL to which the HTTP request will be sent
curl_easy_setopt(curl, CURLOPT_URL, url);
// Set option to follow redirections
curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
// Perform the HTTP request
CURLcode result = curl_easy_perform(curl);
// Check if the request was successful
if (result != CURLE_OK) {
fprintf(stderr, "Download problem: %s\n", curl_easy_strerror(result));
}
// Cleanup and resource deallocation
curl_easy_cleanup(curl);
return EXIT_SUCCESS;
}
Writing ./college_score.c
!make
# Compile the .c file using the gcc compiler with the CFLAGS and links
# resulting binary with the CURL library
gcc -g -Wall college_score.c -o college_score -lcurl
1. Get names of all institutions#
To start, we’ll use a basic query to find the names of all educational institutions recognized by the College Scorecard API.
All of the data for the API can be found using the v1/schools
endpoint.
Fields in the College Scorecard API are accessed with a <time>.<category>.<name>
sequence:
<time>
indicates the year of the data to be accessed. To access the most recent data, uselatest
.<category>
and<name>
can be found in the Data Dictionary file that can be downloaded from the API’s documentation. The<category>
of a field is given by thedev-category
column in theInstitution_Data_Dictionary
section, and the<name>
is given by thedeveloper-friendly name
column.
%%bash
# Store the key
key=$(cat apiKey.txt)
# Store the field name
field='school.name'
# Store the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?fields=${field}&api_key=${key}"
# Call the API to retrieve the metadata
names=$(./college_score -url "$url" | jq '.["metadata"]')
echo "$names"
{
"page": 0,
"total": 6484,
"per_page": 20
}
The total
value indicates the total number results returned in this query. These results are paginated, so each query will return only the number indicated by page_size
, which has a default value of 20 and a maximum value of 100. The page number is indicated by page
, which by default is set to 0.
We can use a loop to create an API request for each page:
%%bash
# Declare url parameters
field='school.name'
sort_key='school.name'
page_size='100'
# Store the key in the key variable
key=$(cat apiKey.txt)
# Attach the key to the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?api_key=${key}"
# Call the API and retrieve the metadata
metadata_tot=$(./college_score -url "$url" | jq '.["metadata"]["total"]')
# Calculate the total number of pages
total_pages=$(((metadata_tot + 1) / page_size + 1 ))
# Loop through each page of the dataset, sending a request for each page
for ((page_number = 0; page_number < total_pages; page_number++)); do
# Combine the parameters to create a working url
url2="http://api.data.gov/ed/collegescorecard/v1/schools?fields=${field}&page=${page_number}&per_page=${page_size}&sort=${sort_key}&api_key=${key}"
# Retrieve the school names
name_data=$(./college_score -url "$url2" | jq '.results[] | ."school.name"')
# Output the school names to a .txt file
echo "$name_data" >> institution_names.txt
# Wait 1 second between API calls to be nicer on the host servers
sleep 1
done
# Print number of institution names found
cat institution_names.txt | wc -l
6484
# Print first 10 results
!head -n 10 institution_names.txt
"A Better U Beauty Barber Academy"
"A T Still University of Health Sciences"
"Aaniiih Nakoda College"
"ABC Adult School"
"ABC Adult School - Cabrillo Lane"
"ABC Beauty Academy"
"ABCO Technology"
"Abcott Institute"
"Abilene Christian University"
"Abilene Christian University-Undergraduate Online"
2. Get names of all universities#
College Scorecard API requests can also take conditions to only select certain institutions.
In this example, we limit the results to only include institutions that award graduate degrees. In order to do this, we set the degrees_awarded.highest
parameter to 4
to indicate that the highest degree awarded by an instition is a graduate degree. This information is within the Institution_Data_Dictionary
section of the College Scorecard data disctionary.
%%bash
# Declare url parameters
condition='latest.school.degrees_awarded.highest=4'
field='school.name'
sort_key='school.name'
page_size='100'
# Store the key
key=$(cat apiKey.txt)
# Attach the key to the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&fields=${field}&api_key=${key}"
# Call the API and retrieve the metadata
metadata_tot=$(./college_score -url "$url" | jq '.["metadata"]["total"]')
# Calculate total number of pages
total_pages=$(((metadata_tot + 1) / page_size + 1 ))
# Loop through each page of the dataset, sending a request for each page
for ((page_number = 0; page_number <= total_pages; page_number++)); do
# Combine the parameters to create a working url
url2="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&fields=${field}&page=${page_number}&per_page=${page_size}&sort=${sort_key}&api_key=${key}"
# Retrieve the school names
name_data=$(./college_score -url "$url2" | jq '.results[] | ."school.name"')
# Output the school names to a .txt file
echo "$name_data" >> university_names.txt
# Wait 1 second between API calls to be nicer on the host servers
sleep 1
done
# Print number of university names found
cat university_names.txt | wc -l
1986
# Print first 10 results
!head -n 10 university_names.txt
"A T Still University of Health Sciences"
"Abilene Christian University"
"Abraham Lincoln University"
"Academy for Five Element Acupuncture"
"Academy for Jewish Religion"
"Academy for Jewish Religion-California"
"Academy of Art University"
"Academy of Chinese Culture and Health Sciences"
"Academy of Vocal Arts"
"Acupuncture and Integrative Medicine College-Berkeley"
3. Find number of universities by state#
The school.state_fips
data element contains a number that corresponds to each state. This mapping is given below:
%%bash
states='{
"1": "Alabama",
"2": "Alaska",
"4": "Arizona",
"5": "Arkansas",
"6": "California",
"8": "Colorado",
"9": "Connecticut",
"10": "Delaware",
"11": "District of Columbia",
"12": "Florida",
"13": "Georgia",
"15": "Hawaii",
"16": "Idaho",
"17": "Illinois",
"18": "Indiana",
"19": "Iowa",
"20": "Kansas",
"21": "Kentucky",
"22": "Louisiana",
"23": "Maine",
"24": "Maryland",
"25": "Massachusetts",
"26": "Michigan",
"27": "Minnesota",
"28": "Mississippi",
"29": "Missouri",
"30": "Montana",
"31": "Nebraska",
"32": "Nevada",
"33": "New Hampshire",
"34": "New Jersey",
"35": "New Mexico",
"36": "New York",
"37": "North Carolina",
"38": "North Dakota",
"39": "Ohio",
"40": "Oklahoma",
"41": "Oregon",
"42": "Pennsylvania",
"44": "Rhode Island",
"45": "South Carolina",
"46": "South Dakota",
"47": "Tennessee",
"48": "Texas",
"49": "Utah",
"50": "Vermont",
"51": "Virginia",
"53": "Washington",
"54": "West Virginia",
"55": "Wisconsin",
"56": "Wyoming",
"60": "American Samoa",
"64": "Federated States of Micronesia",
"66": "Guam",
"69": "Northern Mariana Islands",
"70": "Palau",
"72": "Puerto Rico",
"78": "Virgin Islands"
}'
echo "$states" > states
Using this mapping, we can find the number of universities in each state:
%%bash
# Declare an associative array
declare -A state_frequency
# Declare the parameters for the url
condition='latest.school.degrees_awarded.highest=4'
field='latest.school.state_fips'
page_size='100'
# Store the key
key=$(cat apiKey.txt)
# Attach the key to the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&fields=${field}&api_key=${key}"
# Call the API and retrieve the metadata
metadata_tot=$(./college_score -url "$url" | jq '.["metadata"]["total"]')
# Calculate total number of pages
total_pages=$(((metadata_tot + 1) / page_size + 1 ))
# Loop through each page of the dataset, sending a request for each page
for ((page_number = 0; page_number < total_pages; page_number++)); do
# Combine the parameters to create a working url
url2="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&fields=${field}&page=${page_number}&per_page=${page_size}&api_key=${key}"
# Retrieve the school names
state_data=$(./college_score -url "$url2" | jq '.results[] | ."latest.school.state_fips"')
# Output the frequency to a text file
echo "$state_data" >> frequencies.txt
# Clear the state_data variable
unset state_data
# Wait 1 second between API calls to be nicer on the host servers
sleep 1
done
# Replace the numbers in frequencies.txt with States in the file
while IFS= read -r line; do
# Append each line to the result variable
result=$(cat states | jq '.["'"$line"'"]')
echo "$result" >> state_freq.txt
done < "frequencies.txt"
# Create a table with state and frequencies for that specific state, whilst sorting the states
sort state_freq.txt | uniq -c | awk '{print $2, $1}' > final_output.txt
# Print the top 5 results alphabetically
cat final_output.txt | head -n 5
"Alabama" 30
"Alaska" 4
"Arizona" 22
"Arkansas" 20
"California" 208
Now, we can sort and display the results:
# Retrieve the state and frequencies from highest to lowest
!sort -k2,2rn final_output.txt
"California" 208
"New 155
"Pennsylvania" 109
"Texas" 105
"Illinois" 81
"Florida" 75
"Massachusetts" 73
"Ohio" 65
"Missouri" 56
"North 55
"Virginia" 52
"Georgia" 50
"Indiana" 49
"Puerto 48
"Tennessee" 47
"Michigan" 45
"Wisconsin" 40
"Minnesota" 38
"New 38
"South 33
"Alabama" 30
"Kentucky" 30
"Colorado" 29
"Maryland" 29
"Oregon" 28
"Washington" 28
"Iowa" 27
"Louisiana" 27
"Connecticut" 26
"Kansas" 25
"Oklahoma" 25
"Arizona" 22
"Arkansas" 20
"Nebraska" 20
"West 18
"District 16
"Maine" 16
"Mississippi" 15
"South 15
"Vermont" 14
"New 13
"New 13
"Utah" 13
"Rhode 12
"North 11
"Idaho" 8
"Nevada" 8
"Montana" 7
"Hawaii" 6
"Delaware" 5
"Alaska" 4
"Guam" 1
"Virgin 1
"Wyoming" 1
4. Retrieving multiple data points in a single query#
The following example uses multiple conditions and multiple fields. The conditions in the query are separated by &
while the fields are separated by ,
.
%%bash
# Declare an array
conditions=(
'latest.school.degrees_awarded.highest=4'
'latest.student.size__range=1000..'
)
# Set the IFS to '&' (the separator)
IFS='&'
# Join the array elements using the separator
conditions_string="${conditions[*]}"
# Reset the IFS to its default value
unset IFS
fields=(
'school.name'
'latest.admissions.admission_rate.overall'
'latest.student.size'
'latest.cost.tuition.out_of_state'
'latest.cost.tuition.in_state'
'latest.student.demographics.median_hh_income'
'latest.school.endowment.begin'
)
# Set the IFS to ',' (the separator)
IFS=','
# Join the array elements using the separator
fields_string="${fields[*]}"
# Reset the IFS to its default value
unset IFS
# Declare parameters
key=$(cat apiKey.txt)
page_size='100'
sort_key='school.name'
# Attach the key to the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&fields=${field}&api_key=${key}"
# Call the API and retrieve the metadata
metadata_tot=$(./college_score -url "$url" | jq '.["metadata"]["total"]')
# Calculate total number of pages
total_pages=$(((metadata_tot + 1) / page_size + 1 ))
# Create the headers for the table and put it in data.tsv
echo -e "Name\tAdmission Rate\tSize\tTuition Out of State\tTuition In State\tMedian Household Income\tEndowment" >> data.tsv
# Loop through each page of the dataset, sending a request for each page
for ((page_number = 0; page_number < total_pages; page_number++)); do
# Combine the parameters to create a working url
url2="http://api.data.gov/ed/collegescorecard/v1/schools?${conditions_string}&fields=${fields_string}&page=${page_number}&per_page=${page_size}&api_key=${key}"
# Retrieve the school names
university_data=$(./college_score -url "$url2" | jq '.results[]')
# Unset JSON array
unset json_array
# Split the JSON data into separate JSON objects
IFS=$'\n' read -r -d '' -a json_objects <<< "$(echo "$university_data" | jq -c '.')"
# Store the JSON objects in an array
for obj in "${json_objects[@]}"; do
json_array+=("$obj")
done
# Print the array elements
for ((i = 0; i < ${#json_array[@]}; i++)); do
tsv_data=$(echo "${json_array[i]}" | jq -r '[ .["school.name"], (.["latest.admissions.admission_rate.overall"] | tostring), (.["latest.student.size"] | tostring), (.["latest.cost.tuition.out_of_state"] | tostring), (.["latest.cost.tuition.in_state"] | tostring), (.["latest.student.demographics.median_hh_income"] | tostring), (.["latest.school.endowment.begin"] | tostring) ] | join("\t")')
echo "$tsv_data" >> data.tsv
done
# Wait 1 second between API calls to be nicer on the host servers
sleep 1
done
# Print first 5 results of the TSV
cat data.tsv | head -n 5
Name Admission Rate Size Tuition Out of State Tuition In State Median Household Income Endowment
Alabama A & M University 0.684 5196 18634 10024 49720 null
University of Alabama at Birmingham 0.8668 12776 21216 8832 55735 711616804
University of Alabama in Huntsville 0.781 6985 24770 11878 58688 100321016
Alabama State University 0.966 3296 19396 11068 46065 130589167
We can query the resulting TSV file to find the data for specific universities:
%%bash
# Row to be retrieved
row=$(cat data.tsv | grep "Alabama State University")
# Print the retrieved row
echo "$row"
Alabama State University 0.966 3296 19396 11068 46065 130589167
We can also query the dataframe to find the data for universities that satisfy certain conditions:
# $1 - Name
# $2 - Admission Rate
# $3 - Size
# $4 - Enrollment
# $5 - Tuition Out of State
# $6 - Tuition In State
# $7 - Median Household Income Endowment
# Use awk to filter rows to only include universities with an admission rate under 10%
!awk -F'\t' '$2 < 0.1' data.tsv | sort
Brown University 0.0506 7222 65146 65146 79027 6520175000
Columbia University in the City of New York 0.0395 8902 66139 66139 76971 14349970000
Cornell University 0.0747 15676 63200 63200 80346 9485887743
Dartmouth College 0.0638 4412 62658 62658 79834 8484189450
Duke University 0.0635 6570 62688 62688 78468 12692472000
Harvard University 0.0324 7973 57261 57261 76879 53165753000
Johns Hopkins University 0.0725 5643 60480 60480 81539 9315279000
Massachusetts Institute of Technology 0.0396 4638 57986 57986 77426 27394039000
Northeastern University 0.068 16172 60192 60192 80190 1365215950
Northwestern University 0.0721 8837 63468 63468 81811 11361182000
Princeton University 0.057 5527 57410 57410 81428 37026442000
Rice University 0.0868 4480 54960 54960 77707 8080292000
Stanford University 0.0368 7761 58416 58416 80275 37788187000
Tufts University 0.0969 6747 65222 65222 82793 2646506000
University of California-Los Angeles 0.0857 32423 43473 13401 72896 3161977000
University of Chicago 0.0543 7511 64260 64260 74573 9594956009
University of Pennsylvania 0.065 10572 63452 63452 78252 20523546000
Vanderbilt University 0.0667 7144 60348 60348 76279 10928512332
Williams College 0.085 2138 61770 61770 77966 3911574095
Yale University 0.0457 6639 62250 62250 75345 42282852000
%%bash
# Specify the endowment in scientific notation
Endowment='1.0e+10'
# Convert scientific notation to integer
Endowment_int=$(printf "%.0f" "$Endowment")
# Use awk to compare the endowment value in data.tsv with the integer value
awk -v endowment="$Endowment_int" -F'\t' 'NR > 1 && $7 > endowment && $7 != "null"' data.tsv | sort
Columbia University in the City of New York 0.0395 8902 66139 66139 76971 14349970000
Duke University 0.0635 6570 62688 62688 78468 12692472000
Emory University 0.1135 7017 57948 57948 80509 12218692520
Harvard University 0.0324 7973 57261 57261 76879 53165753000
Massachusetts Institute of Technology 0.0396 4638 57986 57986 77426 27394039000
Northwestern University 0.0721 8837 63468 63468 81811 11361182000
Princeton University 0.057 5527 57410 57410 81428 37026442000
Stanford University 0.0368 7761 58416 58416 80275 37788187000
Texas A & M University-College Station 0.6265 56792 40139 13239 67194 16895619110
University of Michigan-Ann Arbor 0.1769 32448 55334 16736 77145 16795776000
University of Notre Dame 0.1291 8917 60301 60301 76710 18385354000
University of Pennsylvania 0.065 10572 63452 63452 78252 20523546000
University of Virginia-Main Campus 0.1866 17103 55914 20342 79524 10366577975
Vanderbilt University 0.0667 7144 60348 60348 76279 10928512332
Washington University in St Louis 0.1176 7801 60590 60590 79298 13668081000
Yale University 0.0457 6639 62250 62250 75345 42282852000
5. Retrieving all data for an institution#
The College Scorecard API can also be used to retrieve all of the data for a particular institution. The example below finds all data for The University of Alabama:
%%bash
# Store the key
key=$(cat apiKey.txt)
# Set the condition and url encode it
condition='school.name=The%20University%20of%20Alabama'
# Store the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&api_key=${key}"
# Call the API and retrieve the metadata
api_data=$(./college_score -url "$url" | jq '.')
# Print first 20 lines of the result
echo "$api_data" | head -n 20
{
"metadata": {
"page": 0,
"total": 1,
"per_page": 20
},
"results": [
{
"latest": {
"school": {
"zip": "35487-0100",
"city": "Tuscaloosa",
"name": "The University of Alabama",
"alias": null,
"state": "AL",
"locale": 13,
"address": "739 University Blvd",
"dolflag": 1,
"branches": 1,
"men_only": 0,
Finally, we’ll look at the breakdown of size of each program at the University of Alabama:
%%bash
# Store the key
key=$(cat apiKey.txt)
# Set the condition and url encode it
condition='school.name=The%20University%20of%20Alabama'
# Store the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&api_key=${key}"
# Call the API and retrieve the metadata
api_data=$(./college_score -url "$url" | jq '.["results"][0]["latest"]["academics"]["program_percentage"]')
# Sort the data by the value
sorted_data=$(echo "$api_data" | jq 'to_entries | sort_by(-.value) | from_entries')
# Print results
echo "$sorted_data"
{
"business_marketing": 0.2868,
"engineering": 0.1235,
"health": 0.1008,
"communication": 0.0944,
"family_consumer_science": 0.0798,
"social_science": 0.0786,
"psychology": 0.0403,
"biological": 0.0341,
"parks_recreation_fitness": 0.0268,
"education": 0.0243,
"visual_performing": 0.021,
"multidiscipline": 0.0141,
"computer": 0.0139,
"public_administration_social_service": 0.0111,
"english": 0.0105,
"mathematics": 0.0095,
"physical_science": 0.0095,
"history": 0.0087,
"language": 0.0042,
"resources": 0.0036,
"philosophy_religious": 0.0028,
"ethnic_cultural_gender": 0.0016,
"legal": 0,
"library": 0,
"military": 0,
"humanities": 0,
"agriculture": 0,
"architecture": 0,
"construction": 0,
"transportation": 0,
"personal_culinary": 0,
"science_technology": 0,
"precision_production": 0,
"engineering_technology": 0,
"security_law_enforcement": 0,
"communications_technology": 0,
"mechanic_repair_technology": 0,
"theology_religious_vocation": 0
}