College Scorecard API in C

College Scorecard API in C#

by Cyrus Gomes and Avery Fernandez

College Scorecard API documentation: https://collegescorecard.ed.gov/data/data-documentation/

College Scorecard Copyright Status: https://www2.ed.gov/notices/copyright/index.html

The College Scorecard API is an online tool hosted by the U.S. Department of Education that contains data concerning higher education institutions.

These recipe examples were tested on August 23, 2024.

Setup#

First, install the CURL and jq packages by typing the following command in the terminal:

!sudo apt install curl jq

Create a directory for this tutorial:

!mkdir College_Scorecard

Change to the newly created directory:

%cd College_Scorecard

Create a variable for API Key#

An API key is required to access the College Scorecard API. This API key can be obtained here.

Save your API key to a separate text file, then create a variable for your key. Avoid displaying your API key in your terminal (to prevent accidental sharing).

# Create the key file
!touch "apiKey.txt"

We use the following command to access the key as Jupyter does not allow variable sharing for bash scripts.

# Input the key into the file by copy/paste or keying in manually
# Read the key from the file
!apiKey=$(cat "apiKey.txt")

We use the %%file command to create the following makefile which will compile our program and create an executable.

%%file makefile

# Set the variable CC to gcc, which is used to build the program
CC=gcc

# Enable debugging information and enable all compiler warnings
CFLAGS=-g -Wall

# Set the bin variable as the name of the binary file we are creating
BIN=college_score

# Create the binary file with the name we put
all: $(BIN)

# Map any file ending in .c to a binary executable. 
# "$<" represents the .c file and "$@" represents the target binary executable
%: %.c

	# Compile the .c file using the gcc compiler with the CFLAGS and links 
	# resulting binary with the CURL library
	$(CC) $(CFLAGS) $< -o $@ -lcurl

# Clean target which removes specific files
clean:

	# Remove the binary file and an ".dSYM" (debug symbols for debugging) directories
	# the RM command used -r to remove directories and -f to force delete
	$(RM) -rf $(BIN) *.dSYM

Writing makefile

This command is used again to create our .c file which contains the code for the program

%%file ./college_score.c

#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/*CURL program that retrieves College Scorecard data from
  http://api.data.gov/ed/collegescorecard/v1/ */

int main(int argc, char* argv[]) {

    // Return if arguments are invalid
    if (argc < 2 || argc > 3) {                                                                                      
    printf("Error. Please try again correctly. (./college_score -url [url])\n");
        return -1;
    }

    // Initialize CURL HTTP connection
    CURL *curl = curl_easy_init();
    if (!curl) {
        fprintf(stderr, "CURL initialization failed\n");
        return EXIT_FAILURE;
    }

    char link[] = "http://api.data.gov/ed/collegescorecard/v1/schools?fields=";
    char url[1000];
    
    if ((argc==2) && (strcmp(argv[1], "-url")==0)) {
        sprintf(url, "%s", link);
    }

    else if ((argc==3) && (strcmp(argv[1], "-url")==0)) {
        sprintf(url, "%s", argv[2]);
    }

    // Set the URL to which the HTTP request will be sent
    curl_easy_setopt(curl, CURLOPT_URL, url);

    // Set option to follow redirections
    curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);

    // Perform the HTTP request
    CURLcode result = curl_easy_perform(curl);

    // Check if the request was successful
    if (result != CURLE_OK) {
        fprintf(stderr, "Download problem: %s\n", curl_easy_strerror(result));
    }

    // Cleanup and resource deallocation
    curl_easy_cleanup(curl);
    return EXIT_SUCCESS;
}

Writing ./college_score.c

!make

# Compile the .c file using the gcc compiler with the CFLAGS and links 
# resulting binary with the CURL library
gcc -g -Wall college_score.c -o college_score -lcurl

1. Get names of all institutions#

To start, we’ll use a basic query to find the names of all educational institutions recognized by the College Scorecard API.

All of the data for the API can be found using the v1/schools endpoint.

Fields in the College Scorecard API are accessed with a <time>.<category>.<name> sequence:

<time> indicates the year of the data to be accessed. To access the most recent data, use latest.
<category> and <name> can be found in the Data Dictionary file that can be downloaded from the API’s documentation. The <category> of a field is given by the dev-category column in the Institution_Data_Dictionary section, and the <name> is given by the developer-friendly name column.

%%bash

# Store the key
key=$(cat apiKey.txt)

# Store the field name
field='school.name'

# Store the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?fields=${field}&api_key=${key}"

# Call the API to retrieve the metadata
names=$(./college_score -url "$url" | jq '.["metadata"]')

echo "$names"

{
  "page": 0,
  "total": 6484,
  "per_page": 20
}

The total value indicates the total number results returned in this query. These results are paginated, so each query will return only the number indicated by page_size, which has a default value of 20 and a maximum value of 100. The page number is indicated by page, which by default is set to 0.

We can use a loop to create an API request for each page:

%%bash

# Declare url parameters
field='school.name'
sort_key='school.name'
page_size='100'

# Store the key in the key variable
key=$(cat apiKey.txt)

# Attach the key to the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?api_key=${key}"

# Call the API and retrieve the metadata
metadata_tot=$(./college_score -url "$url" | jq '.["metadata"]["total"]')

# Calculate the total number of pages
total_pages=$(((metadata_tot + 1) / page_size + 1 ))

# Loop through each page of the dataset, sending a request for each page
for ((page_number = 0; page_number < total_pages; page_number++)); do
    # Combine the parameters to create a working url
    url2="http://api.data.gov/ed/collegescorecard/v1/schools?fields=${field}&page=${page_number}&per_page=${page_size}&sort=${sort_key}&api_key=${key}"

    # Retrieve the school names
    name_data=$(./college_score -url "$url2" | jq '.results[] | ."school.name"')

    # Output the school names to a .txt file
    echo "$name_data" >> institution_names.txt

    # Wait 1 second between API calls to be nicer on the host servers
    sleep 1

done

# Print number of institution names found
cat institution_names.txt | wc -l

# Print first 10 results
!head -n 10 institution_names.txt

"A Better U Beauty Barber Academy"
"A T Still University of Health Sciences"
"Aaniiih Nakoda College"
"ABC Adult School"
"ABC Adult School - Cabrillo Lane"
"ABC Beauty Academy"
"ABCO Technology"
"Abcott Institute"
"Abilene Christian University"
"Abilene Christian University-Undergraduate Online"

2. Get names of all universities#

College Scorecard API requests can also take conditions to only select certain institutions.

In this example, we limit the results to only include institutions that award graduate degrees. In order to do this, we set the degrees_awarded.highest parameter to 4 to indicate that the highest degree awarded by an instition is a graduate degree. This information is within the Institution_Data_Dictionary section of the College Scorecard data disctionary.

%%bash

# Declare url parameters
condition='latest.school.degrees_awarded.highest=4'
field='school.name'
sort_key='school.name'
page_size='100'

# Store the key
key=$(cat apiKey.txt)

# Attach the key to the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&fields=${field}&api_key=${key}"

# Call the API and retrieve the metadata
metadata_tot=$(./college_score -url "$url" | jq '.["metadata"]["total"]')

# Calculate total number of pages
total_pages=$(((metadata_tot + 1) / page_size + 1 ))

# Loop through each page of the dataset, sending a request for each page
for ((page_number = 0; page_number <= total_pages; page_number++)); do
    # Combine the parameters to create a working url
    url2="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&fields=${field}&page=${page_number}&per_page=${page_size}&sort=${sort_key}&api_key=${key}"

    # Retrieve the school names
    name_data=$(./college_score -url "$url2" | jq '.results[] | ."school.name"')

    # Output the school names to a .txt file
    echo "$name_data" >> university_names.txt

    # Wait 1 second between API calls to be nicer on the host servers
    sleep 1

done

# Print number of university names found
cat university_names.txt | wc -l

# Print first 10 results
!head -n 10 university_names.txt

"A T Still University of Health Sciences"
"Abilene Christian University"
"Abraham Lincoln University"
"Academy for Five Element Acupuncture"
"Academy for Jewish Religion"
"Academy for Jewish Religion-California"
"Academy of Art University"
"Academy of Chinese Culture and Health Sciences"
"Academy of Vocal Arts"
"Acupuncture and Integrative Medicine College-Berkeley"

3. Find number of universities by state#

The school.state_fips data element contains a number that corresponds to each state. This mapping is given below:

%%bash

states='{
"1": "Alabama",
"2": "Alaska",
"4": "Arizona",
"5": "Arkansas",
"6": "California",
"8": "Colorado",
"9": "Connecticut",
"10": "Delaware",
"11": "District of Columbia",
"12": "Florida",
"13": "Georgia",
"15": "Hawaii",
"16": "Idaho",
"17": "Illinois",
"18": "Indiana",
"19": "Iowa",
"20": "Kansas",
"21": "Kentucky",
"22": "Louisiana",
"23": "Maine",
"24": "Maryland",
"25": "Massachusetts",
"26": "Michigan",
"27": "Minnesota",
"28": "Mississippi",
"29": "Missouri",
"30": "Montana",
"31": "Nebraska",
"32": "Nevada",
"33": "New Hampshire",
"34": "New Jersey",
"35": "New Mexico",
"36": "New York",
"37": "North Carolina",
"38": "North Dakota",
"39": "Ohio",
"40": "Oklahoma",
"41": "Oregon",
"42": "Pennsylvania",
"44": "Rhode Island",
"45": "South Carolina",
"46": "South Dakota",
"47": "Tennessee",
"48": "Texas",
"49": "Utah",
"50": "Vermont",
"51": "Virginia",
"53": "Washington",
"54": "West Virginia",
"55": "Wisconsin",
"56": "Wyoming",
"60": "American Samoa",
"64": "Federated States of Micronesia",
"66": "Guam",
"69": "Northern Mariana Islands",
"70": "Palau",
"72": "Puerto Rico",
"78": "Virgin Islands"
}'

echo "$states" > states

Using this mapping, we can find the number of universities in each state:

%%bash

# Declare an associative array
declare -A state_frequency

# Declare the parameters for the url
condition='latest.school.degrees_awarded.highest=4'
field='latest.school.state_fips'
page_size='100'

# Store the key
key=$(cat apiKey.txt)

# Attach the key to the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&fields=${field}&api_key=${key}"

# Call the API and retrieve the metadata
metadata_tot=$(./college_score -url "$url" | jq '.["metadata"]["total"]')

# Calculate total number of pages
total_pages=$(((metadata_tot + 1) / page_size + 1 ))

# Loop through each page of the dataset, sending a request for each page
for ((page_number = 0; page_number < total_pages; page_number++)); do
    # Combine the parameters to create a working url
    url2="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&fields=${field}&page=${page_number}&per_page=${page_size}&api_key=${key}"

    # Retrieve the school names
    state_data=$(./college_score -url "$url2" | jq '.results[] | ."latest.school.state_fips"')

    # Output the frequency to a text file
    echo "$state_data" >> frequencies.txt

    # Clear the state_data variable
    unset state_data

    # Wait 1 second between API calls to be nicer on the host servers
    sleep 1

done

# Replace the numbers in frequencies.txt with States in the file 
while IFS= read -r line; do
    # Append each line to the result variable
    result=$(cat states | jq '.["'"$line"'"]')
    echo "$result" >> state_freq.txt
done < "frequencies.txt"

# Create a table with state and frequencies for that specific state, whilst sorting the states
sort state_freq.txt | uniq -c | awk '{print $2, $1}' > final_output.txt

# Print the top 5 results alphabetically
cat final_output.txt | head -n 5

"Alabama" 30
"Alaska" 4
"Arizona" 22
"Arkansas" 20
"California" 208

Now, we can sort and display the results:

# Retrieve the state and frequencies from highest to lowest
!sort -k2,2rn final_output.txt

"California" 208
"New 155
"Pennsylvania" 109
"Texas" 105
"Illinois" 81
"Florida" 75
"Massachusetts" 73
"Ohio" 65
"Missouri" 56
"North 55
"Virginia" 52
"Georgia" 50
"Indiana" 49
"Puerto 48
"Tennessee" 47
"Michigan" 45
"Wisconsin" 40
"Minnesota" 38
"New 38
"South 33
"Alabama" 30
"Kentucky" 30
"Colorado" 29
"Maryland" 29
"Oregon" 28
"Washington" 28
"Iowa" 27
"Louisiana" 27
"Connecticut" 26
"Kansas" 25
"Oklahoma" 25
"Arizona" 22
"Arkansas" 20
"Nebraska" 20
"West 18
"District 16
"Maine" 16
"Mississippi" 15
"South 15
"Vermont" 14
"New 13
"New 13
"Utah" 13
"Rhode 12
"North 11
"Idaho" 8
"Nevada" 8
"Montana" 7
"Hawaii" 6
"Delaware" 5
"Alaska" 4
"Guam" 1
"Virgin 1
"Wyoming" 1

4. Retrieving multiple data points in a single query#

The following example uses multiple conditions and multiple fields. The conditions in the query are separated by & while the fields are separated by ,.

%%bash

# Declare an array
conditions=(
    'latest.school.degrees_awarded.highest=4'
    'latest.student.size__range=1000..'
)

# Set the IFS to '&' (the separator)
IFS='&'

# Join the array elements using the separator
conditions_string="${conditions[*]}"

# Reset the IFS to its default value
unset IFS

fields=(
    'school.name'
    'latest.admissions.admission_rate.overall'
    'latest.student.size'
    'latest.cost.tuition.out_of_state'
    'latest.cost.tuition.in_state'
    'latest.student.demographics.median_hh_income'
    'latest.school.endowment.begin'
)

# Set the IFS to ',' (the separator)
IFS=','

# Join the array elements using the separator
fields_string="${fields[*]}"

# Reset the IFS to its default value
unset IFS

# Declare parameters
key=$(cat apiKey.txt)
page_size='100'
sort_key='school.name'

# Attach the key to the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&fields=${field}&api_key=${key}"

# Call the API and retrieve the metadata
metadata_tot=$(./college_score -url "$url" | jq '.["metadata"]["total"]')

# Calculate total number of pages
total_pages=$(((metadata_tot + 1) / page_size + 1 ))

# Create the headers for the table and put it in data.tsv
echo -e "Name\tAdmission Rate\tSize\tTuition Out of State\tTuition In State\tMedian Household Income\tEndowment" >> data.tsv

# Loop through each page of the dataset, sending a request for each page
for ((page_number = 0; page_number < total_pages; page_number++)); do
    # Combine the parameters to create a working url
    url2="http://api.data.gov/ed/collegescorecard/v1/schools?${conditions_string}&fields=${fields_string}&page=${page_number}&per_page=${page_size}&api_key=${key}"

    # Retrieve the school names
    university_data=$(./college_score -url "$url2" | jq '.results[]')

    # Unset JSON array
    unset json_array

    # Split the JSON data into separate JSON objects
    IFS=$'\n' read -r -d '' -a json_objects <<< "$(echo "$university_data" | jq -c '.')"

    # Store the JSON objects in an array
    for obj in "${json_objects[@]}"; do
        json_array+=("$obj")
    done

    # Print the array elements
    for ((i = 0; i < ${#json_array[@]}; i++)); do
        tsv_data=$(echo "${json_array[i]}" | jq -r '[ .["school.name"], (.["latest.admissions.admission_rate.overall"] | tostring), (.["latest.student.size"] | tostring), (.["latest.cost.tuition.out_of_state"] | tostring), (.["latest.cost.tuition.in_state"] | tostring), (.["latest.student.demographics.median_hh_income"] | tostring), (.["latest.school.endowment.begin"] | tostring) ] | join("\t")')
        echo "$tsv_data" >> data.tsv
    done

    # Wait 1 second between API calls to be nicer on the host servers
    sleep 1

done

# Print first 5 results of the TSV
cat data.tsv | head -n 5

Name	Admission Rate	Size	Tuition Out of State	Tuition In State	Median Household Income	Endowment
Alabama A & M University	0.684	5196	18634	10024	49720	null
University of Alabama at Birmingham	0.8668	12776	21216	8832	55735	711616804
University of Alabama in Huntsville	0.781	6985	24770	11878	58688	100321016
Alabama State University	0.966	3296	19396	11068	46065	130589167

We can query the resulting TSV file to find the data for specific universities:

%%bash

# Row to be retrieved
row=$(cat data.tsv | grep "Alabama State University")

# Print the retrieved row
echo "$row"

Alabama State University	0.966	3296	19396	11068	46065	130589167

We can also query the dataframe to find the data for universities that satisfy certain conditions:

# $1 - Name
# $2 - Admission Rate
# $3 - Size
# $4 - Enrollment
# $5 - Tuition Out of State
# $6 - Tuition In State
# $7 - Median Household Income Endowment

# Use awk to filter rows to only include universities with an admission rate under 10%
!awk -F'\t' '$2 < 0.1' data.tsv | sort

Brown University	0.0506	7222	65146	65146	79027	6520175000
Columbia University in the City of New York	0.0395	8902	66139	66139	76971	14349970000
Cornell University	0.0747	15676	63200	63200	80346	9485887743
Dartmouth College	0.0638	4412	62658	62658	79834	8484189450
Duke University	0.0635	6570	62688	62688	78468	12692472000
Harvard University	0.0324	7973	57261	57261	76879	53165753000
Johns Hopkins University	0.0725	5643	60480	60480	81539	9315279000
Massachusetts Institute of Technology	0.0396	4638	57986	57986	77426	27394039000
Northeastern University	0.068	16172	60192	60192	80190	1365215950
Northwestern University	0.0721	8837	63468	63468	81811	11361182000
Princeton University	0.057	5527	57410	57410	81428	37026442000
Rice University	0.0868	4480	54960	54960	77707	8080292000
Stanford University	0.0368	7761	58416	58416	80275	37788187000
Tufts University	0.0969	6747	65222	65222	82793	2646506000
University of California-Los Angeles	0.0857	32423	43473	13401	72896	3161977000
University of Chicago	0.0543	7511	64260	64260	74573	9594956009
University of Pennsylvania	0.065	10572	63452	63452	78252	20523546000
Vanderbilt University	0.0667	7144	60348	60348	76279	10928512332
Williams College	0.085	2138	61770	61770	77966	3911574095
Yale University	0.0457	6639	62250	62250	75345	42282852000

%%bash

# Specify the endowment in scientific notation
Endowment='1.0e+10'  

# Convert scientific notation to integer
Endowment_int=$(printf "%.0f" "$Endowment")

# Use awk to compare the endowment value in data.tsv with the integer value
awk -v endowment="$Endowment_int" -F'\t' 'NR > 1 && $7 > endowment && $7 != "null"' data.tsv | sort

Columbia University in the City of New York	0.0395	8902	66139	66139	76971	14349970000
Duke University	0.0635	6570	62688	62688	78468	12692472000
Emory University	0.1135	7017	57948	57948	80509	12218692520
Harvard University	0.0324	7973	57261	57261	76879	53165753000
Massachusetts Institute of Technology	0.0396	4638	57986	57986	77426	27394039000
Northwestern University	0.0721	8837	63468	63468	81811	11361182000
Princeton University	0.057	5527	57410	57410	81428	37026442000
Stanford University	0.0368	7761	58416	58416	80275	37788187000
Texas A & M University-College Station	0.6265	56792	40139	13239	67194	16895619110
University of Michigan-Ann Arbor	0.1769	32448	55334	16736	77145	16795776000
University of Notre Dame	0.1291	8917	60301	60301	76710	18385354000
University of Pennsylvania	0.065	10572	63452	63452	78252	20523546000
University of Virginia-Main Campus	0.1866	17103	55914	20342	79524	10366577975
Vanderbilt University	0.0667	7144	60348	60348	76279	10928512332
Washington University in St Louis	0.1176	7801	60590	60590	79298	13668081000
Yale University	0.0457	6639	62250	62250	75345	42282852000

5. Retrieving all data for an institution#

The College Scorecard API can also be used to retrieve all of the data for a particular institution. The example below finds all data for The University of Alabama:

%%bash

# Store the key
key=$(cat apiKey.txt)

# Set the condition and url encode it
condition='school.name=The%20University%20of%20Alabama'

# Store the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&api_key=${key}"

# Call the API and retrieve the metadata
api_data=$(./college_score -url "$url" | jq '.')

# Print first 20 lines of the result
echo "$api_data" | head -n 20

{
  "metadata": {
    "page": 0,
    "total": 1,
    "per_page": 20
  },
  "results": [
    {
      "latest": {
        "school": {
          "zip": "35487-0100",
          "city": "Tuscaloosa",
          "name": "The University of Alabama",
          "alias": null,
          "state": "AL",
          "locale": 13,
          "address": "739 University Blvd",
          "dolflag": 1,
          "branches": 1,
          "men_only": 0,

Finally, we’ll look at the breakdown of size of each program at the University of Alabama:

%%bash

# Store the key
key=$(cat apiKey.txt)

# Set the condition and url encode it
condition='school.name=The%20University%20of%20Alabama'

# Store the url
url="http://api.data.gov/ed/collegescorecard/v1/schools?${condition}&api_key=${key}"

# Call the API and retrieve the metadata
api_data=$(./college_score -url "$url" | jq '.["results"][0]["latest"]["academics"]["program_percentage"]')

# Sort the data by the value
sorted_data=$(echo "$api_data" | jq 'to_entries | sort_by(-.value) | from_entries')

# Print results
echo "$sorted_data"

{
  "business_marketing": 0.2868,
  "engineering": 0.1235,
  "health": 0.1008,
  "communication": 0.0944,
  "family_consumer_science": 0.0798,
  "social_science": 0.0786,
  "psychology": 0.0403,
  "biological": 0.0341,
  "parks_recreation_fitness": 0.0268,
  "education": 0.0243,
  "visual_performing": 0.021,
  "multidiscipline": 0.0141,
  "computer": 0.0139,
  "public_administration_social_service": 0.0111,
  "english": 0.0105,
  "mathematics": 0.0095,
  "physical_science": 0.0095,
  "history": 0.0087,
  "language": 0.0042,
  "resources": 0.0036,
  "philosophy_religious": 0.0028,
  "ethnic_cultural_gender": 0.0016,
  "legal": 0,
  "library": 0,
  "military": 0,
  "humanities": 0,
  "agriculture": 0,
  "architecture": 0,
  "construction": 0,
  "transportation": 0,
  "personal_culinary": 0,
  "science_technology": 0,
  "precision_production": 0,
  "engineering_technology": 0,
  "security_law_enforcement": 0,
  "communications_technology": 0,
  "mechanic_repair_technology": 0,
  "theology_religious_vocation": 0
}