NFL Weather Data Documentation

Summary

Weather can play an important part of an NFL game, but data with the conditions for various games can be hard to find. This document will describe the process for acquiring weather data via the meteostat API and matching to NFL game id’s. This document also links to a historical archive of NFL game weather data beginning with the year 2000.

Included Files

The files can be found here. The files are as follows:

stadium_coordinates.csv

StadiumName: name of stadium.
RoofType: whether the stadium is indoor, outdoor or has a retractable roof.
Longitude: longitude of the stadium.
Latitude: latitude of the stadium.
StadiumAzimuthAngle: azimuth angle of the field (0 is North, 90 is East, 180 is South, 270 is West).

games.csv

game_id: id corresponding to game.
Season: season game takes place.
StadiumName: name of stadium.
TimeStartGame: date and time game started in local timezone.
TimeEndGame: date and time game ended in local timezone.
TZOffset: hours offset from local timezone of game to UTC timezone.

games_weather.csv

game_id: id corresponding to game.
Source: source of weather data (meteostat or Wunderground).
DistanceToStation: distance from stadium to station in miles.
TimeMeasure: date and time weather was measured in local timezone.
Temperature: temperature in Fahrenheight.
DewPoint: dew point in Fahrenheight.
Humidity: humidity in percentage.
Precipitation: precipitation in inches.
WindSpeed: speed of wind in miles per hour.
WindDirection: azimuth angle of the direction of wind (0 is North, 90 is East, 180 is South, 270 is West).
Pressure: pressure in inches of mercury.

Please note that the games_weather.csv file was created using an older version of the Meteostat API. Thus, following the methods below may give slightly different results.

Acquiring the Weather Data

The weather data that is currently posted on Github was aquired via version 1 of the meteostat API. Version 1 was used to obtain JSON files of weather station locations and the weather readings for a particular date and time. The API has since been updated allowing new ways to obtain the data. In order to use, one must register for the completely free API, but be aware that it has limits on calls per day.

library(tidyverse)

The first step is to find the match games to their times and physical locations in the world. To do this, the longitude for each stadium, latitude for each stadium and the dates of games are needed. They can be found in stadium_coordinates.csv and games.csv.

#loading coordinate data
df_coordinates <- readr::read_csv('stadium_coordinates.csv')
df_games <- readr::read_csv('games.csv')

In order to capture each game, it is ideal to perform one API query per game and location. To be conservative in regards to grabbing all the data necessary, a 3 day window surrounding the day of interest is most ideal.

#converting dates to dates and not characters
df_games <- df_games  %>%
  
  mutate(
    TimeStartGame = lubridate::parse_date_time(TimeStartGame, orders = '%m/%d/%Y %H:%M'),
    TimeEndGame = lubridate::parse_date_time(TimeEndGame, orders = '%m/%d/%Y %H:%M')
  )


#will be used for weather fetching
df_weather_fetch <- df_games  %>%
  
  #determining first and last date of each season - adding buffer to make sure data is captured
  #using substring, since want date only is needed and it is fine to leave as character as itll be part of API call
  mutate(first_date = substr(TimeStartGame - lubridate::days(1), 1, 10),
         last_date = substr(TimeEndGame + lubridate::days(1), 1, 10)) %>%
  
  #joining the coordinates
  inner_join(df_coordinates, by = c("StadiumName" = "StadiumName")) %>%
  
  #selecting relevelnt columns
  select(game_id, Longitude, Latitude, first_date, last_date) %>%
  
  distinct()

As Meteostat updated their API to version 2, there are now a multitude of ways of aquiring the data. Each have advantages and disadvantages:

Method 1: Acquiring Hourly Data from Each Station near Stadium Coordinates For Each Game.
Method 2: Acquiring Point Data for Each Set of Stadium Coordinates For Each Game.
Method 3: Acquiring Bulk Data for Every Station near NFL Stadiums.

Method 1 - Acquiring Hourly Data from Each Station near Stadium Coordinates For Each Game

Advantages:

Guarenteed method to acquire data from each station for each game.

Disadvantages:

Very long computation time required to acquire data for every game in dataset.

This method is similar to how the data was acquired using the version 1 API. However, due to new restrictions on volume (10 day limit per query) and number of queries per time period (2000 queries per day), it is not feasible to acquire data for every game without running this code for a multitude of days (possibly weeks). However, this method guarentees data for every game.

In this method, Meteostat’s Nearby Station and Hourly Station commands are used.

Method 1: Acquiring the Raw Data

To begin using the API to fetch the weather data, iterate through each row of df_weather_fetch and query hourly weather data from every station within a 100 kilometer radius of the stadium and append to a dataframe and/or csv.

#will store results
df_weather_raw_method1 <- data.frame()


API_KEY = "" #INSERT YOUR API KEY HERE

#iterating through each season and location
for(i in seq(1, nrow(df_weather_fetch))){
  
  
  #tracking which iteration of game
  print(paste0(i, ' / ', nrow(df_weather_fetch)))
  

  #selecting row
  row <- df_weather_fetch[i,]
  
  #Query to find every station within 50 km of given longitude and latitude
  #make sure to set API_KEY with your registered API_KEY
  #setting limit to 99 to maximize stations reached
  
  #text of query
  query_station <- paste0('https://api.meteostat.net/v2/stations/nearby?lat=',
                         row$Latitude,'&lon=',row$Longitude,'&limit=99')
  
  
  #assuming result is missing to start
  query_station_content <- ""
  
  consecutive_calls <- 0
  
  #while result is missing
  while(query_station_content == "") {
    
    #response of query
    query_station_response = httr::GET(url =  query_station,  httr::add_headers(`x-api-key` = API_KEY))
    
    #prevent more than 1 call per second (MeteoStat's limit)
    Sys.sleep(1)
    
    #content of query
    query_station_content <- httr::content(query_station_response, as = 'text')
    
    
    
    #adding to consecutive calls
    consecutive_calls <- consecutive_calls + 1
    
    #at this point we assume API key is out of queries for day
    if(consecutive_calls > 5){
      
      
      #saving csv of progress
      write.csv(x = df_weather_raw_method1, file = 'weather_raw.csv', row.names = F)
      
      #breaking code when out of calls for the day
      stopifnot(F)
      
    }
    
    
  }
  
  
      
  #cleaned data of query
  query_station_data = jsonlite::fromJSON(query_station_content)$data %>% as.data.frame()
  
  
  #iterating through each station
  for(j in seq(1, nrow(query_station_data))){

    #tracking which iteration of station
    print(paste0('              ', j, ' / ', nrow(query_station_data)))

    #query to get weather for given station and date
    query_hourly = paste0('https://api.meteostat.net/v2/stations/hourly?station=',
                           query_station_data$id[j],'&start=',row$first_date,
                           '&end=',row$last_date)
        
    
    #assuming result is missing to start
    query_hourly_content <- ""
    
    consecutive_calls <- 0

    while(query_hourly_content == "") {
      
      #response of query
      query_hourly_response = httr::GET(url =  query_hourly,  httr::add_headers(`x-api-key` = API_KEY))
  
      #prevent more than 1 call per second (MeteoStat's limit)
      Sys.sleep(1)
      
      #content of query
      query_hourly_content <- httr::content(query_hourly_response, as = 'text')
      
      #adding to consecutive calls
      consecutive_calls <- consecutive_calls + 1
      
      #at this point we assume API key is out of queries for day
      if(consecutive_calls > 5){
        
        
        #saving csv of progress
        write.csv(x = df_weather_raw_method1, file = 'weather_raw.csv', row.names = F)
        
        #breaking code when out of calls for the day
        stopifnot(F)
        
      }
    
    }
    
    #cleaned data of query
    query_hourly_data = jsonlite::fromJSON(query_hourly_content)$data
    
    
    #many times result will be empty list
    if(length(query_hourly_data) > 0){
      
      
      #appending to dataframe
      df_weather_raw_method1 <- bind_rows(df_weather_raw_method1,
                                  cbind(row  %>%
                                          mutate(distance = query_station_data$distance[j]),
                                        query_hourly_data %>% as.data.frame()))
      

      
    }
    
    
    
    
  }
  
  
  
}

Method 1: Dealing With NA

For method 1, It is ideal to use the closest possible station for each stadium. Moreover, considering the closest stations will have a multitude misisng values, it is ideal to take some steps to fill them. An easy solution that accounts for most of the NA values is to fill from the next closest station:

#cleaning the raw weather data
df_weather_raw_method1 <- df_weather_raw_method1 %>%
  
  #converting time to date time
  mutate(time = lubridate::parse_date_time(time, orders = c("%Y-%m-%d %H:%M:%S", "%m/%d/%Y %H:%M"))) %>%
  
  #grouping by season, site and time
  group_by(game_id, time) %>%
  
  #dealing with NA
  arrange(distance) %>%
  
  #filling NA of closer distanced weather stations with further weather stations
  fill(c('temp', 'dwpt', 'rhum', 'prcp', 
         'wspd', 'wdir', 'pres'), .direction = 'up') %>%
  
  #ungrouping
  ungroup()

Now, all data that is not from the station with the minimum distance to the stadium is filtered out:

#cleaning the raw weather data
df_weather_raw_method1 <- df_weather_raw_method1 %>%
  
  #grouping by season, site and time
  group_by(game_id, time) %>%
  
  #filtering for only closest distance to stadium
  filter(distance == min(distance)) %>%
  
  #ungrouping
  ungroup()

Next, the remaining NA is dealt with by using a linear approximation:

df_weather_raw_method1 <- df_weather_raw_method1 %>%

  #filling remaining NA with moving average
  group_by(game_id) %>%
  
  arrange(time) %>%
  
  #filling NA with linear approximation - precipitation/wind direction cannot be well predicted this way
  mutate_at(vars('temp', 'dwpt',
                 'rhum', 'wspd', 'pres'), ~zoo::na.approx(., maxgap = 3, na.rm = F)) %>%
  
  #ungrouping
  ungroup()


df_weather_raw <- df_weather_raw_method1

This concludes the first method of data acquisition.

Method 2- Acquiring Point Data for Each Set of Stadium Coordinates For Each Game

Advantages:

Fast computation time.
No cleaning required after data acquisition.

Disadvantages:

No guarentee to have data for every game.

Meteostat now has an option to acquire “Point” data based on a longitude and latitude. Acquiring data this way allows one to take advantage of Meteostat’s weather models that combine information from various stations. Considering the weather at the longitude and latitude of the stadium is the desired final data, less total queries are required to obtain data for the entire dataset. However, a multitude of games are unable to be acquired including nearly every game from the stadiums for Carolina from 2000 - 2004, New England from 2016 - 2019, Seattle from 2015- 2019, and a few others.

In this method, Meteostat’s Point Hourly command is used.

#will store results
df_weather_raw_method2 <- data.frame()

API_KEY = "" #INSERT YOUR API KEY HERE


#iterating through each season and location
for(i in seq(1, nrow(df_weather_fetch))){
  
  row <- df_weather_fetch[i,]
  
  #saving progress
  print(paste0(i, ' / ', nrow(df_weather_fetch)))
  
  
  
  #query to get weather for given station and date
  query_hourly = paste0('https://api.meteostat.net/v2/point/hourly?lat=',
                        row$Latitude,'&lon=',row$Longitude,
                        '&start=',row$first_date,
                        '&end=',row$last_date)
  
  
  #assuming result is missing to start
  query_hourly_content <- ""
  
  
  num_calls <- 0
  
  while(query_hourly_content == "") {
    
    #response of query
    query_hourly_response = httr::GET(url =  query_hourly,  httr::add_headers(`x-api-key` = API_KEY))
    

    
    
    #content of query
    query_hourly_content <- httr::content(query_hourly_response, as = 'text')
    
    num_calls <- num_calls + 1
    
    #prevent more than 1 call per second (MeteoStat's limit)
    Sys.sleep(1)
    
    #at this point we assume API key is out of queries for day
    if(num_calls > 5) {
      
      
        #saving csv of progress
        write.csv(x = df_weather_raw_method2, file = 'weather_raw.csv', row.names = F)
        
        #breaking code when out of calls for the day
        stopifnot(F)
      
      
      
    }
    
  }
  

  #cleaned data of query
  query_hourly_data = jsonlite::fromJSON(query_hourly_content)$data
  
  
  #many times result will be empty list
  if(length(query_hourly_data) > 0){
    
    
    
    df_weather_raw_method2 <- bind_rows(df_weather_raw_method2,
                                cbind(row,
                                      query_hourly_data %>% as.data.frame()))
    
    
    
    
  }
  

  
}



df_weather_raw <- df_weather_raw_method2

Method 3 - Acquiring Bulk Data for Every Station near NFL Stadiums

Method 3 is not reccomended unless you have a computer with free memory and strong processing power. One can learn how to aquire bulk weather data here.

Advantages:

No API key is required.
Guarenteed method to acquire data from each station for each game.

Disadvantages:

Great processing power needed to unzip the data and filter for the required rows.

No code is included for this method due to the extreme time complexity required to test and ensure correctness. The idea would be to use the code in method 1 to find the weather stations and filter the bulk hourly data for the hours of the relevant games. One can use the code that fills NA in method 1 as well.

Cleaning the data

Now that the data has been acquired via one of the 3 methods, the cleaning process can begin. First, columns that are essentially completely NA and columns that are no longer needed (including Longitude, Latitude, first_date and last_date) are dropped:

df_weather_raw <- df_weather_raw %>%
  select(-c('snow', 'first_date', 'last_date', 'wpgt', 'Longitude',
            'tsun', 'Latitude', 'coco'))

Finally, considering many fans of NFL football prefer the imperial system of measurement, the data is converted from metric to imperial. Also, an EstimatedCondition column is added.

#conversion between celcius and fahrenheit
celcius2Fahrenheit_slope <- 9/5
celcius2Fahrenheit_intercept <- 32
mm2Inches <- 0.039370
km2Miles <- 0.621371
hPa2Inches <- 0.02952998

#raw weather data
df_weather_raw <- df_weather_raw %>%
  
   mutate(
    TimeMeasure = time,
    
    #making direction go from 0 to 360, and setting as NA if no wind speed
    WindDirection = ifelse(wspd == 0, NA, wdir)  %% 360,
    
    #fixing issues where humidity can be greater than 100%
    Humidity = ifelse(rhum > 100, 100, rhum),
    
    #converting Celcius to Fahrenheit
    Temperature = round(temp * celcius2Fahrenheit_slope + celcius2Fahrenheit_intercept, 2),
    DewPoint = round(dwpt * celcius2Fahrenheit_slope + celcius2Fahrenheit_intercept, 2),
    
    #converting milimeters to inches
    Precipitation = round(pres * mm2Inches, 3),
    
    #converting km / hour to miles / hour
    WindSpeed = round(wspd * km2Miles, 2),
    
    #converting hPa to inches of mercury
    Pressure = round(pres *  hPa2Inches, 4),
    
    #determining condtion from precipitation and temperature
    EstimatedCondition = ifelse(prcp == 0, "Clear", "Rain"),
    EstimatedCondition = ifelse(prcp > 0,
                                ifelse(Temperature <= 34, "Snow", "Rain"),
                                EstimatedCondition),
    EstimatedCondition = ifelse(EstimatedCondition %in% c("Snow", "Rain"),
                                paste0(case_when(prcp < 0.098 ~ "Light ",
                                                 prcp > 0.3 ~ "Heavy ",
                                                 T ~ "Moderate "), EstimatedCondition),
                                EstimatedCondition))  %>%
  
  #selecting relevant columns
  select(game_id, TimeMeasure, Temperature,
         DewPoint, Humidity, Precipitation, WindSpeed, WindDirection, Pressure)

Merging Weather Data

The weather data is joined to the games data by season and location and then measurement times that are outside of the game times are filtered out. The TimeStartFilter and TimeEndFilter variables are created for this filtering. The variables are converted to UTC time and have a buffer to ensure all wanted data is kept.

df_games <- df_games %>%
  mutate(TimeStartFilter = TimeStartGame - lubridate::hours(TZOffset) - lubridate::hours(1),
         TimeEndFilter = TimeEndGame  - lubridate::hours(TZOffset) + lubridate::hours(1))

Finally, one must merge and filter out the unwanted measurements:

#joining each season and rbinding for memory issues
df_weather_games <- left_join(df_games, df_weather_raw, by = c("game_id"= "game_id"))  %>%
  filter( ((TimeMeasure >= TimeStartFilter) & (TimeMeasure <= TimeEndFilter)) )

The data is cleaned and ready to analyze and visualize.

Plotting Results

Now that the data is clean, it can be visualized. The following code creates a simple plot showing the general wind speed trends at every outdoor stadium.

#used for plot values
plot_vals <- df_weather_games %>%
  
  group_by(game_id) %>%
  
  #selecting 1st read of wind
  slice(1) %>% 
  
  #joing games
  inner_join(df_games) %>%
  
  #joing coordinates
  inner_join(df_coordinates) %>%
  
  #applying filters
  filter(RoofType == "Outdoor", Season >= 2015) %>% 
  
  #renaming stadium
  mutate(StadiumNameAdj = paste0(HomeTeam, ': ', StadiumName),
         
         #limiting max value of wind speed
         WindSpeed = ifelse(WindSpeed >= 25, 25, WindSpeed)) %>%
  
  #selecting values of interest
  select(StadiumNameAdj, WindSpeed, WindDirection) %>%
  
  #grouping by stadium
  group_by(StadiumNameAdj) %>%
  
  #filtering for only 20 or more observations
  filter(n() > 20) %>% 
  
  #ungrouping
  ungroup()



set.seed(0)

## Add random noise (WindDirection comes in multiples of 10)
plot_vals$WindDirection <- plot_vals$WindDirection + runif(nrow(plot_vals), -3, 3)


#will be used for axis text
key <- expand.grid(WindDirection = 45/2, WindSpeed = c(5, 10, 15, 20, 25), 
                   StadiumNameAdj = unique(plot_vals$StadiumNameAdj)) %>% 
  mutate(WindLabel = ifelse(WindSpeed == 25, " 25+", WindSpeed))


plot_vals %>% 
  
  ggplot(aes(x = WindDirection, y = WindSpeed)) + 
  
  #axis lines
  geom_hline(yintercept = seq(0, 25, by = 5), colour = "grey90", size = 0.2) +
  geom_vline(xintercept = seq(0, 360-1, by = 45), colour = "grey90", size = 0.2) +
  
  #axis text
  geom_text(data = key, aes(label = WindLabel), colour = "grey", size = 2.2) + 
  
  #vectors
  geom_segment(aes(y = 0, xend = WindDirection, yend = WindSpeed), 
               arrow = arrow(length = unit(0.03, "npc"))) + 
  coord_polar(start = 0) + xlab('') +
  
  #removing y label
  ylab('') +
  
  #scale
  scale_x_continuous(breaks = c(0, 90, 180, 270), labels = c("N", "E", "S", "W"), lim = c(0,360)) +
  scale_y_continuous(lim = c(0, 25)) + 
  
  #black white theme
  theme_bw() +
  
  #title
  ggtitle("Wind Direction and Magnitudes, Outdoor Stadiums 2015-2019") + 
  
  #faceting by stadium
  facet_wrap(~StadiumNameAdj, nrow = 5) +
  
  #theme
  theme(
    plot.title = element_text(hjust = 0.5), 
    plot.subtitle = element_text(hjust = 0.5),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size = 8),
    panel.border  = element_blank(), 
    panel.grid = element_blank()
  )