NFL Weather Data Documentation
Summary
Weather can play an important part of an NFL game, but data with the conditions for various games can be hard to find. This document will describe the process for acquiring weather data via the meteostat API and matching to NFL game id’s. This document also links to a historical archive of NFL game weather data beginning with the year 2000.
Included Files
The files can be found here. The files are as follows:
stadium_coordinates.csv
StadiumName
: name of stadium.RoofType
: whether the stadium is indoor, outdoor or has a retractable roof.Longitude
: longitude of the stadium.Latitude
: latitude of the stadium.StadiumAzimuthAngle
: azimuth angle of the field (0 is North, 90 is East, 180 is South, 270 is West).
games.csv
game_id
: id corresponding to game.Season
: season game takes place.StadiumName
: name of stadium.TimeStartGame
: date and time game started in local timezone.TimeEndGame
: date and time game ended in local timezone.TZOffset
: hours offset from local timezone of game to UTC timezone.
games_weather.csv
game_id
: id corresponding to game.Source
: source of weather data (meteostat or Wunderground).DistanceToStation
: distance from stadium to station in miles.TimeMeasure
: date and time weather was measured in local timezone.Temperature
: temperature in Fahrenheight.DewPoint
: dew point in Fahrenheight.Humidity
: humidity in percentage.Precipitation
: precipitation in inches.WindSpeed
: speed of wind in miles per hour.WindDirection
: azimuth angle of the direction of wind (0 is North, 90 is East, 180 is South, 270 is West).Pressure
: pressure in inches of mercury.
Please note that the games_weather.csv
file was created using an older version of the Meteostat API. Thus, following the methods below may give slightly different results.
Acquiring the Weather Data
The weather data that is currently posted on Github was aquired via version 1 of the meteostat API. Version 1 was used to obtain JSON files of weather station locations and the weather readings for a particular date and time. The API has since been updated allowing new ways to obtain the data. In order to use, one must register for the completely free API, but be aware that it has limits on calls per day.
library(tidyverse)
The first step is to find the match games to their times and physical locations in the world. To do this, the longitude for each stadium, latitude for each stadium and the dates of games are needed. They can be found in stadium_coordinates.csv
and games.csv
.
#loading coordinate data
df_coordinates <- readr::read_csv('stadium_coordinates.csv')
df_games <- readr::read_csv('games.csv')
In order to capture each game, it is ideal to perform one API query per game and location. To be conservative in regards to grabbing all the data necessary, a 3 day window surrounding the day of interest is most ideal.
#converting dates to dates and not characters
df_games <- df_games %>%
mutate(
TimeStartGame = lubridate::parse_date_time(TimeStartGame, orders = '%m/%d/%Y %H:%M'),
TimeEndGame = lubridate::parse_date_time(TimeEndGame, orders = '%m/%d/%Y %H:%M')
)
#will be used for weather fetching
df_weather_fetch <- df_games %>%
#determining first and last date of each season - adding buffer to make sure data is captured
#using substring, since want date only is needed and it is fine to leave as character as itll be part of API call
mutate(first_date = substr(TimeStartGame - lubridate::days(1), 1, 10),
last_date = substr(TimeEndGame + lubridate::days(1), 1, 10)) %>%
#joining the coordinates
inner_join(df_coordinates, by = c("StadiumName" = "StadiumName")) %>%
#selecting relevelnt columns
select(game_id, Longitude, Latitude, first_date, last_date) %>%
distinct()
As Meteostat updated their API to version 2, there are now a multitude of ways of aquiring the data. Each have advantages and disadvantages:
- Method 1: Acquiring Hourly Data from Each Station near Stadium Coordinates For Each Game.
- Method 2: Acquiring Point Data for Each Set of Stadium Coordinates For Each Game.
- Method 3: Acquiring Bulk Data for Every Station near NFL Stadiums.
Method 1 - Acquiring Hourly Data from Each Station near Stadium Coordinates For Each Game
Advantages:
- Guarenteed method to acquire data from each station for each game.
Disadvantages:
- Very long computation time required to acquire data for every game in dataset.
This method is similar to how the data was acquired using the version 1 API. However, due to new restrictions on volume (10 day limit per query) and number of queries per time period (2000 queries per day), it is not feasible to acquire data for every game without running this code for a multitude of days (possibly weeks). However, this method guarentees data for every game.
In this method, Meteostat’s Nearby Station and Hourly Station commands are used.
Method 1: Acquiring the Raw Data
To begin using the API to fetch the weather data, iterate through each row of df_weather_fetch
and query hourly weather data from every station within a 100 kilometer radius of the stadium and append to a dataframe and/or csv.
#will store results
df_weather_raw_method1 <- data.frame()
API_KEY = "" #INSERT YOUR API KEY HERE
#iterating through each season and location
for(i in seq(1, nrow(df_weather_fetch))){
#tracking which iteration of game
print(paste0(i, ' / ', nrow(df_weather_fetch)))
#selecting row
row <- df_weather_fetch[i,]
#Query to find every station within 50 km of given longitude and latitude
#make sure to set API_KEY with your registered API_KEY
#setting limit to 99 to maximize stations reached
#text of query
query_station <- paste0('https://api.meteostat.net/v2/stations/nearby?lat=',
row$Latitude,'&lon=',row$Longitude,'&limit=99')
#assuming result is missing to start
query_station_content <- ""
consecutive_calls <- 0
#while result is missing
while(query_station_content == "") {
#response of query
query_station_response = httr::GET(url = query_station, httr::add_headers(`x-api-key` = API_KEY))
#prevent more than 1 call per second (MeteoStat's limit)
Sys.sleep(1)
#content of query
query_station_content <- httr::content(query_station_response, as = 'text')
#adding to consecutive calls
consecutive_calls <- consecutive_calls + 1
#at this point we assume API key is out of queries for day
if(consecutive_calls > 5){
#saving csv of progress
write.csv(x = df_weather_raw_method1, file = 'weather_raw.csv', row.names = F)
#breaking code when out of calls for the day
stopifnot(F)
}
}
#cleaned data of query
query_station_data = jsonlite::fromJSON(query_station_content)$data %>% as.data.frame()
#iterating through each station
for(j in seq(1, nrow(query_station_data))){
#tracking which iteration of station
print(paste0(' ', j, ' / ', nrow(query_station_data)))
#query to get weather for given station and date
query_hourly = paste0('https://api.meteostat.net/v2/stations/hourly?station=',
query_station_data$id[j],'&start=',row$first_date,
'&end=',row$last_date)
#assuming result is missing to start
query_hourly_content <- ""
consecutive_calls <- 0
while(query_hourly_content == "") {
#response of query
query_hourly_response = httr::GET(url = query_hourly, httr::add_headers(`x-api-key` = API_KEY))
#prevent more than 1 call per second (MeteoStat's limit)
Sys.sleep(1)
#content of query
query_hourly_content <- httr::content(query_hourly_response, as = 'text')
#adding to consecutive calls
consecutive_calls <- consecutive_calls + 1
#at this point we assume API key is out of queries for day
if(consecutive_calls > 5){
#saving csv of progress
write.csv(x = df_weather_raw_method1, file = 'weather_raw.csv', row.names = F)
#breaking code when out of calls for the day
stopifnot(F)
}
}
#cleaned data of query
query_hourly_data = jsonlite::fromJSON(query_hourly_content)$data
#many times result will be empty list
if(length(query_hourly_data) > 0){
#appending to dataframe
df_weather_raw_method1 <- bind_rows(df_weather_raw_method1,
cbind(row %>%
mutate(distance = query_station_data$distance[j]),
query_hourly_data %>% as.data.frame()))
}
}
}
Method 1: Dealing With NA
For method 1, It is ideal to use the closest possible station for each stadium. Moreover, considering the closest stations will have a multitude misisng values, it is ideal to take some steps to fill them. An easy solution that accounts for most of the NA values is to fill from the next closest station:
#cleaning the raw weather data
df_weather_raw_method1 <- df_weather_raw_method1 %>%
#converting time to date time
mutate(time = lubridate::parse_date_time(time, orders = c("%Y-%m-%d %H:%M:%S", "%m/%d/%Y %H:%M"))) %>%
#grouping by season, site and time
group_by(game_id, time) %>%
#dealing with NA
arrange(distance) %>%
#filling NA of closer distanced weather stations with further weather stations
fill(c('temp', 'dwpt', 'rhum', 'prcp',
'wspd', 'wdir', 'pres'), .direction = 'up') %>%
#ungrouping
ungroup()
Now, all data that is not from the station with the minimum distance to the stadium is filtered out:
#cleaning the raw weather data
df_weather_raw_method1 <- df_weather_raw_method1 %>%
#grouping by season, site and time
group_by(game_id, time) %>%
#filtering for only closest distance to stadium
filter(distance == min(distance)) %>%
#ungrouping
ungroup()
Next, the remaining NA is dealt with by using a linear approximation:
df_weather_raw_method1 <- df_weather_raw_method1 %>%
#filling remaining NA with moving average
group_by(game_id) %>%
arrange(time) %>%
#filling NA with linear approximation - precipitation/wind direction cannot be well predicted this way
mutate_at(vars('temp', 'dwpt',
'rhum', 'wspd', 'pres'), ~zoo::na.approx(., maxgap = 3, na.rm = F)) %>%
#ungrouping
ungroup()
df_weather_raw <- df_weather_raw_method1
This concludes the first method of data acquisition.
Method 2- Acquiring Point Data for Each Set of Stadium Coordinates For Each Game
Advantages:
Fast computation time.
No cleaning required after data acquisition.
Disadvantages:
- No guarentee to have data for every game.
Meteostat now has an option to acquire “Point” data based on a longitude and latitude. Acquiring data this way allows one to take advantage of Meteostat’s weather models that combine information from various stations. Considering the weather at the longitude and latitude of the stadium is the desired final data, less total queries are required to obtain data for the entire dataset. However, a multitude of games are unable to be acquired including nearly every game from the stadiums for Carolina from 2000 - 2004, New England from 2016 - 2019, Seattle from 2015- 2019, and a few others.
In this method, Meteostat’s Point Hourly command is used.
#will store results
df_weather_raw_method2 <- data.frame()
API_KEY = "" #INSERT YOUR API KEY HERE
#iterating through each season and location
for(i in seq(1, nrow(df_weather_fetch))){
row <- df_weather_fetch[i,]
#saving progress
print(paste0(i, ' / ', nrow(df_weather_fetch)))
#query to get weather for given station and date
query_hourly = paste0('https://api.meteostat.net/v2/point/hourly?lat=',
row$Latitude,'&lon=',row$Longitude,
'&start=',row$first_date,
'&end=',row$last_date)
#assuming result is missing to start
query_hourly_content <- ""
num_calls <- 0
while(query_hourly_content == "") {
#response of query
query_hourly_response = httr::GET(url = query_hourly, httr::add_headers(`x-api-key` = API_KEY))
#content of query
query_hourly_content <- httr::content(query_hourly_response, as = 'text')
num_calls <- num_calls + 1
#prevent more than 1 call per second (MeteoStat's limit)
Sys.sleep(1)
#at this point we assume API key is out of queries for day
if(num_calls > 5) {
#saving csv of progress
write.csv(x = df_weather_raw_method2, file = 'weather_raw.csv', row.names = F)
#breaking code when out of calls for the day
stopifnot(F)
}
}
#cleaned data of query
query_hourly_data = jsonlite::fromJSON(query_hourly_content)$data
#many times result will be empty list
if(length(query_hourly_data) > 0){
df_weather_raw_method2 <- bind_rows(df_weather_raw_method2,
cbind(row,
query_hourly_data %>% as.data.frame()))
}
}
df_weather_raw <- df_weather_raw_method2
Method 3 - Acquiring Bulk Data for Every Station near NFL Stadiums
Method 3 is not reccomended unless you have a computer with free memory and strong processing power. One can learn how to aquire bulk weather data here.
Advantages:
- No API key is required.
- Guarenteed method to acquire data from each station for each game.
Disadvantages:
- Great processing power needed to unzip the data and filter for the required rows.
No code is included for this method due to the extreme time complexity required to test and ensure correctness. The idea would be to use the code in method 1 to find the weather stations and filter the bulk hourly data for the hours of the relevant games. One can use the code that fills NA in method 1 as well.
Cleaning the data
Now that the data has been acquired via one of the 3 methods, the cleaning process can begin. First, columns that are essentially completely NA and columns that are no longer needed (including Longitude
, Latitude
, first_date
and last_date
) are dropped:
df_weather_raw <- df_weather_raw %>%
select(-c('snow', 'first_date', 'last_date', 'wpgt', 'Longitude',
'tsun', 'Latitude', 'coco'))
Finally, considering many fans of NFL football prefer the imperial system of measurement, the data is converted from metric to imperial. Also, an EstimatedCondition
column is added.
#conversion between celcius and fahrenheit
celcius2Fahrenheit_slope <- 9/5
celcius2Fahrenheit_intercept <- 32
mm2Inches <- 0.039370
km2Miles <- 0.621371
hPa2Inches <- 0.02952998
#raw weather data
df_weather_raw <- df_weather_raw %>%
mutate(
TimeMeasure = time,
#making direction go from 0 to 360, and setting as NA if no wind speed
WindDirection = ifelse(wspd == 0, NA, wdir) %% 360,
#fixing issues where humidity can be greater than 100%
Humidity = ifelse(rhum > 100, 100, rhum),
#converting Celcius to Fahrenheit
Temperature = round(temp * celcius2Fahrenheit_slope + celcius2Fahrenheit_intercept, 2),
DewPoint = round(dwpt * celcius2Fahrenheit_slope + celcius2Fahrenheit_intercept, 2),
#converting milimeters to inches
Precipitation = round(pres * mm2Inches, 3),
#converting km / hour to miles / hour
WindSpeed = round(wspd * km2Miles, 2),
#converting hPa to inches of mercury
Pressure = round(pres * hPa2Inches, 4),
#determining condtion from precipitation and temperature
EstimatedCondition = ifelse(prcp == 0, "Clear", "Rain"),
EstimatedCondition = ifelse(prcp > 0,
ifelse(Temperature <= 34, "Snow", "Rain"),
EstimatedCondition),
EstimatedCondition = ifelse(EstimatedCondition %in% c("Snow", "Rain"),
paste0(case_when(prcp < 0.098 ~ "Light ",
prcp > 0.3 ~ "Heavy ",
T ~ "Moderate "), EstimatedCondition),
EstimatedCondition)) %>%
#selecting relevant columns
select(game_id, TimeMeasure, Temperature,
DewPoint, Humidity, Precipitation, WindSpeed, WindDirection, Pressure)
Merging Weather Data
The weather data is joined to the games data by season and location and then measurement times that are outside of the game times are filtered out. The TimeStartFilter
and TimeEndFilter
variables are created for this filtering. The variables are converted to UTC time and have a buffer to ensure all wanted data is kept.
df_games <- df_games %>%
mutate(TimeStartFilter = TimeStartGame - lubridate::hours(TZOffset) - lubridate::hours(1),
TimeEndFilter = TimeEndGame - lubridate::hours(TZOffset) + lubridate::hours(1))
Finally, one must merge and filter out the unwanted measurements:
#joining each season and rbinding for memory issues
df_weather_games <- left_join(df_games, df_weather_raw, by = c("game_id"= "game_id")) %>%
filter( ((TimeMeasure >= TimeStartFilter) & (TimeMeasure <= TimeEndFilter)) )
The data is cleaned and ready to analyze and visualize.
Plotting Results
Now that the data is clean, it can be visualized. The following code creates a simple plot showing the general wind speed trends at every outdoor stadium.
#used for plot values
plot_vals <- df_weather_games %>%
group_by(game_id) %>%
#selecting 1st read of wind
slice(1) %>%
#joing games
inner_join(df_games) %>%
#joing coordinates
inner_join(df_coordinates) %>%
#applying filters
filter(RoofType == "Outdoor", Season >= 2015) %>%
#renaming stadium
mutate(StadiumNameAdj = paste0(HomeTeam, ': ', StadiumName),
#limiting max value of wind speed
WindSpeed = ifelse(WindSpeed >= 25, 25, WindSpeed)) %>%
#selecting values of interest
select(StadiumNameAdj, WindSpeed, WindDirection) %>%
#grouping by stadium
group_by(StadiumNameAdj) %>%
#filtering for only 20 or more observations
filter(n() > 20) %>%
#ungrouping
ungroup()
set.seed(0)
## Add random noise (WindDirection comes in multiples of 10)
plot_vals$WindDirection <- plot_vals$WindDirection + runif(nrow(plot_vals), -3, 3)
#will be used for axis text
key <- expand.grid(WindDirection = 45/2, WindSpeed = c(5, 10, 15, 20, 25),
StadiumNameAdj = unique(plot_vals$StadiumNameAdj)) %>%
mutate(WindLabel = ifelse(WindSpeed == 25, " 25+", WindSpeed))
plot_vals %>%
ggplot(aes(x = WindDirection, y = WindSpeed)) +
#axis lines
geom_hline(yintercept = seq(0, 25, by = 5), colour = "grey90", size = 0.2) +
geom_vline(xintercept = seq(0, 360-1, by = 45), colour = "grey90", size = 0.2) +
#axis text
geom_text(data = key, aes(label = WindLabel), colour = "grey", size = 2.2) +
#vectors
geom_segment(aes(y = 0, xend = WindDirection, yend = WindSpeed),
arrow = arrow(length = unit(0.03, "npc"))) +
coord_polar(start = 0) + xlab('') +
#removing y label
ylab('') +
#scale
scale_x_continuous(breaks = c(0, 90, 180, 270), labels = c("N", "E", "S", "W"), lim = c(0,360)) +
scale_y_continuous(lim = c(0, 25)) +
#black white theme
theme_bw() +
#title
ggtitle("Wind Direction and Magnitudes, Outdoor Stadiums 2015-2019") +
#faceting by stadium
facet_wrap(~StadiumNameAdj, nrow = 5) +
#theme
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
strip.background = element_blank(),
strip.text = element_text(size = 8),
panel.border = element_blank(),
panel.grid = element_blank()
)