Exploring the BTS Flight Delay Dataset

12/9/202412:00:00 AM

This year I had a particularly fun travel experience that involved a few hours of sitting on a flight during an active tornado warning, and eventual tornado touchdown at Chicago O'Hare Internal Airport (ORD). If you've ever had the displeasure of flying into ORD (as a Chicago native its hard to avoid unless you fly Southwest), you'll know its an absolute sprawling nightmare. Built and expanded over the last 80 years it represents all the poor design decisions you could make for an airport.

As part of my nightmare travel experience I got to utilize Biden's Airline Travel Reform to receive a full refund for my cancelled flight (obviously something that should be the standard but historically has not). I was curious to try and quantify the effectiveness of this reform and understand the impact its had on the travel experience. Luckily we have the Bureau of Transportation Statistics which has a ton of datasets to sift through. I found that they distribute two main datasets for "On-Time" airline information.

While I initially struggled to find information on these datasets and what is actually contained within, I eventually found that the regulations related to this are all well defined in CFR § 234.4. If we read this we can find that airlines are required to provide data on "reportable flights" which prior to January 1st 2018 was defined as:

any nonstop flight, including a mechanically delayed flight, to or from any airport within the contiguous 48 states that accounts for at least 1 percent of domestic scheduled-passenger enplanements in the previous calendar year, as reported to the Department pursuant to part 241 of this title. Qualifying airports will be specified periodically in accounting and reporting directives issued by the Office of Airline Information.

And afterwards it was defined as:

any domestic nonstop scheduled passenger flight, including a mechanically delayed flight, held out to the public under the reporting carrier's code, to or from any U.S. large, medium, small, or non-hub airport as defined in 49 U.S.C. 47102. Qualifying airports will be specified periodically in accounting and reporting directives issued by the Office of Airline Information.

Based on my understanding of this these datasets will share information but the later one likely includes data from a larger selection of airports. Either way both datasets are very similar and I can start with the smaller 2018 version to experiment.

Downloading

Actually downloading the dataset proved difficult as the initial browser requests would just time out. I eventually copied over the request as a CURL command and let it run for a while. Eventually I was greeted with the following (to which I did let out quite a sigh, I can't imagine why it took them 2-3 minutes to respond with this...):

<html>
  <head>
    <title>Object moved</title>
  </head>
  <body>
    <h2>
      Object moved to
      <a
        href="https://transtats.bts.gov/PREZIP/On_Time_Marketing_Carrier_On_Time_Performance_Beginning_January_2018_2024_8.zip"
        >here</a
      >.
    </h2>
  </body>
</html>

But lucky for me it seemed the full data was trivially accessible over HTTP and nicely split up into month-by-month zipped CSV files. I made a quick list of the urls in vim and let the following bash/wget handle it:

while read url; do
    wget -c "$url"
done < urls.txt

After a while I had all the zip files downloaded locally onto an NVMe array. The next step was to decompress everything (and ignore the annoying zip prompts about the 'readme.txt' duplicate file found in each zip):

for i in *.zip; do unzip -o $i; done

ClickHouse

I initially thought this might be a good opportunity to play with DuckDB but after a few minutes of experimenting I decided just to go with my standard data-exploration tool ClickHouse. To import the CSV files I just had to write a schema that would match the CSV file:

CREATE TABLE on_time_marketing_carrier
(
    `FlightDate` Date,
    `Marketing_Airline_Network` String,
    `DOT_ID_Marketing_Airline` UInt64,
    `Flight_Number_Marketing_Airline` UInt64,
    `Operating_Airline` String,
    `DOT_ID_Operating_Airline` UInt64,
    `Flight_Number_Operating_Airline` UInt64,
    `Tail_Number` String,
    `Origin` String,
    `OriginStateName` String,
    `Dest` String,
    `DestStateName` String,
    `DepTime` String,
    `DepDelay` Float64,
    `TaxiOut` Float64,
    `WheelsOff` String,
    `WheelsOn` String,
    `TaxiIn` Float64,
    `CRSArrTime` String,
    `ArrTime` String,
    `ArrDelay` Float64,
    `AirTime` Float64,
    `Distance` Float64,
    `CRSElapsedTime` Float64,
    `ActualElapsedTime` Float64,
    `Cancelled` Float64,
    `Diverted` Float64,
    `CarrierDelay` Float64,
    `WeatherDelay` Float64,
    `NASDelay` Float64,
    `SecurityDelay` Float64,
    `LateAircraftDelay` Float64
)
ENGINE = MergeTree
ORDER BY tuple(FlightDate, Flight_Number_Marketing_Airline, Flight_Number_Operating_Airline)

Once creating the table above I just loaded data serially using the following bash for-loop:

for i in *.csv; do clickhouse-client -q "INSERT INTO bts.on_time_marketing_carrier FORMAT CSVWithNames" < $i; done

Switching To On-Time Performance Dataset

After this I had most of the data (I skipped 2018 initially) loaded into ClickHouse and ready for ad-hoc querying. I was able to build a bunch of base visualizations around this data and because the larger generic data-set had a nearly identical schema much of this work translated over to it. Because of this I decided to just switch over and utilize the larger dataset. I repeated the process above but used the alternative URL pattern of https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2014_2.zip. This time I loaded data all the way back to January 2000 to provide a bit more historical context.

This dataset is almost identical but does differ some and thus required a slight schema change:

CREATE TABLE on_time_performance
(
    `FlightDate` Date,
    `Reporting_Airline` String,
    `DOT_ID_Reporting_Airline` UInt64,
    `Flight_Number_Reporting_Airline` UInt64,
    `Tail_Number` String,
    `Origin` String,
    `OriginStateName` String,
    `Dest` String,
    `DestStateName` String,
    `DepTime` String,
    `DepDelay` Float64,
    `TaxiOut` Float64,
    `WheelsOff` String,
    `WheelsOn` String,
    `TaxiIn` Float64,
    `CRSArrTime` String,
    `ArrTime` String,
    `ArrDelay` Float64,
    `AirTime` Float64,
    `Distance` Float64,
    `CRSElapsedTime` Float64,
    `ActualElapsedTime` Float64,
    `Cancelled` Float64,
    `Diverted` Float64,
    `CarrierDelay` Float64,
    `WeatherDelay` Float64,
    `NASDelay` Float64,
    `SecurityDelay` Float64,
    `LateAircraftDelay` Float64
)
ENGINE = MergeTree
ORDER BY tuple(FlightDate, Reporting_Airline, Flight_Number_Reporting_Airline)

Flight Volume

A good first step is looking at overall flight volumes in the dataset and making sure it correlates with real-world events we know should make an impact.

Chart showing the total flight volume since 2000

Chart showing the total flight volume by origin airport

Chart showing the total flight volume by destination airport

ORD Is Terrible

I said above that ORD is an awful travel experience and I wasn't lying, but representing this in the dataset takes a bit of thinking. I realized the largest pain-point at O'Hare is usually taxing (this is mostly due to the prevailing weather conditions and runway layout). Sure enough when we chart average taxi-in and taxi-out times across the most popular airports we quickly see ORD is an outlier:

Chart showing the average taxi-out time for the top 25 airports by volume. ORD is highlighted.

And looking at a sum of the total time spent taxing its even more noticeable:

Chart showing the sum of time spent taxing out for the top 25 airports by volume. ORD is highlighted.

For completeness the taxi-in charts show the same thing:

Chart showing the average taxi-in time for the top 25 airports by volume. ORD is highlighted. Chart showing the sum of time spent taxing in for the top 25 airports by volume. ORD is highlighted.

It's Getting Worse

My hypothesis was that things would be getting better in 2024 based on the newer regulations. To see this I started looking at the average arrival and departure delays over 3 month periods. Despite my theory it actually seems like things are getting worse:

Chart showing average arrival delay Chart showing average departure delay

Looking at the sum of arrival/departure delay time we also don't see much of an improvement (although it looks like maybe things initially started getting better?):

Chart showing the total sum of delay time

Wasted Time

Taking a look deeper into delays, we can graph time spent by the tracked categories and get a feel for why flights get delayed:

Chart showing total delay by tracked categories

Charting cancelled flights we can see a few notable events like COVID-19 and the 2022 Southwest Airlines scheduling crisis.

Chart showing cancelled flights

Source Code

Of course there is plenty more to look at in this dataset. If you want to try playing around yourself I recommend following my instructions above for downloading the data and then utilizing this jupyter notebook as a starting place.

Musings