This project takes a look at traffic around Atlanta, Georgia for the years 2017-2020. Since I didn’t have many projects on my website, I decided I wanted to put together a few smaller basic projects. For this project, I wanted to get some more practice with cleaning data, working with scikit-learn classifiers, making interactive plots, and various other tools.

What did I do in this project?

First, I browsed through the datasets on Kaggle Datasets to find one that seemed interesting. I had no idea what I wanted to work with, I just wanted to pick something that looked like it would make good practice. Eventually, I settled on US Accidents (4.2 million records) A Countrywide Traffic Accident Dataset (2016 - 2020) which contained traffic accidents and the delays those accidents caused.

Once I had selected a dataset, I started by cleaning the data. I reduced the data set to the Atlanta area, I removed unneeded data, and filled in missing data. Next, I made various interactive plots of traffic accidents around Atlanta. After cleaning and looking at the data, I trained several classifiers to make predictions on the data. The dataset contained estimated delays caused by a traffic accident at that particular location, and I used this for my target values. Finally, I made some plots that compared the traffic of the years 2017-2019 with the year 2020. The year 2020 was used for comparison because that is when COVID-19 began, and I wanted to see a comparison of the traffic accidents.

What were the results?

I managed to get the data clean, and I made several plots to take a look at accident density around Atlanta. The plots looked at accidents by year, by weather, weekday, and hourly.

For the predictions, I tried multiple different types of classifiers. I managed to get significantly better accuracy than the dummy classifier I used for comparison. At the end, I combined my best classifiers into a voting classifier, though this only slightly increased the accuracy.

Next, I took a look at the traffic from 2020 vs 2017-2019 which would compare the effects of COVID-19. The data didn’t show as expected for this section. Where I expected accidents to decrease dramatically, that didn’t seem to be the case. In the end, I believe it was due to the completeness of the dataset. The author said that he believed the dataset to be a subset of the actual accidents since it was only collected from traffic site APIs across the internet. I believe this affected the numbers.

Final Thoughts

Overall, I had no real plans when starting this project. I wanted to pick a dataset and just explore it. There was no initial goal when I decided on the traffic dataset. Once I began working on the project and looking at the dataset, It was a good dataset for working with interactive plots. However, when I got to the prediction part, I found that simply predicting the provided traffic delay categories was not very interesting. As an entry level practice project, I think it was fine, but in the end, I wish I had picked something a little bit more interesting.

The Notebook

The notebook is embedded below. However, it’s best viewed at nbviewer.org. The project can be found on my Github.