F1 Data Science Project part 4 - refactor!
I’m determined to finish this project (what does finished even look like for this project?) by the end of this year. There’s only two races left of the season and there’s already a winner, for both the world champion and constructors.
If you haven't seen the GitHub repo for this yet, be sure to check out the link here. And remember, you will need Poetry installed first to use the Jupyter notebooks.
So, I decided to update the project with new data and also decided to do something that I have been needing to do for a long time…
Refactor
Yes, there has been a need to refactor a large amount of the Python code. The problem was that whenever I needed to add a new set of race results, I copied the code that took the new race data and added it to the race results data frame. All I had to do was change a few names of variables and a string.
This as you can imagine, the main problems with this are:
- Mistakes being made when updating variables and strings
- It looks super messy
- Difficult to read
This was done for each race, and it was just turning into a mess. So I decided to refactor the entire Jupyter notebook for race data and fastest lap.
The functions in particular that I have refactored are the following:
Analysis
driver_count = 0
while driver_count < len(bahrain_df.index):
pos = bahrain_df.loc[driver_count, 'Pos']
driverNo = bahrain_df.loc[driver_count, 'No']
name = bahrain_df.loc[driver_count, 'Driver']
car = bahrain_df.loc[driver_count, 'Car']
laps = bahrain_df.loc[driver_count, 'Laps']
time = bahrain_df.loc[driver_count, 'Time/Retired']
points = bahrain_df.loc[driver_count, 'PTS']
# add row
race_results.loc[-1] = [pos, name, car, laps, time, points, 'BAHRAIN']
# shift the index
race_results.index = race_results.index + 1
driver_count += 1
race_results.head()
Fastest lap
driver_count = 0
while driver_count < len(fl_bahrain.index):
time = fl_bahrain.loc[driver_count, 'TIME']
driverNo = fl_bahrain.loc[driver_count, 'NO']
name = fl_bahrain.loc[driver_count, 'DRIVER']
avg_speed = fl_bahrain.loc[driver_count, 'AVG SPEED']
car = fl_bahrain.loc[driver_count, 'CAR']
if time == 'DNF':
driver_count += 1
else:
#time = re.sub(r'[^0-9.]', '', time)
#timeFloat = float(time)
formattedTime = datetime.strptime(time,'%M:%S.%f')
formattedTime = formattedTime.strftime('%H:%M:%S.%f')[:-5]
#finalTime = timeFloat + fastest_lap_bahrain_time.strftime('%M:%S.%f')[:-3]
# add row
fastest_lap.loc[-1] = [driverNo, name, car, 'BAHRAIN', formattedTime, avg_speed]
# shift the index
fastest_lap.index = fastest_lap.index + 1
driver_count += 1
# CONVERT FL COLUMN TO TIMEDELTA FORMAT
fastest_lap['FL'] = pd.to_timedelta(fastest_lap['FL'])
fastest_lap.head()
As you could imagine, having these blocks of code repeated for each race made the Jupyter notebooks hard to read! So refactoring was needed to make it easier and better.
So, let’s dive in with the analysis function first.
Analysis
From the code block above, we can see several places where a parameter can be used:
- The name of the country (in capitals)
- The specific race data frame
- The data frame for all the race data itself
But apart from just taking the code and putting it into a function, I thought it could have the ability to pass in a boolean to change if the user wanted to commit the results to the data frame or just see a preview of the results.
So the function then changed to look like this:
def addRaceData(country_df, country_name, race_results, commit_mode):
"Adds race data to the race_results data frame."
driver_count = 0
print("Adding data for " + country_name)
try:
if(commit_mode == True):
print("Commit mode is set to TRUE - data WILL be added to the race results data frame!")
while driver_count < len(country_df.index):
pos = country_df.loc[driver_count, 'Pos']
driverNo = country_df.loc[driver_count, 'No']
name = country_df.loc[driver_count, 'Driver']
car = country_df.loc[driver_count, 'Car']
laps = country_df.loc[driver_count, 'Laps']
time = country_df.loc[driver_count, 'Time/Retired']
points = country_df.loc[driver_count, 'PTS']
# add row
race_results.loc[-1] = [pos, name, car, laps, time, points, country_name]
# shift the index
race_results.index = race_results.index + 1
driver_count += 1
print("Race data for " + country_name + " has been added")
elif(commit_mode == False):
print("Commit mode is set to FALSE - data will NOT be added to the race results data frame!")
print("Here's what would've been added to the race results data frame")
while driver_count < len(country_df.index):
pos = country_df.loc[driver_count, 'Pos']
driverNo = country_df.loc[driver_count, 'No']
name = country_df.loc[driver_count, 'Driver']
car = country_df.loc[driver_count, 'Car']
laps = country_df.loc[driver_count, 'Laps']
time = country_df.loc[driver_count, 'Time/Retired']
points = country_df.loc[driver_count, 'PTS']
print(pos, name, car, laps, time, points, country_name)
driver_count += 1
print("Race data for " + country_name + " has NOT been added")
except:
print("ERROR! Double check the arguments provided for the function.\nHave you imported the race data CSV?\nHas the race_results data frame been created?")
Which is great! However, regardless if a True or False boolean is passed in, I’m repeating the functionality of creating variables. This honestly should only happen once, all that matters is if a new row should be added or not.
So let’s do some more improvements:
def addRaceData(country_df, country_name, race_results, commit_mode):
"Adds race data to the race_results data frame."
driver_count = 0
print("Adding data for " + country_name)
print("Commit mode is set to " + str(commit_mode))
try:
while driver_count < len(country_df.index):
pos = country_df.loc[driver_count, 'Pos']
driverNo = country_df.loc[driver_count, 'No']
name = country_df.loc[driver_count, 'Driver']
car = country_df.loc[driver_count, 'Car']
laps = country_df.loc[driver_count, 'Laps']
time = country_df.loc[driver_count, 'Time/Retired']
points = country_df.loc[driver_count, 'PTS']
if(commit_mode):
# add row
race_results.loc[-1] = [pos, name, car, laps, time, points, country_name]
else:
# print row
print(pos, name, car, laps, time, points, country_name)
driver_count += 1
# shift the index
race_results.index = race_results.index + 1
except:
print("ERROR! Double check the arguments provided for the function.\nHave you imported the race data CSV?\nHas the race_results data frame been created?")
Which looks much better. This function is now easier to read, and we can tell if the user passes in True or False, we can tell what the outcome would be.
And now adding race data looks like this:
addRaceData(bahrain_df, "BAHRAIN", race_results, True)
Much better!
Fastest lap
The fastest lap was a similar situation, a few spots where arguments could be used in the function:
- The fastest lap for the race
- The country name
- The fastest lap data frame
Here’s the first iteration of the function:
def addFastestLaps(fl_country, country_name, fastest_lap, commit_mode):
driver_count = 0
print("Adding data for " + country_name)
try:
if(commit_mode==True):
print("Commit mode is set to TRUE - data WILL be added to the fastest lap data frame!")
while driver_count < len(fl_country.index):
time = fl_country.loc[driver_count, 'Time']
driverNo = fl_country.loc[driver_count, 'No']
name = fl_country.loc[driver_count, 'Driver']
avg_speed = fl_country.loc[driver_count, 'Avg Speed']
car = fl_country.loc[driver_count, 'Car']
if time == 'DNF':
driver_count += 1
else:
#time = re.sub(r'[^0-9.]', '', time)
#timeFloat = float(time)
formattedTime = datetime.strptime(time,'%M:%S.%f')
formattedTime = formattedTime.strftime('%H:%M:%S.%f')[:-5]
#finalTime = timeFloat + fastest_lap_bahrain_time.strftime('%M:%S.%f')[:-3]
# add row
fastest_lap.loc[-1] = [driverNo, name, car, country_name, formattedTime, avg_speed]
# shift the index
fastest_lap.index = fastest_lap.index + 1
driver_count += 1
# CONVERT FL COLUMN TO TIMEDELTA FORMAT
fastest_lap['FL'] = pd.to_timedelta(fastest_lap['FL'])
print("Data for " + country_name + " has been added!")
elif(commit_mode==False):
print("Commit mode is set to FALSE - data will NOT be added to the fastest lap data frame!")
while driver_count < len(fl_country.index):
time = fl_country.loc[driver_count, 'Time']
driverNo = fl_country.loc[driver_count, 'No']
name = fl_country.loc[driver_count, 'Driver']
avg_speed = fl_country.loc[driver_count, 'Avg Speed']
car = fl_country.loc[driver_count, 'Car']
if time == 'DNF':
driver_count += 1
else:
#time = re.sub(r'[^0-9.]', '', time)
#timeFloat = float(time)
formattedTime = datetime.strptime(time,'%M:%S.%f')
formattedTime = formattedTime.strftime('%H:%M:%S.%f')[:-5]
#finalTime = timeFloat + fastest_lap_bahrain_time.strftime('%M:%S.%f')[:-3]
print(driverNo, name, car, country_name, formattedTime, avg_speed)
driver_count += 1
except:
print("ERROR! Double check the arguments provided for the function.\nHave you imported the race data CSV?\nHas the fastest_lap data frame been created?")
Similar to the analysis function, I added an additional argument for a commit mode.
And once again I had fallen into the same hole, regardless of if you pass in True or False, the code to create the variables is repeated. So there is room for improvement.
And after some more refactoring:
def addFastestLaps(fl_country, country_name, fastest_lap, commit_mode):
driver_count = 0
print("Adding data for " + country_name)
print("Commit mode is set to " + str(commit_mode))
try:
while driver_count < len(fl_country.index):
time = fl_country.loc[driver_count, 'Time']
driverNo = fl_country.loc[driver_count, 'No']
name = fl_country.loc[driver_count, 'Driver']
avg_speed = fl_country.loc[driver_count, 'Avg Speed']
car = fl_country.loc[driver_count, 'Car']
# shift the index
fastest_lap.index = fastest_lap.index + 1
if time == 'DNF':
driver_count += 1
else:
driver_count += 1
#time = re.sub(r'[^0-9.]', '', time)
#timeFloat = float(time)
formattedTime = datetime.strptime(time,'%M:%S.%f')
formattedTime = formattedTime.strftime('%H:%M:%S.%f')[:-5]
#finalTime = timeFloat + fastest_lap_bahrain_time.strftime('%M:%S.%f')[:-3]
if(commit_mode):
# add row
fastest_lap.loc[-1] = [driverNo, name, car, country_name, formattedTime, avg_speed]
# CONVERT FL COLUMN TO TIMEDELTA FORMAT
fastest_lap['FL'] = pd.to_timedelta(fastest_lap['FL'])
else:
fastest_lap['FL'] = pd.to_timedelta(fastest_lap['FL'])
print(driverNo, name, car, country_name, formattedTime, avg_speed)
except:
print("ERROR! Double check the arguments provided for the function.\nHave you imported the race data CSV?\nHas the fastest_lap data frame been created?")
That’s much better! And calling this function to add the fastest lap data now looks like this:
addFastestLaps(fl_bahrain, "BAHRAIN", fastest_lap, True)
Hooray! Much better!
But things were not quite right, just yet…
After I updated the functions and tested it on one set of data, I updated the notebook to use the new functions.
But in a few scenarios, my function would fall over (this is why you have try and except statements!)
I was confused, the data frame that held the data for the race was created, and the data frame to hold all the data was populated with data as expected. So what gives?
Case sensitivity
Yes, case sensitivity. Several CSV files were different on the headers. Some were lower case, and some were upper case. The header names are case sensitive when working with Pandas.
So, I updated the CSV files to ensure they are each the same. This meant updating over ten files to ensure that the headers all match. And then it worked, I could export the completed results to a CSV file!
But things were STILL not quite right…
I noticed in the fastest lap data frame that the fastest lap column was showing incorrect data. It would display the fastest lap as:
P0DT0H1M34.600S
Which is unreadable, I wouldn’t be able to use this to make graphs with any meaning.
So, Fastest Lap has a data type of float64, I needed to find a way to show the value in seconds. Thankfully, there was such a way:
fastest_lap['FL'] = fastest_lap['FL'].dt.total_seconds()
And this fixed it, crisis averted!
What next?
The F1 2022 season is ending soon, so afterwards I’ll be preparing plenty of graphs and perform some more analysis. And I plan on using the data in Tableau to create some interactive charts.
There are several teams and drivers that would be interesting to perform some analysis on.
Stay tuned!