F1 Data Science Project part 5 - analysis with Pandas and Seaborn!

The season may have ended, but the analysis hasn’t! (And it will continue with the next season...)

Now we have a full set of results, we can look into performing some deeper analysis.

In case you haven’t cloned the repo to play around with the data, view the repo on GitHub here. Remember that you will need to have Python Poetry installed for dependency management.

Let’s jump in!

Average number of points per team

What would be interesting to see is the average number points for each team, across all the races. Since I have a CSV file of the results of each race, we can use that to calculate the average number of points. So let’s do some initial set up work:

First, load the data and create an empty data frame.

raceResults = pd.read_csv('data/compiled-data/race-results.csv')
raceResults.head()

avgPoints = pd.DataFrame()
avgPoints = avgPoints.assign(Team = '', AvgPoints = '')
avgPoints.head()

At first, I took this approach for finding the average for each time:

redBull = raceResults.loc[raceResults['Car'] == 'Red Bull Racing RBPT']
redBull.head()
redBull['PTS'].mean()

And I repeated this for the Ferrari team, as well as Red Bull.

At this point, I thought to myself; there must be a better way of doing this! And there was!

I decided to create an array of the team names, that is used in a for loop.

teamNameArray = ['Ferrari', 'Red Bull Racing RBPT', 'Mercedes', 'McLaren Mercedes', 'Alpine Renault', 'Alfa Romeo Ferrari', 'AlphaTauri RBPT', 'Williams Mercedes', 'Aston Martin Aramco Mercedes', 'Haas Ferrari']
len(teamNameArray)

for i in teamNameArray:
    testRes = round(raceResults.loc[raceResults['Car'] == i]['PTS'].mean(), 2)
    avgPoints.loc[-1] = [i, testRes]
    avgPoints.index = avgPoints.index + 1

avgPoints.sort_values(by='AvgPoints', inplace=True, ascending=False)

The loop is interesting as it uses the loc property to find any row that matches the team name. Then, the ‘round’ function is used when formatting the average calculated from Pandas (set to 2 decimal places). The team name and average score is then added to the data frame (with the index shifted afterwards).

With the data ready, let’s visualise the results with seaborn.

avgSpeedPlotTeam = sns.barplot(x='Team', y='AvgFastestLap', data=avgFastestLapTeam)
avgSpeedPlotTeam.set_xticklabels(avgSpeedPlotTeam.get_xticklabels(), rotation=45, horizontalalignment='right')

It’s clear that Red Bull Racing is in a league of their own. However, what’s interesting is that you can ‘group’ the other teams together. Ferrari and Mercedes are close together, and Alpine and McLaren are close together as well and so on for the other teams.

The average speed for each of the teams

So, the average points show that Red Bull were clearly performing well throughout the season, but can we say the same for the average speed throughout the season?

The answer, not really. In fact, it was very close!

Let’s build out the data we need:

avgSpeedTeam = pd.DataFrame()
avgSpeedTeam = avgSpeedTeam.assign(Team = '', AvgSpeed = '')
avgSpeedTeam.head()

for i in teamNameArray:
    testRes = round(fastestLap.loc[fastestLap['CAR'] == i]['AVG_SPEED'].mean(), 2)
    avgSpeedTeam.loc[-1] = [i, testRes]
    avgSpeedTeam.index = avgSpeedTeam.index + 1

avgSpeedTeam.sort_values(by='AvgSpeed', inplace=True, ascending=False)

Similar setup to before, a loop that calculates the average speed for each team and then adds it to the data frame.

And let’s build the graph:

avgSpeedPlotTeam = sns.barplot(x='Team', y='AvgSpeed', data=avgSpeedTeam)
avgSpeedPlotTeam.set_xticklabels(avgSpeedPlotTeam.get_xticklabels(), rotation=45, horizontalalignment='right')

Which shows…

The average speed across the teams are somewhat similar, as opposed to the average number of points across the teams.

Bonus - Linear Regression

And for a bonus, I decided to create a graph that shows a line of linear regression.

But to do that, I'd need to merge two data frames together. Which is done like so:

df1 = avgFastestLapTeam
df2 = avgSpeedTeam

mergedDf = pd.merge(df1, df2, on="Team", how="inner")

(Note that an inner merge was used)

The frames that were merged now have the average fastest lap and the average speed for all the teams. This will be used in creating the graph that shows the line of linear regression.

Now that we have a merged data frame, we can make a plot of linear regression like so:

linearRegFl = sns.lmplot(data=mergedDf, x='AvgFastestLap', y='AvgSpeed')

Looks great!

Now I could add a hue to the plot for the teams, which would make sense. However, when this is applied...

It doesn't create the line of linear regression. It does tell us that the top scoring teams do have a high average speed and fastest lap.

Conclusion

And that wraps up this post on analysing these results, be sure to clone the repo if you want to take a full look at the Jupyter notebook. Plus, the plots have been saved to a folder under the data directory if you want to view them.

Thanks for reading!

Tags:

Python

A photo of me!

I'm Joshua Blewitt, I'm passionate about product, a technology advocate, customer champion, curious mind and writer. I've worked for companies such as Rightmove, Domino's Pizza and IQVIA.

Let me know your thoughts!
More about me