Exploring COVID data with SQL

Introduction

Recently, I decided to work on my SQL skills a bit by exploring some COVID data produced by Our World in Data, and you can find the GitHub repo here if you want to recreate the database yourself (sadly my idea of uploading the backup of the database itself as a tar file didn’t work as planned, it’s too big for GitHub). Feel free to clone the repo, and explore the data.

In fact, this dataset is updated daily and can be reused freely!

I will say that preparing the data was the hardest part, so the repo will include specific CSV files if you are working with a Postgres database (like I did).

So in this short post, I wanted to share some of the queries I wrote, and what they returned.

But let’s say I’ve created the database (using Postgres), and now I need to create the two tables I need; one for deaths, and one for vaccinations:

-- Table: public.covid_deaths

-- DROP TABLE IF EXISTS public.covid_deaths;

CREATE TABLE IF NOT EXISTS public.covid_deaths
(
    iso_code character varying(50) COLLATE pg_catalog."default",
    continent character varying(50) COLLATE pg_catalog."default",
    location character varying(50) COLLATE pg_catalog."default",
    date date,
    population numeric,
    total_cases numeric,
    new_cases numeric,
    new_cases_smoothed numeric,
    total_deaths numeric,
    new_deaths numeric,
    new_deaths_smoothed numeric,
    total_cases_per_million numeric,
    new_cases_per_million numeric,
    new_cases_smoothed_per_million numeric,
    total_deaths_per_million numeric,
    new_deaths_per_million numeric,
    new_deaths_smoothed_per_million numeric,
    reproduction_rate numeric,
    icu_patients numeric,
    icu_patients_per_million numeric,
    hosp_patients numeric,
    hosp_patients_per_million numeric,
    weekly_icu_admissions numeric,
    weekly_icu_admissions_per_million numeric,
    weekly_hosp_admissions numeric,
    weekly_hosp_admissions_per_million numeric,
    total_tests numeric
)

TABLESPACE pg_default;

ALTER TABLE IF EXISTS public.covid_deaths
    OWNER to postgres;
-- Table: public.covid_vaccinations

-- DROP TABLE IF EXISTS public.covid_vaccinations;

CREATE TABLE IF NOT EXISTS public.covid_vaccinations
(
    iso_code character varying(50) COLLATE pg_catalog."default",
    continent character varying(50) COLLATE pg_catalog."default",
    location character varying(50) COLLATE pg_catalog."default",
    date date,
    total_tests numeric,
    new_tests numeric,
    total_tests_per_thousand numeric,
    new_tests_per_thousand numeric,
    new_tests_smoothed numeric,
    new_tests_smoothed_per_thousand numeric,
    positive_rate numeric,
    tests_per_case numeric,
    tests_units character varying(50) COLLATE pg_catalog."default",
    total_vaccinations numeric,
    people_vaccinated numeric,
    people_fully_vaccinated numeric,
    total_boosters numeric,
    new_vaccinations numeric,
    new_vaccinations_smoothed numeric,
    total_vaccinations_per_hundred numeric,
    people_vaccinated_per_hundred numeric,
    people_fully_vaccinated_per_hundred numeric,
    total_boosters_per_hundred numeric,
    new_vaccinations_smoothed_per_million numeric,
    new_people_vaccinated_smoothed numeric,
    new_people_vaccinated_smoothed_per_hundred numeric,
    stringency_index numeric,
    population_density numeric,
    median_age numeric,
    aged_65_older numeric,
    aged_70_older numeric,
    gdp_per_capita numeric,
    extreme_poverty numeric,
    cardiovasc_death_rate numeric,
    diabetes_prevalence numeric,
    female_smokers numeric,
    male_smokers numeric,
    handwashing_facilities numeric,
    hospital_beds_per_thousand numeric,
    life_expectancy numeric,
    human_development_index numeric,
    excess_mortality_cumulative_absolute numeric,
    excess_mortality_cumulative numeric,
    excess_mortality numeric,
    excess_mortality_cumulative_per_million numeric
)

TABLESPACE pg_default;

ALTER TABLE IF EXISTS public.covid_vaccinations
    OWNER to postgres;

At this point, the data can be imported, I used PgAdmin to import the CSV data into the respective tables.

And with the tables and data now ready, let’s jump into some queries.

Total Cases

Let’s look at the number of total cases vs total deaths and vs population. Essentially, these queries show:

  • what’s the likelihood of dying if you catch COVID
  • what’s the percentage of population that caught COVID
-- looking at total cases vs total deaths
-- shows the likelihood of dying if you contract covid in your country

SELECT cd.location,
       cd.date,
       cd.total_cases,
       cd.total_deaths,
       (cd.total_deaths/cd.total_cases)*100 AS death_percentage
FROM covid_deaths cd
WHERE cd.location like '%States'
    AND cd.continent IS NOT NULL
ORDER BY 1,
         2;

-- Looking at total cases vs population
-- shows percentage of population that caught covid

SELECT cd.location,
       cd.date,
       cd.total_cases,
       cd.population,
       (cd.total_cases/cd.population)*100 AS InfectionPercentage
FROM covid_deaths cd
WHERE cd.location like '%States'
    AND cd.continent IS NOT NULL
ORDER BY 1,
         2;

For the first query, apart from selecting several rows, the death percentage is calculated by taking the total deaths, dividing it by total cases and then multiplying it by one hundred. This query would returns thousands of rows so I decided to filter it by a specific country.

What this query returns is that early on in the pandemic, the death percentage peaked at roughly 6% (as the number of deaths were high against the relatively low number of cases), but now it sits at 1%.

The second query is somewhat similar to the first, but now the infection percentage is calculated by taking the total cases, dividing it by the population and then multiplying it by one hundred. Again, I’m filtering the results by a specific country.

What this query shows is that the percentage of people infected jumped massively at the start of the pandemic. But in the United States, it seems to have settled at 30% of the population being infected.

Countries

Let’s move away from a specific country and let’s see if we can find some data about a range of countries.

-- Looking at countries with highest infection rate compared to population

SELECT cd.location,
       cd.population,
       MAX(cd.total_cases) AS HighestInfectionCount,
       MAX((cd.total_cases/cd.population))*100 AS PercentPopulationInfected
FROM covid_deaths cd
WHERE cd.continent IS NOT NULL
GROUP BY cd.population,
         cd.location
ORDER BY PercentPopulationInfected DESC;

-- showing countries with highest death count per population

SELECT cd.location,
       MAX(cd.total_deaths) AS TotalDeathCount
FROM covid_deaths cd
WHERE cd.continent IS NOT NULL
GROUP BY cd.location
ORDER BY TotalDeathCount DESC;

The first query selects and then groups by the population and the location, while finding the highest value of the total cases recorded and calculating the percentage of the population that has been infected (by taking the max values of the total cases, the population, dividing them and multiplying by 100).

This query returns a list of locations, their population, the highest infection count and their percentage of population that has been infected. There are several locations that haven’t released their infection counts, so there are rows which return a NULL value.

The second query is slightly different, as it only finds the max value of total deaths from each location. What the query shows is that, although there are locations that didn’t report the number of deaths which show a NULL value, there are countries that do. The results ordered by the highest number of total deaths.

Global numbers

Let’s take a look at the global numbers for cases, deaths, and a death percentage. This attempts to calculate the percentage of deaths by taking the sum of the new deaths, dividing it by the sum of new cases, and then multiplying it by 100.

-- global numbers
-- handling dividing by zero

SELECT cd.date,
       SUM(cd.new_cases) AS total_cases,
       SUM(cd.new_deaths) AS total_deaths,
       CASE
           WHEN SUM(cd.new_cases) = 0 THEN NULL
           ELSE SUM(cd.new_deaths)/SUM(cd.new_cases)*100
       END AS death_percentage
FROM covid_deaths cd
WHERE cd.continent IS NOT NULL
    AND cd.new_cases IS NOT NULL
GROUP BY cd.date
ORDER BY 1,
         2;

This is when things get slightly tricky. Calculating the death percentage at first returned an error as it couldn’t calculate dividing by zero. So to handle this issue, I decided to use a case statement to change the value from zero to NULL.

The query returns a lot of NULL rows but it does calculate the death percentage.

Vaccinations

The last query we’ll look at is examining the number of new vaccinations and the rolling total:

-- total population versus vaccinations
 -- Use Common Table Expression
WITH PopVsVac (Continent, Location, Date, Population, New_Vaccinations, RollingPeopleVaccinated) as
    (SELECT cd.continent,
            cd.location,
            cd.date,
            cd.population,
            cv.new_vaccinations,
            SUM(cv.new_vaccinations) OVER (PARTITION BY cd.location
                                           ORDER BY cd.location,
                                                    cd.date) as RollingPeopleVaccinated
     FROM covid_deaths cd
     JOIN covid_vaccinations cv ON cd.location = cv.location
     AND cd.date = cv.date
     WHERE cd.continent IS NOT NULL
         AND cv.new_vaccinations IS NOT NULL)
SELECT *
FROM PopVsVac;

I decided to use a Common Table Expression for this (to make it easier to read and maintain), and the rolling total is calculated by using a partition for each location.

The result of this query is that this shows the number of new vaccinations, the rolling total of vaccinations for each country. This could be expanded to show the percentage of the population that have been vaccinated.

Bonus; create a view

As an added bonus to this post, let’s look at creating a view:

-- create a view

CREATE VIEW country_infection_rate AS
SELECT cd.location,
       cd.population,
       MAX(cd.total_cases) AS HighestInfectionCount,
       MAX((cd.total_cases/cd.population))*100 AS PercentPopulationInfected
FROM covid_deaths cd
WHERE cd.continent IS NOT NULL
GROUP BY cd.population,
         cd.location
ORDER BY PercentPopulationInfected DESC;

SELECT *
FROM country_infection_rate;

What this view will return is the percentage of the population that has been infected in a location, ordered by the location that has the highest percentage first.

Making a view will make it easier to find this information (and that all the queries ran previously could also be made into a view).

Wrap up

So that was a quick overview of the exploratory work in SQL! Don’t forget to check out the GitHub repo, and if you have any feedback, get in touch!

Until next time 👋

Citations

Edouard Mathieu, Hannah Ritchie, Lucas Rodés-Guirao, Cameron Appel, Charlie Giattino, Joe Hasell, Bobbie Macdonald, Saloni Dattani, Diana Beltekian, Esteban Ortiz-Ospina and Max Roser (2020) - "Coronavirus Pandemic (COVID-19)". Published online at OurWorldInData.org. Retrieved from: 'https://ourworldindata.org/coronavirus' [Online Resource]

Tags:

Blog

A photo of me!

I'm Joshua Blewitt, I'm passionate about product, a technology advocate, customer champion, curious mind and writer. I've worked for companies such as Rightmove, Domino's Pizza and IQVIA.

Let me know your thoughts!
More about me