Long and wide data

Consider these two tables (made up data):

‘Longer’ table
Store	Day	Status
Media Markt	Weekday	Open
Media Markt	Weekend	Closed
Elgiganten	Weekday	Open
Elgiganten	Weekend	Open
Expert	Weekday	Closed
Expert	Weekend	Closed

‘Wider’ table
Store	Weekday	Weekend
Media Markt	Open	Closed
Elgiganten	Open	Open
Expert	Closed	Closed

They contain the same kind of information (except perhaps specifying the open status as specifically “Status”). Each of these forms have different advantages, thus it would be useful to be able to convert between these two formats in SQL.

Pivot longer

Consider the table birthstatistics which first lines looks like this:

SELECT * FROM birthstatistics
LIMIT 5

5 records
region	sex	foreign/Swedish background	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020
01 Stockholm county	boys	foreign born	13346	13157	12759	12675	13271	14325	15126	16074	16901	18075	19050	20059	20827	21843	23084	25124	26270	27060	26484
01 Stockholm county	boys	born in Sweden with two foreign born parents	31596	32185	32785	33456	34282	35224	36180	37107	38221	39262	40386	41639	43249	45001	46696	48375	50232	52305	53741
01 Stockholm county	boys	born in Sweden with one parent born in Sweden and one foreign born parent	29815	30395	30976	31565	32102	32667	33368	34193	35105	35894	36845	37945	39228	40391	41558	42512	43310	44123	44619
01 Stockholm county	boys	born in Sweden with two parents born in Sweden	131708	133236	134604	135502	136532	137152	137235	137895	138408	138634	138838	140035	141358	142826	143973	144656	144897	144727	144212
01 Stockholm county	girls	foreign born	12946	12732	12570	12570	13376	14356	15184	16107	16738	17719	18685	19278	19976	20694	21615	23256	24556	25475	24805

Let’s say we want to do a bunch of operations where we specify mathematical constraints on the year number. Let’s say one of them is to get the data where the year is an even number. If there was a column called Year we could specify “WHERE Year % 2 = 0”. But we can’t do that as each year is a column and not a cell value in a column. Thus we would like instead a form where there was a column called “Year”. Then we can have the values in each cell be inside a single column that we will call “Persons”. This is one way to achieve that in SQL (not the prettiest way, but one of the most straightforward):

CREATE View long_table AS
SELECT region, sex, 'foreign/Swedish', 2002 as Year, [2002] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2003 as Year, [2003] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2004 as Year, [2004] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2005 as Year, [2005] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2006 as Year, [2006] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2007 as Year, [2007] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2008 as Year, [2008] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2009 as Year, [2009] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2010 as Year, [2010] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2011 as Year, [2011] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2012 as Year, [2012] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2013 as Year, [2013] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2014 as Year, [2014] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2015 as Year, [2015] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2016 as Year, [2016] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2017 as Year, [2017] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2018 as Year, [2018] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2019 as Year, [2019] as Persons FROM birthstatistics
UNION ALL
SELECT region, sex, 'foreign/Swedish', 2020 as Year, [2020] as Persons FROM birthstatistics;

Now we can do a statement like the following:

SELECT * FROM long_table
WHERE Year % 2 = 0 AND sex = "girls" AND Persons > 100000
LIMIT 5;

5 records
region	sex	‘foreign/Swedish’	Year	Persons
01 Stockholm county	girls	foreign/Swedish	2002	125531
14 Västra Götaland county	girls	foreign/Swedish	2002	120074
01 Stockholm county	girls	foreign/Swedish	2004	127747
14 Västra Götaland county	girls	foreign/Swedish	2004	118544
01 Stockholm county	girls	foreign/Swedish	2006	129493

Pivot wider

Now let’s say we want boys and girls to be columns. We can achieve that with the following:

CREATE VIEW My_view AS
SELECT region, 'foreign/Swedish', Year,
  SUM(CASE WHEN sex = 'boys' THEN Persons ELSE 0 END) AS Boys,
  SUM(CASE WHEN sex = 'girls' THEN Persons ELSE 0 END) AS Girls
FROM long_table
GROUP BY region, 'foreign/Swedish', Year

Why are we using SUM? Well in this example we kind of assume that given a distinct combination of the columns ‘region’, ‘foreign/Swedish’ and ‘Year’, then there only exists on row where ‘sex’ is specified as for example boy, and the value in ‘Persons’ is the value we’ll put under the column ‘boy’ in the widened table with that combination. But what if there are multiple? One way for example to deal with it is to just pick the first row and ignore the others. But one way that usually generalizes well is to sum all of them (perhaps each row symbolizes a sub-region for example).

Here’s an example of a query using the the view with widened data:

SELECT * FROM My_View
WHERE Year % 2 = 0
LIMIT 5

5 records
region	‘foreign/Swedish’	Year	Boys	Girls
01 Stockholm county	foreign/Swedish	2002	206465	196752
01 Stockholm county	foreign/Swedish	2004	211124	200638
01 Stockholm county	foreign/Swedish	2006	216187	205821
01 Stockholm county	foreign/Swedish	2008	221909	210582
01 Stockholm county	foreign/Swedish	2010	228635	216624

As an example let’s create two graphs for the total boys through the years and the total girls through the years.

CREATE VIEW My_total AS
SELECT Year, SUM(Boys) AS Boys, SUM(Girls) AS Girls
FROM My_view
GROUP BY Year

library(ggplot2)
my_view_df <- dbGetQuery(con, "SELECT * FROM My_total")
ggplot(my_view_df, aes(Year, Boys)) + geom_line()

ggplot(my_view_df, aes(Year, Girls)) + geom_line()