{"id":3027,"date":"2019-06-17T15:48:02","date_gmt":"2019-06-17T03:48:02","guid":{"rendered":"https:\/\/www.helenanderson.co.nz\/?p=3027"},"modified":"2020-05-14T18:27:35","modified_gmt":"2020-05-14T06:27:35","slug":"sql-aggregate-functions","status":"publish","type":"post","link":"https:\/\/helenanderson.co.nz\/sql-aggregate-functions\/","title":{"rendered":"SQL Aggregate Functions and Mariah Carey"},"content":{"rendered":"\n

Aggregate functions bring your data to life.<\/p>\n\n\n\n

They allow you to take a standard table and use a function or two to build metrics and answer questions about the data. If you are familiar with\u00a0Pivot Tables\u00a0or\u00a0Excel Functions<\/a>\u00a0the terminology is similar and you’ll be up and running in no time.<\/p>\n\n\n\n


\n\n\n\n

Introduction<\/a>
COUNT all the things<\/a>
COUNT DISTINCT<\/a>
GROUP BY<\/a>
ORDER BY<\/a>
GROUP BY multiple columns<\/a>
HAVING<\/a>
Types of Functions available<\/a><\/p><\/blockquote>\n\n\n\n


\n\n\n\n

Introduction to aggregate functions<\/h2>\n\n\n\n

In this example, I’m going to be using a subset of the Billboard Top 100 dataset, available on\u00a0Kaggle<\/a>. I grew up in the 90s so I’ve loaded the 90s rows into a table called music_charts in\u00a0SQL Server.<\/p>\n\n\n\n

First up, I’m going to look at what I’m dealing with:<\/p>\n\n\n\n

select top 10 *\n  from dbo.music_charts<\/code><\/pre>\n\n\n\n
<\/figure>\n\n\n\n
\n\n\n\n

The table has loaded in fine with the four columns I am looking for. But, If I am to answer any of my burning questions<\/a> about 90s music and who was the top of the charts I’m going to need to use aggregate functions.<\/p>\n\n\n\n


\n\n\n\n

COUNT all the things<\/h2>\n\n\n\n

The first way to explore this dataset is to COUNT how large it is. This is similar to the COUNT function in Excel and works in the same way.<\/p>\n\n\n\n

This function counts up all the rows and returns, in this case, 1000 to the results window. 100 rows per year over 10 years. It’s all there and ready to go.<\/p>\n\n\n\n

select count(*) as table_count\n  from dbo.music_charts<\/code><\/pre>\n\n\n\n
\n\n\n\n

COUNT DISTINCT<\/h2>\n\n\n\n

Next up, I want to see how many artists are in the table. I can see that there are duplicates where an artist reached the top of the charts more than once.<\/p>\n\n\n\n

We can do that by combining COUNT with DISTINCT:<\/p>\n\n\n\n

select count(distinct artist) as artist_count\n  from dbo.music_charts<\/code><\/pre>\n\n\n\n
\n\n\n\n

This has returned a COUNT of 545 DISTINCT artists from the table. It still isn’t super useful so we need to add more columns.<\/p>\n\n\n\n


\n\n\n\n

Filtering with GROUP BY<\/h2>\n\n\n\n

It’s nice to have some facts and figures, but it would be more useful to answer some questions like ‘which artist featured the most across the top 100 charts in the 1990s?’<\/p>\n\n\n\n

Similar to a pivot table in Excel, we can do this by adding more columns.<\/p>\n\n\n\n

However, it’s not quite as simple as just dropping in a column. If we run this query we get an error:<\/p>\n\n\n\n

select \nartist, \ncount(artist) as artist_count\nfrom dbo.music_charts\n\n\/* Msg 8120, Level 16, State 1, Line 1\nColumn 'dbo.music_charts.artist' is invalid in the select list \nbecause it is not contained in either an aggregate function \nor the GROUP BY clause. *\/<\/code><\/pre>\n\n\n\n

We need to explicitly say that we want all the artists to be collapsed down into one row with the count in the artist_count column next door. We do this using GROUP BY.<\/p>\n\n\n\n


\n\n\n\n
<\/a><\/figure>\n\n\n\n
\n\n\n\n
select \nartist, \ncount(artist) as artist_count\nfrom dbo.music_charts\ngroup by artist<\/code><\/pre>\n\n\n\n
\n\n\n\n
<\/a><\/figure><\/div>\n\n\n\n
\n\n\n\n

ORDER BY<\/h2>\n\n\n\n

This is getting close but needs one more clause to answer my question.<\/p>\n\n\n\n

select top 5\nartist, \ncount(artist) as artist_count\nfrom dbo.music_charts\ngroup by artist\norder by artist_count desc, artist\n<\/code><\/pre>\n\n\n\n
\n\n\n\n

By ordering by the artist_count we can see that Mariah Carey appears the most in the dataset. Naturally.<\/p>\n\n\n\n


\n\n\n\n
<\/a><\/figure><\/div>\n\n\n\n
\n\n\n\n

GROUP BY multiple columns<\/h2>\n\n\n\n

Now we have the hang of the syntax we can start drilling into the data further and answering questions like ‘Which Mariah Carey songs featured more often in the Top 100 charts in the 90s?’<\/p>\n\n\n\n

select\nartist, \nsong,\ncount(artist) as artist_count\nfrom dbo.charts\nwhere artist = 'mariah carey'\ngroup by artist, song\norder by artist_count desc, song, artist\n<\/code><\/pre>\n\n\n\n
\n\n\n\n
<\/a><\/figure><\/div>\n\n\n\n
\n\n\n\n

HAVING<\/h2>\n\n\n\n

The last thing to cover off is HAVING. This allows us to put a condition on the aggregate. This is different from using a WHERE that puts a condition on individual rows.<\/p>\n\n\n\n

To better illustrate how this works let’s use an example where we want to see who has featured on the list more than five times.<\/p>\n\n\n\n

We count up the number of artists and then use HAVING as the last step to filter on the count.<\/p>\n\n\n\n

select\nartist, \ncount(artist) as artist_count\nfrom dbo.music_charts\ngroup by artist\nhaving count(*) > 5\norder by artist_count desc<\/code><\/pre>\n\n\n\n
\n\n\n\n
<\/a><\/figure><\/div>\n\n\n\n
\n\n\n\n

Types of Functions available<\/h2>\n\n\n\n

But it doesn’t stop with counting things.<\/p>\n\n\n\n

The basic functions used day to day in SQL Server<\/a> are:<\/p>\n\n\n\n