GROUP BY: Aggregating Your Data

Last week, we learned to use WHERE to efficiently return only the rows that we want from a database. But what if you want to summarize the data more efficiently?

It turns out that you can have the database do the summarization for you with the GROUP BY keyword.

Like Python’s collections.Counter or pandas groupby(), SQL’s GROUP BY lets you summarize data by category. It allows you to count, sum, and average across groups.

The Problem

At this point, if you were asked how many customers are in the city of San Antonio, you might reach for ORDER BY and order by the city, then count the results for San Antonio.

SELECT city 
FROM customers
ORDER BY city

You might also use a WHERE clause and set city = 'San Antonio', return the results, and count the number of rows. This seems more optimized since you will return only the rows where the city is San Antonio

SELECT city 
FROM customers
WHERE city = 'San Antonio'

Neither of these is efficient because what you are looking for is a simple count rather than multiple rows.

These approaches would be even less effective if you wanted to know the number of customers in each city.

Python Comparison

Let’s take a moment to look at how we might get the number of customers in each city using Python. You could use Counter, manual grouping, or Pandas to achieve this.

Here we can see this achieved with a counter

from collections import Counter
city_counts = Counter(c['city'] for c in customers)

Here is a manual grouping

city_counts = {}
for c in customers:
    city = c['city']
    city_counts[city] = city_counts.get(city, 0) + 1

and Pandas

import pandas as pd
df = pd.DataFrame(customers)
df.groupby('city').size()

The issue with all of these solutions is that if you simply want to know how many customers are in each city, you are sending all 500 records to Python and letting it manage them. This gets to be even more of an issue if you have 1 million customers.

Thankfully, SQL can do this in the database and simply return results that contain the city and the number of customers in that city.

-- SQL: GROUP BY with COUNT
SELECT city, COUNT(*) as customer_count
FROM customers
GROUP BY city

Same result, but the work happens on the database server.

Aggregate Functions

In this example, COUNT() is a function that aggregates values in the query. In this case, it counts them, but there are many aggregate functions in SQL.

Here are some common ones:

Function	What it does	Example
COUNT(*)	Count all rows	How many customers?
COUNT(column)	Count non-NULL values	How many have emails?
SUM(column)	Add up values	Total sales
AVG(column)	Average value	Average order size
MIN(column)	Smallest value	First signup date
MAX(column)	Largest value	Most recent signup

One thing to note: COUNT(*) counts all rows, while COUNT(column) only counts rows where that column is not NULL. This distinction matters when your data has missing values.

The Golden Rule

Aggregate functions are very useful, and they work well with the GROUP BY clause. There is one thing to remember when using them that will trip you up, even after years of writing SQL. If you are receiving errors or unexpected results when using aggregate functions and GROUP BY, this is the first thing that you should check, twice.

Every column in SELECT must be either:

In the GROUP BY clause, OR
Inside an aggregate function

-- ✅ Correct: city is in GROUP BY, COUNT is aggregate
SELECT city, COUNT(*)
FROM customers
GROUP BY city

-- ❌ Wrong: name is not grouped or aggregated
SELECT city, name, COUNT(*)
FROM customers
GROUP BY city
-- Error! Which "name" should it show for each city?

This trips up ~~beginners~~ everyone. The database doesn’t know which name to pick when there are 50 customers in San Antonio, so it either throws an error or gives bad results.

The Basics

Here are some examples of how you can use aggregate functions and GROUP BY together while keeping the golden rule of grouping in mind.

First, if you just want to know how many rows are in a table you can use the following syntax

SELECT COUNT(*) as customer_count
FROM customers

You can also return the count of customers per city like we saw before.

SELECT city, COUNT(*) as customer_count
FROM customers
GROUP BY city

You can also use multiple aggregate functions for the same GROUP BY

SELECT city, 
       COUNT(*) as total,
       MIN(signup_date) as first_signup,
       MAX(signup_date) as last_signup
FROM customers
GROUP BY city

..or GROUP BY multiple columns

SELECT city, is_premium, COUNT(*) as count
FROM customers
GROUP BY city, is_premium

Notice that in both these examples any column in the SELECT clause that is not an aggregate function is in the GROUP BY clause.

Combining with WHERE and ORDER BY

So how does the database handle using a WHERE or ORDER BY with a GROUP BY?

WHERE filters before grouping:

-- Count premium customers per city
SELECT city, COUNT(*) as premium_count
FROM customers
WHERE is_premium = true
GROUP BY city

The ORDER BY clause sorts the grouped results:

-- Cities with most customers first
SELECT city, COUNT(*) as customer_count
FROM customers
GROUP BY city
ORDER BY customer_count DESC

Below is the execution order, remember that this does not match the order that you write them in a SQL query.

FROM customers           -- 1st: Get data
WHERE is_premium = true  -- 2nd: Filter rows
GROUP BY city            -- 3rd: Group remaining rows
SELECT city, COUNT(*)    -- 4th: Pick columns and aggregate
ORDER BY premium_count   -- 5th: Sort results

GROUP BY lets you count and summarize. But what if you want to filter after grouping? For instance, “only return cities with more than 10 customers”? WHERE won’t work, and we need a new keyword that we will discuss next week.

The Problem#

Python Comparison#

Aggregate Functions#

The Golden Rule#

The Basics#

Combining with WHERE and ORDER BY#

Stay Updated