Window Functions: The Feature Python Developers Miss Most

This week, we are going to focus on window functions, which are very powerful data manipulation tools. It’s something that you can do in Python, but in SQL, it is super simple and very powerful. We’ll continue using the same orders and customers tables from previous posts in our DuckDB sample database.

Before I continue, I have a confession to make. While I have used window functions, I haven’t used them nearly as much as I should have. I only discovered them in the past few years. Prior to that, I used some other SQL tricks for these types of queries, but this method is much cleaner.

So let’s dive in. What problem do window functions solve? They seem to do aggregation-type activities, can’t GROUP BY do this?

The short answer is, not very well.

Window functions allow you to do things like calculate running totals, rankings, and moving averages, which tend to be very difficult to do otherwise because it requires a ‘window’ into a subset of the data.

The Problem GROUP BY Can’t Solve

So let’s look at a simple grouped query. This is each customer id and the total amount ordered for that customer. The GROUP BY allows the aggregation by customer id.

SELECT customer_id, SUM(amount) as total
FROM orders
GROUP BY customer_id

But what if you want:

Each order row, plus the customer’s running total at that point?
Each order row, plus how it compares to the average?
Each order row, plus its rank within that customer’s orders?

GROUP BY can’t do this, but window functions can.

Python Comparison

Looking at a Python-based alternative, Pandas handles this reasonably well, though you have to bring back all the data unsummarized for it to do it.

df['running_total'] = df.groupby('customer_id')['amount'].cumsum()
df['rank'] = df.groupby('customer_id')['amount'].rank(ascending=False)

Without Pandas, it’s quite tedious to loop and summarize the data.

Now let’s look at a window function.

SELECT *,
       SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as running_total,
       RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) as rank
FROM orders

Here we will return all orders with the associated cumulative order amount, by date no less. It also ranks the amount of the purchase against the customer’s other purchases. Pretty powerful!

Syntax

Let’s take a closer look at the window function syntax. There are two parameters. First is the PARTITION BY column, which is similar to GROUP BY since it says which column to focus the activity on, except that it doesn’t collapse down the data. The second parameter is ORDER BY column, which is the order of the data in the partition. Very important for ranks, and running totals.

function() OVER (
    PARTITION BY column    -- Like GROUP BY, but doesn't collapse
    ORDER BY column        -- Order within each partition
)

Essential Window Functions

Now, let’s go through some valuable window functions that are available.

Use ROW_NUMBER() to get a unique number for each row in the window.

SELECT name, city,
       ROW_NUMBER() OVER (ORDER BY signup_date) as signup_order
FROM customers

Both RANK() and DENSE_RANK() are available for ranking data. The difference is the way that they handle ties in the data. If two numbers have the same ranking (1, 2, 2, 4), RANK() will skip the next value and resume counting. DENSE_RANK() does not do this and would rank these as (1, 2, 2, 3).

SELECT customer_id, amount,
       RANK() OVER (ORDER BY amount DESC) as rank,
       DENSE_RANK() OVER (ORDER BY amount DESC) as dense_rank
FROM orders

We are familiar with SUM(), AVG(), and COUNT() as aggregate functions. These are also available as window functions, allowing them to be used over a partition.

SELECT customer_id, order_date, amount,
       SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as running_total,
       AVG(amount) OVER (PARTITION BY customer_id) as customer_avg
FROM orders

LAG() and LEAD() allow you to reference data from the previous or next row in the result set.

SELECT customer_id, order_date, amount,
       LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as prev_amount,
       LEAD(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as next_amount
FROM orders

Practical Examples

Now, let’s look at some practical examples of how these window functions can be used.

Top 3 Orders Per Customer

Here is a simple query that combines CTEs with window functions to return the top three largest orders for each customer.

WITH ranked AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) as rn
    FROM orders
)
SELECT * FROM ranked WHERE rn <= 3

Change Over Time

Here is another useful application of window functions. Here we are comparing the results to the previous time period and determining the difference.

SELECT order_date, amount,
       LAG(amount) OVER (ORDER BY order_date) as prev_amount,
       amount - LAG(amount) OVER (ORDER BY order_date) as change
FROM orders

Try It Yourself

Challenge 1: Find Each Customer’s Most Recent Order

Use ROW_NUMBER() with a CTE to return only the most recent order for each customer. You’ll need to partition by customer_id and order by order_date DESC.

Challenge 2: Month-Over-Month Change

Using LAG(), write a query that calculates the difference in order amount compared to the previous order for each customer. Partition by customer_id and order by order_date.

What’s Next?

It’s time to set you free to work on your own data and tables. We have seen that briefly already, and next week we are going to take a closer look at SQL’s Data Definition Language (DDL) that lets you create tables.

The Problem GROUP BY Can’t Solve#

Python Comparison#

Syntax#

Essential Window Functions#

Practical Examples#

Top 3 Orders Per Customer#

Change Over Time#

Try It Yourself#

Challenge 1: Find Each Customer’s Most Recent Order#

Challenge 2: Month-Over-Month Change#

What’s Next?#

Stay Updated