Sql | Jamal Hansen

GROUP BY: Aggregating Your Data

Last week, we learned to use WHERE to efficiently return only the rows that we want from a database. But what if you want to summarize the data more efficiently? It turns out that you can have the database do the summarization for you with the GROUP BY keyword. Like Python’s collections.Counter or pandas groupby(), SQL’s GROUP BY lets you summarize data by category. It allows you to count, sum, and average across groups. ...

WHERE: Filtering Your Data

We have come a long way in the past couple of months, working through the core SQL keywords. So far, we can SELECT columns, specify FROM where our data lives, and ORDER BY to sort results. That is quite a lot, and today we are going to unlock the real power of SQL by giving you the ability to filter your results before they are returned from the server. When you’re querying a remote database, this is a huge optimization win; not only will your database send fewer rows of data, but your network and your client will receive less data to sort through. ...

A spoonful of letters with alphabet soup

ORDER BY: Sorting Your Results

We now have a firm grasp on how to use SELECT: Choosing Your Columns and FROM: Where Your Data Lives to tell the database where to find data and how to format the columns when it returns it. With this knowledge, we can pull back all of the data from a table in a database. There is still a problem with the data that we receive from a query. It can come back in any order. It may return in the same order 9 times out of 10, but there is no guarantee that it will come back in the same order next time. This happens because database engines optimize execution plans based on factors like data volume, indexes, and available memory, and those optimizations can change between queries. ...

SELECT: Choosing Your Columns

You have written SELECT * many times by now. It works, but it’s a bit like asking for everything in the fridge when you just want milk. This week, we will look at the SELECT clause and see that it does more than just pick columns. It transforms your output. Previously, we looked closely at the FROM clause, which tells the database where the query will find the data. The SELECT clause defines which columns will be returned, and you can reshape data on the way out. ...

FROM: Where Your Data Lives

We have come a long way over the last five posts, but we are just getting started. So far, we have explored concepts that will help us along our journey, but haven’t talked a whole lot about SQL itself. We have seen some basic SQL that uses a couple of keywords, SELECT and FROM, but we haven’t looked very closely at what these do. Let’s do that now, starting with FROM. ...

SQL Thinks in Sets, Not Loops

Remember back when we started, I mentioned SQL was difficult because of how I was thinking? I was asking it to perform steps to return data. This didn’t work because SQL uses a declarative syntax that describes the final result. Until I realized this, SQL felt hard. Let’s explore this concept further. Working with lists and loops When you work with lists in Python, one of the first tools you reach for is the for loop. The for loop is great because it lets you take every item in the list and apply some logic to it, one at a time. It might look something like this. ...

Don't forget to save! Persisting your DuckDB database

I still remember losing schoolwork and video game progress because I forgot to save. That sinking feeling when hours of work vanish because you were too caught up in the flow to pause and save. In our last post, we created a customer database and generated 500 rows of fake data. Our in-memory database has the same problem: when Python exits, all that data vanishes: import duckdb con = duckdb.connect(':memory:') con.execute("CREATE TABLE customers (id INT, name VARCHAR)") con.execute("INSERT INTO customers VALUES (1, 'Alice')") print(con.execute("SELECT * FROM customers").fetchdf()) # Script ends... and the data is gone forever A database is supposed to provide persistent storage, isn’t it? Let’s fix that with one small change. ...

Generate Practice Data with faker

Last week, we got DuckDB running with three hardcoded rows. That got us started, but three rows? You can eyeball that. Let’s generate hundreds of realistic customers and build a dataset worth exploring. Python has the perfect tool: faker. It’s a library that generates realistic fake data: names, emails, addresses, dates, and anything else you’d find in a real database. Let’s use it to build a dataset we can explore for the rest of this series. ...

I know Python; Why learn SQL

I learned SQL very early in my career. At the time, I didn’t understand why, and for the first month or so, it didn’t make sense to me. The syntax didn’t resemble any language I had seen before, and it employed concepts with which I was unfamiliar. This all made SQL seem scary, and oddly enough, SQL hasn’t changed much in the past 40 years, which makes it even more of an oddity. ...