Famine's Domain

Famine's Domain

Developer / Street Fighter / Beast

Using Common Table Expressions

Let’s talk about Common Table Expressions (CTE) with Microsoft SQL Server. These are temporary data sets within the execution scope of a single SELECT, INSERT, UPDATE or VIEW statement. CTE’s only last for the duration of your query or connection and can be self-referencing more than one time in a single query for uses like recursion or a general view.

The reason I wanted to talk about common table expressions (CTE) is because it’s something simple, easy to learn and pretty heavily used in analytical tasks. I keep getting stuck on deciding what to write about because we are now in a world where there are millions of pages of content out there. It’s pretty overwhelming when you think about it because like most, you get stuck trying to talk about something different. You start focusing on bigger ideas that take more thought and prep work.

Well, I’ve given up on that idea and figured I would talk about the basics too. I’ll focus on the use cases of those basics and try to show how I’ve used them in real world situations that generally apply to analytics, data science, data engineer and so on.

CTE Syntax

So, what is the basic syntax of the common table expression (CTE)?

Example

It’s really that simple. It can be even more simple than that, but look at what is happening here. You’re basically creating a temporary result set on FactProductSales. When the outer query is executed, so is the inner query. You get similar effects when taking that inner query and making a sub-query. In theory, they should be similar performance. However, the CTE can be used recursively where a sub-query cannot. Outside of that, many may chose to use CTE in replace of sub-query because they favor the syntax and structure clarity better.

If the CTE is referenced more than once in the outer query, then you risk having the inner query executed multiple times as opposed to just once. This is ultimately decided by the query optimizer. This is why others may opt for alternatives unless they are doing specific recursion use cases. One option may just be using a temporary table where they can index and have statistics for example.

Stacking CTE’s

In this example, I have two inner query definitions targeting two sets of data at different points in time. The outer query is then using a UNION to join both of those datasets together similar to what you may do with a sub-query. You can obviously still just use one inner query here and just SELECT within the range of both groups. Then when you create your outer join, you just filter the dates with your WHERE clause. The concern however is that you will select more data than you need.

That being said, there is nothing stopping you from going the UNION route in the first inner join. Like in many cases in SQL, there are a number of ways to skin a cat. Try the various ways, see how the query optimizer treats those different ways. Don’t just be refined to CTE’s. Pick the best option for the use case. I typically use CTE for smaller datasets and take different approaches for larger datasets.

Using CTE with MERGE

Outside of the common recursion examples, CTE’s are commonly used with MERGE statements in order to create a filtered view. That filtered view is used to compare staging data with primary data or easily understood as old data with new data. The MERGE statement helps make that comparison and then provides you with a series of logical actions you can take such as INSERT, UPDATE or DELETE.

In my real-world example similar to the query below, I receive data in 24-hour increments every night in batch. My primary table that I wish to target with these new records has an clustered index based on time stamps. Using the CTE with a MERGE statement allows me to INSERT those new records from the past 24-hours on top of that table and on top of that index.

While I can just use a simple INSERT statement to achieve the desired results, what happens when there is a republish of old data? I will either have to alter the INSERT statement or just simply use a MERGE with a CTE that filters the inner query based on the new data I received. That way I only target the primary data–that may be filled with billions of records–I want to read.

Here you can see how I used the CTE to filter the primary table where I want to INSERT new records that do not match the filter from the CTE. You can expand on this more by adding a WHEN MATCHED clause to the MERGE statement to also UPDATE records where certain fields have changed.

In the above example, you may have noticed the reference of the columns in the CTE are missing (i.e.: PrimaryData). You do not have to define your columns after you call AS. It’s good practice to define them for the end user to get an understanding of what fields are available from the inner query.

Common Table Expressions Termination

I was taught to always start my Common Table Expressions with a semicolon (;). That’s because if you have a Common Table Expressions in a query where there are other definitions above the Common Table Expressions , then the Common Table Expressions will fail to execute and give you an error when you run the query. While always putting the semicolon before you begin the Common Table Expressions (for example:  ;WITH MyCTE…) does ensure the Common Table Expressions will run, it’s a poor practice. Every statement above or below the Common Table Expressions including the Common Table Expressions itself should always be terminated with a semicolon.

Recursion Example

In most cases I deal with, recursion is used when I need a function that calls itself and when I need to divide and conquer a problem by taking the original problem and dividing it into smaller (easily solved) pieces. For example, the Fibonacci sequence is a common example where you have to reference what happened in the past to output the future:

Fibonacci Sequence

0, 1, 1, 2, 3, 5, 8, 13

0 + 1 = 1

1 + 1 = 2

1 + 2 = 3

2 + 3 = 5…

With Common Table Expressions, you will need to reference the CTE within the CTE. Confused? Take a look at this basic recursion example.

You n number in this case is not a variable, but the alias selecting the number one (1). The UNION is referencing that first n alias again and adding + 1 to the result. This keeps looping until the filter of WHERE n+1 <= 10 is satisfied. In this case, you should get a result of 10 records that go from 1 to 10 when your outer query select n from numbers.

In the Fibonacci sequence example with SQL, the same approach is taken where you SELECT n, but you also select the previous number + N in order to get the desired result of

 xn = xn – 1 +  xn – 2

Now lets look at how you can accomplish that rule in SQL.

The end result with be desired results of 0, 1, 1, 2, 3, 5, 8, 13 all the way to 701,408,733.

Wrapping Up

That should cover a good bit of the Common Table Expressions and some examples. Feel free to share your thoughts on the CTE and of course, if you notice any incorrect information above or errors. I am not a master at what I do, but I get by.

Glen 'Famine' Swan
Follow Me

Glen 'Famine' Swan

Glen is a 8-year video game industry professional with over 10 AAA accredit titles under his belt. Currently, he is a practicing developer and data scientist within the digital marketing industry.
Glen 'Famine' Swan
Follow Me

Latest posts by Glen 'Famine' Swan (see all)