Advanced SQL: Going beyond the Basics
Structured Query Language (SQL) makes a great addition to your toolbox. It may not be as popular as Python or JavaScript, but knowing SQL, especially advanced SQL, offers a number of benefits.
For one thing, knowing advanced SQL techniques can help you get better insights from your data, so that you can communicate your ideas more effectively. And since SQL is subject to the same performance bottlenecks and performance issues as other languages, a deeper knowledge will help you optimize your queries.
Finally, additional knowledge will help you stand out from other developers, who might know CRUD operations but not advanced SQL concepts. This gives you a career advantage.
This article will introduce you to some advanced SQL concepts, including subqueries, running totals, and common table expressions (CTEs). To follow along, you can use any database that supports SQL and an editor of your choice. You can also use an online SQL editor.
The following examples use an online SQL editor along with an SQLite database. If you are working with a database such as PostgreSQL, you might have to change the syntax; however, the underlying concepts are the same.
UNION, INTERSECT, and EXCEPT Operators
You can use sample data from this resource. In this example, all the commands have been combined into a single script in this repository. Run the script and it will create three tables with sample data. Below is the schema of the tables:
The first five rows of each table are shown below:
UNION
, INTERSECT
, and EXCEPT
operators help you combine results from two queries. Operators in SQL are similar to math operators while working with sets. You must remember the following rule:
The number of columns and the order of the columns in the table returned by the first query must be equal to the number of columns and the order of the columns in the table returned by the second query. Moreover, the data type of the corresponding columns must be compatible.
Below is the general syntax for using operators in SQL:
SELECT * FROM TABLE1
OPERATOR
SELECT * FROM TABLE2
UNION Operator
The UNION
operator returns a combination of the tables from the two queries. By default, the result will not have duplicate values. However, you can use the UNION ALL
operator if you want to keep the duplicates.
The following command will return a list of all distinct cities:
SELECT working_area FROM AGENTS
UNION
SELECT cust_city FROM CUSTOMER
The following command will return a list of all cities, including duplicates:
SELECT
working_area
FROM
AGENTS
UNION ALL
SELECT
cust_city
FROM
CUSTOMER
Below is an image comparing the outputs of UNION
and UNION ALL
:
As you can see, in the right table, London and Bangalore are repeated.
INTERSECT Operator
The INTERSECT
operator returns the common elements in the tables returned by the first and second query.
-- Returns a list of common cities in both columns
SELECT
working_area
FROM
AGENTS
INTERSECT
SELECT
cust_city
FROM
CUSTOMER
The result is shown below:
EXCEPT Operator
The EXCEPT
operator returns all the elements in the table returned by the first query that are not also in the table returned by the second query.
SELECT
working_area
FROM
AGENTS
EXCEPT
SELECT
cust_city
FROM
CUSTOMER
The above command will return a table with a single row containing Vancouver. You’ll note that Vancouver is the only city that is in CUSTOMER.cust_city
but not in AGENTS.working_area
.
Subqueries
In SQL, you can have a query inside a query, also known as a subquery. A subquery can also have a query inside it. Subqueries take three main forms:
- As part of a
SELECT
statement - After
FROM
- After
WHERE
Subqueries help you break down complex joins and unions into smaller parts to make your query more readable and maintainable, especially since you can use results from existing queries. Subqueries also make debugging easier, because you can debug each query independently. Understanding subqueries also helps you write recursive SQL.
Subquery with SELECT
Look at the tables named AGENTS
and ORDERS
. You are tasked with returning a table with two columns. The first column should contain the agent_code
from the AGENTS
table, and the second should contain the sum of the ORDERS.ord_amount
column. The second column should have the same value, or the sum repeated multiple times.
First, return the sum of all order amounts:
SELECT
SUM(ord_amount)
FROM
ORDERS
The above command will be your subquery. You could use it as shown below to obtain the required result:
SELECT
agent_code,
(
SELECT
SUM(ord_amount)
FROM
ORDERS
) total
FROM
AGENTS
The command in the second line is the subquery. It returns the sum of the ORDERS.ord_amount
column. You can also name the second column, either by using AS
or by including your preferred name after the subquery. Below is a screenshot of the partial result:
Subquery with FROM
The following SQL command returns a table with the total number of order amounts, an average of order amounts, and the number of orders sold by an agent to a customer:
SELECT
agent_code,
cust_code,
sum(ord_amount) total_amnt,
avg(ord_amount) avg_amount,
count(*) AS num
FROM
ORDERS
GROUP BY
agent_code,
cust_code
You must use the result to find the total number of orders sold by each agent. You can use the above command as a subquery along with FROM
:
SELECT
agent_code,
count(agent_code) AS count
FROM
(
SELECT
agent_code,
cust_code,
sum(ord_amount) total_amnt,
avg(ord_amount) avg_amount,
count(*) AS num
FROM
ORDERS
GROUP BY
agent_code,
cust_code
)
GROUP BY
agent_code
The above command returns the total number of customers that each agent sold to. Instead of selecting from an existing table, you select from the table returned by your subquery. Below is the partial result:
Subquery with WHEN
Now use the AGENTS
and CUSTOMER
tables to return a table that only contains common names shared by agents and customers.
First, return a table with the names of every customer, using this command:
SELECT
cust_name
FROM
CUSTOMER
Now use the IN
operator and the table returned by the above query to find the common names. Below is the command, using the previous command as a subquery:
SELECT
agent_name
FROM
AGENTS
WHERE
agent_name IN (
SELECT
cust_name
FROM
CUSTOMER
)
The WHERE
clause checks if the AGENTS.agent_name
is present within the table returned by your subquery. It will return a table with two rows: Benjamin and Charles.
Common Table Expressions (CTEs)
In SQL, you create temporary variables to store the results from your subqueries. These variables are only accessible within the outer query. For the earlier example, in which you returned a table with the number of orders sold by each agent, you can store that result as a CTE to make your queries more readable.
WITH CTE_AGENT_CUSTOMER AS (
SELECT
agent_code,
cust_code,
sum(ord_amount) total_amnt,
avg(ord_amount) avg_amount,
count(*) AS num
FROM
ORDERS
GROUP BY
agent_code,
cust_code
)
SELECT
agent_code,
count(agent_code) AS count,
AVG(total_amnt) AS AVG,
SUM(total_amnt) AS SUM
FROM
CTE_AGENT_CUSTOMER
GROUP BY
agent_code
CTEs are required when working with recursive SQL.
Recursive CTEs in SQL
Recursion in SQL is like recursion in any other language. You divide a bigger problem into a smaller problem and keep on solving the smaller problem till you reach the terminating condition. If you don’t provide a terminating condition, it will send your SQL command into an infinite loop.
The following are two common problems that can be solved using recursion.
First N Numbers
Your task is to use recursion to print a table with numbers from one to twenty. Below is the recursive code:
WITH NUMBER_CTE AS (
SELECT
1 AS num -- Starting Table
UNION
-- Combine Starting Table with Table from Below Query
SELECT
num + 1 as num -- Update Rule
FROM
NUMBER_CTE -- CTE calling itself, ie recursion
WHERE
num < 20 -- Terminating Condition
)
SELECT
*
from
NUMBER_CTE -- Displaying the recursive CTE
Consider the first and the second query of the UNION
operator separately.
SELECT
1 AS num -- Starting Table
The above is a straightforward query that returns a column called num
with the value of 1
in it.
SELECT
num + 1 as num -- Update Rule
FROM
NUMBER_CTE -- CTE calling itself, ie recursion
WHERE
num < 20 -- Terminating Condition
The above query increments the value of column num
by one as long as the original value is less than 20
.
In the first run, the result from the first subquery will be:
num
------
1
------
And the result from the second query will be:
num
------
2
------
The UNION
operator will combine both tables and update your CTE to the following table:
num
------
1
------
2
------
In the second query, since you select from your CTE, the two queries along with the UNION
operator will be executed again. The first query will again return:
num
------
1
------
The second query will use the CTE from the previous run and increment each row by one, returning the following:
num
------
2
------
3
------
The UNION
operator again combines the tables from both queries. The updated CTE will be:
num
------
1
------
2
------
3
------
This process will keep going until the column contains a row with the value 20
. In this case, the WHEN
condition won’t be satisfied and you will stop the recursion process.
Factorial
Now try a SQL command that creates a table with factorial values. Try to find the factorial of 5
.
WITH FACTORIAL AS (
SELECT
5 AS NUM,
1 AS PRODUCT -- Starting Table
UNION
-- Combine Starting Table with Table from Below Query
SELECT
NUM - 1 AS NUM,
PRODUCT * NUM AS PRODUCT -- Update Rule
FROM
FACTORIAL -- CTE calling itself, i.e recursion
WHERE
NUM > 1 -- Terminating Condition
)
SELECT
*
FROM
FACTORIAL
Your initial table should have two columns—one to store the current multiplier and one to store the current product. The product will be 1
while the first multiplier will be 5
.
The result from the first query will be:
num product
------ ---------
5 1
------ ---------
The result from the second query will be:
num product
-------- --------------
5-1 = 4 5*1 = 5
-------- --------------
The new CTE will be:
num product
------ ---------
5 1
------ ---------
4 5
------ ---------
In the next run, the result from the first query will be the same:
num product
------ ---------
5 1
------ ---------
The result from the second query will be calculated based on the CTE from the previous run:
num product
--------- ---------
5 - 1 = 4 1 * 5 = 5
--------- ---------
4 - 1 = 3 5 * 4 = 20
--------- ---------
The UNION
operator will combine both tables. The updated CTE will be:
num product
------ ---------
5 1
------ ---------
4 5
------ ---------
3 20
------ ---------
This process will keep going until you reach the terminating condition.
General Syntax for Recursive CTEs
Based on the above problems, you can see the general syntax for recursive CTEs in SQL:
WITH CTE AS
(
-- Starting Table
UNION -- Combine Starting Table with Table from Below Query
-- Updated Table with Update conditions if any
FROM CTE-- CTE calling itself, i.e recursion
-- Terminating Condition
)
SELECT * FROM CTE
The terminating condition doesn’t necessarily have to be a WHERE
condition; it can also be a JOIN
condition. You’ll look at terminating conditions with JOIN
in the following cases.
Add Sample Data
Recursion is commonly used to find hierarchical structures in a company or when working with family trees. You’ll need to create a new table in your database. The first column will be the agent’s code, and the second will be their manager’s agent code. If an agent has no manager, the value in the second column will be NULL
. Below is the hierarchical structure you will be adding to your database:
You can find the script to add the data in this repo.
Task One
Say you’re given an agent’s ID and are asked to return a table with the agent’s manager, skip-manager, skip-skip-manager, etc., all the way to the top. Below is the command to complete this task:
with HIERARCHY_CTE AS (
SELECT
h1.agent_code,
h1.manager_code
from
HIERARCHY h1
WHERE
h1.agent_code == 'A006'
UNION
SELECT
h2.agent_code,
h2.manager_code
FROM
HIERARCHY h2
JOIN HIERARCHY_CTE ON h2.AGENT_CODE = HIERARCHY_CTE.manager_code
)
SELECT
*
FROM
HIERARCHY_CTE
Again, consider each query individually.
SELECT
h1.agent_code,
h1.manager_code
FROM
HIERARCHY h1
WHERE
h1.agent_code == 'A006'
The above query returns the agent’s code and their manager’s agent code. Below is the output:
agent_code manager_code
--------- ------------
A006 A005
Now look at the second query:
SELECT
h2.agent_code,
h2.manager_code
FROM
HIERARCHY h2
JOIN HIERARCHY_CTE ON h2.AGENT_CODE = HIERARCHY_CTE.manager_code
This performs a JOIN
between the HIERARCHY
table and the CTE you created. It tries to find the rows where the agent_code
from HIERARCHY
matches the manager_code
for the CTE. It then returns the agent’s code and their manager’s code, or A006
’s manager and A006
’s skip-manager.
Below is the output:
agent_code manager_code
--------- ------------
A005 A003
The union returns the following table:
agent_code manager_code
--------- ------------
A006 A005
--------- ------------
A005 A003
In the second run, the first query will again return:
agent_code manager_code
--------- ------------
A006 A005
The second query, however, will include A003
’s manager as well. Below is the output:
agent_code manager_code
--------- ------------
A005 A003
--------- ------------
A003 A001
The UNION
operator combines the tables. Your updated CTE will be:
agent_code manager_code
--------- ------------
A006 A005
--------- ------------
A005 A003
--------- ------------
A003 A001
A001
has no manager. In the next run, you will hit the terminating condition.
Task Two
In this task, you’ll assign a level to each agent. The agents with no boss are level 1, the agents who report to them are level 2, and so on. Below is the command that returns this table:
with HIERARCHY_CTE AS (
SELECT
h1.agent_code,
h1.manager_code,
1 AS level
from
HIERARCHY h1
WHERE
h1.manager_code IS NULL
UNION
SELECT
h2.agent_code,
h2.manager_code,
level + 1 AS level
from
HIERARCHY h2
JOIN HIERARCHY_CTE on h2.manager_code == HIERARCHY_CTE.agent_code
)
SELECT
*
FROM
HIERARCHY_CTE
This is similar to the command from your previous task. The main difference is that you start from the top agent, rather than the bottom.
Running Totals
Running totals are also known as cumulative sums. Consider the ORDERS
table in your database. It has columns called ord_amount
and ord_date
. Try to calculate the cumulative sum on each date record in the ord_date
column. Below is the command:
SELECT
ord_date,
SUM(ord_amount) OVER (
ORDER BY
ord_date
) AS Running_Total
FROM
ORDERS
The OVER
command tells SQL that you want to find the cumulative sum of the ord_amount
column over the whole data set. The column in the ORDER BY
clause is used to sort the results by date.
CASE WHEN Statements
CASE WHEN
statements in SQL are similar to if-else or switch-case statements in other languages. Below is the general syntax for CASE WHEN
statements:
CASE
WHEN CONDITION THEN RESULT
WHEN CONDITION THEN RESULT
WHEN CONDITION THEN RESULT
ELSE RESULT
END Column_Name
The result in ELSE
is returned if none of the conditions return true.
The CUSTOMER
table has a column called grade
with values from 0-4
. You are tasked with creating a new column called grade_letter
that maps 1
to A
, 2
to B
, and everything else to F
. You can use CASE WHEN
statements to complete this task. Below is the command:
SELECT
*,
CASE WHEN grade == 1 THEN 'A' WHEN grade == 2 THEN 'B' ELSE 'F' END grade_letter
FROM
CUSTOMER
Conclusion
This guide offered you some hands-on experience in working with more advanced SQL topics. Implementing these tools will help you gain more insights from your data so that you can write better queries in SQL.
If you are working with SQL, you might be interested in CloudQuery. The SQL-based open source cloud asset inventory integrates with major cloud providers including Azure, AWS, and GCP. You can use it to audit and evaluate your cloud assets from a current and historical perspective.
For more on what CloudQuery can do for you, check out its documentation.
This article was originally posted on
https://www.realpythonproject.com/advanced-sql-going-beyond-the-basics/
Connect with me on LinkedIn:
https://www.linkedin.com/in/rahulbanerjee2699/