Advanced SQL: Going beyond the Basics

Structured Query Language (SQL) makes a great addition to your toolbox. It may not be as popular as Python or JavaScript, but knowing SQL, especially advanced SQL, offers a number of benefits.

For one thing, knowing advanced SQL techniques can help you get better insights from your data, so that you can communicate your ideas more effectively. And since SQL is subject to the same performance bottlenecks and performance issues as other languages, a deeper knowledge will help you optimize your queries.

Finally, additional knowledge will help you stand out from other developers, who might know CRUD operations but not advanced SQL concepts. This gives you a career advantage.

This article will introduce you to some advanced SQL concepts, including subqueries, running totals, and common table expressions (CTEs). To follow along, you can use any database that supports SQL and an editor of your choice. You can also use an online SQL editor.

The following examples use an online SQL editor along with an SQLite database. If you are working with a database such as PostgreSQL, you might have to change the syntax; however, the underlying concepts are the same.

UNION, INTERSECT, and EXCEPT Operators

You can use sample data from this resource. In this example, all the commands have been combined into a single script in this repository. Run the script and it will create three tables with sample data. Below is the schema of the tables:

Agents table schema Image courtesy of w3resource

The first five rows of each table are shown below:

First five rows

UNION, INTERSECT, and EXCEPT operators help you combine results from two queries. Operators in SQL are similar to math operators while working with sets. You must remember the following rule:

The number of columns and the order of the columns in the table returned by the first query must be equal to the number of columns and the order of the columns in the table returned by the second query. Moreover, the data type of the corresponding columns must be compatible.

Below is the general syntax for using operators in SQL:

SELECT * FROM TABLE1
OPERATOR
SELECT * FROM TABLE2

UNION Operator

The UNION operator returns a combination of the tables from the two queries. By default, the result will not have duplicate values. However, you can use the UNION ALL operator if you want to keep the duplicates.

The following command will return a list of all distinct cities:

SELECT working_area FROM AGENTS
UNION
SELECT cust_city FROM CUSTOMER

The following command will return a list of all cities, including duplicates:

SELECT 
  working_area 
FROM 
  AGENTS 
UNION ALL
SELECT 
  cust_city 
FROM 
  CUSTOMER

Below is an image comparing the outputs of UNION and UNION ALL:

UNION vs. UNION ALL

As you can see, in the right table, London and Bangalore are repeated.

INTERSECT Operator

The INTERSECT operator returns the common elements in the tables returned by the first and second query.

-- Returns a list of common cities in both columns
SELECT 
  working_area 
FROM 
  AGENTS 
INTERSECT 
SELECT 
  cust_city 
FROM 
  CUSTOMER

The result is shown below:

INTERSECT operator

EXCEPT Operator

The EXCEPT operator returns all the elements in the table returned by the first query that are not also in the table returned by the second query.

SELECT 
  working_area 
FROM 
  AGENTS 
EXCEPT 
SELECT 
  cust_city 
FROM 
  CUSTOMER

The above command will return a table with a single row containing Vancouver. You’ll note that Vancouver is the only city that is in CUSTOMER.cust_city but not in AGENTS.working_area.

Subqueries

In SQL, you can have a query inside a query, also known as a subquery. A subquery can also have a query inside it. Subqueries take three main forms:

  • As part of a SELECT statement
  • After FROM
  • After WHERE

Subqueries help you break down complex joins and unions into smaller parts to make your query more readable and maintainable, especially since you can use results from existing queries. Subqueries also make debugging easier, because you can debug each query independently. Understanding subqueries also helps you write recursive SQL.

Subquery with SELECT

Look at the tables named AGENTS and ORDERS. You are tasked with returning a table with two columns. The first column should contain the agent_code from the AGENTS table, and the second should contain the sum of the ORDERS.ord_amount column. The second column should have the same value, or the sum repeated multiple times.

First, return the sum of all order amounts:

SELECT 
  SUM(ord_amount) 
FROM 
  ORDERS

The above command will be your subquery. You could use it as shown below to obtain the required result:

SELECT 
  agent_code, 
  (
    SELECT 
      SUM(ord_amount) 
    FROM 
      ORDERS
  ) total 
FROM 
  AGENTS

The command in the second line is the subquery. It returns the sum of the ORDERS.ord_amount column. You can also name the second column, either by using AS or by including your preferred name after the subquery. Below is a screenshot of the partial result:

Partial result

Subquery with FROM

The following SQL command returns a table with the total number of order amounts, an average of order amounts, and the number of orders sold by an agent to a customer:

SELECT 
  agent_code, 
  cust_code, 
  sum(ord_amount) total_amnt, 
  avg(ord_amount) avg_amount, 
  count(*) AS num 
FROM 
  ORDERS 
GROUP BY 
  agent_code, 
  cust_code

You must use the result to find the total number of orders sold by each agent. You can use the above command as a subquery along with FROM:

SELECT 
  agent_code, 
  count(agent_code) AS count 
FROM 
  (
    SELECT 
      agent_code, 
      cust_code, 
      sum(ord_amount) total_amnt, 
      avg(ord_amount) avg_amount, 
      count(*) AS num 
    FROM 
      ORDERS 
    GROUP BY 
      agent_code, 
      cust_code
  ) 
GROUP BY 
  agent_code

The above command returns the total number of customers that each agent sold to. Instead of selecting from an existing table, you select from the table returned by your subquery. Below is the partial result:

Partial result

Subquery with WHEN

Now use the AGENTS and CUSTOMER tables to return a table that only contains common names shared by agents and customers.

First, return a table with the names of every customer, using this command:

SELECT 
  cust_name 
FROM 
  CUSTOMER

Now use the IN operator and the table returned by the above query to find the common names. Below is the command, using the previous command as a subquery:

SELECT 
  agent_name 
FROM 
  AGENTS 
WHERE 
  agent_name IN (
    SELECT 
      cust_name 
    FROM 
      CUSTOMER
  )

The WHERE clause checks if the AGENTS.agent_name is present within the table returned by your subquery. It will return a table with two rows: Benjamin and Charles.

Common Table Expressions (CTEs)

In SQL, you create temporary variables to store the results from your subqueries. These variables are only accessible within the outer query. For the earlier example, in which you returned a table with the number of orders sold by each agent, you can store that result as a CTE to make your queries more readable.

WITH CTE_AGENT_CUSTOMER AS (
  SELECT 
    agent_code, 
    cust_code, 
    sum(ord_amount) total_amnt, 
    avg(ord_amount) avg_amount, 
    count(*) AS num 
  FROM 
    ORDERS 
  GROUP BY 
    agent_code, 
    cust_code
) 
SELECT 
  agent_code, 
  count(agent_code) AS count, 
  AVG(total_amnt) AS AVG, 
  SUM(total_amnt) AS SUM 
FROM 
  CTE_AGENT_CUSTOMER 
GROUP BY 
  agent_code

CTEs are required when working with recursive SQL.

Recursive CTEs in SQL

Recursion in SQL is like recursion in any other language. You divide a bigger problem into a smaller problem and keep on solving the smaller problem till you reach the terminating condition. If you don’t provide a terminating condition, it will send your SQL command into an infinite loop.

The following are two common problems that can be solved using recursion.

First N Numbers

Your task is to use recursion to print a table with numbers from one to twenty. Below is the recursive code:

WITH NUMBER_CTE AS (
  SELECT 
    1 AS num -- Starting Table
  UNION 
    -- Combine Starting Table with Table from Below Query
  SELECT 
    num + 1 as num -- Update Rule
  FROM 
    NUMBER_CTE -- CTE calling itself, ie recursion
  WHERE 
    num < 20 -- Terminating Condition
    ) 
SELECT 
  * 
from 
  NUMBER_CTE -- Displaying the recursive CTE

Consider the first and the second query of the UNION operator separately.

SELECT 
  1 AS num -- Starting Table

The above is a straightforward query that returns a column called num with the value of 1 in it.

SELECT 
  num + 1 as num -- Update Rule
FROM 
  NUMBER_CTE -- CTE calling itself, ie recursion
WHERE 
  num < 20 -- Terminating Condition

The above query increments the value of column num by one as long as the original value is less than 20.

In the first run, the result from the first subquery will be:

num
------
1
------

And the result from the second query will be:

num
------
2
------

The UNION operator will combine both tables and update your CTE to the following table:

num
------
1
------
2
------

In the second query, since you select from your CTE, the two queries along with the UNION operator will be executed again. The first query will again return:

num
------
1
------

The second query will use the CTE from the previous run and increment each row by one, returning the following:

num
------
2
------
3
------

The UNION operator again combines the tables from both queries. The updated CTE will be:

num
------
1
------
2
------
3
------

This process will keep going until the column contains a row with the value 20. In this case, the WHEN condition won’t be satisfied and you will stop the recursion process.

Factorial

Now try a SQL command that creates a table with factorial values. Try to find the factorial of 5.

WITH FACTORIAL AS (
  SELECT 
    5 AS NUM, 
    1 AS PRODUCT -- Starting Table
  UNION 
    -- Combine Starting Table with Table from Below Query
  SELECT 
    NUM - 1 AS NUM, 
    PRODUCT * NUM AS PRODUCT -- Update Rule
  FROM 
    FACTORIAL -- CTE calling itself, i.e recursion
  WHERE 
    NUM > 1 -- Terminating Condition
    ) 
SELECT 
  * 
FROM 
  FACTORIAL

Your initial table should have two columns—one to store the current multiplier and one to store the current product. The product will be 1 while the first multiplier will be 5.

The result from the first query will be:

num    product

------ ---------
5        1
------ ---------

The result from the second query will be:

num        product
--------   --------------
5-1 = 4     5*1 = 5
--------   --------------

The new CTE will be:

num    product
------ ---------
5         1
------ ---------
4         5
------ ---------

In the next run, the result from the first query will be the same:

num    product
------ ---------
5        1
------ ---------

The result from the second query will be calculated based on the CTE from the previous run:

num         product
---------   ---------
5 - 1 = 4   1 * 5 = 5
---------   ---------
4 - 1 = 3   5 * 4 = 20
---------   ---------

The UNION operator will combine both tables. The updated CTE will be:

num     product
------  ---------
5         1
------  ---------
4         5
------  ---------
3        20
------  ---------

This process will keep going until you reach the terminating condition.

General Syntax for Recursive CTEs

Based on the above problems, you can see the general syntax for recursive CTEs in SQL:

WITH CTE AS
(
    -- Starting Table
    UNION -- Combine Starting Table with Table from Below Query
    -- Updated Table with Update conditions if any
    FROM CTE-- CTE calling itself, i.e recursion
    -- Terminating Condition
)

SELECT * FROM CTE

The terminating condition doesn’t necessarily have to be a WHERE condition; it can also be a JOIN condition. You’ll look at terminating conditions with JOIN in the following cases.

Add Sample Data

Recursion is commonly used to find hierarchical structures in a company or when working with family trees. You’ll need to create a new table in your database. The first column will be the agent’s code, and the second will be their manager’s agent code. If an agent has no manager, the value in the second column will be NULL. Below is the hierarchical structure you will be adding to your database:

Hierarchical structure

You can find the script to add the data in this repo.

Task One

Say you’re given an agent’s ID and are asked to return a table with the agent’s manager, skip-manager, skip-skip-manager, etc., all the way to the top. Below is the command to complete this task:

with HIERARCHY_CTE AS (
  SELECT 
    h1.agent_code, 
    h1.manager_code 
  from 
    HIERARCHY h1 
  WHERE 
    h1.agent_code == 'A006' 
  UNION 
  SELECT 
    h2.agent_code, 
    h2.manager_code 
  FROM 
    HIERARCHY h2 
    JOIN HIERARCHY_CTE ON h2.AGENT_CODE = HIERARCHY_CTE.manager_code
) 
SELECT 
  * 
FROM 
  HIERARCHY_CTE

Again, consider each query individually.

SELECT 
  h1.agent_code, 
  h1.manager_code 
FROM 
  HIERARCHY h1 
WHERE 
  h1.agent_code == 'A006'

The above query returns the agent’s code and their manager’s agent code. Below is the output:

agent_code   manager_code
---------   ------------
A006         A005

Now look at the second query:

SELECT 
  h2.agent_code, 
  h2.manager_code 
FROM 
  HIERARCHY h2 
  JOIN HIERARCHY_CTE ON h2.AGENT_CODE = HIERARCHY_CTE.manager_code

This performs a JOIN between the HIERARCHY table and the CTE you created. It tries to find the rows where the agent_code from HIERARCHY matches the manager_code for the CTE. It then returns the agent’s code and their manager’s code, or A006’s manager and A006’s skip-manager.

Below is the output:

agent_code  manager_code
---------   ------------
A005        A003

The union returns the following table:

agent_code  manager_code
---------   ------------
A006        A005
---------   ------------
A005        A003

In the second run, the first query will again return:

agent_code   manager_code
---------    ------------
A006           A005

The second query, however, will include A003’s manager as well. Below is the output:

agent_code manager_code
--------- ------------
A005        A003
--------- ------------
A003        A001

The UNION operator combines the tables. Your updated CTE will be:

agent_code    manager_code
---------    ------------
A006           A005
---------    ------------
A005           A003
---------    ------------
A003           A001

A001 has no manager. In the next run, you will hit the terminating condition.

Task Two

In this task, you’ll assign a level to each agent. The agents with no boss are level 1, the agents who report to them are level 2, and so on. Below is the command that returns this table:

with HIERARCHY_CTE AS (
  SELECT 
    h1.agent_code, 
    h1.manager_code, 
    1 AS level 
  from 
    HIERARCHY h1 
  WHERE 
    h1.manager_code IS NULL 
  UNION 
  SELECT 
    h2.agent_code, 
    h2.manager_code, 
    level + 1 AS level 
  from 
    HIERARCHY h2 
    JOIN HIERARCHY_CTE on h2.manager_code == HIERARCHY_CTE.agent_code
) 
SELECT 
  * 
FROM 
  HIERARCHY_CTE

This is similar to the command from your previous task. The main difference is that you start from the top agent, rather than the bottom.

Running Totals

Running totals are also known as cumulative sums. Consider the ORDERS table in your database. It has columns called ord_amount and ord_date. Try to calculate the cumulative sum on each date record in the ord_date column. Below is the command:

SELECT 
  ord_date, 
  SUM(ord_amount) OVER (
    ORDER BY 
      ord_date
  ) AS Running_Total 
FROM 
  ORDERS

The OVER command tells SQL that you want to find the cumulative sum of the ord_amount column over the whole data set. The column in the ORDER BY clause is used to sort the results by date.

CASE WHEN Statements

CASE WHEN statements in SQL are similar to if-else or switch-case statements in other languages. Below is the general syntax for CASE WHEN statements:

CASE
WHEN CONDITION THEN RESULT
WHEN CONDITION THEN RESULT
WHEN CONDITION THEN RESULT
ELSE RESULT
END Column_Name

The result in ELSE is returned if none of the conditions return true.

The CUSTOMER table has a column called grade with values from 0-4. You are tasked with creating a new column called grade_letter that maps 1 to A, 2 to B, and everything else to F. You can use CASE WHEN statements to complete this task. Below is the command:

SELECT 
  *, 
  CASE WHEN grade == 1 THEN 'A' WHEN grade == 2 THEN 'B' ELSE 'F' END grade_letter 
FROM 
  CUSTOMER

Conclusion

This guide offered you some hands-on experience in working with more advanced SQL topics. Implementing these tools will help you gain more insights from your data so that you can write better queries in SQL.

If you are working with SQL, you might be interested in CloudQuery. The SQL-based open source cloud asset inventory integrates with major cloud providers including Azure, AWS, and GCP. You can use it to audit and evaluate your cloud assets from a current and historical perspective.

For more on what CloudQuery can do for you, check out its documentation.


This article was originally posted on 
https://www.realpythonproject.com/advanced-sql-going-beyond-the-basics/

Connect with me on LinkedIn: 
https://www.linkedin.com/in/rahulbanerjee2699/