Speeding up SQL queries by orders of magnitude using UNION
SQL is a very powerful tool for querying data. It allows you to write queries against your relational data in a declarative manner, letting you describe what data that you want to retrieve without having to describe how to retrieve it. In most cases, this works very well, and the query optimizer in many database engines (MySQL, PostgreSQL, etc.) will create an efficient query plan.
Efficient query plans rely on a schema that uses appropriate data types, especially for primary key columns, where doing things such as misusing VARCHAR
can kill performance. Another critical element of enabling fast query plans is appropriately indexing columns, which eliminates the need to perform full table scans when retrieving data. Unfortunately, even following these schema rules, it’s possible to write SQL queries that have surprisingly poor performance, often leading to the bewilderment of the developer writing such a query. Perhaps the most surprising aspect of this type of query is that it is often written in the most intuitive way to describe the data.
A performance trap: the diamond-shaped schema
One of the most common cases where SQL query performance can degrade significantly is in a diamond shaped schema, where there are multiple ways of joining two tables together. In such a schema, a query is likely to use OR
to join tables in more than one way, which eliminates the optimizer’s ability to create an efficient query plan. This scenario is best illustrated through an example.
Imagine we have the following schema for a chain of retail stores that sell food and drink. The table layout is as follows:
stores
+---------+------+
customers +---->| id | int |<----------------+
+----------+------+ | | address | text | |
+--->| id | int | | +---------+------+ |
| | name | text | | |
| | store_id | int +--+ employees |
| +----------+------+ +----------+------+ |
| +----->| id | int | |
| | | name | text | |
| customer_orders | | role | text | |
| +-------------+-----------+ | | store_id | int +--+
| | id | int |<--+ | +----------+------+
+--+ customer_id | int | | |
| created | timestamp | | |
+-------------+-----------+ | | employee_markouts
| | +--------------+-----------+
| | | id | int |
customer_order_items | +--+ employee_id | int |
+-------------------+-----+ | | meal_item_id | int +--+
| id | int | | | created | timestamp | |
| customer_order_id | int +--+ +--------------+-----------+ |
+--+ meal_item_id | int | |
| +-------------------+-----+ |
| meal_items |
| +-------+------+ |
+-------------------------------->| id | int |<--------------------+
| label | text |
| price | int |
+-------+------+
Here are a few key features of this schema:
- Both
customers
andemployees
belong tostores
. - Customers place
customer_orders
which consist of one or severalcustomer_order_items
. - Employees periodically get free items for their employment, which are recorded as
employee_markouts
. - Both
customer_order_items
andemployee_markouts
referencemeal_items
, which include the labels and prices of the food items sold.
For the purposes of our testing, we’ll be deploying this schema with proper indexing on a PostgreSQL 12.6 database with the following number of records in each table:
Table | Number of Records |
---|---|
stores |
800 |
employees |
20,000 |
employee_markouts |
25,000 |
customers |
20,000 |
customer_orders |
100,000 |
customer_order_items |
550,482 |
meal_items |
500 |
All of the orders and markouts are randomly distributed amongst the customers and employees, respectively. Employees and customers are also randomly distributed across stores.
Handling a report request
In order to audit inventory, the logistics team at the corporate headquarters requests a tool that can generate a report containing all meal_items
that left a given store’s inventory on a particular day. This requires a query that includes items that were both sold to customers as well as recorded as employee markouts for the specified store on the specified day.
To break this request down into more manageable segments, we’ll first retrieve all of the meal items that are a part of an employee markout created on the given day at the given store. Once we have this, we’ll expand it to include meal items that have been purchased by customers.
The query that retrieves only employee markout data starts at the stores
table and joins the employee tables down to the meal_items
table. This is a fairly straightforward query, and since the columns are indexed, we expect it to perform well.
Query #1 - Retrieving only employee meal items
SELECT meal_items.*, employee_markouts.employee_id
FROM stores
INNER JOIN employees
ON stores.id = employees.store_id
INNER JOIN employee_markouts
ON employees.id = employee_markouts.employee_id
INNER JOIN meal_items
ON employee_markouts.meal_item_id = meal_items.id
WHERE stores.id = 250
AND employee_markouts.created >= '2021-02-03'
AND employee_markouts.created < '2021-02-04';
We get the following when we run this query:
id | label | price | employee_id |
---|---|---|---|
173 | Pizza | 3.73 | 3737 |
339 | Tuna Sashimi | 21.41 | 3737 |
2 results, Execution Time: 1.499 ms
This query gives us the data we’re looking for and runs in a blazing fast 1.499 milliseconds—the excellent performance we expected. The problem is that we’re not done yet, we also need to retrieve the meal items that are a part of customer orders. In order to do this, we’ll modify the above query in the following ways:
- We’ll include a second branch of joins from
stores
tomeal_items
through the customers tables, updating the final join into themeal_items
table to use anOR
to merge both branches. - Since we’re looking for meal items that are a part of either employee markouts or customer orders, we’ll convert all of our joins to be
LEFT
joins and add a condition in our customers branch to ignore employee markouts. - We’ll also change the columns we’re selecting to include either one of the
employee_id
or thecustomer_id
that the meal item belongs to.
Our new query with these changes incorporated is below:
Query #2 - Retrieving both employee and customer meal items using multi-branch joins
SELECT
*,
meal_items.
employee_markouts.employee_id,
customer_orders.customer_idFROM stores
-- employees branch
LEFT JOIN employees
ON stores.id = employees.store_id
LEFT JOIN employee_markouts
ON employees.id = employee_markouts.employee_id
-- customers branch
LEFT JOIN customers
ON (stores.id = customers.store_id AND employee_markouts.id IS null)
LEFT JOIN customer_orders
ON customers.id = customer_orders.customer_id
LEFT JOIN customer_order_items
ON customer_orders.id = customer_order_items.customer_order_id
-- join both branches into meal_items
LEFT JOIN meal_items
ON (customer_order_items.meal_item_id = meal_items.id
OR employee_markouts.meal_item_id = meal_items.id
)WHERE stores.id = 250
AND meal_items.id IS NOT null
AND ( employee_markouts.created >= '2021-02-03' AND employee_markouts.created < '2021-02-04'
OR customer_orders.created >= '2021-02-03' AND customer_orders.created < '2021-02-04'
)GROUP BY meal_items.id, employee_markouts.id, customer_orders.id, customer_order_items.id;
Running query #2 gives the following results (abridged for brevity):
id | label | price | employee_id | customer_id |
---|---|---|---|---|
5 | Stinky Tofu | 21.24 | 3769 | |
17 | Chicken Sandwich | 18.37 | 11085 | |
25 | Kebab | 16.30 | 3769 | |
… | … | … | … | … |
173 | Pizza | 3.73 | 3737 | |
339 | Tuna Sashimi | 21.41 | 3737 | |
… | … | … | … | … |
490 | Ribeye Steak | 20.10 | 1052 |
45 results, Execution Time: 3264.547 ms
Of the 45 results, 43 represent meal items purchased by customers, and we continue to see the two meal items from the previous query that come from employee markouts. Unfortunately, the performance of this multi-branch query is far worse than before. The sub-2-millisecond query from before has ballooned into a sluggish 3,264 milliseconds.
This performance may be acceptable for a one-off query, but for any other use case, this execution time of more than three seconds is very poor given the relatively small amount of data in our database. Our database has roughly only 750,000 rows. If we were dealing with row counts in the tens or hundreds of millions, the performance of our report would likely be in the tens of seconds. This is an unacceptable amount of time to make our end users wait, especially if they need to run multiple reports, so we need to find a way to achieve better performance.
Sticking to fast and simple queries
After running each of our two queries, it’s apparent that attempting to retrieve meal item counts for both employees and customers in a single query in the way that we wrote query #2 resulted in a significant degradation in performance. Even without being familiar with the specifics of the query plan (which we can see by re-running the query prefixed with EXPLAIN
or EXPLAIN ANALYZE
), we can try to stick to using simpler queries that we think will have better performance, and see whether there’s a better way to compose the results.
Query #1 retrieved meal items only a part of employee markouts and it performed extremely well. Let’s try writing the query to retrieve only the meal items that are a part of customer orders and examine its performance. Like the query for employee data, this query will join the tables between the stores
and meal_items
tables, but instead do so through the customer tables. There’s three customer-specific tables rather than two employee-specific tables, but otherwise this query is very similar to the first:
Query #3 - Retrieving only customer meal items
SELECT meal_items.*, customer_orders.customer_id
FROM stores
INNER JOIN customers
ON stores.id = customers.store_id
INNER JOIN customer_orders
ON customers.id = customer_orders.customer_id
INNER JOIN customer_order_items
ON customer_orders.id = customer_order_items.customer_order_id
INNER JOIN meal_items
ON customer_order_items.meal_item_id = meal_items.id
WHERE stores.id = 250
AND customer_orders.created >= '2021-02-03'
AND customer_orders.created < '2021-02-04';
When we run this query, we get the following results:
id | label | price | customer_id |
---|---|---|---|
5 | Stinky Tofu | 21.24 | 3769 |
17 | Chicken Sandwich | 18.37 | 11085 |
… | … | … | … |
482 | Vegetable Soup | 4.50 | 3769 |
490 | Ribeye Steak | 20.10 | 1052 |
43 results, Execution Time: 102.283 ms
We get exactly the results we expect. Looking at the performance, we see that this query runs in only 102 milliseconds. This is slower than query #1 because we have significantly more customers than employees in our database, but still far faster than the 3264 milliseconds query #2 took to run.
Now we are in a situation where we retrieve the correct results, albeit split across two queries. Despite this, the runtime of both query #1 (only employee meal items) and query #3 (only customer meal items) put together is more than 30 times faster than query #2 (both employee and customer meal items through multi-branch joins). All we need to do is merge the results of these queries. The good news is that SQL has an operation that will let us do this while preserving this speed.
Preserving performance through UNION
The UNION
operation allows us to merge the results of two queries. Since we know that query #1 and query #3 are each significantly faster than query #2, we would expect that the results of the UNION
operation will be fast as well.
We use both query #1 and query #3 nearly verbatim in what will be our new combined query. Since the UNION
operation requires that the results of each query contain the same columns, we have to include a NULL
placeholder column for whichever type of data (either employee_id
or customer_id
) the given side of the UNION
will not retrieve.
One other thing that the UNION
operation does is deduplicate rows in the result set. Since we don’t care about deduplication, we can use UNION ALL
to tell the database engine that it can skip the deduplication step. This results in a performance boost with larger data sets.
The resulting query is as follows:
Query #4 - Retrieving both employee and customer meal items using UNION
SELECT -- employees query
*,
meal_items.
employee_markouts.employee_id,null as customer_id
FROM stores
INNER JOIN employees
ON stores.id = employees.store_id
INNER JOIN employee_markouts
ON employees.id = employee_markouts.employee_id
INNER JOIN meal_items
ON employee_markouts.meal_item_id = meal_items.id
WHERE stores.id = 250
AND employee_markouts.created >= '2021-02-03'
AND employee_markouts.created < '2021-02-04'
UNION ALL
SELECT -- customers query
*,
meal_items.null as employee_id,
customer_orders.customer_idFROM stores
INNER JOIN customers
ON stores.id = customers.store_id
INNER JOIN customer_orders
ON customers.id = customer_orders.customer_id
INNER JOIN customer_order_items
ON customer_orders.id = customer_order_items.customer_order_id
INNER JOIN meal_items
ON customer_order_items.meal_item_id = meal_items.id
WHERE stores.id = 250
AND customer_orders.created >= '2021-02-03'
AND customer_orders.created < '2021-02-04';
Given what we’ve seen above, we expect 45 results from this query. Two for employees, and 43 for customers. Running the query gives the following results:
id | label | price | employee_id | customer_id |
---|---|---|---|---|
173 | Pizza | 3.73 | 3737 | |
339 | Tuna Sashimi | 21.41 | 3737 | |
… | … | … | … | … |
403 | Ricotta Stuffed Ravioli | 11.09 | 17910 | |
386 | Tacos | 11.02 | 17910 |
45 results, Execution Time: 112.309 ms
We get exactly the same results we expect, in a blazing fast 112 milliseconds. This is now a single query that gives us the same results that query #2 gave us, but does so approximately 30 times faster. Using UNION
here costs us virtually nothing in terms of performance. The time is essentially just the sum of the two underlying queries.
It’s worth noting that the results of the above query are ordered differently than our original query, which is ordered by the id
column. This is because the UNION
operation appends rows in the order that it runs each underlying query (which is also why we get the employee meal items first). If we need the order to match, we can achieve this by wrapping query #4 in a very simple SELECT
operation that orders the results by id
:
Query #5 - Retrieving both employee and customer meal items using UNION
, ordered by id
SELECT * FROM (
-- ... Query #4 from above, omitted for brevity
ORDER BY id; ) results
Which gives us:
id | label | price | employee_id | customer_id |
---|---|---|---|---|
5 | Stinky Tofu | 21.24 | 3769 | |
17 | Chicken Sandwich | 18.37 | 11085 | |
… | … | … | … | … |
173 | Pizza | 3.73 | 3737 | |
… | … | … | … | … |
490 | Ribeye Steak | 20.10 | 1052 |
45 results, Execution Time: 113.340 ms
Query #5 gives us exactly the same results in the same order as query #2, but with a 2,880% increase in performance. This is an outstanding improvement, and is now performant enough as to where query #5 can be used in any application.
Conclusion
There are many ways to write a SQL query to retrieve a given set of results. Most database engines are great at creating performant query plans, but certain features within a query can derail the query planner and result in a very slow query. In this post, we covered a common scenario that results in poor query performance: using OR
to combine multiple branches of joins in a single query.
Arriving at query #2 to get the combined results was the intuitive way of thinking through the problem, and something that someone with intermediate or advanced SQL skills could come up with. However, once we realized that performance was bad, we applied the following steps to find a solution:
- We focused on writing only simpler and well-performing queries that each gave different portions of our desired results.
- We merged the results using SQL’s
UNION
operation. - We ensured ordering was identical using a simple wrapper query.
This technique can be applied in many situations where query performance is poor due to this type of diamond-shaped branching and merging. When working on production software systems, we often see performance bottlenecks caused by slow queries removed when rewriting queries in this manner. In many cases, the performance improvement is so dramatic that it absolves the need to cache query results in systems like Redis, resulting in less system complexity in addition to better performance.
SQL’s UNION
operation is not usually thought of as a means to boost performance. However, in many cases it can dramatically speed queries up by enabling an otherwise complex query to be split into several faster and simpler queries that are then merged together. Recognizing when UNION
can be applied takes some practice, but once someone is aware of this technique, it’s possible to look for situations where a performance bottleneck can be removed through this approach.
Ben Levy and Christian Charukiewicz are Partners and Principal Software Engineers at Foxhound Systems. At Foxhound Systems, we focus on building fast and reliable custom software. Are you facing a performance issue or looking for help with something you’re working on? Reach out to us at info@foxhound.systems.