Speeding up SQL queries by orders of magnitude using UNION
SQL is a very powerful tool for querying data. It allows you to write queries against your relational data in a declarative manner, letting you describe what data that you want to retrieve without having to describe how to retrieve it. In most cases, this works very well, and the query optimizer in many database engines (MySQL, PostgreSQL, etc.) will create an efficient query plan.
Efficient query plans rely on a schema that uses appropriate data types, especially for primary key columns, where doing things such as misusing
VARCHAR can kill performance. Another critical element of enabling fast query plans is appropriately indexing columns, which eliminates the need to perform full table scans when retrieving data. Unfortunately, even following these schema rules, it’s possible to write SQL queries that have surprisingly poor performance, often leading to the bewilderment of the developer writing such a query. Perhaps the most surprising aspect of this type of query is that it is often written in the most intuitive way to describe the data.
A performance trap: the diamond-shaped schema
One of the most common cases where SQL query performance can degrade significantly is in a diamond shaped schema, where there are multiple ways of joining two tables together. In such a schema, a query is likely to use
OR to join tables in more than one way, which eliminates the optimizer’s ability to create an efficient query plan. This scenario is best illustrated through an example.
Imagine we have the following schema for a chain of retail stores that sell food and drink. The table layout is as follows:
stores +---------+------+ customers +---->| id | int |<----------------+ +----------+------+ | | address | text | | +--->| id | int | | +---------+------+ | | | name | text | | | | | store_id | int +--+ employees | | +----------+------+ +----------+------+ | | +----->| id | int | | | | | name | text | | | customer_orders | | role | text | | | +-------------+-----------+ | | store_id | int +--+ | | id | int |<--+ | +----------+------+ +--+ customer_id | int | | | | created | timestamp | | | +-------------+-----------+ | | employee_markouts | | +--------------+-----------+ | | | id | int | customer_order_items | +--+ employee_id | int | +-------------------+-----+ | | meal_item_id | int +--+ | id | int | | | created | timestamp | | | customer_order_id | int +--+ +--------------+-----------+ | +--+ meal_item_id | int | | | +-------------------+-----+ | | meal_items | | +-------+------+ | +-------------------------------->| id | int |<--------------------+ | label | text | | price | int | +-------+------+
Here are a few key features of this schema:
- Customers place
customer_orderswhich consist of one or several
- Employees periodically get free items for their employment, which are recorded as
meal_items, which include the labels and prices of the food items sold.
For the purposes of our testing, we’ll be deploying this schema with proper indexing on a PostgreSQL 12.6 database with the following number of records in each table:
|Table||Number of Records|
All of the orders and markouts are randomly distributed amongst the customers and employees, respectively. Employees and customers are also randomly distributed across stores.
Handling a report request
In order to audit inventory, the logistics team at the corporate headquarters requests a tool that can generate a report containing all
meal_items that left a given store’s inventory on a particular day. This requires a query that includes items that were both sold to customers as well as recorded as employee markouts for the specified store on the specified day.
To break this request down into more manageable segments, we’ll first retrieve all of the meal items that are a part of an employee markout created on the given day at the given store. Once we have this, we’ll expand it to include meal items that have been purchased by customers.
The query that retrieves only employee markout data starts at the
stores table and joins the employee tables down to the
meal_items table. This is a fairly straightforward query, and since the columns are indexed, we expect it to perform well.
Query #1 - Retrieving only employee meal items
SELECT meal_items.*, employee_markouts.employee_id FROM stores INNER JOIN employees ON stores.id = employees.store_id INNER JOIN employee_markouts ON employees.id = employee_markouts.employee_id INNER JOIN meal_items ON employee_markouts.meal_item_id = meal_items.id WHERE stores.id = 250 AND employee_markouts.created >= '2021-02-03' AND employee_markouts.created < '2021-02-04';
We get the following when we run this query:
2 results, Execution Time: 1.499 ms
This query gives us the data we’re looking for and runs in a blazing fast 1.499 milliseconds—the excellent performance we expected. The problem is that we’re not done yet, we also need to retrieve the meal items that are a part of customer orders. In order to do this, we’ll modify the above query in the following ways:
- We’ll include a second branch of joins from
meal_itemsthrough the customers tables, updating the final join into the
meal_itemstable to use an
ORto merge both branches.
- Since we’re looking for meal items that are a part of either employee markouts or customer orders, we’ll convert all of our joins to be
LEFTjoins and add a condition in our customers branch to ignore employee markouts.
- We’ll also change the columns we’re selecting to include either one of the
customer_idthat the meal item belongs to.
Our new query with these changes incorporated is below:
Query #2 - Retrieving both employee and customer meal items using multi-branch joins
SELECT *, meal_items. employee_markouts.employee_id, customer_orders.customer_idFROM stores -- employees branch LEFT JOIN employees ON stores.id = employees.store_id LEFT JOIN employee_markouts ON employees.id = employee_markouts.employee_id -- customers branch LEFT JOIN customers ON (stores.id = customers.store_id AND employee_markouts.id IS null) LEFT JOIN customer_orders ON customers.id = customer_orders.customer_id LEFT JOIN customer_order_items ON customer_orders.id = customer_order_items.customer_order_id -- join both branches into meal_items LEFT JOIN meal_items ON (customer_order_items.meal_item_id = meal_items.id OR employee_markouts.meal_item_id = meal_items.id )WHERE stores.id = 250 AND meal_items.id IS NOT null AND ( employee_markouts.created >= '2021-02-03' AND employee_markouts.created < '2021-02-04' OR customer_orders.created >= '2021-02-03' AND customer_orders.created < '2021-02-04' )GROUP BY meal_items.id, employee_markouts.id, customer_orders.id, customer_order_items.id;
Running query #2 gives the following results (abridged for brevity):
45 results, Execution Time: 3264.547 ms
Of the 45 results, 43 represent meal items purchased by customers, and we continue to see the two meal items from the previous query that come from employee markouts. Unfortunately, the performance of this multi-branch query is far worse than before. The sub-2-millisecond query from before has ballooned into a sluggish 3,264 milliseconds.
This performance may be acceptable for a one-off query, but for any other use case, this execution time of more than three seconds is very poor given the relatively small amount of data in our database. Our database has roughly only 750,000 rows. If we were dealing with row counts in the tens or hundreds of millions, the performance of our report would likely be in the tens of seconds. This is an unacceptable amount of time to make our end users wait, especially if they need to run multiple reports, so we need to find a way to achieve better performance.
Sticking to fast and simple queries
After running each of our two queries, it’s apparent that attempting to retrieve meal item counts for both employees and customers in a single query in the way that we wrote query #2 resulted in a significant degradation in performance. Even without being familiar with the specifics of the query plan (which we can see by re-running the query prefixed with
EXPLAIN ANALYZE), we can try to stick to using simpler queries that we think will have better performance, and see whether there’s a better way to compose the results.
Query #1 retrieved meal items only a part of employee markouts and it performed extremely well. Let’s try writing the query to retrieve only the meal items that are a part of customer orders and examine its performance. Like the query for employee data, this query will join the tables between the
meal_items tables, but instead do so through the customer tables. There’s three customer-specific tables rather than two employee-specific tables, but otherwise this query is very similar to the first:
Query #3 - Retrieving only customer meal items
SELECT meal_items.*, customer_orders.customer_id FROM stores INNER JOIN customers ON stores.id = customers.store_id INNER JOIN customer_orders ON customers.id = customer_orders.customer_id INNER JOIN customer_order_items ON customer_orders.id = customer_order_items.customer_order_id INNER JOIN meal_items ON customer_order_items.meal_item_id = meal_items.id WHERE stores.id = 250 AND customer_orders.created >= '2021-02-03' AND customer_orders.created < '2021-02-04';
When we run this query, we get the following results:
43 results, Execution Time: 102.283 ms
We get exactly the results we expect. Looking at the performance, we see that this query runs in only 102 milliseconds. This is slower than query #1 because we have significantly more customers than employees in our database, but still far faster than the 3264 milliseconds query #2 took to run.
Now we are in a situation where we retrieve the correct results, albeit split across two queries. Despite this, the runtime of both query #1 (only employee meal items) and query #3 (only customer meal items) put together is more than 30 times faster than query #2 (both employee and customer meal items through multi-branch joins). All we need to do is merge the results of these queries. The good news is that SQL has an operation that will let us do this while preserving this speed.
Preserving performance through UNION
UNION operation allows us to merge the results of two queries. Since we know that query #1 and query #3 are each significantly faster than query #2, we would expect that the results of the
UNION operation will be fast as well.
We use both query #1 and query #3 nearly verbatim in what will be our new combined query. Since the
UNION operation requires that the results of each query contain the same columns, we have to include a
NULL placeholder column for whichever type of data (either
customer_id) the given side of the
UNION will not retrieve.
One other thing that the
UNION operation does is deduplicate rows in the result set. Since we don’t care about deduplication, we can use
UNION ALL to tell the database engine that it can skip the deduplication step. This results in a performance boost with larger data sets.
The resulting query is as follows:
Query #4 - Retrieving both employee and customer meal items using
SELECT -- employees query *, meal_items. employee_markouts.employee_id,null as customer_id FROM stores INNER JOIN employees ON stores.id = employees.store_id INNER JOIN employee_markouts ON employees.id = employee_markouts.employee_id INNER JOIN meal_items ON employee_markouts.meal_item_id = meal_items.id WHERE stores.id = 250 AND employee_markouts.created >= '2021-02-03' AND employee_markouts.created < '2021-02-04' UNION ALL SELECT -- customers query *, meal_items.null as employee_id, customer_orders.customer_idFROM stores INNER JOIN customers ON stores.id = customers.store_id INNER JOIN customer_orders ON customers.id = customer_orders.customer_id INNER JOIN customer_order_items ON customer_orders.id = customer_order_items.customer_order_id INNER JOIN meal_items ON customer_order_items.meal_item_id = meal_items.id WHERE stores.id = 250 AND customer_orders.created >= '2021-02-03' AND customer_orders.created < '2021-02-04';
Given what we’ve seen above, we expect 45 results from this query. Two for employees, and 43 for customers. Running the query gives the following results:
|403||Ricotta Stuffed Ravioli||11.09||17910|
45 results, Execution Time: 112.309 ms
We get exactly the same results we expect, in a blazing fast 112 milliseconds. This is now a single query that gives us the same results that query #2 gave us, but does so approximately 30 times faster. Using
UNION here costs us virtually nothing in terms of performance. The time is essentially just the sum of the two underlying queries.
It’s worth noting that the results of the above query are ordered differently than our original query, which is ordered by the
id column. This is because the
UNION operation appends rows in the order that it runs each underlying query (which is also why we get the employee meal items first). If we need the order to match, we can achieve this by wrapping query #4 in a very simple
SELECT operation that orders the results by
Query #5 - Retrieving both employee and customer meal items using
UNION, ordered by
SELECT * FROM ( -- ... Query #4 from above, omitted for brevity ORDER BY id;) results
Which gives us:
45 results, Execution Time: 113.340 ms
Query #5 gives us exactly the same results in the same order as query #2, but with a 2,880% increase in performance. This is an outstanding improvement, and is now performant enough as to where query #5 can be used in any application.
There are many ways to write a SQL query to retrieve a given set of results. Most database engines are great at creating performant query plans, but certain features within a query can derail the query planner and result in a very slow query. In this post, we covered a common scenario that results in poor query performance: using
OR to combine multiple branches of joins in a single query.
Arriving at query #2 to get the combined results was the intuitive way of thinking through the problem, and something that someone with intermediate or advanced SQL skills could come up with. However, once we realized that performance was bad, we applied the following steps to find a solution:
- We focused on writing only simpler and well-performing queries that each gave different portions of our desired results.
- We merged the results using SQL’s
- We ensured ordering was identical using a simple wrapper query.
This technique can be applied in many situations where query performance is poor due to this type of diamond-shaped branching and merging. When working on production software systems, we often see performance bottlenecks caused by slow queries removed when rewriting queries in this manner. In many cases, the performance improvement is so dramatic that it absolves the need to cache query results in systems like Redis, resulting in less system complexity in addition to better performance.
UNION operation is not usually thought of as a means to boost performance. However, in many cases it can dramatically speed queries up by enabling an otherwise complex query to be split into several faster and simpler queries that are then merged together. Recognizing when
UNION can be applied takes some practice, but once someone is aware of this technique, it’s possible to look for situations where a performance bottleneck can be removed through this approach.
Ben Levy and Christian Charukiewicz are Partners and Principal Software Engineers at Foxhound Systems. At Foxhound Systems, we focus on building fast and reliable custom software. Are you facing a performance issue or looking for help with something you’re working on? Reach out to us at email@example.com.