Facebook Twitter GitHub LinkedIn LinkedIn LinkedIn
Back to all posts

Speeding up SQL queries by orders of magnitude using UNION

SQL is a very powerful tool for querying data. It allows you to write queries against your relational data in a declarative manner, letting you describe what data that you want to retrieve without having to describe how to retrieve it. In most cases, this works very well, and the query optimizer in many database engines (MySQL, PostgreSQL, etc.) will create an efficient query plan.

Efficient query plans rely on a schema that uses appropriate data types, especially for primary key columns, where doing things such as misusing VARCHAR can kill performance. Another critical element of enabling fast query plans is appropriately indexing columns, which eliminates the need to perform full table scans when retrieving data. Unfortunately, even following these schema rules, it’s possible to write SQL queries that have surprisingly poor performance, often leading to the bewilderment of the developer writing such a query. Perhaps the most surprising aspect of this type of query is that it is often written in the most intuitive way to describe the data.

A performance trap: the diamond-shaped schema

One of the most common cases where SQL query performance can degrade significantly is in a diamond shaped schema, where there are multiple ways of joining two tables together. In such a schema, a query is likely to use OR to join tables in more than one way, which eliminates the optimizer’s ability to create an efficient query plan. This scenario is best illustrated through an example.

Imagine we have the following schema for a chain of retail stores that sell food and drink. The table layout is as follows:

                                stores
                                +---------+------+
     customers            +---->| id      | int  |<----------------+
     +----------+------+  |     | address | text |                 |
+--->| id       | int  |  |     +---------+------+                 |
|    | name     | text |  |                                        |
|    | store_id | int  +--+                   employees            |
|    +----------+------+                      +----------+------+  |
|                                      +----->| id       | int  |  |
|                                      |      | name     | text |  |
|  customer_orders                     |      | role     | text |  |
|  +-------------+-----------+         |      | store_id | int  +--+
|  | id          | int       |<--+     |      +----------+------+
+--+ customer_id | int       |   |     |
   | created     | timestamp |   |     |
   +-------------+-----------+   |     |  employee_markouts
                                 |     |  +--------------+-----------+
                                 |     |  | id           | int       |
    customer_order_items         |     +--+ employee_id  | int       |
    +-------------------+-----+  |        | meal_item_id | int       +--+
    | id                | int |  |        | created      | timestamp |  |
    | customer_order_id | int +--+        +--------------+-----------+  |
 +--+ meal_item_id      | int |                                         |
 |  +-------------------+-----+                                         |
 |                                 meal_items                           |
 |                                 +-------+------+                     |
 +-------------------------------->| id    | int  |<--------------------+
                                   | label | text |
                                   | price | int  |
                                   +-------+------+

Here are a few key features of this schema:

For the purposes of our testing, we’ll be deploying this schema with proper indexing on a PostgreSQL 12.6 database with the following number of records in each table:

Table Number of Records
stores 800
employees 20,000
employee_markouts 25,000
customers 20,000
customer_orders 100,000
customer_order_items 550,482
meal_items 500

All of the orders and markouts are randomly distributed amongst the customers and employees, respectively. Employees and customers are also randomly distributed across stores.

Handling a report request

In order to audit inventory, the logistics team at the corporate headquarters requests a tool that can generate a report containing all meal_items that left a given store’s inventory on a particular day. This requires a query that includes items that were both sold to customers as well as recorded as employee markouts for the specified store on the specified day.

To break this request down into more manageable segments, we’ll first retrieve all of the meal items that are a part of an employee markout created on the given day at the given store. Once we have this, we’ll expand it to include meal items that have been purchased by customers.

The query that retrieves only employee markout data starts at the stores table and joins the employee tables down to the meal_items table. This is a fairly straightforward query, and since the columns are indexed, we expect it to perform well.

Query #1 - Retrieving only employee meal items

SELECT meal_items.*, employee_markouts.employee_id
FROM stores
INNER JOIN employees
ON stores.id = employees.store_id
INNER JOIN employee_markouts
ON employees.id = employee_markouts.employee_id
INNER JOIN meal_items
ON employee_markouts.meal_item_id = meal_items.id
WHERE stores.id = 250
AND employee_markouts.created >= '2021-02-03'
AND employee_markouts.created < '2021-02-04';

We get the following when we run this query:

id label price employee_id
173 Pizza 3.73 3737
339 Tuna Sashimi 21.41 3737

2 results, Execution Time: 1.499 ms

This query gives us the data we’re looking for and runs in a blazing fast 1.499 milliseconds—the excellent performance we expected. The problem is that we’re not done yet, we also need to retrieve the meal items that are a part of customer orders. In order to do this, we’ll modify the above query in the following ways:

Our new query with these changes incorporated is below:

Query #2 - Retrieving both employee and customer meal items using multi-branch joins

SELECT
  meal_items.*,
  employee_markouts.employee_id,
  customer_orders.customer_id
FROM stores
-- employees branch
LEFT JOIN employees
ON stores.id = employees.store_id
LEFT JOIN employee_markouts
ON employees.id = employee_markouts.employee_id
-- customers branch
LEFT JOIN customers
ON (stores.id = customers.store_id AND employee_markouts.id IS null)
LEFT JOIN customer_orders
ON customers.id = customer_orders.customer_id
LEFT JOIN customer_order_items
ON customer_orders.id = customer_order_items.customer_order_id
-- join both branches into meal_items
LEFT JOIN meal_items
ON (customer_order_items.meal_item_id = meal_items.id
    OR employee_markouts.meal_item_id = meal_items.id
   )
WHERE stores.id = 250
AND meal_items.id IS NOT null
AND ( employee_markouts.created >= '2021-02-03' AND employee_markouts.created < '2021-02-04'
      OR customer_orders.created >= '2021-02-03' AND customer_orders.created < '2021-02-04'
    )
GROUP BY meal_items.id, employee_markouts.id, customer_orders.id, customer_order_items.id;

Running query #2 gives the following results (abridged for brevity):

id label price employee_id customer_id
5 Stinky Tofu 21.24 3769
17 Chicken Sandwich 18.37 11085
25 Kebab 16.30 3769
173 Pizza 3.73 3737
339 Tuna Sashimi 21.41 3737
490 Ribeye Steak 20.10 1052

45 results, Execution Time: 3264.547 ms

Of the 45 results, 43 represent meal items purchased by customers, and we continue to see the two meal items from the previous query that come from employee markouts. Unfortunately, the performance of this multi-branch query is far worse than before. The sub-2-millisecond query from before has ballooned into a sluggish 3,264 milliseconds.

This performance may be acceptable for a one-off query, but for any other use case, this execution time of more than three seconds is very poor given the relatively small amount of data in our database. Our database has roughly only 750,000 rows. If we were dealing with row counts in the tens or hundreds of millions, the performance of our report would likely be in the tens of seconds. This is an unacceptable amount of time to make our end users wait, especially if they need to run multiple reports, so we need to find a way to achieve better performance.

Sticking to fast and simple queries

After running each of our two queries, it’s apparent that attempting to retrieve meal item counts for both employees and customers in a single query in the way that we wrote query #2 resulted in a significant degradation in performance. Even without being familiar with the specifics of the query plan (which we can see by re-running the query prefixed with EXPLAIN or EXPLAIN ANALYZE), we can try to stick to using simpler queries that we think will have better performance, and see whether there’s a better way to compose the results.

Query #1 retrieved meal items only a part of employee markouts and it performed extremely well. Let’s try writing the query to retrieve only the meal items that are a part of customer orders and examine its performance. Like the query for employee data, this query will join the tables between the stores and meal_items tables, but instead do so through the customer tables. There’s three customer-specific tables rather than two employee-specific tables, but otherwise this query is very similar to the first:

Query #3 - Retrieving only customer meal items

SELECT meal_items.*, customer_orders.customer_id
FROM stores
INNER JOIN customers
ON stores.id = customers.store_id
INNER JOIN customer_orders
ON customers.id = customer_orders.customer_id
INNER JOIN customer_order_items
ON customer_orders.id = customer_order_items.customer_order_id
INNER JOIN meal_items
ON customer_order_items.meal_item_id = meal_items.id
WHERE stores.id = 250
AND customer_orders.created >= '2021-02-03'
AND customer_orders.created < '2021-02-04';

When we run this query, we get the following results:

id label price customer_id
5 Stinky Tofu 21.24 3769
17 Chicken Sandwich 18.37 11085
482 Vegetable Soup 4.50 3769
490 Ribeye Steak 20.10 1052

43 results, Execution Time: 102.283 ms

We get exactly the results we expect. Looking at the performance, we see that this query runs in only 102 milliseconds. This is slower than query #1 because we have significantly more customers than employees in our database, but still far faster than the 3264 milliseconds query #2 took to run.

Now we are in a situation where we retrieve the correct results, albeit split across two queries. Despite this, the runtime of both query #1 (only employee meal items) and query #3 (only customer meal items) put together is more than 30 times faster than query #2 (both employee and customer meal items through multi-branch joins). All we need to do is merge the results of these queries. The good news is that SQL has an operation that will let us do this while preserving this speed.

Preserving performance through UNION

The UNION operation allows us to merge the results of two queries. Since we know that query #1 and query #3 are each significantly faster than query #2, we would expect that the results of the UNION operation will be fast as well.

We use both query #1 and query #3 nearly verbatim in what will be our new combined query. Since the UNION operation requires that the results of each query contain the same columns, we have to include a NULL placeholder column for whichever type of data (either employee_id or customer_id) the given side of the UNION will not retrieve.

One other thing that the UNION operation does is deduplicate rows in the result set. Since we don’t care about deduplication, we can use UNION ALL to tell the database engine that it can skip the deduplication step. This results in a performance boost with larger data sets.

The resulting query is as follows:

Query #4 - Retrieving both employee and customer meal items using UNION

SELECT -- employees query
  meal_items.*,
  employee_markouts.employee_id,
  null as customer_id
FROM stores
INNER JOIN employees
ON stores.id = employees.store_id
INNER JOIN employee_markouts
ON employees.id = employee_markouts.employee_id
INNER JOIN meal_items
ON employee_markouts.meal_item_id = meal_items.id
WHERE stores.id = 250
AND employee_markouts.created >= '2021-02-03'
AND employee_markouts.created < '2021-02-04'
UNION ALL
SELECT -- customers query
  meal_items.*,
  null as employee_id,
  customer_orders.customer_id
FROM stores
INNER JOIN customers
ON stores.id = customers.store_id
INNER JOIN customer_orders
ON customers.id = customer_orders.customer_id
INNER JOIN customer_order_items
ON customer_orders.id = customer_order_items.customer_order_id
INNER JOIN meal_items
ON customer_order_items.meal_item_id = meal_items.id
WHERE stores.id = 250
AND customer_orders.created >= '2021-02-03'
AND customer_orders.created < '2021-02-04';

Given what we’ve seen above, we expect 45 results from this query. Two for employees, and 43 for customers. Running the query gives the following results:

id label price employee_id customer_id
173 Pizza 3.73 3737
339 Tuna Sashimi 21.41 3737
403 Ricotta Stuffed Ravioli 11.09 17910
386 Tacos 11.02 17910

45 results, Execution Time: 112.309 ms

We get exactly the same results we expect, in a blazing fast 112 milliseconds. This is now a single query that gives us the same results that query #2 gave us, but does so approximately 30 times faster. Using UNION here costs us virtually nothing in terms of performance. The time is essentially just the sum of the two underlying queries.

It’s worth noting that the results of the above query are ordered differently than our original query, which is ordered by the id column. This is because the UNION operation appends rows in the order that it runs each underlying query (which is also why we get the employee meal items first). If we need the order to match, we can achieve this by wrapping query #4 in a very simple SELECT operation that orders the results by id:

Query #5 - Retrieving both employee and customer meal items using UNION, ordered by id

SELECT * FROM (
   -- ... Query #4 from above, omitted for brevity
) results ORDER BY id;

Which gives us:

id label price employee_id customer_id
5 Stinky Tofu 21.24 3769
17 Chicken Sandwich 18.37 11085
173 Pizza 3.73 3737
490 Ribeye Steak 20.10 1052

45 results, Execution Time: 113.340 ms

Query #5 gives us exactly the same results in the same order as query #2, but with a 2,880% increase in performance. This is an outstanding improvement, and is now performant enough as to where query #5 can be used in any application.

Conclusion

There are many ways to write a SQL query to retrieve a given set of results. Most database engines are great at creating performant query plans, but certain features within a query can derail the query planner and result in a very slow query. In this post, we covered a common scenario that results in poor query performance: using OR to combine multiple branches of joins in a single query.

Arriving at query #2 to get the combined results was the intuitive way of thinking through the problem, and something that someone with intermediate or advanced SQL skills could come up with. However, once we realized that performance was bad, we applied the following steps to find a solution:

  1. We focused on writing only simpler and well-performing queries that each gave different portions of our desired results.
  2. We merged the results using SQL’s UNION operation.
  3. We ensured ordering was identical using a simple wrapper query.

This technique can be applied in many situations where query performance is poor due to this type of diamond-shaped branching and merging. When working on production software systems, we often see performance bottlenecks caused by slow queries removed when rewriting queries in this manner. In many cases, the performance improvement is so dramatic that it absolves the need to cache query results in systems like Redis, resulting in less system complexity in addition to better performance.

SQL’s UNION operation is not usually thought of as a means to boost performance. However, in many cases it can dramatically speed queries up by enabling an otherwise complex query to be split into several faster and simpler queries that are then merged together. Recognizing when UNION can be applied takes some practice, but once someone is aware of this technique, it’s possible to look for situations where a performance bottleneck can be removed through this approach.


Ben Levy and Christian Charukiewicz are Partners and Principal Software Engineers at Foxhound Systems. At Foxhound Systems, we focus on building fast and reliable custom software. Are you facing a performance issue or looking for help with something you’re working on? Reach out to us at info@foxhound.systems.