A photograph a baby elephant walking away from the camera, towards a body of water with which has an adult elephant bathing in it.

Essential elements of high performance applications

Offloading work to the SQL database

July 25, 2022

Christian Charukiewicz Partner & Principal Software Engineer

Ben Levy Partner & Principal Software Engineer

performance-optimization sql essential-elements-of-high-performance

In the previous post in this series we discussed the importance of proper use of indexes for performance when defining your SQL database. Creating indexes is a simple but essential part of application performance. However, once your database schema is created and indexes are employed, another key aspect of building a fast application is effectively leaning on the database in order to quickly execute work and perform computations. Doing so effectively will often reduce the total amount of work that your application servers as well as your database have to perform. What this looks like in practice is writing SQL queries that use the full breadth of features available.

Joins and Subqueries

One of the simplest examples of this is knowing how and when to employ JOIN queries, as well as understanding the differences between different types of JOINs and how they can impact performance.

For example, a common practice that hurts application performance is employing “N+1” queries. We’ve written about this as well as other approaches for retrieving data in the beginning of our post on grouping query results with Haskell. In short, code structured in this manner will run one query for each item in a list of other items retrieved in a preceding query.

Here’s an example of code that does this, where we need to query both a posts table and a users table to determine the total number of posts made by each author that has recently made a post:

# Runs a SQL query to retrieve recent posts
posts = selectAllRecentPosts()
postCounts = {}
for post in posts:
    # Runs a SQL query to count of posts for a given user
    postCounts[post.authorId] = countPostsForAuthor(post.authorId)

Where selectAllRecentPosts() may run a query like:

SELECT *
FROM posts
WHERE created >= NOW() - INTERVAL '1' DAY;

And countPostsForAuthor(123) runs a query like:

SELECT COUNT(DISTINCT posts.id)
FROM posts
WHERE posts.author_id = 123;

These queries are both simple, and it’s easy to make them both take advantage of indexes in order to ensure they run quickly. However, the issue with the N+1 approach is that the second query has to run for each post retrieved by the first query. This means that a list of 20 posts will run 21 (or 20+1) queries.

Not only does this require waiting for the database to execute 21 queries (a number that will scale as the number of recently created posts goes up), but this also requires 21 separate network calls between the application and the database. It is common practice to have database server hardware that is separate from the application server, so network latency of 7 milliseconds per query will add 147 milliseconds to our request in network latency alone. If we had 150 recently created posts, this network latency would balloon to a whopping 1,057 milliseconds, or over 1 second of just waiting for the application to send and retrieve data from the database.

With this in mind, we want to offload this work to the database while also reducing the number of database calls required. So instead of the code we saw earlier, we’re going to move all of the work to the database and change our code to invoke the following query:

SELECT recent_authors.id, COUNT(posts.id)
FROM (
    SELECT DISTINCT users.*
    FROM users
    INNER JOIN posts
    ON users.id = posts.author_id
    WHERE posts.created >= NOW() - INTERVAL '1' DAY
) AS recent_authors
INNER JOIN posts
ON recent_authors.id = posts.author_id
GROUP BY recent_authors.id;

This query is more complicated than either one of the preceding queries on their own. It employs an INNER JOIN to retrieve users as well as the associated posts table data based. It also uses a subquery to determine which authors have been recently active. Using this approach, we are certain that we will always run one query. Moreover, our code will have been simplified as well, as we were able to eliminate the for loop entirely and just invoke this query directly.

This example of replacing two separate queries with a single more complex query that employs a JOIN and a subquery illustrates our general point—let the database do the work.

One other important point to make here is that the type of JOIN we are using here is very deliberate. An INNER JOIN will join together two tables and only keep rows that have matches on both the left and right side of the join. In our case, that means every result must include both a users table entry and a posts table entry. This is exactly what we want, since we’re only interested in display posts counts for users that have recently made a post.

If, by contrast, we wanted to generate a report that included the name of each user as well as how many posts they’ve authored, including users that did not author any posts, we might write a query like this:

SELECT users.id, COUNT(DISTINCT posts.id)
FROM users
LEFT JOIN posts
ON users.id = posts.authorId
GROUP BY users.id;

Here, the LEFT JOIN (which can also be called and written as LEFT OUTER JOIN) will change the way the query runs to keep results from the users table even if there are no matching posts. This is a requirement given that we are interested in displaying a 0 count for users that have never authored anything.

Different types of joins have different behaviors and performance implications. INNER and LEFT joins are the two most common types you are likely to encounter in most applications, but there are also FULL, RIGHT, and CROSS joins that you may encounter in certain situations.

SQL databases aren’t slow, and in a properly indexed database with sensibly written queries, leaning into the database will result in better system performance than attempting to write simpler queries and then instead performing the same work in the application.

Cloudtrellis

A new service built by Foxhound Systems Discover problems with your website before your users do

Cloudtrellis scans your entire site for broken links, accessibility issues, and SEO errors to ensure a flawless user experience.

Detect error pages, broken links, accessibility issues, and SEO problems
Create scans with tailored configurations for each website and subdomain you manage
Schedule scans to run monthly, weekly, or even daily to closely monitor for new issues
Get notified of new scan results via email
Share scan results with your team via direct link

Learn more

Grouping and Aggregations

Another way to lean on our database is to use it for calculating aggregates. In simple terms, aggregation is the computation of a single value from many values (or many rows, in the case of SQL aggregations). Examples of aggregate functions supported by most databases include SUM, AVG, MIN, MAX. Perhaps the most commonly used aggregation function is COUNT. Aggregations will almost always be used alongside a GROUP BY statement, which defines how to aggregate the rows for a given set of results.

We employed this manner of grouping above in order to count the number of posts each user had created in our specified search interval. In the outer query, we indicated that we wanted to GROUP BY users.id, which causes all rows with a given user ID to “collapse” into a single row, and any aggregation functions—the COUNT function in our case—to treat all rows in each group as an input to the respective calculation. In our query, every distinct posts.id for each users.id was counted, giving us the number of posts each user created.

Grouping can be somewhat confusing to people first learning SQL, since when the aggregation function is removed, we can’t typically “see” the results of the GROUP BY operation. Different database engines will handle this situation slightly differently, but generally speaking the effect of grouping isn’t visible until some aggregation function is applied.

Window Functions

Above, we mentioned how grouping causes rows to “collapse” into a single row, specified by whatever column(s) should be used as the group identifier. Window functions provide another way to apply aggregations to subsets of rows in a given set of results, but unlike with grouping, using window functions allows rows in a given grouping (or “partition”) to maintain their own identity and continue to appear in the result set.

Basic window function example

This behavior of window functions is best illustrated through an example. Suppose we have a post_metrics table that contains metrics about the performance of each post in our blog. We query the table but use a window function in order to see view count metrics about each individual post, as well as the average post performance for each author.

SELECT
    author,
    post_title,
    view_count,
    AVG(view_count) OVER (PARTITION BY author) AS average_view_count
FROM
    post_metrics;

Which would give us a set of results like:

author	post_title	view_count	average_view_count
John Smith	SQL basics, pt 1	1012	772
John Smith	SQL basics, pt 2	748	772
John Smith	SQL basics, pt 3	556	772
Michael Walters	Use Python for your next project	972	1660.5
Michael Walters	I was wrong, use Rust for your next project	2349	1660.5

As we can see above, the benefit of using a window function over a GROUP BY statement is that we preserve the presence of each row in the result set. If we were to GROUP BY author (as opposed to applying the window over PARTITION BY author), we would still be able to see the average_view_count. However, depending on database engine, inferring information about the post_title and view_count columns would be either impossible due to error (such as in PostgreSQL, which would not allow us to include these non-aggregated and non-grouped columns in the query) or nonsensical (such as in MySQL, which would display an arbitrary single value from the group).

More advanced window function example

Let’s look at another application of window functions that really shows their strength. Suppose that we’re writing banking software and want to create an account statement that includes account activity, such as the list of all debits and credits that most banks make visible in their web applications. Let’s assume our database already has a charges table that includes each debit or credit, including the id of the account, the value of the change, and the date with the following data:

account_id	value	date
1	20	2022-07-12
2	95	2022-07-12
2	40	2022-07-13
1	35	2022-07-13
1	50	2022-07-14
1	-15	2022-07-15
2	-135	2022-07-15

Generating a statement that includes activity requires indicating whether each charge is a debit or a credit. In our case we’ll treat positive values as a debit and negative charges as a credit. The activity view also requires displaying the account balance (the running total) along with each charge. Fortunately, we can use a window function to achieve this.

SELECT
    account_id,
    abs(value) AS charge_amount,
    CASE WHEN value > 0 THEN 'CREDIT' ELSE 'DEBIT' END AS charge_kind,
    SUM(value) OVER (
        PARTITION BY account_id
        ORDER BY created_at
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS balance,
    created_at AS transaction_date
FROM charges;

This produces:

account_id	charge_amount	charge_kind	balance	transaction_date
1	20	CREDIT	20	2022-07-12
1	35	CREDIT	55	2022-07-13
1	50	CREDIT	105	2022-07-14
1	15	DEBIT	90	2022-07-15
2	95	CREDIT	95	2022-07-12
2	40	CREDIT	135	2022-07-13
2	135	DEBIT	0	2022-07-15

Like our first example, this query partitions the data, in this case by the account_id. However, then the window function applies both an ordering as well as indicates which rows should be used in the calculation. The ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW tell the database to use only rows up to and including the current row in order to calculate the sum of the charges. Without specifying this, we would not see a running total in each row, but the final balance amount repeated for every row in each partition, similar to how we saw the average_view_count repeated in the previous example.

In practice, we may want to limit the data retrieval to one account at a time. In order to do that, we’d simple add a condition like WHERE charges.account_id = 1 to the end of the query. The window function would still generate the balance column for just the one account.

When building our application, we can utilize window functions to perform these types of operations directly in the query, precluding the need to make an extra pass over the result set in our application code. Without the use of a window function, if we wanted to preserve individual post metrics in our result set, we would either need to write a second query to calculate the average view count or make a pass through the result set in our application code in order to compute the average for each author.

Set Operations: Union, Intersect, and Except

Another way of maximizing the workload of the database is utilizing UNION, which enables two otherwise distinct queries that return the same column set to be executed in a single query. UNION can be used to reduce the number of distinct but similar calls to the database, reducing the round trip time that we highlighted earlier. UNION can also be used as an optimization technique , enabling us to rewrite queries in a way that improves their performance while making them easier to understand. In the linked post, we walk through an in-depth example of a large query joining many tables with complex join conditions that was split into two much simpler queries and the result was combined using UNION.

The general structure of a UNION query is:

SELECT
    col_1, col_2, col_3
FROM
    some_table
INNER JOIN
    some_other_table
...
UNION
SELECT
    col_1, col_2, col_3
FROM
    some_table
LEFT JOIN
    another_table
...

The result of a query shaped like the one above would be a single set of rows with the col_1, col_2, and col_3 columns. It does not matter which tables either side of the UNION performs a SELECT on, so long as the set of columns returned on each side are named the same and returned in the same order.

The UNION operation has the benefit of automatically deduplicating rows. If deduplication is unnecessary or undesirable, the UNION ALL operation should be used instead. For larger data sets, the performance of UNION ALL may be better than UNION as the deduplication can add to the query execution time.

In addition to UNION, there are also INTERSECT and EXCEPT operations that can be performed. INTERSECT will only include rows that appear in the result of both queries, whereas EXCEPT will include all rows from the left query other than ones that also appear in the right query. Both INTERSECT and EXCEPT will deduplicate rows unless the ALL keyword is used as a suffix. These operations are useful for filtering data when a single SELECT query cannot be used to define the conditions while achieving a well performing query, or when readability of the single-query form suffers.

All three operations can be used to both reduce the number of distinct queries that need to be sent to the database as well as reduce the workload of the application servers. For example, there’s no need for merging and deduplicating the result of two separate queries in your code when the SQL database has already done the work by the time the query results are returned.

Wrap Up

In this post we discussed some of the primary operations that SQL databases enable us to write queries that not only retrieve data but also perform complex computations. Effectively utilizing features like subqueries, aggregations, and set operations allows us to reduce the amount of chatter with the database, cuts down on the amount of data sent over the network, and can altogether eliminate computations that both our database and application servers need to perform. Applying these techniques effectively is absolutely critical to a highly performant application.