Essential elements of high performance applications: SQL indexes
There are many aspects that go into making a fast application. Web application performance is a broad topic because there are numerous concerns in making a page load quickly or a button click feel responsive. One of the difficulties developers must grapple with in pursuit of performance is that any one of these facets can become a bottleneck if overlooked.
Building fast web applications requires a comprehensive understanding and examination of the entire system. In this post, we kickoff a series that covers essential elements that go into building high performance web applications. Throughout this series, we are going to discuss the performance of web applications written in a high level language (such as PHP or Python), backed by a SQL database, and where the frontend interacts with the back end through HTTP requests that download HTML, JSON, or a combination of both, since this structure is the most common one found today.
One of the biggest boons to performance in a web application is effective use of indexes. An index improves the performance of queries by using a lookup table to answer a query instead of scanning all of the data in the underlying table. The role of an index in SQL is similar to the role of the index in the back of a book—instead of scanning the entire book for a particular term, we can look for the term in the index and then jump to the page in the book that the index specifies for that term.
Indexes improve the performance of queries that use conditions to retrieve or modify data. This will frequently be
SELECT queries but will often include
UPDATE queries as well, and more specifically queries containing clauses such as
JOIN, amongst others. The trade off is that indexes reduce the performance of
UPDATE queries, since the database engine must update the relevant indices whenever the underlying data changes.
Indexes are configured for a table by specifying which columns are a part of the index. Suppose we create the following table:
CREATE TABLE users ( id SERIAL PRIMARY KEY, VARCHAR, password_hash VARCHAR, email_address TIMESTAMP created_at );
It’s worth noting that most SQL databases will automatically create an index on primary key and unique columns, so in the example above our
id column will have an index. However, if we find ourselves searching for recently created users by filtering on the
created_at column, we’ll want to create an index on that column as well:
CREATE INDEX users_created_at_index ON users (created_at);
In PostgreSQL, we can use the
\d command to see information about a table. Here’s the output after both creating the table and the index:
postgres=# \d users Table "public.users" Column | Type | Collation | Nullable | Default ---------------+-----------------------------+-----------+----------+----------------------------------- id | integer | | not null | nextval('users_id_seq'::regclass) password_hash | character varying | | | email_address | character varying | | | created_at | timestamp without time zone | | | Indexes: "users_pkey" PRIMARY KEY, btree (id) "users_created_at_index" btree (created_at)
We can see that the
users_created_at_index exists on the
created_at column. The output gives us information about the type of data structure this index uses (a B-tree) as well as whether the index is part of a primary key, which we can see is true for the index on the
id column. The default data structure your database engine uses to create indexes is adequate for most use cases, and it is rare that you will have to specify something different.
With the above index, queries that use the
created_at column will now be able to take advantage of the index. Here’s an example query that finds users created in the last week:
SELECT * FROM users WHERE created_at >= NOW() - INTERVAL '7' DAY;
In both cases the query will return the same results, but in a table with a lot of data, the presence of the index on the
created_at column may allow the query to complete hundreds or even thousands of times faster than without.
In some instances, it can be beneficial to include multiple columns in an index. When your use case involves frequently running queries that specify conditions on multiple columns, creating multicolumn indexes that include several or all of the columns in the condition can significantly improve performance.
For example, suppose we often run this query (with
? being a stand-in for arbitrary string values):
SELECT first, last, email FROM users WHERE first = '?' AND last = '?';
Since both the
last columns are used in our
WHERE conditions, we can create an index on both columns:
CREATE INDEX users_first_last_index ON users (first, last);
Multicolumn indexes have very specific performance characteristics. Generally speaking, utilization of a multicolumn index is most efficient when there are constraints on the leftmost column in the index (
first in this case). It’s worth reading your database engine’s documentation on multicolumn indexes to understand how to best utilize them for your use case.
Building on top of multicolumn indexes, we’ll introduce one last feature: the covering index. When a multicolumn index contains every column that is to be retrieved by the query, most database engines can avoid reading from the table altogether and will return the values directly from the index.
Here’s an example of a query that is suitable for this type of optimization:
SELECT email, created_at FROM users WHERE email = '?' AND created_at >= '?';
We can create the following index that will end up serving as a covering index for this query:
CREATE INDEX users_email_created_at_index ON users (email, created_at);
Since we’re returning only the
created_at columns from the query, and the index contains both of these columns, the database will be able to perform an index-only scan and skip reading from the table in order to return our desired results. This can lead to a significant improvement in performance for this query.
However, keep in mind that every index comes with a cost: while the performance improves for querying the data, the performance of writing data goes down, as the index must be updated alongside the underlying table data. Deciding what indexes to create often requires analysis of the performance of your database running under a real workload. Preemptively creating numerous multi-column indexes in an attempt to improve the performance of various queries that your application may perform is not recommended.
Further reading about indexes
Utilizing indexes effectively is an important part of building a fast application and what we’ve covered here only scratches the surface. You can read the index documentation for the SQL database of your choice to learn more.
- Use the index, Luke!
- PostgreSQL index documentation
- MySQL index documentation
- SQLite index documentation
Christian Charukiewicz is a Partner at Foxhound Systems, where we focus on building fast, reliable, and intuitive custom software. Have an idea for a new application? We’ll deliver the best version of it. Start a project with us.