Query optimization with subqueries

Piotr Kolaczkowski
Calendar icon
2 marca 2010

Modern relational database systems, including those with open source code, allow the creation of very complex SQL queries. By placing subqueries in SELECT, FROM and WHERE sections, as well as by combining queries using operators such as UNION or INTERSECT, it's not difficult to write a query that won't fit on the monitor. This flexibility, unfortunately, comes at a price: analyzing such a complex query is quite a challenge for a database engine and, as practice shows, some database systems don't handle it very well. The easiest way to explain this is with an example, modeled on a real case.

Let's imagine that we are developing a system, one of whose tasks is to record information about the activity of users posting files on the server. For each registered user, we want to know when and what file he posted. The database schema for this functionality looks as follows:

Table "public.test_user" Column | Type | Modifiers -----------------+-----------------------+----------- user_id | integer | not null name | character varying(64) | not null login | character varying(16) | not null hashed_password | character varying(32) | not null Indexes: "test_user_pkey" PRIMARY KEY, btree (user_id) "test_user_login_key" UNIQUE, btree ("login")
Table "public.test_upload" Column | Type | Modifiers -------------+------------------------+----------- upload_id | integer | not null user_id | integer | not null path | character varying(255) | not null upload_time | timestamp | not null Indexes:
    "test_upload_pkey" PRIMARY KEY, btree (upload_id) Foreign-key constraints:


test_upload_user_id_fkey" FOREIGN KEY (user_id) REFERENCES test_user(user

_id)The table testuser*** contains about 10,000 users. One user, on average, has placed 5 files on the server, although there are, of course, those who have not placed a single one, as well as those who have placed several hundred. Each posted file corresponds to one record in the test**upload* table.

Let's say our example service needs to display 10 different users who have recently added some files. The user who has added a file the longest is to be at the end of the list, the user whose file is "freshest" - at the very beginning. However, no user should be in the "top 10" list more than once.

A first approach to this query could look like the following:

SELECT name, path, upload_time FROM test_user u JOIN test_upload l ON (u.user_id = l.user_id) ORDER BY upload_time DESCIT LIMIT 10;

Unfortunately, the query, although it actually displays the most recently uploaded files, doesn't quite do what we expect - one user can appear more than once in the list. With help comes an additional condition that will eliminate older entries for the same user:

SELECT name, path, upload_time FROM test_user u JOIN test_upload l ON (u.user_id = l.user_id) WHERE upload_time = (SELECT max(upload_time) FROM test_upload WHERE user_id = u.user_id) ORDER BY upload_time DESC LIMIT 10;

Now the query is already working correctly, but depending on the database system we are working on, we may have introduced another serious problem: the query will execute much slower. On our test system based on PostgreSQL 8.2, the first (incorrect) version executed 200 ms. The "corrected" version, meanwhile, executed for just over 10 minutes, or about 3,000 times slower. The reason for this slowness is shown by the EXPLAIN result:

QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=50967080.34..50967080.36 rows=10 width=40) (actual time=617733.543..617733.554 rows=10 loops=1) -> Sort (cost=50967080.34..50967080.52 rows=75 width=40) (actual time=617733.542..617733.548 rows=10 loops=1) Sort Key: l.upload_date -> Nested Loop (cost=0.00..50967078.00 rows=75 width=40) (actual time=16.545..617691.016 rows=9967 loops=1) Join Filter: (l.upload_date = (subplan)) -> Seq Scan on test_upload l (cost=0.00..894.00 rows=50000 width=28) (actual time=0.012..41.359 rows=50000 loops=1) -> Index Scan using test_user_pkey on test_user u (cost=0.00..0.28 rows=1 width=20) (actual time=0.009..0.011 rows=1 loops=50000) Index Cond: (u.user_id = l.user_id) SubPlan -> Aggregate (cost=1019.02..1019.03 rows=1 width=4) (actual time=12.333..12.333 rows=1 loops=50000) -> Seq Scan on test_upload (cost=0.00..1019.00 rows=6 width=4) (actual time=1.967..12.323 rows=6 loops=50000) Filter: (user_id = $0) Total runtime: 617733.733 ms

Po pierwsze, podzapytanie jest wykonywane dosyć nieoptymalnie - skanowana jest cała tabela, aby znaleźć pliki jednego użytkownika. Po drugie, podzapytanie jest wykonywane 50000 razy, raz na każdy rekord analizowany w głównym zapytaniu. Z pierwszym problemem można sobie poradzić przez dodanie odpowiedniego indeksu:

CREATE INDEX upload_user_id_idx ON test_upload(user_id);

Ta prosta zmiana spowodowała, że czas zapytania skrócił się do ok 0,6 sekundy:

QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=422099.29..422099.31 rows=10 width=40) (actual time=614.207..614.214 rows=10 loops=1) -> Sort (cost=422099.29..422099.47 rows=74 width=40) (actual time=614.205..614.208 rows=10 loops=1) Sort Key: l.upload_time -> Merge Join (cost=0.00..422096.99 rows=74 width=40) (actual time=0.071..602.762 rows=9967 loops=1) Merge Cond: (u.user_id = l.user_id) Join Filter: (l.upload_time = (subplan)) -> Index Scan using test_user_pkey on test_user u (cost=0.00..378.25 rows=10000 width=20) (actual time=0.021..5.243 rows=10000 loops=1) -> Index Scan using upload_user_id_idx on test_upload l (cost=0.00..1595.25 rows=50000 width=28) (actual time=0.014..24.648 rows=50000 loops=1) SubPlan -> Aggregate (cost=8.38...8.39 rows=1 width=4) (actual time=0.009..0.009 rows=1 loops=50000) -> Index Scan using upload_user_id_idx on test_upload (cost=0.00..8.36 rows=6 width=4) (actual time=0.003..0.006 rows=6 loops=50000) Index Cond: (user_id = $0) Total runtime: 614.338 ms

In this article, however, we wanted to present a different technique for optimizing subqueries: by eliminating them. It is known that a subquery that is not there does not need time. What we care about, however, is that by eliminating a subquery, the whole thing is still correct, i.e. returns the right results, which means we need to replace it with something. The key to using this technique is to replace the subquery with a join. Joins are easier for the system to compute if only because there are different algorithms for implementing joins and the optimizer has more "room for manoeuvre" here. Besides, bulk access to a large table is usually cheaper than thousands of small, simple accesses selecting a few records at a time.

Many database engines can perform such a transformation automatically for uncorrelated subqueries, but in this case we are dealing with a correlated subquery because it references a surrounding query. Commercial database systems would handle this case as well, but if we are not lucky enough to use them, we have to handle it ourselves.

First, let's remove the uploaduserid_idx index, which is no longer needed. Then let's execute the query:

SELECT user_id, max(upload_time) FROM test_upload GROUP BY user_id;

It executes in only 63 ms and contains all the necessary data to check if the user file is the "last" one and should be included in the result. Now you just need to combine this query with the full contents of the table with users and files, and finally sort accordingly:

SELECT name, path, upload_time FROM test_user u JOIN test_upload l ON (u.user_id = l.user_id) JOIN ( SELECT user_id, max(upload_time) AS ud FROM test_upload GROUP BY user_id ) x ON (x.user_id = l.user_id AND l.upload_time = ud) ORDER BY upload_time DESC LIMIT 10;

The execution time of this query was 167 ms, i.e. eliminating the subquery provided a speedup of almost 4 times:

QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=4324.97..4324.99 rows=10 width=40) (actual time=166.005..166.011 rows=10 loops=1) -> Sort (cost=4324.97..4325.13 rows=66 width=40) (actual time= rows=10 loops=1) Sort Key: l.upload\_date -> Hash Join (cost=2053.31..4322.97 rows=66 width=40) (actual time=93.562..155.832 rows=9967 loops=1) Hash Cond: ((l.user\_id = u.user\_id) AND (l.upload\_date = x.ud)) -> Seq Scan on test\_upload l (cost=0.00..894.00 rows=50000 width=28) (actual time=0.010..18.370 rows=50000 loops=1) -> Hash (cost=1920.04..1920.04 rows=8885 width=28) (actual time=93.524..93.524 rows=9922 loops=1) -> Hash Join (cost=1498.00..1920.04 rows=8885 width=28) (actual time=68.164..86.451 rows=9922 loops=1) Hash Cond: (x.user\_id = u.user\_id) -> HashAggregate (cost=1144.00..1255.06 rows=8885 width=8) (actual time=52.218..57.378 rows=9922 loops=1) -> Seq Scan on test\_upload (cost=0.00..894.00 rows=50000 width=8) (actual time=0.008..18.568 rows=50000 loops=1) -> Hash (cost=229.00..229.00 rows=10000 width=20) (actual time=15.925..15.925 rows=10000 loops=1) -> Seq Scan on test\_user u (cost=0.00..229.00 rows=10000 width=20) (actual time=0.007..5.835 rows=10000 loops=1) Total runtime: 166.287 ms

We ran the tests on a small dataset that completely fit in memory. However, what happens if we increase the amount of data? To test this, we generated a second, large dataset so that the user table contained 500,000 records and the table with added files contained 10 million. This time, the version with a subquery using the index on testupload(userid) executed for more than 30 minutes and we had to stop the test. In contrast, the version without the correlated subquery took 44 seconds. In contrast, the cost ratio of the two queries estimated by the PostgreSQL optimizer was about 1600:1. The differences are so significant because this time the data does not fit entirely in memory and each time the subquery was executed, it required physical access to a random disk space. Meanwhile, in the second case, the records are retrieved sequentially and no time is wasted positioning the disk heads. We did not test the subquery version without an index on this dataset. Had we done so, it is likely that the publication date of this article would have had to be pushed back a year.

As you can see, using a join and an uncorrelated subquery in FROM instead of a correlated subquery in WHERE can provide big performance gains. However, it is also important to keep in mind the risks that this technique brings. First of all, by changing the form of the query, we run the risk that the new query will not be equivalent to the original. In many cases, the substitution may seem mechanical, but great care must be taken not to introduce duplicate records when adding another join. Joins, unlike the use of EXISTS, IN or '=' operators in the WHERE section, can not only eliminate records, but also duplicate them. This problem is usually solved by making sure that more than one record is never included (in our case by simply noting that the user\_id is unique), or by adding a DISTINCT word so that any duplicates are removed at the end. The second danger is to try to use this technique whenever and wherever you can. And, unfortunately, it does not always yield gains in performance. We managed to speed up the presented query to 0.4 ms on a small dataset and 1 s on a large dataset. How? That's a topic for a separate article.


Read also

Calendar icon

17 maj

Investor Data Room - analyzing startup data in R Shiny
Data Room is a tool for sharing key company information that investors or potential business partners need. Read the article on Inves...
Calendar icon

19 kwiecień

Overview and comparison of classification algorithms in Python
Artificial intelligence, including algorithms based on deep neural networks, have become increasingly popular recently. Image generat...
Calendar icon

27 luty

Looker Studio with data from its own API
Looker Studio commonly referred to as Google Reports is Google's browser-based answer to business analytics. Formerly, its full name ...