Modern relational database systems, including those with open source code, allow the creation of very complex SQL queries. By placing subqueries in SELECT, FROM and WHERE sections, as well as by combining queries using operators such as UNION or INTERSECT, it's not difficult to write a query that won't fit on the monitor. This flexibility, unfortunately, comes at a price: analyzing such a complex query is quite a challenge for a database engine and, as practice shows, some database systems don't handle it very well. The easiest way to explain this is with an example, modeled on a real case.
Let's imagine that we are developing a system, one of whose tasks is to record information about the activity of users posting files on the server. For each registered user, we want to know when and what file he posted. The database schema for this functionality looks as follows:
Table "public.test_user" Column | Type | Modifiers -----------------+-----------------------+----------- user_id | integer | not null name | character varying(64) | not null login | character varying(16) | not null hashed_password | character varying(32) | not null Indexes: "test_user_pkey" PRIMARY KEY, btree (user_id) "test_user_login_key" UNIQUE, btree ("login")
Table "public.test_upload" Column | Type | Modifiers -------------+------------------------+----------- upload_id | integer | not null user_id | integer | not null path | character varying(255) | not null upload_time | timestamp | not null Indexes: "test_upload_pkey" PRIMARY KEY, btree (upload_id) Foreign-key constraints:
"
test_upload_user_id_fkey" FOREIGN KEY (user_id) REFERENCES test_user(user
_id)The table testuser*** contains about 10,000 users. One user, on average, has placed 5 files on the server, although there are, of course, those who have not placed a single one, as well as those who have placed several hundred. Each posted file corresponds to one record in the test**upload* table.
Let's say our example service needs to display 10 different users who have recently added some files. The user who has added a file the longest is to be at the end of the list, the user whose file is "freshest" - at the very beginning. However, no user should be in the "top 10" list more than once.
A first approach to this query could look like the following:
SELECT name, path, upload_time FROM test_user u JOIN test_upload l ON (u.user_id = l.user_id) ORDER BY upload_time DESCIT LIMIT 10;
Unfortunately, the query, although it actually displays the most recently uploaded files, doesn't quite do what we expect - one user can appear more than once in the list. With help comes an additional condition that will eliminate older entries for the same user:
SELECT name, path, upload_time FROM test_user u JOIN test_upload l ON (u.user_id = l.user_id) WHERE upload_time = (SELECT max(upload_time) FROM test_upload WHERE user_id = u.user_id) ORDER BY upload_time DESC LIMIT 10;
Now the query is already working correctly, but depending on the database system we are working on, we may have introduced another serious problem: the query will execute much slower. On our test system based on PostgreSQL 8.2, the first (incorrect) version executed 200 ms. The "corrected" version, meanwhile, executed for just over 10 minutes, or about 3,000 times slower. The reason for this slowness is shown by the EXPLAIN result:
Po pierwsze, podzapytanie jest wykonywane dosyć nieoptymalnie - skanowana jest cała tabela, aby znaleźć pliki jednego użytkownika. Po drugie, podzapytanie jest wykonywane 50000 razy, raz na każdy rekord analizowany w głównym zapytaniu. Z pierwszym problemem można sobie poradzić przez dodanie odpowiedniego indeksu:
CREATE INDEX upload_user_id_idx ON test_upload(user_id);
Ta prosta zmiana spowodowała, że czas zapytania skrócił się do ok 0,6 sekundy:
In this article, however, we wanted to present a different technique for optimizing subqueries: by eliminating them. It is known that a subquery that is not there does not need time. What we care about, however, is that by eliminating a subquery, the whole thing is still correct, i.e. returns the right results, which means we need to replace it with something. The key to using this technique is to replace the subquery with a join. Joins are easier for the system to compute if only because there are different algorithms for implementing joins and the optimizer has more "room for manoeuvre" here. Besides, bulk access to a large table is usually cheaper than thousands of small, simple accesses selecting a few records at a time.
Many database engines can perform such a transformation automatically for uncorrelated subqueries, but in this case we are dealing with a correlated subquery because it references a surrounding query. Commercial database systems would handle this case as well, but if we are not lucky enough to use them, we have to handle it ourselves.
First, let's remove the uploaduserid_idx index, which is no longer needed. Then let's execute the query:
SELECT user_id, max(upload_time) FROM test_upload GROUP BY user_id;
It executes in only 63 ms and contains all the necessary data to check if the user file is the "last" one and should be included in the result. Now you just need to combine this query with the full contents of the table with users and files, and finally sort accordingly:
SELECT name, path, upload_time FROM test_user u JOIN test_upload l ON (u.user_id = l.user_id) JOIN ( SELECT user_id, max(upload_time) AS ud FROM test_upload GROUP BY user_id ) x ON (x.user_id = l.user_id AND l.upload_time = ud) ORDER BY upload_time DESC LIMIT 10;
The execution time of this query was 167 ms, i.e. eliminating the subquery provided a speedup of almost 4 times:
We ran the tests on a small dataset that completely fit in memory. However, what happens if we increase the amount of data? To test this, we generated a second, large dataset so that the user table contained 500,000 records and the table with added files contained 10 million. This time, the version with a subquery using the index on testupload(userid) executed for more than 30 minutes and we had to stop the test. In contrast, the version without the correlated subquery took 44 seconds. In contrast, the cost ratio of the two queries estimated by the PostgreSQL optimizer was about 1600:1. The differences are so significant because this time the data does not fit entirely in memory and each time the subquery was executed, it required physical access to a random disk space. Meanwhile, in the second case, the records are retrieved sequentially and no time is wasted positioning the disk heads. We did not test the subquery version without an index on this dataset. Had we done so, it is likely that the publication date of this article would have had to be pushed back a year.
As you can see, using a join and an uncorrelated subquery in FROM instead of a correlated subquery in WHERE can provide big performance gains. However, it is also important to keep in mind the risks that this technique brings. First of all, by changing the form of the query, we run the risk that the new query will not be equivalent to the original. In many cases, the substitution may seem mechanical, but great care must be taken not to introduce duplicate records when adding another join. Joins, unlike the use of EXISTS, IN or '=' operators in the WHERE section, can not only eliminate records, but also duplicate them. This problem is usually solved by making sure that more than one record is never included (in our case by simply noting that the user\_id is unique), or by adding a DISTINCT word so that any duplicates are removed at the end. The second danger is to try to use this technique whenever and wherever you can. And, unfortunately, it does not always yield gains in performance. We managed to speed up the presented query to 0.4 ms on a small dataset and 1 s on a large dataset. How? That's a topic for a separate article.
.