Search

Tuesday, July 6, 2010

MySQL Union and query optimization


UNION can be a pretty nice tool. It’s a good way to query several tables with the same data with a single call to the database server. The most common example includes adding results from archive table to your recent data. MYSQL ships with two flavors of Union; DISTINCT and ALL (assume this is the same with Oracle and MSSQL but I don’t use them enough to know).

Basic syntax:
(SELECT fields FROM table WHERE conditions)
UNION [DISTINCT | ALL]
(SELECT fields FROM table2 WHERE conditions)
[WHERE]
ORDER BY
[LIMIT]

UNION defaults to UNION DISTINCT implicitly. The difference between DISTINCT and ALL is pretty self explanatory; DISTINCT prevents duplicates just like in a normal SELECT statement. The problem with DISTINCT is that it’s very VERY costly. Both DISTINCT and ALL build a temp table to store the results; however, DISTINCT must index the results to avoid duplicates. In addition to that, from what I’ve read, applying ORDER BY and LIMIT to the entire UNION will cause MYSQL to ignore any indexes as well. I have the need to UNION up to eight tables. In my test I was working with a query I know that was giving me problems; it has six tables.

Table 1: ~25,000 records
Table 2: ~8,000 records
Table 3: ~300,000 records
Table 4: ~200,000 records
Table 5: ~820,000 records
Table 6: ~125,000 records

My where clause was pretty simple, just looking for all the records added by a single user. UNION DISTINCT caused my ColdFusion page to time out after 60 seconds. I could increase the timeout to get an accurate picture of how long it actually takes but I don’t feel like breaking anything right now (that’s what my 9-5 job is for). At first I was afraid it was because MYSQL ignores the index when ordering and limiting unions. My first thought was I needed to have to run each query individually then do a QofQ to re-order them.

In my particular situation I'm not going to have any duplicates because the data in each query is already unique to the other tables. In this case I can safely use UNION ALL without getting any duplicates with the added benefit of the speed UNION ALL gives me over UNION DISTINCT. How much faster is it? Considerably. My query went from timing out at 60 seconds to completing in 34 seconds. That’s pretty sweet, but not great.

My needs require me to limit the end result to 500 records. That’s fine for my end result, but what about each SELECT within the union? Believe it or not 99% of all the records in these six tables match my WHERE condition. That means my unions create a temp table of ~1,500,000 records of which I only need 500. No wonder UNION DISTINCT timed out; that’s a lot of indexing. Building such a large table when I only need a very small subset of that is just silly. Fortunately I can limit the results from the individual selects. Limiting each select to 500 now gives me a temp table with only 3,000 records, much better than 1.5M and growing! Unfortunately, to be sure I get the results I expect, I also need to include the same ORDER BY clause to each select that I have at the end of the entire union. Even with the overhead of ordering 1.5M records, the query now finishes in 21 seconds, a savings of 13 more seconds.

Most of the time the query only runs using tables 1 and 2, which are pretty small so it runs very fast. If anyone has any other ideas on how to make the larger unions run better, please comment below!

No comments:

Post a Comment