I’m trying to optimize quickly optimize the search functionality of some outdated forum software written in PHP. I’ve got my work down to a query that looks like this:
SELECT thread.threadid FROM thread AS thread INNER JOIN word AS word ON (word.title LIKE 'word1' OR word.title LIKE 'word2') INNER JOIN postindex AS postindex ON (postindex.wordid = word.wordid) INNER JOIN post AS postquery ON (postquery.postid = postindex.postid) WHERE thread.threadid = postquery.threadid GROUP BY thread.threadid HAVING COUNT(DISTINCT word.wordid) = 2 LIMIT 25;
word1
and word2
are examples; there could be any number of words. The number at the very end of the query is the total number of words. The idea is that a thread most contain all words in the search query, spread out over any number of posts.
This query often exceeds 60 seconds with only two words, and times out. I’m stumped; I can’t figure out how to further optimize this horrid search engine.
As far as I can tell, everything is indexed properly, and I’ve run ANALYZE
recently. Most of the database is running on InnoDB. Here’s the output of EXPLAIN
:
+----+-------------+-----------+--------+----------------------------------------------------------------------------------------+---------+---------+------------------------------+------+-----------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-----------+--------+----------------------------------------------------------------------------------------+---------+---------+------------------------------+------+-----------------------------------------------------------+ | 1 | SIMPLE | word | range | PRIMARY,title | title | 150 | NULL | 2 | Using where; Using index; Using temporary; Using filesort | | 1 | SIMPLE | postindex | ref | wordid,temp_ix | temp_ix | 4 | database1.word.wordid | 3 | Using index condition | | 1 | SIMPLE | postquery | eq_ref | PRIMARY,threadid,showthread | PRIMARY | 4 | database1.postindex.postid | 1 | NULL | | 1 | SIMPLE | thread | eq_ref | PRIMARY,forumid,postuserid,pollid,title,lastpost,dateline,prefixid,tweeted,firstpostid | PRIMARY | 4 | database1.postquery.threadid | 1 | Using index | +----+-------------+-----------+--------+----------------------------------------------------------------------------------------+---------+---------+------------------------------+------+-----------------------------------------------------------+
Update
LIMIT 25
doesn’t seem to be helping much. It shaves off maybe second from a query that normally returns hundreds of results.
Clarification
The part that’s slowing down MySQL is the GROUP BY ... HAVING ...
bit. With GROUP BY
, the LIMIT
is pretty much useless for improving performance. Without GROUP BY
, and as long as the LIMIT
remains, the queries are quite speedy.
SQL Info
Output of SHOW CREATE TABLE postindex;
:
CREATE TABLE `postindex` ( `wordid` int(10) unsigned NOT NULL DEFAULT '0', `postid` int(10) unsigned NOT NULL DEFAULT '0', `intitle` smallint(5) unsigned NOT NULL DEFAULT '0', `score` smallint(5) unsigned NOT NULL DEFAULT '0', UNIQUE KEY `wordid` (`wordid`,`postid`), KEY `temp_ix` (`wordid`), KEY `postid` (`postid`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8
I didn’t make the table, so I have no idea why there’s a duplicate index on wordid; however, I’m not willing to delete it, since this is ancient, fickle software.
Advertisement
Answer
You can try several rewrites and compare execution plan and times.
Using 2 EXISTS
subqueries (one for each word to be checked):
SELECT t.threadid FROM thread AS t WHERE EXISTS ( SELECT 1 FROM post AS p JOIN postindex AS pi ON pi.postid = p.postid JOIN word AS w ON pi.wordid = w.wordid WHERE w.title = 'word1' AND t.threadid = p.threadid ) AND EXISTS ( SELECT 1 FROM post AS p JOIN postindex AS pi ON pi.postid = p.postid JOIN word AS w ON pi.wordid = w.wordid WHERE w.title = 'word2' AND t.threadid = p.threadid ) ;
Using one EXISTS
subquery:
SELECT t.threadid FROM thread AS t WHERE EXISTS ( SELECT 1 FROM post AS p1 JOIN postindex AS pi1 ON pi1.postid = p1.postid JOIN word AS w1 ON w1.wordid = pi1.wordid AND w1.title = 'word1' JOIN post AS p2 ON p2.threadid = p1.threadid JOIN postindex AS pi2 ON pi2.postid = p2.postid JOIN word AS w2 ON w2.wordid = pi2.wordid AND w2.title = 'word2' WHERE t.threadid = p1.threadid AND t.threadid = p2.threadid ) ;
A single query with many joins and GROUP BY
only to remove the duplicate threadid
:
SELECT t.threadid FROM thread AS t JOIN post AS p1 ON p1.threadid = t.threadid JOIN postindex AS pi1 ON pi1.postid = p1.postid JOIN word AS w1 ON w1.wordid = pi1.wordid AND w1.title = 'word1' JOIN post AS p2 ON p1.threadid = t.threadid JOIN postindex AS pi2 ON pi2.postid = p2.postid JOIN word AS w2 ON w2.wordid = pi2.wordid AND w2.title = 'word2' WHERE p1.threadid = p2.threadid -- this line is redundant GROUP BY t.threadid ;