25 February 2010

PostgreSQL Optimizer Bits: Semi and Anti Joins

The series “PostgreSQL Optimiser Bits” will introduce the strategies and highlights of the PostgreSQL optimiser. We start today with a new feature of PostgreSQL 8.4: Semi and Anti Joins.

Since version 8.4, PostgreSQL has been offering a new optimisation strategy for the optimisation of certain queries: Semi and Anti Joins.

A Semi Join is a specific form of a join, which only takes the keys of relation a into account if these are also present in the associated table b. An Anti Join is the negative form of a Semi Join: that is, a key picked in table a will be taken into account if it is not present in table b.

To summarize, Semi and Anti Joins are specific forms of a join which only take certain keys on the left side into account – where queries want to make sure certain keys exist, but are not concerned with the content of the key itself. This behaviour is already widely known in Object Relation Mappers (ORM) which formulate such queries using EXIST() or NOT EXIST().

Compared to PostgreSQL 8.3 the same query is possible with a much simpler and more efficient query plan. The following simple example shows this improvement: take two tables, a, b and an EXIST() query. A certain set of data from a is to be found which has its equivalent a.id2 = b.id in b. Of course, this aim can also be accomplished by one single join, however, this example shows the improvements of the optimizer solving this query.


The optimiser in PostgreSQL in 8.3 determines the following plan for this example. Keep in mind that both tables a and b each have an index on the column id and id2.

                                QUERY PLAN
 Index Scan using a_id_idx on a  (cost=0.00..8355.27 rows=503 width=4)
   Index Cond: (id = 200)
   Filter: (subplan)
     ->  Index Scan using b_id_idx on b  (cost=0.00..8.27 rows=1 width=4)
           Index Cond: ($0 = id)

In contrast, in PostgreSQL 8.4 the optimizer can use a hash Semi Join:

                                QUERY PLAN
 Hash Semi Join  (cost=27.52..78.16 rows=969 width=4)
   Hash Cond: (a.id2 = b.id)
   ->  Index Scan using a_id_idx on a  (cost=0.00..37.32 rows=969 width=8)
         Index Cond: (id = 200)
   ->  Hash  (cost=15.01..15.01 rows=1001 width=4)
         ->  Seq Scan on b  (cost=0.00..15.01 rows=1001 width=4)

The reduced costs of this query plan are more than obvious – and lower costs mean fewer I/O accesses. So, in future a more detailed analysis of such queries is worth a look.

Categories: PostgreSQL
Tags: PostgreSQL PostgreSQL Competence Center


About the author

Bernd Helmle

share post: