Optimizing Subqueries

What Is a Subquery

When an application runs a SQL statement to operate the database, a large number of subqueries are used because they are more clear than table join. Especially in complicated query statements, subqueries have more complete and independent semantics, which makes SQL statements clearer and easy to understand. Therefore, subqueries are widely used.

In GaussDB(DWS), subqueries can also be called sublinks based on the location of subqueries in SQL statements.

  • Subquery: corresponds to a scope table (RangeTblEntry) in the query parse tree. That is, a subquery is a SELECT statement following immediately after the FROM keyword.

  • Sublink: corresponds to an expression in the query parsing tree. That is, a sublink is a statement in the WHERE or ON clause or in the target list.

    In conclusion, a subquery is a scope table and a sublink is an expression in the query parsing tree. A sublink can be found in constraint conditions and expressions. In GaussDB(DWS), sublinks can be classified into the following types:

    • exist_sublink: corresponding to the EXIST and NOT EXIST statements.

    • any_sublink: corresponding to the OP ANY(SELEC...) statement. OP can be the IN, <, >, or = operator.

    • all_sublink: corresponding to the OP ALL(SELECT...) statement. OP can be the IN, <, >, or = operator.

    • rowcompare_sublink: corresponding to the RECORD OP (SELECT...) statement.

    • expr_sublink: corresponding to the (SELECT with a single target list item) statement.

    • array_sublink: corresponding to the ARRAY(SELECT...) statement.

    • cte_sublink: corresponding to the WITH(...) statement.

    The sublinks commonly used in OLAP and HTAP are exist_sublink and any_sublink. The sublinks are pulled up by the optimization engine of GaussDB(DWS). Because of the flexible use of subqueries in SQL statements, complex subqueries may affect query performance. Subqueries are classified into non-correlated subqueries and correlated subqueries.

    • Non-correlated subquery

      The execution of a subquery is independent from any attribute of outer queries. In this way, a subquery can be executed before outer queries.

      Example:

      select t1.c1,t1.c2
      from t1
      where t1.c1 in (
          select c2
          from t2
          where t2.c2 IN (2,3,4)
      );
                                QUERY PLAN
      ---------------------------------------------------------------
      Streaming (type: GATHER)
         Node/s: All datanodes
         ->  Hash Right Semi Join
               Hash Cond: (t2.c2 = t1.c1)
               ->  Streaming(type: REDISTRIBUTE)
                     Spawn on: All datanodes
                     ->  Seq Scan on t2
                           Filter: (c2 = ANY ('{2,3,4}'::integer[]))
               ->  Hash
                     ->  Seq Scan on t1
      (10 rows)
      
    • Correlated subquery

      The execution of a subquery depends on some attributes of outer queries which are used as AND conditions of the subquery. In the following example, t1.c1 in the t2.c1 = t1.c1 condition is a dependent attribute. Such a subquery depends on outer queries and needs to be executed once for each outer query.

      Example:

      select t1.c1,t1.c2
      from t1
      where t1.c1 in (
          select c2
          from t2
          where t2.c1 = t1.c1 AND t2.c2 in (2,3,4)
      );
                                       QUERY PLAN
      -----------------------------------------------------------------------
      Streaming (type: GATHER)
         Node/s: All datanodes
         ->  Seq Scan on t1
               Filter: (SubPlan 1)
               SubPlan 1
                 ->  Result
                       Filter: (t2.c1 = t1.c1)
                       ->  Materialize
                               ->  Streaming(type: BROADCAST)
                                   Spawn on: All datanodes
                     ->  Seq Scan on t2
                                         Filter: (c2 = ANY ('{2,3,4}'::integer[]))
      (12 rows)
      

More Optimization Examples

  1. Change the base table to a replication table and create an index on the filter column.

create table master_table (a int);
create table sub_table(a int, b int);
select a from master_table group by a having a in (select a from sub_table);

In this example, a correlated subquery is contained. To improve the query performance, you can change sub_table to a replication table and create an index on the a column.

  1. Modify the SELECT statement, change the subquery to a JOIN relationship between the primary table and the parent query, or modify the subquery to improve the query performance. Ensure that the subquery to be used is semantically correct.

explain (costs off)select * from master_table as t1 where t1.a in (select t2.a from sub_table as t2 where t1.a = t2.b);
                        QUERY PLAN
----------------------------------------------------------
 Streaming (type: GATHER)
   Node/s: All datanodes
   ->  Seq Scan on master_table t1
         Filter: (SubPlan 1)
         SubPlan 1
           ->  Result
                 Filter: (t1.a = t2.b)
                 ->  Materialize
                       ->  Streaming(type: BROADCAST)
                             Spawn on: All datanodes
                             ->  Seq Scan on sub_table t2
(11 rows)

In the preceding example, a subplan is used. To remove the subplan, you can modify the statement as follows:

explain(costs off) select * from master_table as t1 where exists (select t2.a from sub_table as t2 where t1.a = t2.b and t1.a = t2.a);
                    QUERY PLAN
--------------------------------------------------
 Streaming (type: GATHER)
   Node/s: All datanodes
   ->  Hash Semi Join
         Hash Cond: (t1.a = t2.b)
         ->  Seq Scan on master_table t1
         ->  Hash
               ->  Streaming(type: REDISTRIBUTE)
                     Spawn on: All datanodes
                     ->  Seq Scan on sub_table t2
(9 rows)

In this way, the subplan is replaced by the semi-join between the two tables, greatly improving the execution efficiency.