Spark Enhanced Open Source Feature: Optimized SQL Query of Cross-Source Data

Scenario

Enterprises usually store massive data, such as from various databases and warehouses, for management and information collection. However, diversified data sources, hybrid dataset structures, and scattered data storage lower query efficiency.

The open source Spark only supports simple filter pushdown during querying of multi-source data. The SQL engine performance is deteriorated due of a large amount of unnecessary data transmission. The pushdown function is enhanced, so that aggregate, complex projection, and complex predicate can be pushed to data sources, reducing unnecessary data transmission and improving query performance.

Only the JDBC data source supports pushdown of query operations, such as aggregate, projection, predicate, aggregate over inner join, and aggregate over union all. All pushdown operations can be enabled based on your requirements.

Table 1 Enhanced query of cross-source query

Module

Before Enhancement

After Enhancement

aggregate

The pushdown of aggregate is not supported.

  • Aggregation functions including sum, avg, max, min, and count are supported.

    Example: select count(*) from table

  • Internal expressions of aggregation functions are supported.

    Example: select sum(a+b) from table

  • Calculation of aggregation functions is supported. Example: select avg(a) + max(b) from table

  • Pushdown of having is supported.

    Example: select sum(a) from table where a>0 group by b having sum(a)>10

  • Pushdown of some functions is supported.

    Pushdown of lines in mathematics, time, and string functions, such as abs(), month(), and length() are supported. In addition to the preceding built-in functions, you can run the SET command to add functions supported by data sources.

    Example: select sum(abs(a)) from table

  • Pushdown of limit and order by after aggregate is supported. However, the pushdown is not supported in Oracle, because Oracle does not support limit.

    Example: select sum(a) from table where a>0 group by b order by sum(a) limit 5

projection

Only pushdown of simple projection is supported. Example: select a, b from table

  • Complex expressions can be pushed down.

    Example: select (a+b)*c from table

  • Some functions can be pushed down. For details, see the description below the table.

    Example: select length(a)+abs(b) from table

  • Pushdown of limit and order by after projection is supported.

    Example: select a, b+c from table order by a limit 3

predicate

Only simple filtering with the column name on the left of the operator and values on the right is supported. Example:

select * from table where a>0 or b in ("aaa", "bbb")

  • Complex expression pushdown is supported.

    Example: select * from table where a+b>c*d or a/c in (1, 2, 3)

  • Some functions can be pushed down. For details, see the description below the table.

    Example: select * from table where length(a)>5

aggregate over inner join

Related data from the two tables must be loaded to Spark. The join operation must be performed before the aggregate operation.

The following functions are supported:

  • Aggregation functions including sum, avg, max, min, and count are supported.

  • All aggregate operations can be performed in a same table. The group by operations can be performed on one or two tables and only inner join is supported.

The following scenarios are not supported:

  • aggregate cannot be pushed down from both the left- and right-join tables.

  • aggregate contains operations, for example, sum(a+b).

  • aggregate operations, for example, sum(a)+min(b).

aggregate over union all

Related data from the two tables must be loaded to Spark. union must be performed before aggregate.

Supported scenarios:

Aggregation functions including sum, avg, max, min, and count are supported.

Unsupported scenarios:

  • aggregate contains operations, for example, sum(a+b).

  • aggregate operations, for example, sum(a)+min(b).

Precautions

  • If external data source is Hive, query operation cannot be performed on foreign tables created by Spark.

  • Only MySQL and MPPDB data sources are supported.