Spark SQL is an important component of Apache Spark and subsumes Shark. It helps engineers unfamiliar with MapReduce to get started quickly. Users can enter SQL statements directly to analyze, process, and query data.
Spark SQL can join and convert skew data. It evenly distributes data that does not contain skewed keys to different tasks for processing. For data that contains skewed keys, Spark SQL broadcasts the smaller amount of data and uses the Map-Side Join to evenly distribute the data to different tasks for processing. This fully utilizes CPU resources and improves performance.
Spark SQL employs the coalesce operator to process small files and combines partitions generated by small files in tables. This reduces the number of hash buckets during a shuffle operation and improves performance.
For details about Spark SQL architecture and principles, see http://spark.apache.org/docs/2.1.0/programming-guide.html.