Spark SQL

Spark SQL

It seems that a huge part of understanding the differences between Spark 1 and 2 is getting a handle on Spark SQL. I gave the release paper by Armbrust et al. 2015 a first run through and am collecting my thoughts on it.

Resilient Distributed Datasets (RDDs) were and still are the fundamental elements of Spark. Spark 1 was all about transforming and acting upon RDDs directly as the programmer. While a programmer can still touch RDDs, the main achievement in Spark 2 is the Dataset abstraction. Everyone is encouraged to work with Datasets for most purposes at this point, and I’ve certainly got plenty to learn on that front. But I was interested in the Dataframe abstraction, because it seemed like the awkward and tragically overlooked middle child in the RDD / Dataframe / Dataset Spark family. Dataframes were introduced along with Spark SQL, so their most important contribution was bringing SQL functionality to Spark. This functionality has been folded into Datasets, hence the tendency to forget Dataframes.

The major achievement of Spark SQL is integrating the relational approach of SQL with the procedural approach of Scala / Spark so that SQL operations can be performed directly within the Spark framework using Scala, Python, or Java. The Dataframe API is an important piece of this achievement, as is the Catalyst optimizer that uses Scala’s functional language characteristics to act as a sort of compiler language for optimizing Dataframe manipulations in Spark.

A fair bit of this paper discusses the Spark basics of lazy transformations, where the scheduler builds up and optimizes the DAG, and actions that actually execute the DAG plan. I learned these basics a little while ago, but this paper also goes into fine-grained detail on how Catalyst accomplishes this optimization. The sort of keywords seem to be Scala’s pattern matching and quasiquotes features, so I should look into these further as I continue learning. The really cool thing about Catalyst is that it was designed to be extensible and has already benefited from a number of customizations. ADAM is one such contributer to Catalyst, and now I have a better background for understanding what it means to say that ADAM improved optimization of coordinate joins, for example. The Spark SQL authors say that Catalyst optimization is relatively easy, but I think I have a little ways to go before I am tinkering with it!

One of the main reasons for Catalyst is the huge leap in advanced analytics it has facilitated. The authors discuss the new support for JSON, XML, and other data sources as well as integration with the machine learning library MLlib. Data sources and user-defined functions are also extensible, but again I think this will come later for me.

The bottom line for me right now is that I’m excited to carry my new 601 MySQL skills and relational database knowledge into the Spark SQL domain. I have had some trouble making sense of how to manipulate RDDs directly, so it’s really empowering to see that I should be able to move pretty freely around Spark Dataframes and Datasets with Spark SQL. It also makes me feel like I’m learning a much more broadly applicable skillset in working with Spark Datasets, because really anyone who knows SQL can get a handle on what I’m working on.