CSIS 604 - Spark

12 Jul 2017

Spark

Hadoop provides a super resilient/reliable way to distribute big data across many nodes using the MapReduce algorithm and lots of engineered redundancy and recover procedures [1]⁠, but it is excessively costly in certain situations. In an acyclic job, where a batch of data goes from point A to point B, it makes perfect sense to read the input, save intermediate files as necessary, and write out the results. In a cyclic job, on the other hand, where a single input is reused perhaps in many iterations of a single algorithm, it doesn’t make sense to keep rereading the same input and rewriting intermediate results over and over. Spark takes the resiliency and reliability of Hadoop (though achieving it in a different way) into a new framework that works really efficiently for iterative data reads and writes by reading inputs into an in-memory Resilient Distributed Dataset (RDD) [2]⁠ . The RDD is resilient, because pieces lost by node faults can be recomputed, thanks in part to the functional features of Scala and also to the directed acyclic graph (DAG) generated and remembered by the master node.

The main idea of Spark is the RDD and things you can do with an RDD. You can create an RDD in various ways like reading a file or creating some type of Scala collection and then parallelizing it. One cool thing for the programmer to keep in mind is that an RDD doesn’t really exist until an action is performed on it like collect or cache. A cached RDD can be reused which is also really good to keep in mind. The meat and potatoes of Spark include 1) filtering for certain conditions, mapping with custom functions, reducing back to the master, and caching (text search example), 2) foreach-looping through each element and using accumulator variables as safe global variables (logistic regression example), and broadcasting variables to workers (least-squares example). Broadcast variables are sent to worker nodes only once to minimize overhead. If that wasn’t enough, Scala can be run via the interpreter for a super-fun and exciting live coding extravaganza. These examples run extremely fast, so sorry Hadoop.

The paper discusses implementation details of Spark. I didn’t realize that Spark is built on Mesos. I actually thought it was the other way around, that Mesos was built in response to the promise of Spark. Mesos is a helpful layer separating a cluster of machines from Hadoop, Spark, or some other framework trying to use the cluster [3]⁠. I spent a hot second trying to understand Mesos in the context of trying to understand what Yarn is and what it does. My understanding is that Mesos is sort of like a cluster operating system and that Yarn is yet-another way to schedule tasks and resources on the cluster. So maybe Mesos is the port authority and Yarn is the crane crew or something, did I mention I like analogies. I think Mesos is more of the hardware cluster manager and Yarn is more of the manager for all of the software trying to use the hardware and would therefore handle cluster users like Hadoop, Spark, etc.

The next set of implementation details they discuss is how lost partitions are recomputed, circumventing the redundancy strategy of Hadoop, by including with a shared interface with each piece that provides methods for finding where the partition should be and piecing a lost partition together from its parent before the last transformation task was called and also by serializing functions/tasks that are sent to each node.

The final implementation detail is the customization of the Scala interpreter so that variables and objects and such are always up-to-date for whatever nodes need them across the cluster.

The authors say that RDD’s are an abstraction of distributed shared memory (DSM), but that the way Spark recovers failed partitions is more efficient than the DSM checkpoint strategy and also that Spark jumped on the bandwagon of software trying to improve the safety of shared memory from the programmer’s standpoint. Included in the long list of software bested by Spark, Twister implemented a shared memory distributed framework but without the fault-tolerance of RDD’s. A welcome addition to the many strong efforts to be second best, Dryad attempted to use functional programming elements like those in Scala but without the in-memory triumph of the almighty Sparkolites. I was interested to see IPython come up here, because I didn’t think it was in the slightest bit related, but it turns out that in addition to being interactive, the point of IPython is to run parallel cluster computations. Too bad, IPython, Spark can crunch bigger data than you will ever hope to.

The future work is interesting, as they have successfully implemented the shuffle operation that I believe powers Spark SQL [4]⁠, and also because they don’t really mention the Dataframe or Dataset abstractions introduced in later versisons of Spark 1 and Spark 2, respectively. Perhaps they weren’t expecting the Catalyst optimizer to make datasets so fast, but the world may never know. And don’t even get me started on streaming, because I don’t understand it at all!

References [1] K. Shvachko, “The Hadoop Distributed File System,” IEEE 26th Symp. Mass Storage Syst. Technol., pp. 1–10, 2010. [2] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark : Cluster Computing with Working Sets,” HotCloud’10 Proc. 2nd USENIX Conf. Hot Top. cloud Comput., p. 10, 2010. [3] B. Hindman, A. Konwinski, A. Platform, F.-G. Resource, and M. Zaharia, “Mesos: A platform for fine-grained resource sharing in the data center,” Proc. …, p. 32, 2011. [4] M. Armbrust et al., “Spark SQL: Relational Data Processing in Spark,” Proc. 2015 ACM SIGMOD Int. Conf. Manag. Data - SIGMOD ’15, pp. 1383–1394, 2015.