Walter Blair Personal Blog

MapReduce

In 2008, Dean and Ghemawat [1]⁠ wrote a follow-up to their original 2004 presentation of the MapReduce distributed processing API, a paper which has been cited an insane number of times [2]⁠. Before developing MapReduce, Google was crunching big data all the time for many different reasons, and programmers were getting bogged down in the logistics of handling big data distribution across clusters. Though index and sort algorithms are fairly straightforward, the difficulty in all of these efforts was that programmers from lots of different backgrounds and with many different skill sets had to become experts in managing distributed processing. What a pain! In 2004, Dean and Ghemawat presented MapReduce which provides an API that takes care of everything under the hood so that a programmer without really thinking about it could leverage the powerful Google clusters to do all kinds of different tasks.

MapReduce allows a programmer to crunch big data without worrying about all the moving parts. In my mind, MapReduce consists of a programming model and a cluster manager. The programming model is sort of to divide and conquer a very large input. A map function takes an input of some size, divides it up, and does something to each piece. This something might be a count or a search or a sort or something else. After each little input chunk is mapped to an intermediate output, each intermediate piece is then reduced / aggregated back to some smaller number of outputs. Using this model, a really expensive problem can be solved by lots of cheap little machines. The logistics are a bit less straightforward. The master-slave model is used here, as it’s the master’s job to chunk the input, send each piece to several redundant map workers, keep track of how each worker is doing on each mapping task, tell the reduce workers where to get their intermediate data via RPC, and then help get all the reduced outputs back to the global disk. When workers fail, as they inevitably will in a cluster of thousands of machines, it’s the master’s job to re-execute the failed task to get things moving again (See Fig 1).

In the related works sections of both the 2004 and 2008 papers, Dean and Ghemawat emphasize three key features that put MapReduce ahead of the competition: 1) The programmer doesn’t need to worry about cluster management, 2) the model is fault-tolerant, and 3) the master is able to optimize local disk locality of input chunks. It seems a little odd that the 2008 paper doesn’t really offer anything new in the related works section, because I would think that lots of people would have been working on the problem during this period. It’s certainly possible that the 2004 publication quashed the hopes and dreams of Google’s competitors and stopped them in their tracks, but it’s also possible that thousands upon thousands of citations of the original were just begging for a rehash sequel with prettier fonts…I’m not judging though…But at the same time, it would have been nice for the authors to have offered a bit more perspective on the community that was rapidly building around Hadoop (and Phoenix?) and also address some of the limitations that must have already been identified for a project like Apache Spark to have been started by 2009.

For example, the grep example spends 75% of its runtime on setup while the master sends the data chunks out to all of the workers (See Fig 2). Is there any way around this requirement that each worker have a local copy of the data? Also, the sort example spends half of its time and energy on reading and writing intermediate outputs … surely there must be some hope of generating these intermediates in memory to alleviate this I/O overhead … (See Fig 3). Finally, there are some design differences between HDFS and GFS, like different write consistency guarantees [3]⁠, that I’m sure reflect the different use cases of Google versus users in the open-source community, but Dean and Ghemawat don’t discuss any trade offs that they weigh in-house that might shed some light on the divergent evolution between Google’s implementation and Hadoop.

References
[1] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Commun. ACM, vol. 51, no. 1, p. 107, 2008.
[2] J. Dean and S. Ghemawat, “Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2004.
[3] A. Daphalapurkar, M. Shimpi, and P. Newalkar, “Mapreduce & Comparison of HDFS And GFS,” Int. J. Eng. Comput. Sci., vol. 3, no. 9, pp. 8321–8325, 2014.

CSIS 604 - MapReduce