CSIS 604 - Pig

14 Jul 2017

Pig

Leave it to Yahoo!, the only company to include the most annoying punctuation mark possible as part of their name, to develop a bad solution to a real problem [1]⁠. The introduction to this paper is really nice, because it gives a concise rundown of the benefits and limitations of Hadoop’s MapReduce framework. Hadoop stepped in to solve the problems of crunching huge amounts of data by providing a distributed framework that takes advantage of giant and easily scalable clusters of commodity hardware by developing the MapReduce algorithm and providing all the infrastructure needed to implement it reliably [2]⁠. The authors highlight Hadoop’s reliance on and support for procedural programming as one of the key reasons for its popularity, but argue that the low-level details involved in Hadoop development severely limit its capability to do anything more complex than map and reduce. If you want to do a filter, for example, you have to write all kinds of custom code! Perish the thought! Though Spark would solve this issue in a much better way a few years later [3]⁠, we’ll focus on one of Yahoo!’s brain!children and donations to Apache, Pig Latin.

The solution that Pig Latin provides to Hadoop’s limited low-level programming approach is to provide a high-level language that bridges the gap between the procedural programming, with which most developers are most comfortable, and the more flexible high-level imperative programming of SQL. It’s odd to me that the goal of Pig is to give programmers the power to specify the execution plan of an elaborate map-transform-reduce job, because I would think that computers would be much, much better at optimizing such transformations. But perhaps they view it like driving stick…except that they do promise the future development of an optimizer, which is totally sending me mixed! Messages! The authors’ guiding-light motivation seems to be to make programmers feel good about themselves, as evidence by the quoted testimonial from a happy programmer. To dive for one sentence into stylistic concerns, I would normally celebrate the apparent embrace of diversity represented by the authors’ use of feminine pronouns to represent Pig developers were it not for the fact that the stated motivation is to help entrenched and/or confused developers work with large, difficult datasets. Anywho! Enough sarcasm for now, on to the technical concerns.

Perhaps there is an important niche for Pig, as it helps with all of the issues mentioned above especially in the context of ad hoc analysis, meaning the developer wants to reduce as much as possible the overhead involved with writing and debugging new code to allow potentially fleeting analyses. I appreciate the broadest view of the motivation which is to allow users to tinker around and try new things with large datasets with minimal time spent on tedious tasks or trying to understand what execution choices an opaque optimizer made in executing a query.

Getting more into the details of how Pig achieves a more user-friendly environment, the authors describe the rejection of database normalization levels that require atomic fields in favor of a more flexible approach that allows fields to include nested data structures like tuples and lists. It sounds like this could get hairy, but when I remember that the motivation is for ad hoc analysis rather than large-scale production, I can see why this would help a developer work more quickly. More specifically, this structure prioritizes easy-to-write user-defined-functions (UDFs) which, several years before Spark entered the seen, would be a real advantage. I think providing the easiest-possible access to UDFs must be the key to making Pig a super convenient language for procedural programmers.

I skimmed through the language details, in large part due to my current lack of intent on learning Pig, but I gather that the priorities in the language design were to focus on getting a developer into parallel-only data wrangling as quickly and painlessly as possible by implementing the imperative map-reduce process as a procedural foreach-groupby model that is more familiar to most users.

What really peaked my interest was the inclusion of a debugger feature, because I’ve found it really confusing and difficult to try to debug Scala-Spark applications without extra help like Spark’s DAG explorer and such. As much as I make fun of them, I identify with the authors’ assertion that most developers have procedural programming ingrained into heads and that it’s sometimes mind-bending to reason through a map-reduce program. I like that Pig Pen makes variables and UDFs transparent, I like that you can generate dummy data to test everything, and I sort of hate the name except that it perhaps reminds people that pigs are naturally very clean and organized little critters if they have their druthers.

Regarding the related work, it is super interesting that Dryad comes up and clearly illustrates the sort of break-neck pace of competition in this domain. The authors acknowledge that Microsoft’s Dryad makes similar progress over Hadoop but basically say that they can’t really comment on how its language compares to Pig because it’s under wraps. I do have an affinity for open-source projects, so I appreciate that Yahoo!, though it pains me to write their company name, has contributed so much to Hadoop and related tools like Pig. I absolutely love that Google developed a language called Sawzall, because I am going to sawzall the crap out of the OSB that will be on my garage in a few weeks, but what’s more related to the current topic is that it seems to be harder to use than Pig. The authors also confidently and almost pithily assert that their Pig Pen debugger is matchlessly awesome.

For future work, the authors point to an execution plan optimizer, which sounds like a set of training wheels that I suppose could help but seems to somehow compromise the freedom freely given to programmers in the initial concept. I like that the authors also propose a GUI that illustrates the execution plan, because that’s exactly what helps so much when trying to understand a Spark job so was obviously a good idea. Everybody and their mom tries to add support for other languages, so it certainly makes sense that Python and Perl users should be thrown a bone in this regard. What sounds more difficult is their final proposal which is to embed Pig into full procedural languages to expand the procedural tools available to developers. This sounds quite ambitious, but a quick scan through the various releases suggests that they’ve done a great job with Python and JRuby but maybe not Perl… (https://pig.apache.org/releases.html#News).

References [1] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin,” Proc. 2008 ACM SIGMOD Int. Conf. Manag. data - SIGMOD ’08, p. 1099, 2008. [2] K. Shvachko, “The Hadoop Distributed File System,” IEEE 26th Symp. Mass Storage Syst. Technol., pp. 1–10, 2010. [3] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark : Cluster Computing with Working Sets,” HotCloud’10 Proc. 2nd USENIX Conf. Hot Top. cloud Comput., p. 10, 2010.