Hive v. MySQL Cluster
Fuad et al. present a quick and dirty comparison of the apparently common MySQL Cluster framework for crunching big data and the newer Apache Hadoop MapReduce + Hive data management framework, with a half-baked bone thrown to Pig along the way [1]. The authors seemed to acknowledge that MySQL is sort of the king here, perhaps due to historical reasons and developer comfort/familiarity, but their aim was to show that as datasets increase in size, the choice is to either continually beef up the MySQL hardware to keep up with ballooning memory requirements or to move to a Hadoop/Hive cluster using a low-cost hardware cluster that can scale inexpensively.
We’ve looked at Hadoop’s MapReduce programming model [2] and HDFS distributed file management [3], but we haven’t looked specifically at MySQL Cluster. Wading briefly into Hadoop v. MySQL Cluster debates on StackOverflow, it looks like there are major differences to consider here beyond scale. Some users argue that MySQL Cluster naturally extends all of the normal MySQL typed relational database features that we learned about in CSIS 601, whereas Hadoop doesn’t care at all about normalization, data types, etc. I can’t really verify this claim, as it seems to me like MySQL Cluster doesn’t really have any tie to MySQL but is rather an architecture that can work with SQL, NoSQL, etc. [4]. It seems like the real difference is how reliability is achieved, as MySQL Cluster randomly shards data and does a lot of synchronous updating, requiring a lot of heavy-duty memory gymnastics to keep track of indices/keys, whereas Hadoop relies more on systematically-selected redundancy [1].
Hive is a technology we haven’t yet looked into, and I don’t really understand much about data warehousing in general. Looking at one of the technical reports on Hive makes me realize I didn’t know anything about it at all [5]! Hive seems to be a layer through which Hadoop can work with most of the data people deal with most of the time. One of the things Hive does is translate Hadoop’s MapReduce plan into SQL, which I now realize is the basis for Hive’s inclusion in the current paper. More and more, Hive performs data warehousing services, but I’ll learn about that another day. Interestingly, Steinbach compares Hive to Pig and says that one of the major differences between the two is simply whether a developer is more comfortable using an imperative programming language like Pig or a non-imperative language like Hive’s SQL [5].
The take home message from the data in the current paper is that the Hadoop/Hive cluster take a fair amount of overhead to get started and are, largely for this reason, slower than MySQL Cluster for smaller versions of the test dataset. When the number of records increases to a seemingly ridiculous but totally plausible size, MySQL performance suffers greatly from all the memory required to keep track of everything, whereas the Hadoop/Hive finally reaps the benefits of its over-the-top architecture and outperforms MySQL Cluster. The authors note that providing beefier hardware may confer advantages to MySQL Cluster, whereas scaling up the low-cost hardware will suffice for Hadoop.
It seems odd to throw Pig into the comparison here, because Pig wasn’t developed for speed but rather for ad hoc data exploration. On the other hand, it’s nice to have filler data to make your paper longer! The authors mention several times that Pig is not well-suited for the straight-forward queries used in this study, so it would have been really interesting to include some more complex queries to which the authors allude. This study was a half step away from being able to show when the complexity of a query prefers a Pig versus SQL implementation, and they have left me in suspense!
One result of which I don’t believe I saw an explanation was the difference between Hive’s Query 6 performance and that of MySQL Cluster and Pig. Hive doesn’t see a huge performance hit for Query 6, which is an aggregation and sort, but this is by far the worst query for Cluster and Pig. I don’t really understand why, but I have a feeling that this single query was the real reason that the average performance improved so much for Hive, so it merits some serious discussion. If I just missed it, then I am a monkey’s uncle.
References
[1] A. Fuad, A. Erwin, and H. P. Ipung, “Processing performance on Apache Pig, Apache Hive and MySQL cluster,” Proc. 2014 Int. Conf. Information, Commun. Technol. Syst. ICTS 2014, pp. 297–301, 2014.
[2] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Commun. ACM, vol. 51, no. 1, p. 107, 2008.
[3] K. Shvachko, “The Hadoop Distributed File System,” IEEE 26th Symp. Mass Storage Syst. Technol., pp. 1–10, 2010.
[4] M. Ronstrom and L. Thalmann, “MySQL cluster architecture overview,” MySQL …, 2004.
[5] C. Steinbach, “Apache Hadoop * Community Spotlight: ApacheHive,” 2013.