CSIS 604 - HDFS

09 Jul 2017

HDFS

Along with giving a thorough treatment of the Hadoop Distributed File System (HDFS), Shvachko et al. 2010 [1] answered several questions I had after reading the previous assigned paper on MapReduce [2]⁠. I was wondering what happened to the Google File System (GFS) and whether it led to or became HDFS. I don’t know how much drama was involved, but it seems that HDFS development followed several choices made by GFS but was led chiefly by Yahoo!. That’s the last time I’m going to use the exclamation mark in this paper…

Hadoop is a system that works behind the scenes in a MapReduce job to plan and execute all of the data acquisitioning, partitioning, and moving, to optimize the location of computations, and to reliably manage worker nodes. HDFS is the file system component and therefore one of the core components of Hadoop along with MapReduce. HDFS uses the master/slave architecture where a single namenode contains all the meta-information (the plans and current state of machines), and the many datanodes in the cluster do the work and report their status back to the namenode. There’s also a client that is able to interact with the system to read/write/manage files in the system by coordinating with the namenode and pushing data to the datanodes.

One of the major challenges that comes with distributing data is reliability, and most of the paper is concerned with design choices that involve reliability: 1) The namespace, which I understand to be all of the metadata about the cluster and current job, includes checkpoint info and journal info. In the same way that a trial balance is used in my accounting internship to double check transactions, it seems that checkpoints and journals are used to keep an account of the goings on in the cluster. 2) Each datanode must pass a handshake with the namenode that ensures compatible software versions and such. 3) Namenodes send reports to the namenode on the state of their data block(s) and also send pings at regular intervals called heartbeats. 4) Checkpoint and Backup nodes and system snapshots are maintained to recover gracefully from namenode and other failures. 5) Data blocks are replicated in triplicate across datanodes. The first-choice replicate is located nearby for speed, but the second and third replicates are located on a different node rack and in a random cluster location, respectively. 6) A Scanner component regularly scans data block checksums to look for data faults.

Let’s talk about related work. It’s interesting to me that one of the most important and guiding decisions when designing a distributed file system is how to ensure some data reliability, that is whether to use guarantees or redundancy (or both?), and HDFS and GFS chose triplicate redundancy. Several other distributed file systems went with something like RAID. HDFS also shares the same architecture of a namenode that organizes a cluster of datanodes, but it sounds like some related systems, including GFS, have distributed the namespace, which I guess means that there would be several namenodes in a given cluster. I’m not exactly sure if I have this right, and I’m definitely not sure how it would work, so I’ll just stick with the framework involving a single namenode. It sounds weird to have multiple maistros conducting a single symphony, but jumping ahead to future work, it sounds like Yahoo was/is trying to transform the namespace from a cluster namespace to a job namespace, meaning that one namenode exists for a single job. This would allow multiple namenodes to exist in a single cluster I think. The goal here is to make the namenode/datanode architecture more scalable by dividing up the amount of data a single namenode is responsible for and thereby solving the memory limitations of the namenode. A distinguishing characteristic of HDFS seems to be its API that allows easy access from the HDFS client, but I’m not exactly sure what this suggests about the related distributed file systems.

One issue I didn’t really understand was client permissions. The authors admitted that the checks on the client’s permissions were minimal and promised to add some sort of external trusted source in future work. I skimmed through the Hadoop 2.8.0 permissions guide (https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html) to look for external authentication but didn’t really understand what I was looking at. It still looks like it’s mainly the operating system’s responsibility. I’m not really clear on what the security threat is here. Presumably the client would be accessed via some machine that has robust security.

Another issue I didn’t quite understand is why a core switch failure couldn’t be resolved somewhat by requiring that some number of replicates must be located in different cores. I guess this is a choice made in the ongoing speed v. reliability tug of war. I would presume that having replicates in different cores would improve replication but would decrease performance.

References
[1] K. Shvachko, “The Hadoop Distributed File System,” IEEE 26th Symp. Mass Storage Syst. Technol., pp. 1–10, 2010.
[2] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Commun. ACM, vol. 51, no. 1, p. 107, 2008.