Ceph
Leave it to UC Santa Cruz to make everything that is good in the world. Weil et al. 2006 introduced Ceph as the first truly scalable and resilient file system designed for large clusters [1]. It seems like this is the first time that the modern file system features I take for granted were introduced, like the use of reusable functions to generate lost data and/or metadata to recover from node failures. I totally thought this was just a Spark thing! But then again, Spark was also developed in this neck of the woods, hence my shameless fawning over this institution.
It sounds like the motivation for Ceph was the increasingly strained performance of high-performance clusters operating at rapidly increasing scale. Traditional file systems tended to be very centralized in a way that makes me think of the days when everyone used to get their news from one of a handful of major news networks. The information was as reliable as possible, due to all of the control afforded by a centralized system, but the tradeoff was speed and availability. It takes awhile for information to flow through centralized systems, and when the control points fail it can be a huge bottleneck. The technology that really laid the immediate groundwork for Ceph was the development of object-based storage devices (OSDs) which seem like the first step in distributing the decision making and execution-optimizing of a distributed file system. Ceph took the next step by distributing not only the data but also the meta-data in order to widen this bottleneck.
Which makes me wonder at the name Ceph. Ceph refers to a head or brain, but this Ceph project goes a long way toward decentralizing file systems for the sake of speed and resilience. Distributing data and metadata works similarly to distributing news outlets in that a bit of control is sacrificed so that worker nodes can get what they need faster and so that the system can recover easily when a single cog stops turning.
An important component that allows the distribution of metadata is a hash-function of sorts called CRUSH that allows metadata to be calculated, circumventing the need to travel to a different location in the cluster to look up needed information. Dynamic Subtree Partitioning is an algorithm used to dynamically optimize the distribution of the metadata throughout the cluster by identifying very busy directories and mapping their contents across a range of MDS servers to mitigate the workload of each and prevent metadata I/O from becoming a bottleneck.
A lot of the work of information reliability emerges from the ways that the Ceph client talks to the MDS in charge of the relevant metadata and the OSD that is in charge of a given data partition. The client asks the MDS for capabilities like read or write on a particular piece of data, and the MDS decides whether to grant the capability and also locks the data if the client is allowed to go work with it. The client is then able to independently figure out where to find the data it wants using CRUSH and then goes and reads or writes on the OSD containing the data. Finally, the client updates the MDS on any changes that were made to the data. The sort of synchronous I/O described above takes a long time, so Ceph offers the option of accepting asynchronous updates that work quite well owing to a number of extensions developed by the scientific computing community over the years. Ceph is much less lax with metadata and makes it fairly inconvenient to switch off synchronous I/O on metadata.
I’ve talked a bit about the OSDs that contain distributed data, and I’m a little surprised to find that the authors centralize these OSDs in a sense within the Reliable Autonomic Distributed Object Store (RADOS). Part of RADOS’s responsibility is to implement CRUSH to randomly (and therefore resiliently) distribute data across placement groups, which are OSD groups designed to strike a balanc e between locality optimization and fault tolerance. The key point here is that the CRUSH hashing strategy is replicable, so that the mapping of data to an OSD can be regenerated by any machine with the hash function. Another responsibility of RADOS is to respond to messages from OSDs and monitor OSD faults in a distributed framework. What boosts cluster performance here is that it is not RADOS’s responsibility to recover from failures, but rather it is the job of each OSD to talk to the other OSDs in its placement group to sort of get with the program.
The brief discussion on the OSD’s local file system went somewhat far over my head. My shallow understanding is that an OSD avoids the usual overhead of a inux kernel fileystem in favor of one that is particularly well suited for dealing in objects. I don’t understand what is meant by a user space approach. The authors understandably spend awhile demonstrating the remarkable performance of Ceph, but I think the quickest demonstration was the near linear scaling of cluster performance under a handful of different tasks.
What stood out to me the most in the related works section was the mention of Google File System, which was an inspiration for what would later become HDFS [2]. The authors of the current paper claim that GFS is most useful for very large files and also a bit too niche in their departure from POSIX standards. I don’t exactly understand whether they mean large input files or large file chunks, and I also don’t know what ‘large’ really is, but I would think that the scientific computing community would very often deal with large file sizes like genomic data, satellite and medical imagery, etc. On the other hand, I have no doubt that Google was a bit too proprietary with their conventions. I’m also interested in how HDFS interacts with Ceph at this point. From a really brief scan online, It seems like a lot of folks are running Hadoop on top of the Ceph file system (CephFS). My first thought is that the motivation here would be to avoid change, given that HDFS arose several years after Ceph, but I also wonder if this has to do with the file size and POSIX issues raised by the authors.
References [1] S. A. Weil, S. A. Brandt, E. L. Miller, and D. D. E. Long, “Ceph : A Scalable , High-Performance Distributed File System,” pp. 307–320. [2] K. Shvachko, “The Hadoop Distributed File System,” IEEE 26th Symp. Mass Storage Syst. Technol., pp. 1–10, 2010.