Key Takeaways
- The open source distributed block storage system Ceph RBD started from an idea triggered by community feedback and was implemented through collaborative, iterative open source development.
- The architecture of RBD leverages core Ceph/RADOS capabilities to deliver scalable and reliable distributed block storage.
- The open and transparent development process, involving both core maintainers and community contributors, was key to RBD’s quick adoption.
- RBD is foundational for cloud-native infrastructure (such as OpenStack, Kubernetes). This demonstrates the long-term value of building on open standards and collaboration.
- Open source systems like Ceph may start with humble beginnings but they continue to evolve through community driven innovation that is central to their success.
This year marks fifteen years of RADOS Block Device (RBD), the Ceph block storage interface. Ceph is a distributed storage system. It was started by Sage Weil for his doctoral dissertation at the University of California, Santa Cruz, and was originally designed only as a distributed filesystem built to scale from the ground-up. Having evolved into a unified, enterprise-grade storage platform and, in addition, to a filesystem interface, Ceph now supports object and block storage. The RESTful object storage interface (RADOS Gateway, or RGW), designed to be compatible with AWS S3, and RBD, a block storage system, were later additions and expanded Ceph’s capabilities. This anniversary is a good opportunity to look back at how RBD came to be.
I joined the Ceph project in 2008. My first commit was in January; I started working full-time later that year. The beginning was very exciting. Sage and I shared an office on the fiftieth floor of a high-rise building in downtown Los Angeles. Every day at lunch we had our very private Ceph conference. Nowadays, Cephalocon, the annual Ceph conference draws hundreds of participants from all over the world. At that point, Ceph graduated from academia and had just started its second phase, incubation at DreamHost, a company that Sage had co-founded years before. The total number of people working on Ceph full-time was two.
On my first day, Sage told me that there was a TODO file in the repository and I should do what I want. I took the two parts of this sentence as two distinct and independent pieces of information. I leaned towards “I should do what I want” after I took a look at the TODO file and saw the following:
- ENOSPC
- finish client failure recovery (reconnect after long eviction; and slow delayed reconnect)
- make kclient use ->sendpage?
- rip out io interruption?
bugs
- journal assert(header.wrap) in print_header()
big items
- ENOSPC
- enforceable quotas?
- mds security enforcement
- client, user authentication
- cas
...
In these early days, the heap of what we could do was endless, so we really tried to explore a lot of directions. The then-recent snapshots feature that Sage added to Ceph took a significant chunk of this TODO file. The first project I chose to work on, the ceph.conf configuration system, was not as glamorous but was essential. Up until then, in order to run any Ceph command, you needed to pass in all the configuration on the command line. For an academic project that may be acceptable, but a viable configuration system is required for any useful application.
A RESTful Object Storage System
We continued working on getting the Ceph filesystem ready for primetime and, while doing so, we also thought about more great stuff that the storage system could do. In early 2009, I started to work on a new and exciting RESTful object storage system (initially dubbed C3, and very quickly switched to the temporary name RADOS Gateway or RGW). Ceph already had an internal object storage system called RADOS, so why not expose it directly via the S3 API? It turned out that there were a lot of reasons why a direct 1:1 mapping of RADOS to S3 was not a good idea.
The RADOS semantics are quite different from what an S3 compatible system requires. For example, RADOS has a relatively small object size limit and S3 can support objects as large as 5TB. S3 keeps objects in indexed buckets. RADOS, however, keeps objects in unindexed pools, so listing objects without an index was a very inefficient operation. We managed to hit quite a few such issues on our way to figuring out the right RGW architecture.
I explored a parallel effort of what we called “active objects”, a precursor to Ceph object-classes. The idea was to push computation closer to the data, so that we could extend the storage to do more things. In the first iteration you could push a Python code snippet that was then executed in the Ceph Object Storage Daemons (OSDs).
Following Google News
Back then, when you googled Ceph, most of the search results were either about the Council on Education for Public Health or of the Ceph alien species in the Electronic Arts Crysis game series. I set a Google News alert for the “Ceph” keyword to see if anyone was publishing anything about our project. In early November 2009 I received a notification that linked to an article about Sheepdog, a new distributed block storage system for QEMU. This triggered the Google News alert because someone in the comments suggested that Ceph could be a more viable solution. I pointed it to Sage:
me: http://www.linux-kvm.com/content/sheepdog-distributed-storage-management-qemukvm
note the ceph reference in the responses
Sage: nice!
yeah this got me thinking that it would be really easy to make a block device driver that just stripes over objects
me: yeah.. we might want to invest some time doing just that
maybe having some rados kernel client
and having a block device based on that
Sage: it'd mean cleaning up the monc, osdc interfaces.. but that's probably a good thing anyway
...
Early RBD Implementations
Understandably, we didn’t put all of our other work on hold to implement this. We were busy implementing CephX, the Ceph authentication, and authorization subsystem (the X was a placeholder until we decided how to name it, a task we never got around to). The Ceph filesystem kernel module was yet to be merged into the Linux kernel, a milestone we were actively working towards for a while. Keeping it to the true open process that made Ceph what it is, Sage published a mailing list message the next week about the idea. He suggested two projects (Weil, Sage, Email to Ceph-devel mailing list. 11 November 2009.):
- Put together a similar qemu storage driver that uses librados to store the image fragments. This should be extremely simple (the storage driver is implemented entirely in user space). I suspect the hardest part would be deciding how to name and manage the list of available images.
- Write a linux block device driver that does the same thing. This would be functionally similar to creating a loopback device on top of ceph, but could avoid the file system layer and any interaction with the MDS. Bonus here would be fully supporting TRIM and barriers.
The response to this call to action came a few months later from Christian Brunner who sent us an initial implementation of a QEMU driver. We were able to use the basis of what he created and started to get it ready for inclusion into upstream QEMU. The Ceph filesystem module was merged upstream into the Linux kernel within a couple of weeks, which was a huge success for the project. I also decided to work on a second kernel driver, this time a block device driver that was compatible with this QEMU driver.
The two RBD drivers were two separate implementations; a very minor amount of code was shared between them, because one was written to run in the userspace and integrate with the QEMU block interfaces, while the other was created to run as a Linux kernel module and implemented the kernel block driver interface. Both drivers were pretty lean and converted the I/O operations into RADOS object operations. A single block image was striped over multiple small-sized RADOS objects, which allowed for operations to run concurrently on multiple OSDs, a property that benefited from the Ceph scale-out capabilities.
More Capabilities
We added more capabilities to the two RBD drivers: a management tool for the RBD volumes and support for snapshots. For the snapshots to work correctly, the running instances needed to learn about them as they were created. To do this, I implemented a new Ceph subsystem called Watch/Notify, which allowed sending events over RADOS objects. The RBD instance “watches” its image metadata object and the admin tool sends a notification to it when a new snapshot is created.
Another subsystem we created, used for the first time, was the Ceph object-classes. This mechanism allowed the creation of a specialized code in the RADOS I/O path that could be called in order to either mutate or make a complex read operation on RADOS objects. The first object class was implemented to track the names of the RBD snapshots and their mappings into RADOS snapshot IDs. Instead of a racy read-modify-write cycle that required more complex locking or other mechanism to deal with races, we would just send a single RBD snapshot creation call and it would be done atomically on the OSD.
Upstream Acceptance
Creating the RBD linux kernel device driver required cleaning up all the Ceph kernel code and moving the common RADOS code to a separate kernel library. We got it into the Linux kernel in a record time. It was merged upstream in October 2010, just over six months after the filesystem was merged.
Christian, who continued to help with the development of the QEMU driver, recalls now what the hardest part of getting the QEMU driver upstream was:
“At that time it was quite a discussion to convince the QEMU project that a driver for a distributed storage system would be needed”.
Moreover, there was a separate discussion within the QEMU project whether the driver should be merged or whether they should create a plugin block storage mechanism that would allow for different drivers to be added externally and without their involvement. Around the same time, the sheepdog project was also under review and was involved in the same discussion. The QEMU project developers didn’t want to deal with issues and bugs that the new drivers would inevitably bring with them. Both we and the sheepdog developers communicated that we would be dealing with issues that arise from our drivers. In the end, the monolithic path prevailed and it was decided that the drivers would be part of the QEMU repository.
We went through the review process and made sure we were responsive to and fixed all the issues that were brought up by the reviewers. For example, the original threading model that the driver was using was wrong and that needed to be addressed. Finally, a few weeks after merging the kernel RBD driver into the Linux kernel, we also merged the QEMU driver upstream.
It only took about a year from the first idea to the two drivers to be merged. The whole project spurred multiple subprojects that are now fundamental to the Ceph ecosystem. RBD was built on the sound foundations that RADOS provided and, as such, benefited from all the hard work from all Ceph project contributors.
The Benefits of RBD
RBD became almost an overnight success. It is now a key storage infrastructure across virtualization and cloud systems. It became the de facto standard persistent storage for OpenStack due to its seamless integration and its scalability. In Kubernetes it provides the reliable persistent storage layer that stateful containerized applications require. In traditional virtualization and private cloud environments, it offers an open, distributed, and highly available VM storage as an alternative to proprietary storage. It still continuously improves and evolves, thanks to the hard work of the many Ceph project contributors and across other projects that intersect with it.
Looking Ahead
This is a collaborative effort that demonstrates the power of open source. What replaces the old TODO file will hopefully be less obscure but not shorter. There is still much to do and there are even more places to innovate that we cannot yet even think of. Sometimes the spark of an idea and a willing community is all that is needed.
Thank you to Christian Brunner and Sage Weil for their valuable comments.