The Exadata Storage Server provides high performance I/O services for the Oracle RDBMS, and Oracle is using PMEM to improve its logging latency by 8x.
One of the most performance-sensitive types of I/O is database logging, where low latency is vital, especially for online transaction processing systems. Until recently, the Exadata Storage Server provided low latency database logging via the use of flash devices. However, with the advent of persistent memory came the ability to write data to a persistent store with the latency of memory access – much faster than even flash storage. Nonetheless, the goal of using persistent memory for database logging provided some significant challenges:
- A single Exadata Storage Server can provide database logging for many databases. Simply placing the database logs on persistent memory would be cost-prohibitive, as the logs could occupy many gigabytes of space.
- The Exadata Storage Server and the Oracle RDBMS communicate via a network, with the RDBMS sending I/O requests and Exadata replying with an I/O response. Although the network is a high-speed interconnect, network latency is still a major factor, and the round-trip for network requests would far surpass the latency of writing data to persistent memory.
To solve the challenge of using persistent memory in a space-efficient and cost-efficient manner, we decided to use a minimal amount of persistent memory in a shared fashion, and as a temporary store:
- Minimal usage – By default, Exadata will use 960 MB of persistent memory for the purpose of servicing database logging requests. The 960 MB is configurable and can be changed should the need arise.
- Shared – The persistent memory allocated for logging use is available to any and all databases. This avoids the problem of trying to configure logging space per database, and lets each database use as much space as it needs on demand.
- Temporary store – Once logging data is written to persistent memory, it only resides there on a very temporary basis: as soon as possible, it is flushed to disk or flash, where the actual database log is stored on a permanent basis. Once the data is flushed, the corresponding space in the persistent memory is made available for reuse.
It should be noted that this is NOT a caching mechanism. The persistent memory log is never used to satisfy reads; all data is flushed as soon as possible in FIFO order.
To solve the problem of achieving ultra-low latency with minimal network overhead, we decided to use RDMA. Specifically, we used RoCE (RDMA over Converged Ethernet) along with IB Verbs, which provide a rich functionality set for remote memory operations. Additionally, the primary IB Verbs feature which we used was Shared Receive Queues; in a nutshell, this is a mechanism which allows you to create a pool of buffers that are shareable across a set of client connections. The Exadata Storage Server partitions the persistent memory log into a set of buffers, and then places those buffers into SRQs. Once clients connect to the Exadata Storage Server, they can issue an RDMA operation which will then use any available free buffers assigned to those SRQs. When a message is sent via RDMA, the server is notified as far as which buffers contain the data. When the server is done processing the request, it places buffers back on the SRQs. In addition, logging data sent via RDMA is prioritized on the network end-to-end to prevent any congestion.
However, the nature of an RDMA-based protocol presents its own problems:
- How can the client safely consider the request as completed? How can we guarantee that the logging data has been copied to the persistent memory on the Storage Server?
Persistence is guaranteed by ensuring that the RDMA transfers reach the ADR (Asynchronous DRAM Refresh) domain.
- How do we deal with Storage Server crashes (both software and hardware)?
The inherent nature of persistent memory is the saving grace here. Regardless of a software crash or a node crash on the Storage Server, we can guarantee that after a software/node restart/reboot, any previously sent logging data will still be there on persistent memory. The Storage Server can then flush any unwritten data to its permanent location. There are issues as far as ensuring message/data validity and ordering of writes – those are solvable, but outside the scope of this blog post.
- How do we deal with client deaths during the middle of sending a request?
The primary issue here is message/data validity. Depending on the timing of a client’s death, logging data sent via RDMA will either arrive intact with appropriate notification to the server, or it won’t. In the case of the former, it is not interesting at all, as the request will be processed normally. It is the latter case which is interesting because data could be partially copied to persistent memory. Even though the server may not be notified, a server crash at the same time could be problematic: a subsequent server restart must be able to differentiate between a complete/valid message and an incomplete/invalid one. Again, the mechanism for making this determination is outside the scope of this blog post.
The end result of this new scheme is that we have seen up to an 8x performance improvement in database logging latency.
In summary, the Exadata Storage Server uses persistent memory in combination with RDMA (via IB verbs) to provide an extremely low latency database logging service for clients connected via a high-speed network.
I am a Consulting Member of Technical Staff in the Exadata Development group at Oracle. I have been at Oracle for 25 years, and have worked on various products over the years including the RDBMS (Oracle’s flagship product), as well as Exadata, our most successful engineered system. I would categorize myself as a systems programmer. In the last several years, I have worked on two major Exadata features: Flash Logging and PMEM Logging – both of them are designed to provide high performance, low latency logging services for the Oracle RDBMS.