Zest: The Maximum Reliable Terabytes Per Second Storage for Petascale Systems Paul Nowoczynski, Nathan Stone, Jared Yanovich, Jason Sommerfield ABSTRACT: The PSC has developed a prototype distributed file system infrastructure that vastly accelerates aggregated write bandwidth on large compute platforms. Write bandwidth, more than read bandwidth, is the dominant bottleneck in HPC I/O scenarios due to writing checkpoint data, visualization data and post-processing (multi-stage) data. We have prototyped a scalable solution on the Cray XT3 compute platform that will be directly applicable to future petascale compute platforms having of order 106 cores. Our design emphasizes high-efficiency scalability, low-cost commodity components, lightweight software layers, end-to-end parallelism, client-side caching and software parity, and a unique model of load-balancing outgoing I/O onto high- speed intermediate storage followed by asynchronous reconstruction to a 3rd-party parallel file system. KEYWORDS: Parallel Application Checkpoint, Parallel I/O, Petascale Storage, Client- side Raid, High-performance commodity storage, Terabytes per second. 1. Introduction
Today, HPC sites implement parallel file systems
Computational power in modern High Performance
comprised of an increasing number of distributed storage
Computing (HPC) platforms is rapidly increasing.
nodes. However, in the current environment, disk
Moore s Law alone accounts for doubling processing
bandwidth performance greatly lags behind that of CPU,
power roughly every 18 months. But a historical analysis
memory, and interconnects. This means that as the
of the fastest computing platforms by Top500.org shows
number of clients continues to increase and outpace the
a doubling of compute power in HPC systems roughly
performance improvement trends of storage devices,
every 14 months, so that the first petaflop computing
larger and larger storage systems will be necessary to
platform is expected in late 2008. This accelerated
accommodate the equivalent I/O workload. Through our
growth trend is due largely to an increase in the number
analysis of prospective multi-terabyte/sec storage
of processing cores; the current fastest computer has
architectures we have concluded that increasing the
roughly 256K cores. An increase in the number of cores
bandwidth efficiency of constituent disks is essential to
imposes two types of burdens on the storage subsystem:
reeling in the rising cost of large parallel storage systems
larger data volume and more requests. The data volume
and minimizing the number of storage system
increases because the physical memory per core is
generally kept balanced resulting in a larger aggregate data volume, on the order of petabytes for petascale
It is common wisdom that disks in large parallel storage
systems. But more cores also mean more file system
systems only expose a portion of their aggregate spindle
clients, more I/O requests to the storage servers and
bandwidth to the application. Optimally, the only
ultimately more seeking of the back-end storage media
bandwidth loss in the storage system would come from
while storing that data. This will result in higher
redundancy overhead. Today, however, in realistic HPC
observed latencies and lower overall I/O performance.
scenarios the modules used to compose parallel storage
write process and is necessary to uphold the semantics of
systems generally attain < 50% of their aggregate spindle
bandwidth. There are several possible culprits which may be responsible for this degradation, of them, only
Prior to the advent of petascale computing this data
one need be present to negatively impact performance:
storage method would be considered prohibitive because
the aggregate spindle bandwidth is greater than the
it destroys two inferential systems which are critical to
bandwidth of the connecting bus; the raid controller's
today's parallel I/O infrastructures: the object-based
parity calculation engine output is slower than the
parallel file system metadata schema and the block-level
connecting bus; and sub-optimal LBA request ordering
RAID parity group association. RAID systems infer that
caused by the filesystem. The first two factors are direct
every same numbered block within the respective set of
functions of the storage controller and may be rectified
spindles are bound together to form a protected unit.
by matched input and output bandwidths from host to
This method is effective because only the address of a
disk. The last factor, which is essentally 'seek' overhead,
failed block is needed to determine the location of its
is more difficult to overcome because of the
protection unit 'cohorts' with no further state being stored.
codependence of the disk layer and filesystem on the
Despite this inferential advantage, we contend that strict
simple linear block interface. The raid layer further
parity clustering can be detrimental to performance
complicates matters by incorporating several spindles
because it pushes data to specific regions on specific
into the same block device address range and forcing
them to be managed in strict unison.
Object-based parallel file systems use file-object maps to
Zest attempts to increase per-spindle efficiency through a
describe the location of a file's data. These maps are key
design which implements performance-wise data
components to the efficiency of the object-storage
placement as opposed to those which are more friendly to
method because they allow for arbitrary amounts of data
today's filesystem metadata schemas. Data stored by Zest
to be indexed by a very small data structure composed
is done via the fastest mode available to the server
merely of an ordered list of storage servers and a stride.
without concern to file fragmentation or provisions for
In essence, the map describes the location of the file's
storing global metadata. As a result, the current
sub-files and the number of bytes which may be accessed
implementation of Zest has no application-level read
before proceeding to the subfile or stripe. Besides the
support. Instead it serves as a transitory cache which
obvious advantages in the area of metadata storage, there
copies its data into a full-featured filesystem at a non-
are several caveats of this method. The most obvious is
critical time. This method is well suited for application
that the sub-files are the static products of the object
checkpoint data because immediate readback capabiliies
metadata model which was designed with its own
efficiency in mind. The result is an overly deterministic data placement method in which by forcing I/O into a
2. Design Concepts
specific sub-file, increases complexity at the spindle because the backing filesystem's block allocation
The Zest checkpoint I/O system employs three primary
schemes cannot guarantee sequentiality in the face of
concepts to achieve its performance target of 90%
thousands or millions of simultaneous IO streams.
2b. Client-side Parity Calculation 2a. Relatively Non-Deterministic Data Placement
In order to prevent potential server-side raid bottlenecks,
Zest is designed to perform sequential I/O whenever
Zest places the parity generation and checksumming
possible. To achieve a high degree of sequentiality,
workload onto the clients. The HPC resource, which is
Zest's block allocation scheme is not determined by data
the source of the I/O, has orders of magnitude more
offset or the file object identifier but rather the next
memory bandwidth and CPU cycles at its disposal than
available block on the disk. Additionally, the
that of the storage servers. Placing the parity workload
sequentiality of the allocation scheme is not affected by
onto the client CPUs saves the storage system from
the number of clients, the degree of randomization within
requiring costly raid controllers and guarantees that parity
the incoming data streams, or the RAID attributes (i.e.
parity position) of the block. Because it minimizes seeks, this simple, non-deterministic data placement method is
2c. No Leased Locks
extremely effective for presenting sequential data streams
To minimize network RPC overhead; features which
to the spindle. It should be noted that a block's parity
induce blocking; and the complexity of the IO servers;
position does restrict the number of disks which may
Zest purposely does not use leased locks. Instead, it
handle it. This is the only determinism maintained in the
ensures the integrity of intra-page, unaligned writes performed by multiple clients. Typically, filesystem
caches are page-based and therefore a global lock is
authority for his disk, it duties include: performing reads
needed to ensure the update atomicity of a page. Zest
and writes, io request scheduling, rebuilding active data
does not use such a method, instead it uses vector-based
lost due to disk failure, freespace management and block
write buffers. One possible caveat of this method is that
allocation, tracking of bad blocks, and statistics keeping.
Zest cannot guarantee transactional ordering for overlapping writes. Since it is uncommon for large
In order to ensure proper RAID semantics, the disk I/O
parallel HPC applications to write into overlapping file
system interacts with a set of queues called raid vectors.
offsets we do not feel that this is a fatal drawback.
This construct exists to ensure that write blocks of differing parity positions are not stored onto the same
3. Server Design
disk. Raid vectors are filled with write buffers by the Rpc stack, who appropriately places incoming buffers
The Zest I/O server, otherwise known as a zestion,
into their respective queues. Disks are assigned to raid
appears as a storage controller / file server hybrid.
vectors based on their number. Given a 16-disk zestion, a
Similar to a controller, the zestion manages I/O to each
3+1 RAID scheme would create four raid vector queues
drive as a seperate device. I/O is not done into a virtual
where disks[0-3] were assigned to queue0, disks[4-7] to
lun of multiple disks or volumes but rather to each disk.
queue1, and so on. This configuration allows for
In the vein of a file server, the zestion is aware of file
multiple drives to process write requests from a single
inodes and file extents. This combination of behaviors
queue. The result of this design is a pull-based I/O
enables Zest to interact with a filesystem in a way which
system where incoming I/O's are handled by the devices
which are ready to accept them. Devices which are slow naturally take less work and devices recognized as failed,
The zestion is composed of several subsystems which are
remove themselves from all raid vector queues. In order
to be present on multiple raid vectors, disk I/O threads have the ability to simultaneously block on multiple input
3a. Networking and RPC Stack
sources. This capability allows for each disk thread to
Zest uses a modified version of the LNET and ptlrpc
accept write I/O requests on behalf of many raid schemes
libraries found in the Lustre filesystem. There are several
and read requests from the syncer and parity
reasons for this, the primary being the need to maintain
regeneration subsystems (described below).
capability with LNET routers for use on the Cray XT3. Presently,Zest supports both usermode LNET drivers
The disk I/O subsystem may be configured to use one of
(tcplnd and uptlld). On the zestion, the tcplnd is the
three access modes: scsi generic, block, or file. The scsi
functional equivalent to kernel mode Lustre ksocklnd.
generic mode provides a zero-copy I/O path to the disk,
Some modifications were made to tcplnd for supporting
we have found this mode to be extremely efficient. The
multi-rail configurations, per-interface statistics, and the
'file' mode is useful for testing and debugging on
workstations which do not have multiple disk drives available for Zest use.
After further investigation into the Lustre rpc library it was decided to adopt the implementation because of its
3c. Syncer and File Reconstruction
proven robustness, performance, and logical integration
At present, Zest does not support globally-stored file
with the LNET/Portals API. Ptlrpc also provides a
metadata therefore it relies on copying its data into a full-
service layer abstraction which aids in the creation of
featured filesystem to present the data for readback. In
multi-threaded network servers. Zest makes use of this
practice this accompanying filesystem exists on the same
service layer to establish two RPC services: IO and
physical storage and the copy process occurs after the
metadata. The Zest IO and metadata services are groups
of symmetric threads which process all client RPCs. Metadata RPCs are not concerned with bulk data
Upon storing an entire parity group stream from a client,
movement but instead interface with the zestion's inode
the completed parity group is passed into the syncer's
cache and with the namespace of the accompanying full-
work queue. From there the syncer issues a read request
featured filesystems. The IO service is responsible for
to each disk holding a member of the parity group. The
pulling data buffers from the clients and passing them
disk I/O thread services this read request once all of the
into the write processing queues called raid vectors.
write queues are empty. Once the read I/O is completed, the read request handle is passed back the syncer. From
3b. Disk I/O Subsystem
there it is written to the full-feature filesystem via a
The Zest disk I/O subsytem assigns one thread for each
pwrite syscall (the io vector parameters necessary for the
valid disk as determined by the configuration system.
system pwrite were provided by the client and stored
Disk numbers are assigned at format time and are stored
adjacently to the file data). When the entire parity group
within the Zest superblock. Each disk thread is the sole
has been copied out, the syncer instructs the disk threads
output across zestions. Instead, the group of client
to schedule reclamation for each of the synced blocks.
processors are evenly distributed across the set of
This process occurs only after all members of the parity
zestions. We expect to make use of this behavior to
implement checkpoint bandwidth provisioning for mixed workloads.
The syncer, and other Zest 'read' clients, are required to
perform a checksum on the data returned from the disk.
4a. Parity and Checksum Calculation
This checksum protects the data and it associated
As described above, the Zest client is tasked with
metadata (primarily the describing io vectors). In the
calculating parity on its outgoing data stream and
event of a checksum failure, the block is scheduled to be
performing a 64-bit checksum on each write buffer and
rebuilt through the parity regeneration service.
associated metadata. Performing these calculation on the client distributes this workload across a larger number of
3d. Parity Declustering and Regeneration
cores and optimizes the compute resource performance as
Zest's parity system is responsible for two primary tasks:
a whole by allowing the zestions to focus on data
storing declustered parity state and the reconstruction of
The Zest parity system is configurable by the client based
Prior to being passed to the syncer process, completed
on the degree of protection sought by the application and
parity groups are handed to the parity declustering service
the hardware located at the server. At present, Zest
where they are stored to a solid-state device (parity
gracefully handles only single device failures through
device). Parity device addressing is based on the disk and
RAID5. Both Zest server and client are equipped to
block numbers of the newly written blocks. Indexing the
handle short write streams where the number of buffers in
parity device by disk and block number allows for inquiry
the stream is smaller than the requested raid scheme.
on behalf of corrupt blocks where the only known information are the disk and block numbers. This is
4b. Write Aggregation
necessary for handling the case of a corrupt Zest block.
The client aggregates small I/Os into its vector-based
The parity group structure is a few hundred bytes in size
cache buffers on a per file-descriptor basis. Designed to
and lists all members of the protection unit. For each
work on an MPP machine such as the Cray XT3, Zest
member in the parity group, the structure is copied to that
assigns a small number of data buffers to each file
member's respective parity device address.
descriptor. These buffers can hold any offset within the respective file though a maximum number of fragments
During normal operation, the parity device is being
(vectors) per buffer is enforced. This maximum is
updated in conjunction with incoming writes in an
determined at zestion format time and is directly
asynchronous manner by the parity device thread. This
contigent on the number of io vectors which can be stored
operation is purposely asynchronous to minimize
in the metadata region of a Zest block. Typically we
blocking in the disk I/O thread's main routine. As a
have configured this maximum to 16 meaning that a
result, the parity device is not the absolute authority on
client may fill a write buffer until either its capacity is
parity group state. Instead, the on-disk structures have
consumed or the maximum number of fragments has
precedence in determining the state of the declustered
parity groups. Currently at boot time, active parity groups are joined by a group finding operation and the parity device is verified against this collection. In the
4c. Client to Server Data Transfer
event of a failed disk, the parity device is relied upon as
The elemental transfer mode from client to server is pull-
the authority for the failed disk's blocks. In the future this
based, implemented via LNetGet(). As write buffers are
fsck-like operation will be supplemented with a journal.
consumed, they are placed into a rpc set and the zestion is instructed to schedule the retrieval of the buffer. The rpc 4. Client Design and Implementation set is a functional construct of the lustre ptlrpc library which allows groups of semantically related rpc requests
The Zest client currently exists as both a FUSE (file
to be managed as a single operation. This fits nicely with
system in user-space) mount and a statically linkable
Zest's concept of parity groups, hence, the zest client
library. The latter, used primarily for the Cray XT3, is
assigns an rpc set to each active parity group.
similar to the liblustre library in that it is single-threaded. It should be recognized that the Zest system is not
Ensuring the viability of the client's parity groups
optimized for single client performance but rather large
requires the client to hold it buffers until the entire group
multitudes of parallel clients. Therefore it stands to
(or rpc set) has been acknowledged by the zestion. Zest
reason that despite zestions being equally accessible by
supports both write-back and write-through server
all compute processors, the zest client does not stripe its
caching, the protocol for acknowledgement hinges
around the caching policy requested by the client.
volatile blocks which it may have been holding are
Depending on the size of the write() request and the
located during the fsck process and rebuilt.
availability of buffers, zero-copy or buffer-copy mode may be used.
5c. Multi-pathing and Failover Zest servers support pairwise dynamic fail-over through
One pivotal advantage of relative non-deterministic data
disk multi-pathing, disk UUID identification, and Linux-
placement is that it allows Zest clients to send parity
HA software. Zestion fail-over pairs are configured to
groups to any zestion within the storage network. The
recognize each others disks through the global
result is that, in the event of a server failure, a client may
configuration file. Since zest clients are able to resubmit
resend an entire parity group to any other zestion. We
writes to other zestions, the fail-over procedure is not as
predict that this feature will be extremely valuable in
fragile or imminent as one might expect. The primary
large parallel storage networks because it allows for
post fail-over activity is to examine the partner's disks in
perfect rebalancing of I/O workloads in the event of
search of non-synchronized data and process that data
server node failures and therefore eliminates the creation
6. Zest within the Cray XT3 Environment 5. Server Fault Handling
Adapting the Cray XT3 I/O environment, the Zest server
5a. Media Error Handling
cluster is located on an InfiniBand cloud external to the
All Zest data and metadata are protected by a 64-bit
machine. The Cray SIO nodes act as a gateway between
checksum used to detect media errors. On 'read' the
the XT3's internal network and the Zest servers. Lustre zestion verifies the checksum to ensure that the block has
LNET routing services are run on each of the SIO nodes
not been compromised. When a bad block is found its
to route traffic from Seastar compute interconnect to the
parity group information is located via a lookup into a
external InfiniBand network, providing seamless
separate device. The parity device is a solid state
connectivity from compute processors to the external
memory device whose purpose is to maintain the parity
group structure for every block on the system. Any Zest block's parity group descriptor is located via a unique
To achieve compatibility with the Lustre routing service,
address composed of the block's disk and block
Zest's networking library is largely based on Lustre's
identifiers. Since the IO pattern to the parity device is
LNET and RPC subsystems. The Zest server uses a
essentially random it has been outfitted with a small
modified TCP-based Lustre networking driver with
solid-state disk. Currently an 8 million block Zest system
additional support for InfiniBand sockets-direct protocol.
requires a 4 gigabyte parity device. The parity device
Since Zest is a user mode service, it does not have access
update path is asynchronous and therefore, if needed, the
to the native, kernel-mode InfiniBand Lustre driver so
entire device may be reconstructed during file system
SDP was chosen for its convenient path into the
5b. Run-time Disk Failures 7. Zest Performance Results
Since the Zest system manages RAID and file-objects, handling of disk failures only requires that volatile blocks
Here are preliminary performance results from a single
are rebuilt. Zest has full knowledge of the disk's contents
12-disk zestion. Both the PSC Cray XT3 and a small
so old or unused blocks are not considered for rebuilding.
linux cluster we used as client systems. The zestion's
During the course of normal operation, Zest maintains
drives are SATA-2 and operate at a sustained rate of
lists and indexes of all blocks being used within the
75MB/s. The maximum observed performance of the
system. In the event of a disk failure, the set of blocks
entire set of disks, tested with scsi generic io, was
who have not been synchronized or whose parity group
900MB/s. The zestion self-test, which utilizes the I/O
cohorts have not been synchronized will be rebuilt. Here
codepath, without using the rpc layer, measured a
the parity device will be used to determine the failed
sustained back-end bandwith rate of 840MB/s.
block's parity group cohorts. It must be noted that, at this time, Zest cannot recover from simultaneous failure of a
To date, the best numbers are attained when using the
linux cluster as a client. This is due to the use of the Infiniband sockets-direct protocol as the transport. We
5c. File System Check and Boot-time Failures
have not been successful running the Lustre ksocklnd
On boot, the file system check analyzes the system's
with sockets-direct so tests performed from the Cray XT3
parity groups and schedules the synchronization of
have relied on IP over Infiniband (IPoIB). The IPoIB
volatile blocks. If a disk fails in the start-up phase any
path is significantly inferior to the sockets-direct
protocol. It should be noted that these tests did not utilize
Figure P1shows Zest performance when using the Cray
XT3 as a client. This XT3 client test shows that IpoIB is not an effective interconnect. We are assured that IPoIB is the culprit because equally poor results were observed
while using this mode on the linux cluster (280MB/s from 100pe's). As described above in Section 6, the test
used several Cray SIO as Lnet routers. The routers were
configured to use IPoIB since sockets-direct was not available. In an attempt to maximize throughput, the
zestion employed multiple IpoIB interfaces. These
interfaces were joined into the same Lnet network
through the multi-rail feature which we added into the
10. Conclusion and Future Development
Zest is designed to facilitate fast parallel I/O for the
largest production systems currently conceived (petascale) and it includes features like configurable
client-side parity calculation, multi-level checksums, and
implicit load-balancing within the server. It requires no hardware RAID controllers, and is capable of using lower
Figure P0: Linux Cluster: Sockets-direct
cost commodity components (disk shelves filled with
protocol, Client Raid 5+1. (Y-Axis is MB/s)
SATA drives). We minimize the impact of many client connections, many I/O requests and many file system
Figure P0 shows that in the best case, 120 client threads,
seeks upon the backend disk performance and in so doing
the end-to-end throughput of the zestion was 89.6% of the
leverage the performance strengths of each layer of the
aggregate spindle bandwidth. However the application,
using a RAID5 5+1, which incurs a 17% overhead, saw 75% of the aggregate. lt can be safely extrapolated that
In the future we aim to improve network performance by
had the application used an 11+1 parity stripe, ~82% of
leveraging the upcoming Lustre LNET user to kernel
the aggregate would have been realized. This result
mode bridging module. This module provides LNET api
shows that more work must be done to eliminate the 8%
compatibility to userspace applications therefore it should
loss to server overhead. It also demonstrates that the
be easily adapted by Zest. It is our hope that using the
application observed bandwidth is largely dependent on
native Lnet RDMA drivers will substantially increase
Much contemplation has been given to the development
of a global-metadata system for Zest. At this time
(Spring '08), the high-level design has been considered
and at least one key data structure has been implemented for this purpose. It is most likely, however, that
immediate efforts will be put forth into stabilization and
performance improvements of the present transitory
About the Authors
Paul Nowoczynski, Jared Yanovich, Nathan Stone,
and Jason Sommerfield are members of the Pittsburgh Supercomputing Center's Advanced Systems Group. Firgure P1: Cray XT3, IPoIB protocol, Client Raid 5+1. (Y-axis is MB/s)
Bicycle Accident 12/21/11 and Consequences S. J. PealeIn riding my bicycle to work, I have a choice between two routes: 1. Over Fairview overpassfor a distance of about 4 miles. 2: On a bikeway under both the freeway and Hollister Avefor a distance of about 6.7 miles. The latter is safer with less traffic. On the morning ofDec. 21, 2011, I decided to take the shorter route because it was cold. Th
The Buzz on Caffeine: How Caffeine Affects Your Health If you entered your college years without acquiring a taste for caffeine, late nights studying may kick off a caffeine habit. But what exactly is the deal with caffeine? Is it a harmless habit or something to worry about? What is caffeine? Caffeine is a natural stimulant found in coffee, tea, chocolate, many soft drinks and some m