Zest: The Maximum Reliable Terabytes Per Second Storage for
Petascale Systems
Paul Nowoczynski, Nathan Stone, Jared Yanovich, Jason Sommerfield
ABSTRACT: The PSC has developed a prototype distributed file system infrastructure
that vastly accelerates aggregated write bandwidth on large compute platforms. Write
bandwidth, more than read bandwidth, is the dominant bottleneck in HPC I/O scenarios
due to writing checkpoint data, visualization data and post-processing (multi-stage)
data. We have prototyped a scalable solution on the Cray XT3 compute platform that
will be directly applicable to future petascale compute platforms having of order 106
cores. Our design emphasizes high-efficiency scalability, low-cost commodity
components, lightweight software layers, end-to-end parallelism, client-side caching
and software parity, and a unique model of load-balancing outgoing I/O onto high-
speed intermediate storage followed by asynchronous reconstruction to a 3rd-party
parallel file system.

KEYWORDS: Parallel Application Checkpoint, Parallel I/O, Petascale Storage, Client-
side Raid, High-performance commodity storage, Terabytes per second.
1. Introduction
Today, HPC sites implement parallel file systems Computational power in modern High Performance comprised of an increasing number of distributed storage Computing (HPC) platforms is rapidly increasing. nodes. However, in the current environment, disk Moore s Law alone accounts for doubling processing bandwidth performance greatly lags behind that of CPU, power roughly every 18 months. But a historical analysis memory, and interconnects. This means that as the of the fastest computing platforms by Top500.org shows number of clients continues to increase and outpace the a doubling of compute power in HPC systems roughly performance improvement trends of storage devices, every 14 months, so that the first petaflop computing larger and larger storage systems will be necessary to platform is expected in late 2008. This accelerated accommodate the equivalent I/O workload. Through our growth trend is due largely to an increase in the number analysis of prospective multi-terabyte/sec storage of processing cores; the current fastest computer has architectures we have concluded that increasing the roughly 256K cores. An increase in the number of cores bandwidth efficiency of constituent disks is essential to imposes two types of burdens on the storage subsystem: reeling in the rising cost of large parallel storage systems larger data volume and more requests. The data volume and minimizing the number of storage system increases because the physical memory per core is generally kept balanced resulting in a larger aggregate data volume, on the order of petabytes for petascale It is common wisdom that disks in large parallel storage systems. But more cores also mean more file system systems only expose a portion of their aggregate spindle clients, more I/O requests to the storage servers and bandwidth to the application. Optimally, the only ultimately more seeking of the back-end storage media bandwidth loss in the storage system would come from while storing that data. This will result in higher redundancy overhead. Today, however, in realistic HPC observed latencies and lower overall I/O performance.
scenarios the modules used to compose parallel storage write process and is necessary to uphold the semantics of systems generally attain < 50% of their aggregate spindle bandwidth. There are several possible culprits which may be responsible for this degradation, of them, only Prior to the advent of petascale computing this data one need be present to negatively impact performance: storage method would be considered prohibitive because the aggregate spindle bandwidth is greater than the it destroys two inferential systems which are critical to bandwidth of the connecting bus; the raid controller's today's parallel I/O infrastructures: the object-based parity calculation engine output is slower than the parallel file system metadata schema and the block-level connecting bus; and sub-optimal LBA request ordering RAID parity group association. RAID systems infer that caused by the filesystem. The first two factors are direct every same numbered block within the respective set of functions of the storage controller and may be rectified spindles are bound together to form a protected unit. by matched input and output bandwidths from host to This method is effective because only the address of a disk. The last factor, which is essentally 'seek' overhead, failed block is needed to determine the location of its is more difficult to overcome because of the protection unit 'cohorts' with no further state being stored.
codependence of the disk layer and filesystem on the Despite this inferential advantage, we contend that strict simple linear block interface. The raid layer further parity clustering can be detrimental to performance complicates matters by incorporating several spindles because it pushes data to specific regions on specific into the same block device address range and forcing them to be managed in strict unison.
Object-based parallel file systems use file-object maps to Zest attempts to increase per-spindle efficiency through a describe the location of a file's data. These maps are key design which implements performance-wise data components to the efficiency of the object-storage placement as opposed to those which are more friendly to method because they allow for arbitrary amounts of data today's filesystem metadata schemas. Data stored by Zest to be indexed by a very small data structure composed is done via the fastest mode available to the server merely of an ordered list of storage servers and a stride. without concern to file fragmentation or provisions for In essence, the map describes the location of the file's storing global metadata. As a result, the current sub-files and the number of bytes which may be accessed implementation of Zest has no application-level read before proceeding to the subfile or stripe. Besides the support. Instead it serves as a transitory cache which obvious advantages in the area of metadata storage, there copies its data into a full-featured filesystem at a non- are several caveats of this method. The most obvious is critical time. This method is well suited for application that the sub-files are the static products of the object checkpoint data because immediate readback capabiliies metadata model which was designed with its own efficiency in mind. The result is an overly deterministic data placement method in which by forcing I/O into a 2. Design Concepts
specific sub-file, increases complexity at the spindle because the backing filesystem's block allocation The Zest checkpoint I/O system employs three primary schemes cannot guarantee sequentiality in the face of concepts to achieve its performance target of 90% thousands or millions of simultaneous IO streams. 2b. Client-side Parity Calculation
2a. Relatively Non-Deterministic Data Placement
In order to prevent potential server-side raid bottlenecks, Zest is designed to perform sequential I/O whenever Zest places the parity generation and checksumming possible. To achieve a high degree of sequentiality, workload onto the clients. The HPC resource, which is Zest's block allocation scheme is not determined by data the source of the I/O, has orders of magnitude more offset or the file object identifier but rather the next memory bandwidth and CPU cycles at its disposal than available block on the disk. Additionally, the that of the storage servers. Placing the parity workload sequentiality of the allocation scheme is not affected by onto the client CPUs saves the storage system from the number of clients, the degree of randomization within requiring costly raid controllers and guarantees that parity the incoming data streams, or the RAID attributes (i.e. parity position) of the block. Because it minimizes seeks, this simple, non-deterministic data placement method is 2c. No Leased Locks
extremely effective for presenting sequential data streams To minimize network RPC overhead; features which to the spindle. It should be noted that a block's parity induce blocking; and the complexity of the IO servers; position does restrict the number of disks which may Zest purposely does not use leased locks. Instead, it handle it. This is the only determinism maintained in the ensures the integrity of intra-page, unaligned writes performed by multiple clients. Typically, filesystem caches are page-based and therefore a global lock is authority for his disk, it duties include: performing reads needed to ensure the update atomicity of a page. Zest and writes, io request scheduling, rebuilding active data does not use such a method, instead it uses vector-based lost due to disk failure, freespace management and block write buffers. One possible caveat of this method is that allocation, tracking of bad blocks, and statistics keeping.
Zest cannot guarantee transactional ordering for overlapping writes. Since it is uncommon for large In order to ensure proper RAID semantics, the disk I/O parallel HPC applications to write into overlapping file system interacts with a set of queues called raid vectors. offsets we do not feel that this is a fatal drawback.
This construct exists to ensure that write blocks of differing parity positions are not stored onto the same 3. Server Design
disk. Raid vectors are filled with write buffers by the Rpc stack, who appropriately places incoming buffers The Zest I/O server, otherwise known as a zestion, into their respective queues. Disks are assigned to raid appears as a storage controller / file server hybrid. vectors based on their number. Given a 16-disk zestion, a Similar to a controller, the zestion manages I/O to each 3+1 RAID scheme would create four raid vector queues drive as a seperate device. I/O is not done into a virtual where disks[0-3] were assigned to queue0, disks[4-7] to lun of multiple disks or volumes but rather to each disk. queue1, and so on. This configuration allows for In the vein of a file server, the zestion is aware of file multiple drives to process write requests from a single inodes and file extents. This combination of behaviors queue. The result of this design is a pull-based I/O enables Zest to interact with a filesystem in a way which system where incoming I/O's are handled by the devices which are ready to accept them. Devices which are slow naturally take less work and devices recognized as failed, The zestion is composed of several subsystems which are remove themselves from all raid vector queues. In order to be present on multiple raid vectors, disk I/O threads have the ability to simultaneously block on multiple input 3a. Networking and RPC Stack
sources. This capability allows for each disk thread to Zest uses a modified version of the LNET and ptlrpc accept write I/O requests on behalf of many raid schemes libraries found in the Lustre filesystem. There are several and read requests from the syncer and parity reasons for this, the primary being the need to maintain regeneration subsystems (described below).
capability with LNET routers for use on the Cray XT3. Presently, Zest supports both usermode LNET drivers The disk I/O subsystem may be configured to use one of (tcplnd and uptlld). On the zestion, the tcplnd is the three access modes: scsi generic, block, or file. The scsi functional equivalent to kernel mode Lustre ksocklnd. generic mode provides a zero-copy I/O path to the disk, Some modifications were made to tcplnd for supporting we have found this mode to be extremely efficient. The multi-rail configurations, per-interface statistics, and the 'file' mode is useful for testing and debugging on workstations which do not have multiple disk drives available for Zest use.
After further investigation into the Lustre rpc library it was decided to adopt the implementation because of its 3c. Syncer and File Reconstruction
proven robustness, performance, and logical integration At present, Zest does not support globally-stored file with the LNET/Portals API. Ptlrpc also provides a metadata therefore it relies on copying its data into a full- service layer abstraction which aids in the creation of featured filesystem to present the data for readback. In multi-threaded network servers. Zest makes use of this practice this accompanying filesystem exists on the same service layer to establish two RPC services: IO and physical storage and the copy process occurs after the metadata. The Zest IO and metadata services are groups of symmetric threads which process all client RPCs. Metadata RPCs are not concerned with bulk data Upon storing an entire parity group stream from a client, movement but instead interface with the zestion's inode the completed parity group is passed into the syncer's cache and with the namespace of the accompanying full- work queue. From there the syncer issues a read request featured filesystems. The IO service is responsible for to each disk holding a member of the parity group. The pulling data buffers from the clients and passing them disk I/O thread services this read request once all of the into the write processing queues called raid vectors.
write queues are empty. Once the read I/O is completed, the read request handle is passed back the syncer. From 3b. Disk I/O Subsystem
there it is written to the full-feature filesystem via a The Zest disk I/O subsytem assigns one thread for each pwrite syscall (the io vector parameters necessary for the valid disk as determined by the configuration system. system pwrite were provided by the client and stored Disk numbers are assigned at format time and are stored adjacently to the file data). When the entire parity group within the Zest superblock. Each disk thread is the sole has been copied out, the syncer instructs the disk threads output across zestions. Instead, the group of client to schedule reclamation for each of the synced blocks. processors are evenly distributed across the set of This process occurs only after all members of the parity zestions. We expect to make use of this behavior to implement checkpoint bandwidth provisioning for mixed workloads.
The syncer, and other Zest 'read' clients, are required to perform a checksum on the data returned from the disk. 4a. Parity and Checksum Calculation
This checksum protects the data and it associated As described above, the Zest client is tasked with metadata (primarily the describing io vectors). In the calculating parity on its outgoing data stream and event of a checksum failure, the block is scheduled to be performing a 64-bit checksum on each write buffer and rebuilt through the parity regeneration service.
associated metadata. Performing these calculation on the client distributes this workload across a larger number of 3d. Parity Declustering and Regeneration
cores and optimizes the compute resource performance as Zest's parity system is responsible for two primary tasks: a whole by allowing the zestions to focus on data storing declustered parity state and the reconstruction of The Zest parity system is configurable by the client based Prior to being passed to the syncer process, completed on the degree of protection sought by the application and parity groups are handed to the parity declustering service the hardware located at the server. At present, Zest where they are stored to a solid-state device (parity gracefully handles only single device failures through device). Parity device addressing is based on the disk and RAID5. Both Zest server and client are equipped to block numbers of the newly written blocks. Indexing the handle short write streams where the number of buffers in parity device by disk and block number allows for inquiry the stream is smaller than the requested raid scheme.
on behalf of corrupt blocks where the only known information are the disk and block numbers. This is 4b. Write Aggregation
necessary for handling the case of a corrupt Zest block. The client aggregates small I/Os into its vector-based The parity group structure is a few hundred bytes in size cache buffers on a per file-descriptor basis. Designed to and lists all members of the protection unit. For each work on an MPP machine such as the Cray XT3, Zest member in the parity group, the structure is copied to that assigns a small number of data buffers to each file member's respective parity device address.
descriptor. These buffers can hold any offset within the respective file though a maximum number of fragments During normal operation, the parity device is being (vectors) per buffer is enforced. This maximum is updated in conjunction with incoming writes in an determined at zestion format time and is directly asynchronous manner by the parity device thread. This contigent on the number of io vectors which can be stored operation is purposely asynchronous to minimize in the metadata region of a Zest block. Typically we blocking in the disk I/O thread's main routine. As a have configured this maximum to 16 meaning that a result, the parity device is not the absolute authority on client may fill a write buffer until either its capacity is parity group state. Instead, the on-disk structures have consumed or the maximum number of fragments has precedence in determining the state of the declustered parity groups. Currently at boot time, active parity groups are joined by a group finding operation and the parity device is verified against this collection. In the 4c. Client to Server Data Transfer
event of a failed disk, the parity device is relied upon as The elemental transfer mode from client to server is pull- the authority for the failed disk's blocks. In the future this based, implemented via LNetGet(). As write buffers are fsck-like operation will be supplemented with a journal.
consumed, they are placed into a rpc set and the zestion is instructed to schedule the retrieval of the buffer. The rpc 4. Client Design and Implementation
set is a functional construct of the lustre ptlrpc library which allows groups of semantically related rpc requests The Zest client currently exists as both a FUSE (file to be managed as a single operation. This fits nicely with system in user-space) mount and a statically linkable Zest's concept of parity groups, hence, the zest client library. The latter, used primarily for the Cray XT3, is assigns an rpc set to each active parity group. similar to the liblustre library in that it is single-threaded. It should be recognized that the Zest system is not Ensuring the viability of the client's parity groups optimized for single client performance but rather large requires the client to hold it buffers until the entire group multitudes of parallel clients. Therefore it stands to (or rpc set) has been acknowledged by the zestion. Zest reason that despite zestions being equally accessible by supports both write-back and write-through server all compute processors, the zest client does not stripe its caching, the protocol for acknowledgement hinges around the caching policy requested by the client. volatile blocks which it may have been holding are Depending on the size of the write() request and the located during the fsck process and rebuilt.
availability of buffers, zero-copy or buffer-copy mode may be used. 5c. Multi-pathing and Failover
Zest servers support pairwise dynamic fail-over through
One pivotal advantage of relative non-deterministic data disk multi-pathing, disk UUID identification, and Linux- placement is that it allows Zest clients to send parity HA software. Zestion fail-over pairs are configured to groups to any zestion within the storage network. The recognize each others disks through the global result is that, in the event of a server failure, a client may configuration file. Since zest clients are able to resubmit resend an entire parity group to any other zestion. We writes to other zestions, the fail-over procedure is not as predict that this feature will be extremely valuable in fragile or imminent as one might expect. The primary large parallel storage networks because it allows for post fail-over activity is to examine the partner's disks in perfect rebalancing of I/O workloads in the event of search of non-synchronized data and process that data server node failures and therefore eliminates the creation 6. Zest within the Cray XT3 Environment
5. Server Fault Handling
Adapting the Cray XT3 I/O environment, the Zest server 5a. Media Error Handling
cluster is located on an InfiniBand cloud external to the All Zest data and metadata are protected by a 64-bit machine. The Cray SIO nodes act as a gateway between checksum used to detect media errors. On 'read' the the XT3's internal network and the Zest servers. Lustre zestion verifies the checksum to ensure that the block has LNET routing services are run on each of the SIO nodes not been compromised. When a bad block is found its to route traffic from Seastar compute interconnect to the parity group information is located via a lookup into a external InfiniBand network, providing seamless separate device. The parity device is a solid state connectivity from compute processors to the external memory device whose purpose is to maintain the parity group structure for every block on the system. Any Zest block's parity group descriptor is located via a unique To achieve compatibility with the Lustre routing service, address composed of the block's disk and block Zest's networking library is largely based on Lustre's identifiers. Since the IO pattern to the parity device is LNET and RPC subsystems. The Zest server uses a essentially random it has been outfitted with a small modified TCP-based Lustre networking driver with solid-state disk. Currently an 8 million block Zest system additional support for InfiniBand sockets-direct protocol. requires a 4 gigabyte parity device. The parity device Since Zest is a user mode service, it does not have access update path is asynchronous and therefore, if needed, the to the native, kernel-mode InfiniBand Lustre driver so entire device may be reconstructed during file system SDP was chosen for its convenient path into the 5b. Run-time Disk Failures
7. Zest Performance Results
Since the Zest system manages RAID and file-objects, handling of disk failures only requires that volatile blocks Here are preliminary performance results from a single are rebuilt. Zest has full knowledge of the disk's contents 12-disk zestion. Both the PSC Cray XT3 and a small so old or unused blocks are not considered for rebuilding. linux cluster we used as client systems. The zestion's During the course of normal operation, Zest maintains drives are SATA-2 and operate at a sustained rate of lists and indexes of all blocks being used within the 75MB/s. The maximum observed performance of the system. In the event of a disk failure, the set of blocks entire set of disks, tested with scsi generic io, was who have not been synchronized or whose parity group 900MB/s. The zestion self-test, which utilizes the I/O cohorts have not been synchronized will be rebuilt. Here codepath, without using the rpc layer, measured a the parity device will be used to determine the failed sustained back-end bandwith rate of 840MB/s.
block's parity group cohorts. It must be noted that, at this time, Zest cannot recover from simultaneous failure of a To date, the best numbers are attained when using the linux cluster as a client. This is due to the use of the Infiniband sockets-direct protocol as the transport. We 5c. File System Check and Boot-time Failures
have not been successful running the Lustre ksocklnd On boot, the file system check analyzes the system's with sockets-direct so tests performed from the Cray XT3 parity groups and schedules the synchronization of have relied on IP over Infiniband (IPoIB). The IPoIB volatile blocks. If a disk fails in the start-up phase any path is significantly inferior to the sockets-direct protocol. It should be noted that these tests did not utilize Figure P1shows Zest performance when using the Cray XT3 as a client. This XT3 client test shows that IpoIB is not an effective interconnect. We are assured that IPoIB is the culprit because equally poor results were observed while using this mode on the linux cluster (280MB/s from 100pe's). As described above in Section 6, the test used several Cray SIO as Lnet routers. The routers were configured to use IPoIB since sockets-direct was not available. In an attempt to maximize throughput, the zestion employed multiple IpoIB interfaces. These interfaces were joined into the same Lnet network through the multi-rail feature which we added into the 10. Conclusion and Future Development
Zest is designed to facilitate fast parallel I/O for the largest production systems currently conceived (petascale) and it includes features like configurable client-side parity calculation, multi-level checksums, and implicit load-balancing within the server. It requires no hardware RAID controllers, and is capable of using lower Figure P0: Linux Cluster: Sockets-direct cost commodity components (disk shelves filled with protocol, Client Raid 5+1. (Y-Axis is MB/s) SATA drives). We minimize the impact of many client connections, many I/O requests and many file system Figure P0 shows that in the best case, 120 client threads, seeks upon the backend disk performance and in so doing the end-to-end throughput of the zestion was 89.6% of the leverage the performance strengths of each layer of the aggregate spindle bandwidth. However the application, using a RAID5 5+1, which incurs a 17% overhead, saw 75% of the aggregate. lt can be safely extrapolated that In the future we aim to improve network performance by had the application used an 11+1 parity stripe, ~82% of leveraging the upcoming Lustre LNET user to kernel the aggregate would have been realized. This result mode bridging module. This module provides LNET api shows that more work must be done to eliminate the 8% compatibility to userspace applications therefore it should loss to server overhead. It also demonstrates that the be easily adapted by Zest. It is our hope that using the application observed bandwidth is largely dependent on native Lnet RDMA drivers will substantially increase Much contemplation has been given to the development of a global-metadata system for Zest. At this time (Spring '08), the high-level design has been considered and at least one key data structure has been implemented for this purpose. It is most likely, however, that immediate efforts will be put forth into stabilization and performance improvements of the present transitory About the Authors
Paul Nowoczynski, Jared Yanovich, Nathan Stone, and Jason Sommerfield are members of the Pittsburgh Supercomputing Center's Advanced Systems Group.
Firgure P1: Cray XT3, IPoIB protocol, Client Raid 5+1. (Y-axis is MB/s)

Source: https://cug.org/5-publications/proceedings_attendee_lists/2008CD/S08_Proceedings/pages/Authors/16-19Thursday/Nowoczynski-Thursday18A/Nowoczynski-Thursday18A-paper.pdf


Bicycle Accident 12/21/11 and Consequences S. J. PealeIn riding my bicycle to work, I have a choice between two routes: 1. Over Fairview overpassfor a distance of about 4 miles. 2: On a bikeway under both the freeway and Hollister Avefor a distance of about 6.7 miles. The latter is safer with less traffic. On the morning ofDec. 21, 2011, I decided to take the shorter route because it was cold. Th

Microsoft word - inflexxionthe buzz on caffeinea.doc

The Buzz on Caffeine: How Caffeine Affects Your Health If you entered your college years without acquiring a taste for caffeine, late nights studying may kick off a caffeine habit. But what exactly is the deal with caffeine? Is it a harmless habit or something to worry about? What is caffeine? Caffeine is a natural stimulant found in coffee, tea, chocolate, many soft drinks and some m

Copyright © 2010-2014 Online pdf catalog