Architecture Improvements for Data-Intensive High-End Computing

At the hardware level, the emerging nonvolatile storage-class memory devices such as flash-memory based solid-state drives and phase-change memory can provide more promising performance than hard disk drives, especially for random accesses [7, 14]. However, they cannot reduce the data movement across the network, and they help to mitigate the performance gap between CPU and I/O but will not be able to solve the I/O bottleneck problem alone.

Active storage [24, 30, 39], active disks [25, 11], and smart disks [12] have gained increasing attention recently. Active storage leverages the computing capability of storage nodes and performs certain computation to reduce the bandwidth requirement between storage and compute nodes. Active disks and smart disks integrate a processing unit within disk storage devices and offload computations to embedded processing unit. However, these architecture improvements are designed to explore either the idle computing power of storage nodes or an embedded processor, and have limited computation-offloading capability. It is easy to see that DEP provides a much more powerful platform for the same purpose. I/O forwarding (both hardware and software solutions) [2, 17] and data shipping [28] provide approaches to offloading I/O requests to dedicated nodes, aggregating the requests, and carrying out them on behalf of compute nodes. The data nodes proposed in the DEP design can carry all these functions and do more.

Programming Model Improvements for Data-Intensive High-End Computing

Current parallel programming models are designed for computation-intensive applications. These programming models include Message Passing Interface (MPI) [16], Global Arrays [22], Unified Parallel C [15], Chapel, X10, Co-array Fortran, and data parallel programming models such as High Performance Fortran (HPF). These programming models primarily focus on the memory abstractions and communication mechanism among processes. I/O is treated as a peripheral activity and often a separate phase in these programming models and execution paradigms, which is often achieved through a subset of interfaces such as MPI-IO [34].

Advanced I/O libraries, such as Hierarchical Data Format (HDF), Parallel netCDF (PnetCDF), and Adaptable IO System (ADIOS), provide high-level abstractions, map the abstractions onto I/O in one way or another, and complement parallel programming models in managing I/O activities. The recent MapReduce programming model [13, 29] is an instant hit and has been proven effective for many data-intensive applications. The MapReduce model, however, is typically layered on top of distributed file systems and is not designed for high performance computing semantics. It requires specific Map and Reduce abstractions as well [13, 29]. DEP is designed for general parallel applications, with an increased programming capability.

Runtime System Improvements for Data-Intensive High-End Computing

There has been significant amount of research effort in optimizing I/O performance using runtime libraries, such as collective I/O [35, 19, 8], two-phase I/O, extended two-phase I/O, data sieving, server-direct I/O, disk-directed I/O, lightweight I/O [26], partitioned collective I/O [40], layout-aware collective I/O [8], ADIOS library [20], and resonant I/O [41]. These strategies collect and aggregate small requests into larger ones at the I/O client/middleware/server level.

Many caching, buffering, staging, and prefetching optimization strategies exist at runtime as well, such as collective caching [18], collective buffering [23], active buffering [21], discretionary caching [37], SpecHint prefetching [6], transparent informed prefetching (TIP) [27], adaptive prefetching based on time series modeling [36], multiple-level caching and prefetching for Blue Gene systems [4], and our prior work in pre-execution based prefetching [9, 10] and a signature based prefetching with post-execution analysis [5]. Abbasi et. al. recently proposed a DataStager framework with data staging services that move output data to dedicated staging or I/O nodes prior to storage, which has been proven effective in reducing the I/O overheads and interferences on compute nodes [1]. Zheng et. al. proposed a preparatory data analytics (PreDatA) approach to preparing and characterizing scientific data when generated (e.g. data reorganization and metadata annotation) to speedup subsequent data access [42]. These approaches have shown considerable performance improvement with dedicated output staging services and preparatory analysis. Our proposed DEP approach, built upon server-push architecture [31, 32], leverages dedicated nodes as well, but is different. The dedicated data processing nodes work for both reads and writes, and can provide buffering or staging, but more importantly on reduction. The notion of data processing nodes in DEP is a rethinking of HEC systems architecture to provide balanced computational and I/O capability. The DEP considers to address the I/O bottleneck issues fundamentally from the execution paradigm including systems architecture and programming model, not only from runtime optimizations.

Parallel file systems (PFS), such as Lustre, GPFS [28], PanFS, PVFS, and PPFS2, enable concurrent I/O accesses from multiple clients to files. Numerous optimizations exist to improve the file system performance, such as data staging services [1], latent asynchrony I/O [38], and a log-structured interposition layer [3]. A comprehensive comparison between PVFS and distributed file system HDFS was presented in [10].


  1. H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky, K. Schwan and F. Zheng. DataStager: Scalable Data Staging Services for Petascale Applications. Cluster Computing 13(3): 277-290, 2010.
  2. N. Ali, P. H. Carns, K. Iskra, D. Kimpe, S. Lang, R. Latham, R. B. Ross, L. Ward and P. Sadayappan. Scalable I/O Forwarding Framework for High-performance Computing Systems. In Proc. of the 2009 IEEE Intl. Conf. on Cluster Computing, 2009.
  3. J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte and M. Wingate. PLFS: A Checkpoint Filesystem for Parallel Applications. In Proc. of ACM/IEEE Supercomputing Conference, 2009.
  4. J. G. Blas, F. Isaila, J. Carretero, R. Latham and R. Ross. Multiple-Level MPI File Write-Back and Prefetching for Blue Gene Systems. In Proc. of PVM/MPI, 2009.
  5. S. Byna, Y. Chen, X.-H. Sun, R. Thakur and W. Gropp. Parallel I/O Prefetching Using MPI File Caching and I/O Signatures. In Proc. of the ACM/IEEE Supercomputing Conference (SC'08), 2008.
  6. F. Chang and G. A. Gibson. Automatic I/O Hint Generation Through Speculative Execution. In Proc. of the 3rd Symposium on Operating Systems Design and Implementation (OSDI), 1999.
  7. F. Chen, D. A. Koufaty and X. Zhang. Hystor: Making the Best Use of Solid State Drives in High Performance Storage Systems. ICS 2011, 2011.
  8. Y. Chen, X.-H. Sun, R. Thakur, P. C. Roth and W. Gropp. LACIO: A New Collective I/O Strategy for Parallel I/O Systems. In Proc. of IEEE International Parallel and Distributed Processing Symposium (IPDPS' 11), May, 2011.
  9. Y. Chen, S. Byna, X.-H. Sun, R. Thakur, and W. Gropp. Hiding I/O Latency with Pre-execution Prefetching for Parallel Applications. Best paper award finalist, in Proc. of the ACM/IEEE SuperComputing Conference (SC'08), Nov. 2008.
  10. Y. Chen, S. Byna, X.-H. Sun, R. Thakur, W. Gropp. “Exploring Parallel I/O Concurrency with Speculative Prefetching,” in Proc. 37th International Conference on Parallel Processing (ICPP'08), 2008.
  11. G. Chockler and D. Malkhi. Active Disk Paxos with infinitely many processes. In Proc. of the 21th annual symposium on Principles of distributed computing, pp. 78-87, 2002.
  12. S. Chiu, W.-K. Liao and A. Choudhary. Design and Evaluation of Distributed Smart Disk Architecture for I/O-Intensive Workloads. In ICCS, 2003.
  13. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proc. of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI'04), pp. 137 - 150, December, 2004.
  14. X. Y. Dong and Y. Xie. AdaMS: Adaptive MLC/SLC Phase-change Memory Design for File Storage. ASP-DAC, 31-36, 2011.
  15. T. E. Ghazawi and L. Smith. UPC: Unified Parallel C. ACM/IEEE conference on Supercomputing (SC'06), 2006.
  16. W. D. Gropp, E. Lusk and R. Thakur. Using MPI-2. MIT Press, 1999.
  17. K. Iskra, J. W. Romein, K. Yoshii and P. Beckman. ZOID: I/O Forwarding Infrastructure for Petascale Architectures. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 153 -162, 2008.
  18. W.-K. Liao, A. Ching, K. Coloma, A. Choudhary and L. Ward. An Implementation and Evaluation of Client-Side File Caching for MPI-IO. In Proc. of IEEE International Parallel and Distributed Processing Symposium, 2007.
  19. W.-K. Liao and A. Choudhary. Dynamic Adaptive File Domain Partitioning Methods for Collective I/O based on Underlying Parallel File System Locking Protocols. In Proc. of ACM/IEEE Supercomputing Conference, 2008.
  20. J. F. Lofstead, S. Klasky, K. Schwan, N. Podhorszki and C. Jin. Flexible I/O and Integration for Scientific Codes Through the Adaptable I/O System (ADIOS). In Proc. of the 6th International Workshop on Challenges of Large Applications in Distributed Environments, 2008.
  21. X. S. Ma, M. Winslett, J. Lee and S.-k. Yu. Faster Collective Output through Active Buffering. IPDPS, 2002.
  22. J. Nieplocha, R. J. Harrison and R. J. Littlefield. Global arrays: a Portable “Shared-Memory” Programming Model for Distributed Memory Computers. ACM/IEEE Supercomputing Conference (SC'94), 1994.
  23. B. Nitzberg, V. Mary and Lo. Collective. Buffering: Improving Parallel I/O Performance. HPDC, 1997.
  24. E. Riedel, G. Gibson and C. Faloutsos. Active Storage For Large-Scale Data Mining and Multimedia. In Proc. of the 24rd Intl. Conference on Very Large Data Bases, 1998.
  25. E. Riedel and G. Gibson. Active Disks – Remote Execution for Network-Attached Storage Abstract. Carnegie Mellon Univ. Pittsburgh, 1997.
  26. R. Oldfield, L. Ward, R. Riesen, A. B. Maccabe, P. Widener and T. Kordenbrock. Lightweight I/O for Scientific Applications. In Proc. of IEEE International Conf. on Cluster Computing, 2006.
  27. R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky and J. Zelenka. Informed Prefetching and Caching. In Proc. of the 15th ACM Symposium on Operating Systems Principles (SOSP’95), 1995.
  28. F. Schmuck and R. Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. In 1st USENIX Conference on File and Storage Technologies, 2002.
  29. S. Sehrish, G. Mackey, J. Wang and J. Bent. MRAP: A Novel MapReduce-based Framework to Support HPC Analytics Applications with Access Patterns. In Proc. of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC), 2010.
  30. S. W. Son, S. Lang, P. Carns, R. Ross, R. Thakur, B. Ozisikyilmaz, P. Kumar, W.-K. Liao and A. Choudhary. Enabling Active Storage on Parallel I/O Software Stacks. In Proc. of the 26th IEEE Symp. on Massive Storage Systems & Technologies, 2010.
  31. X.-H. Sun, S. Byna and Y. Chen. Server-based Data Push Architecture for Multi-processor Environments. Journal of Computer Science and Technology, Vol. 22, No. 5, 641 – 652, 2007.
  32. X.-H. Sun, S. Byna and Y. Chen. Improving Data Access Performance with Server Push Architecture. In Proc. of the NSF Next Generation Software Program Workshop (with IPDPS'07), 2007.
  33. W. Tantisiriroj, S. W. Son, S. Patil, S. Lang, G. Gibson, R. B. Ross: On the duality of data-intensive file system design: reconciling HDFS and PVFS. In Proc. of ACM/IEEE Supercomputing Conference (SC’11), 2011.
  34. R. Thakur, R. Ross, E. Lusk and W. Gropp. Users Guide for ROMIO: A High-Performance, Portable MPI-IO Implementation. Technical Memorandum ANL/MCS-TM-234, Mathematics and Computer Science Division, ANL, 2004.
  35. R. Thakur, W. Gropp and E. Lusk. Data Sieving and Collective I/O in ROMIO. In Proc. of the 7th Symposium on the Frontiers of Massively Parallel Computation, 1999.
  36. N. Tran and D. A. Reed. Automatic ARIMA Time Series Modeling for Adaptive I/O Prefetching. IEEE Trans. Parallel Distrib. Syst. 15(4): 362-377 (2004).
  37. M. Vilayannur, A. Sivasubramaniam, M. T. Kandemir, R.Thakur and R. Ross. Discretionary Caching for I/O on Clusters. Cluster Computing 9(1): 29-44, 2006.
  38. P. M. Widener, M. Payne, P. G. Bridges, M. Wolf, H. Abbasi, S.McManus and K. Schwan. Exploiting Latent I/O Asynchrony in Petascale Science Applications. ICPP Workshops, 105-112, 2009.
  39. Y. Xie, K.-K. M. Reddy, D. Feng, D.D.E. Long, Y. Kang, Z. Niu and Z. Tan. Design and Evaluation of Oasis : An Active Storage Framework based on TIO OSD Standard. In 27th IEEE Symp. on Mass Storage Systems and Technologies (MSST), 2011.
  40. W. Yu and J. S. Vetter. ParColl: Partitioned Collective I/O on the Cray XT. ICPP, 562-569, 2008.
  41. X. Zhang, S. Jiang, and K. Davis. Making Resonance a Common Case: A High-performance Implementation of Collective I/O on Parallel File Systems. IPDPS, 2009.
  42. F. Zheng, H. Abbasi, C. Docan, J. F. Lofstead, Q. Liu, S. Klasky, M. Parashar, N.Podhorszki, K. Schwan and M. Wolf. PreDatA - Preparatory Data Analytics on Peta-scale Machines. IPDPS, 2010
Recent changes RSS feed Debian Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki