CNS-1405697
CSR: Medium:
Pythia: An Application Analysis and Online Modeling Based Prediction Framework for Scalable Resource Management

PIs: Ali R. Butt and Chao Wang

Distributed Systems & Storage Lab.

Table of Contents:

  1. Project Overview
  2. Motivation
  3. Team Members
  4. Major Outcomes
  5. Completed Thesis
  6. Summary of Significant Results
  7. Broader Impact
  8. Publications

Project Overview:

Pythia exploits compile time User-Defined Functions analysis and integrated online simulations of DSF runtimes to provide a comprehensive solution for efficient DSF resource management and scheduling. The scientific value and innovation of this research can be summarized as three tightly-coupled research tasks that enable the development of Pythia, which will achieve efficient resource management for large-scale data management systems such as Hadoop.

overview

Online oracle architecture to assist DSF runtime.

Top

Motivation:

Top

Pythia Team:

PIs:

  1. Ali R. Butt
  2. Chao Wang

PhD:

  1. Hyogi Sim
  2. Yue Cheng
  3. Ali Anwar
  4. Luna Xu
  5. Bharti Wadhwa
  6. Arnab K. Paul
  7. Krish K. R.
  8. Tariq Kamal
  9. Aleksandr Khasymski
  10. Markus Kusano

MSc:

  1. Safdar Iqbal
  2. Arpit Goyal
  3. Sangeetha Banavathi Srinivasa
  4. Mohammed Salman
  5. Wenjie Zhuang
  6. Arjun Passi
  7. Sepideh Khoshnood
  8. Yuzhong Wen

REU Scholars:

  1. Syeda Mahmood

The PIs are committed to supporting women in systems research. The Pythia project has supported six women in various roles (PhD, Msc, and REU).

Top

Major Outcomes:

CAST: Cloud Data Analytics Storage Tiering [HPDC’15]

The work on analytics workload management to find a sophisticated tiering strategy that can efficiently utilize all possible storage service options (e.g., virtual machine (VM)-local ephemeral disks, network-attached persistent disks and cloud object store, etc.) to meet the tenant requirement was undertaken. Addressing this problem, we proposed CAST, a cloud storage tiering solution that tenants can leverage to reduce their monetary cost and improve performance of cloud analytics workloads.

CAST

 CAST

 CAST

 CAST

We evaluated our CAST system using production workload traces from Facebook and a 400-core Google Cloud based Hadoop cluster. The results demonstrated that Cast++ achieves 1.21x performance and reduces deployment costs by 51.4% compared to local storage configuration. An enhancement, CAST++, extends these capabilities to meet deadlines for analytics workflows while minimizing the cost. The evaluation showed that compared to extant storage-characteristic-oblivious cloud deployment strategies, CAST++ improved the performance by as much as 37.1% while reducing deployment costs by as much as 51.4%.

MOS: Workload-aware Elasticity for Cloud Object Stores [HPDC’16]

MOS goes beyond extant object store systems and offers a novel solution wherein the traditional monolithic design is dynamically partitioned into multiple fine-grained microstores each serving a different type of workloads. Prototype performance evaluation on a 128-core local testbed shows that compared to the statically configured setup, MOS improves the sustained throughput by 79% for a large-object intensive workload while reducing the 95%-ile tail latency by up to 70% for a small-object intensive workload. Large-scale simulation experiments on a 50 server cluster shows that by utilizing the same set of resources, MOS++, an optimized version that leverages containers to manage resources at fine granularity, achieves up to 18% performance improvement compared to the baseline case MOS.

 MOS

 MOS

 MOS

 MOS and its enhancement MOS++, outperforms extant object stores in multi-tenant environments by leveraging containers for fine-grained resource management and higher resource efficiency. We also developed COSPerf, a cloud object store simulator, to further verify the design choices in MOS++ and similar systems.

GERBIL: MPI+YARN [CCGrid’15]

gerbil

gerbil

gerbil

MBal: A Load Balanced Memory Cache Tier [EuroSys’15]

The work on in-memory key value store resulted in a framework called MBAL, which lends itself naturally and flexibly to provide adaptive load balancing both within a server and across the cache cluster through a multi-phase load balancer (e.g., key replication, coordinated data migration, etc.). MBal can dynamically mitigate the imbalance, while supporting high performance and resource utilization efficiency as well as reducing key-value query tail latency.

mbal

mbal

MBal goes beyond extant systems and offers a holistic solution wherein the load balancing model tracks hotspots and applies different strategies based on imbalance severity -- key replication, server-local or cross-server coordinated data migration. Performance evaluation on an 8-core commodity server shows that compared to a state-of-the-art approach, MBal scales with number of cores and executes 2.3x and 12x more queries/second for GET and SET operations, respectively. Furthermore, MBAL also cohesively employs different migration and replication techniques to improve performance by load balancing both within a server and across servers to redistribute and mitigate hotspots. Testing on a cloud-based 20- node cluster demonstrates that each of the considered load balancing techniques effectively complement each other, and compared to Memcached can improve latency and throughput by 35% and 20%, respectively.

Other related projects:

AnalyzeThis: An Analysis Workflow-Aware Storage System [SC’15]:
An analysis workflow-aware storage system that seamlessly blends together the flash storage and data analysis.

Multi-tiered Buffer Cache for Persistent Memory Devices:
A tiered caching system for combining PM devices to achieve the best of both PCM and FB-DRAM at lower cost-per-GB.

TurnKey: Unlocking Pluggable Distributed Key-Value Stores [HotStorage’16]:
A development platform that eases distributed KV store programming by providing common distributed management functionalities.

Scalable Metering Systems [IC2E’15, Cloud’15]:
A detailed study of monitoring data collected by OpenStack with a goal to pinpoint the limitations of the current approach and design alternate solutions.

MEMTUNE [IPDPS’16]:
Dynamic Memory Management for In-memory Data Analytic Platforms

DUX [CCGrid’16]:
An application-attuned dynamic data management system for data processing frameworks

Top

Summary of Significant Results:

Top

Completed Thesis and student supervision:

PhD thesis:

  1. Yue Cheng, Workload-aware Efficient Storage Systems, 2017
  2. Tariq Kamal, Computational Cost Analysis of Large-Scale Agent-Based Epidemic Simulations, 2016
  3. Krish K. R., Exploiting Heterogeneity in Distributed Software Frameworks, 2015
  4. Aleksandr Khasymski, Accelerated Storage Systems, 2015

MSc thesis:

  1. M. Safdar Iqbal, The Multi-tiered Future of Storage: Understanding Cost and Performance Trade-offs in Modern Storage Systems, 2017
  2. Sangeetha Banavathi Srinivasa, Smart Load Balancing of exa-scale storage systems, 2017
  3. Mohammed Salman, Improving Endurance and Performance in Flash Storage Clusters, 2017
  4. Yuzhong Wen, Replication of Concurrent Applications in a Shared Memory Multikernel, 2016
  5. Sepideh Khoshnood, Constraint Solving for Diagnosing Concurrency Bugs, 2015

MSc non thesis:

  1. Arpit Goyal, Efficient I/O Management for Distributed Software Frameworks, 2017

Top

Broader Impact:

Our approach and workflow-aware analysis system will reduce time-to-solution for running simulations and models by supporting easy-to-use and easy-to-program distributed software frameworks. Such system models enable highly efficient and scalable computing applications, which in turn will impact society profoundly, e.g., by enabling faster discovery of new drugs for ailments or exposing a new physical phenomenon. Thus, our research shares its impacts on society with others who focus on the improvement of computer based modeling for scientific discovery. It has a huge potential to improve the quality of life.

Top

Publications:

      1. Hyogi Sim, Youngjae Kim, Sudharshan S. Vazhkudai, Geoffroy R. Vallee, Seung-Hwan Lim, and Ali R. Butt. TagIt: A Fast and Efficient Scientific Data Discovery Service. To appear in Proceedings of the 2017 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17), Denver, CO, pages 10, November 2017. (AR:18.7%).
      2. Arnab Kumar Paul, Wenjie Zhuang, Luna Xu, Min Li, Mustafa Rafique, and Ali R. Butt. CHOPPER: Optimizing Data Partitioning for In-Memory Data Analytics Frameworks. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster), Taipei, Taiwan, pages 10, September 2016. (AR: 24%).
      3. Ali Anwar, Yue Cheng, Hai Huang, and Ali R. Butt. ClusterOn: Building Highly Configurable and Reusable Clustered Data Services using Simple Data Nodes. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage), Denver, CO, pages 5, June 2016. (AR: 36.9%).
      4. Ali Anwar, Yue Cheng, Aayush Gupta, and Ali R. Butt. MOS: Workload-aware Elasticity for Cloud Object Stores. In Proceedings of the 25th ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), Kyoto, Japan, pages 12, May 2016. (AR: 15.5%).
        An initial design, Taming the Cloud Object Storage with MOS, appeared in Proceedings of the 10th Parallel Data Storage Workshop (PDSW), Austin, Texas, pages 6, November 2015. (AR: 36%).
        A related poster was presented in 6th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys 2015), Tokyo, Japan, July 2015.
      5. Markus Kudano, and Chao Wang. Flow-sensitive composition of thread-modular abstract interpretation. In Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE), Seattle, WA, November 2016.
      6. Chungha Sung, Markus Kusano, Nishant Sinha, and Chao Wang. Static DOM event dependency analysis for testing web applications. In Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE), Seattle, WA, November 2016.
      7. Luna Xu, Min Li, Li Zhang, Ali R. Butt, Yandong Wang, and Zane Zhenhua Hu. MEMTUNE: Dynamic Memory Management for In-memory Data Analytic Platforms. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS), Chicago, IL, pages 10, May 2016. (AR: 23%).
      8. Yue Cheng, M. Safdar Iqbal, Aayush Gupta, and Ali R. Butt. Provider versus Tenant Pricing Games for Hybrid Object Stores in the Cloud. In IEEE Internet Computing: Special Issue on Cloud Storage, 20(3):28-35, May/June 2016. [Link to paper]
      9. Ali Anwar, Yue Cheng, and Ali R. Butt. Towards Managing Variability in the Cloud. In Proceedings of the 1st IEEE International Workshop on Variability in Parallel and Distributed Systems (VarSys), Chicago, IL, pages 4, May 2016.
      10. Krish K. R., Bharti Wadhwa, M. Safdar Iqbal, M. Mustafa Rafique, and Ali R. Butt. On Efficient Hierarchical Storage for Big Data Processing. In 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Cartagena, Colombia, pages 6, May 2016. (AR: 36%).
      11. Hyogi Sim, Youngjae Kim, Sudharshan S. Vazhkudai, Devesh Tiwari, Ali Anwar, Ali R. Butt, and Lavanya Ramakrishnan. AnalyzeThis: An Analysis Workflow-Aware Storage System. In Proceedings of the 2015 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'15), Austin, TX, pages 12, Nov. 2015. (AR: 22.1%).
      12. Min Li, Dakshi Agrawal, Frederick Reiss, Berni Schiefer, Ali R. Butt, Josep Lluis Larriba Pey, Francois Raab, Doshi Kshitij and Yinglong Xia. SparkBench: A Spark Performance Testing Suite. In Proceedings of the Seventh TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC 2015), Kohala Coast, HI, pages 20, August 2015.
      13. Ali Anwar, Anca Sailer, Andrzej Kochut, Ali R. Butt. Anatomy of Cloud Monitoring and Metering: A case study and open problems. In Proceedings of the 6th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys 2015), Tokyo, Japan, pages 7, July 2015. (AR: 29.4%).
      14. Yue Cheng, M. Safdar Iqbal, Aayush Gupta, and Ali R. Butt. Pricing Games for Hybrid Object Stores in the Cloud: Provider vs. Tenant. In Proceedings of the the 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), Santa Clara, CA, pages 7, July 2015. (AR: 32.8%).
      15. Hyogi Sim, Youngjae Kim, Sudharshan S. Vazhkudai, Devesh Tiwari, Ali Anwar, Ali R. Butt, and Lavanya Ramakrishnan. AnalyzeThis: An Analysis Workflow-Aware Storage System. Poster in the 2015 USENIX Annual Technical Conference (ATC), Santa Clara, CA, July 2015.
      16. Ali Anwar, Anca Sailer, Andrzej Kochut, Charles O. Schulz, Alla Segal, and Ali R. Butt. Cost-Aware Cloud Metering with Scalable Service Management Infrastructure. In Proceedings of the IEEE 2015 International Conference on Cloud Computing (IEEE Cloud), New York, pages 8, June 2015. (AR: 17%).
        An earlier version of the paper, Scalable Metering for an Affordable IT Cloud Service Management, appeared in Proceedings of the IEEE International Conference on Cloud Engineering (IC2E), Tempe, AZ, pages 6, March 2015.
      17. Yue Cheng, M. Safdar Iqbal, Aayush Gupta, and Ali R. Butt. CAST: Tiering Storage for Data Analytics in the Cloud. In Proceedings of the International ACM Symposium on High-Performance Distributed Computing (HPDC), Portland, Oregon, pages 12, June 2015. (AR: 16.4%).
      18. Hung-Ching Chang, Bo Li, Godmar Back, Ali R. Butt, and Kirk Cameron. LUC: Limiting the Unintended Consequences of Power Scaling on Parallel Transaction-Oriented Workloads. In Proceedings of the 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Hyderabad, India, pages 10, May 2015. (AR: 21.8%).
      19. Luna Xu, Min Li, and Ali R. Butt. Gerbil: MPI+YARN. In Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Shenzhen, Guangdong, China, pages 10, May 2015. (AR: 25.7%).
      20. Yue Cheng, Aayush Gupta, and Ali R. Butt. An In-Memory Object Caching Framework with Adaptive Load Balancing. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), Bordeaux, France, pages 16, April 2015. (AR: 20.8%).
      21. Shengjian Guo, Markus Kusano, Chao Wang, Zijiang Yang, and Aarti Gupt. Assertion guided symbolic execution of multithreaded programs. In Proceedings of the ACM SIGSOFT Symposium on Foundations of Software Engineering (FSE), Bergamo, Italy, September 2015.
      22. Sepideh Khoshnood, Markus Kusano, and Chao Wang. ConcBugAssist: Constraint solving for diagnosis and repair of concurrency bugsIn Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA), Baltimore, MD, July 2015.
      23. Markus Kusano, Arijit Chattopadhyay, and Chao Wang. Dynamic generation of likely invariants for multithreaded programs. In Proceedings of the IEEE/ACM International Conference on Software Engineering (ICSE), Florence, Italy, May 2015.
      24. Naling Zhang, Markus Kusano, and Chao Wang. Dynamic partial order reduction for relaxed memory models. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Portland, OR, June 2015.
      25. Fang Liu, Xiaokui Shu, Danfeng Yao, and Ali R. Butt. Privacy-Preserving Scanning of Big Content for Sensitive Data Exposure with MapReduce. In Proceedings of the Fifth ACM Conference on Data and Application Security and Privacy (CODASPY), San Antonio, TX, pages 12, March 2015. (AR: 21.3%).
      26. Tariq Kamal, Ali R. Butt, Keith Bisset, and Madhav Marathe. Cost Estimation of Parallel Constrained Producer-Consumer Algorithms. In Proceedings of the 23rd Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), Turku, Finland, pages 8, February 2015. (AR: 28.0%).
        An earlier version of the paper, Load Analysis and Cost Estimation of Parallel Constrained Producer-Consumer Algorithms, appeared in Proceedings of the IEEE International Conference on Cluster Computing (Cluster), Madrid, Spain, pages 286--287, September 2014.
      27. Krish K. R., M. Safdar Iqbal, M. Mustafa Rafique and Ali R. Butt. Towards Energy Awareness in Hadoop. In Proceedings of the Fourth International Workshop on Network-Aware Data Management (NDM) @SC'14, New Orleans, LA, pages 16-22, November 2014. (AR: 45.5%).
      28. Krish K. R., M. Safdar Iqbal, and Ali R. Butt. VENU: Orchestrating SSDs in Hadoop Storage. Short paper in Proceedings of the IEEE International Conference on Big Data (BigData), Washington, DC, pages 207-212, October 2014. (AR: 40.2%).

Top