dssl_ Distributed Systems & Storage Laboratory

Information

Our research group meets weekly on Wed 12:00pm in GP 4211.
If you are interested in joining our research group, please contact Dr. Butt.

News

Old news!
[an error occurred while processing this directive]

Research

Our research is supported by grants from National Science Foundation, DOE's Oak Ridge National Lab., IBM, NetApp, and Virginia Tech.

Current Projects

Please see the Publications page for a more up-to-date listing of our research. The following tends to lag over time...

AMOCA: Capability-aware programming model for asymmetric many-core processors for data-intensive workloads
HPC Data Management: Developing innovative data staging and offloading techniques for meeting Center-User Service Level Agreements
ERP: Energy-Reliability trade-offs in disks
MRPerf: Realistic simulator for designing MapReduce clusters
I/O for Heterogeneous Multiprocessors: Exploring efficient I/O techniques for the Cell/BE processor
ReplayCache: Using history-based cache management for improving I/O performance for repetitive applications runs
SARD: Techniques for servicing I/O requests from peer node memory.
Disconnected Operations in Grids: Integrating intermittent mobile devices in Grids via support for disconnected operations
PeerStripe: Storing large-data on contributory resources
FlexiCache: Modular Linux buffer cache design to support customized replacement policies

Old Projects

The effect of Kernel prefetching on file system buffer cache:
Designing effective block replacement algorithms to minimize file system buffer cache misses is a challenging task. Despite the well-known interactions between prefetching and caching, almost all buffer cache replacement algorithms have been proposed and studied comparatively without taking into account file system prefetching which exists in all modern operating systems. In this project we studied the effect of such kernel prefetching and showed (SIGMETRICS'05) that it can have a significant impact on the relative performance in terms of the number of actual disk I/Os of many well-known replacement algorithms; it can not only narrow the performance gap but also change the relative performance benefits of different algorithms. The goal of the project is to demonstrate the importance for buffer caching research to take file system prefetching into consideration.
More information on this project is available at the AccuSim webpage.
Predicting program behavior using program-counter:
Borrowing from the ideas in computer architecture research, it was determined that the program counter can also serve as an indication of program behavior for the operating system kernel. This insight was leveraged in two applications, power management of hard disks (IEEE TC'06) and file system buffer cache management (OSDI'04), with promising results. The approach is implemented in the Linux kernel.
More information about the project can be found at the PCOS webpage.
Peer-to-peer resource management:
Modern resource sharing systems comprise of thousands of resources, and peer-to-peer (p2p) approaches can be used to provide resource self-organization in the presence of failures. We designed two projects based on this concept. First, we used p2p mechanisms to manage Condor pools (SC'03, JPDC 66:1). Condor is a distributed system that allows sharing of resources within an administrative domain. We developed an automatic collaboration framework that allowed remote pools to discover each other, and therefore enabled resource sharing across administrative domains. Second, we applied the p2p approach to harness idle disk space on nodes within academic and corporate setups (SC'04, JoGC'06). We developed a distributed file system by extending the Network File System (NFS), which allows sharing of idle disk space in a transparent manner.
More information about the project can be found here.
Ensuring fairness in resource sharing:
We observed that in resource sharing systems some users tend to only utilize resources without contributing resources to the system. This creates an imbalance and results in the eventual collapse of the system. We developed a DHT based accountability and feedback mechanism that would allow users in the system to determine "credit-worthiness" of other users (VM'04, PPoPP'05, SC'05, JoGC'06). This information can then be used to decide whether or not to allow exchange of resources with a particular user. The project solves the practical problem of fairness in sharing by providing a distributed accountability mechanism. Greedy users can be quickly identified and secluded, resulting in a robust system.
More information about the project can be found at the GridCop webpage.
Query caching in peer-to-peer networks:
We observed that p2p query traffic exhibits temporal locality and can benefit from caching. In the first part of this project, a query caching proxy was installed at the boundary of an organization and queries originating from inside the organization were cached. We refer to this as forward-caching. Next, we cached the queries originating from outside the organization and forwarded to inside (WCW9). We refer to this as reverse-caching. We found that if the cache capacity is fixed in terms of the number of cached query replies, the forward and reverse query caching are equivalent; both in hit ratio and in bandwidth savings. The project provided insight into caching for reducing p2p traffic (which is now the most prevalent traffic on the Internet), and improving bandwidth utilization.
Designing computational grid portals:
Computational grids provide computing power by sharing resources across administrative domains. This sharing, coupled with the need to execute untrusted code from arbitrary users, introduces security hazards. Grid environments are built on top of platforms that control access to resources within a single administrative domain, at the granularity of a user. In wide-area multi-domain grid environments, the overhead of maintaining user accounts is prohibitive, and securing access to resources via user accountability is impractical. Typically, these issues are handled by implementing checks that guarantee the safety of applications, so that they can run in shared user accounts. We showed (JPDC 63:10, IPDPS'02) that safety checks -- language-based, compile-time, link-time or load-time -- currently implemented in most grid environments are either inadequate or limit allowed grid users and applications. A survey of various grid systems was done that highlights the problems and limitations of current grid environments. A runtime process monitoring technique was also proposed. The approach allows setting-up an execution environment that supports the full legitimate use allowed by the security policy of a shared resource.