Compiler and Runtime Support for Data Intensive Computing on Multi-Dimensional Data
The analysis and processing of large datasets that arise from simulations and from sensors plays an increasingly important role in many domains of scientific research. Typical examples of very large scientific datasets include long running simulations of time-dependent phenomena that periodically generate snapshots of their state (e.g., hydrodynamics and chemical transport simulation for estimating pollution impact on water bodies, magnetohydrodynamics simulation of planetary magnetospheres, simulation of a flame sweeping through a volume, airplane wake simulations), archives of raw and processed remote sensing data (e.g. AVHRR, Thematic Mapper, MODIS), and archives of medical images (e.g. high resolution light microscopy, CT imaging, MRI, sonography). These datasets are usually multi-dimensional. The data dimensions can be spatial coordinates, time, or varying experimental conditions such as temperature, velocity or magnetic field. The increasing importance of such datasets has been widely recognized.
In this project, we are developing techniques and middleware system support for 1) efficient storage and querying of multi-dimensional datasets on storage clusters, and 2) efficient processing of data on distributed storage and computation systems. Our approach is to develop policies that optimize computational efficiency for a broad range of computations carried out on multi-dimensional datasets. These policies need to take into account the spatial structure of a dataset, the partitioning of the dataset between storage units, and the computations to be performed. We implement these policies in middleware systems to support development of applications that consist of interacting components of data query and application specific processing operations. Some of the results from our research so far can be summarized as follows:
Support for partial data replication and data reordering to support data subsetting and data aggregation queries on distributed memory parallel machines. We developed a cost model and compiler and runtime optimizations for combined use of space partitioned and attribute partitioned replicas for executing data subsetting range queries. The algorithm allows uneven partitioning of replicas across storage nodes. Different replicas can be partitioned across different subsets of storage nodes. We investigated application of dynamic programming and greedy heuristics for determining the best set of replicas.
Support for application of XML data management support in scientific data analysis workflows. We have develop runtime support that aims to address issues associated with metadata management, data storage and management, and execution of data analysis workflows on distributed storage and computation platforms. Our framework couples a distributed, filter-stream based dataflow engine with a distributed XML-based data and metadata management system.
Support efficient data source abstractions that provide an object-relational view of data while hiding the details of storage and transport mechanisms and dataset layouts. In this abstraction, Basic Data Sources (BDS) interpret flat files as a set of records and are the building blocks of the view mechanism. Derived Data Sources (DDS) may be built on top of BDSs and provide more complex objects that serve the scientists' needs. The simplest DDS is one that supports a join based view over BDSs. We examined issues involving building such DDSs for scientific applications and consider distributed versions of the indexed join and the Grace Hash join algorithms.
A framework and techniques for efficient storage, retrieval, and processing of multi-resolution datasets on parallel and distributed disk-based storage clusters. We evaluated the techniques in the navigation and parallel visualization of Terabyte scale 3D biomedical image datasets. Support for efficient use of peer-to-peer storage systems to support efficient querying and subsetting of multi-dimensional datasets. We have developed several query scheduling strategies to improve query response times when multiple queries are submitted to the system. We have implemented these strategies in a prototype data server systems built on top of a peer-to-peer object storage system called Pond.
Project Researchers
Project Publications
Publications |
Xi Zhang, Tahsin M. Kurc, Joel H. Saltz, "Design and Analysis of a Multi-dimensional Data Sampling Service for Large Scale Data Analysis Applications", 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2006: pp. 58. |
Henrique Andrade, Tahsin M. Kurc, Alan Sussman, Joel H. Saltz, "Optimizing the Execution of Multiple Data Analysis Queries in Parallel and Distributed Environments", IEEE Transactions on Parallel and Distributed Systems, 2004: pp. 520-532. |
Sivaramakrishnan Narayanan, Umit V. Catalyurek, Tahsin M. Kurc, Xi Zhang, Joel H. Saltz, "Applying Database Support for Large Scale Data Driven Science in Distributed Environments", Proceedings of the Fourth International Workshop on Grid Computing (Grid 2003), 2003: pp. 141-148. |
Sivaramakrishnan Narayanan, Tahsin M. Kurc, Umit V. Catalyurek, Joel H. Saltz, "Database Support for Data-Driven Scientific Applications in the Grid", Parallel Processing Letters, 2003: pp. 245-271. |
Matthew Spencer, Renato A. Ferreira, Michael Beynon, Tahsin M. Kurc, Umit V. Catalyurek, Alan Sussman, Joel H. Saltz, "Executing Multiple Pipelined Data Analysis Operations in the Grid", Proceedings of the 2002 ACM/IEEE SC02 Conference, 2002: pp. 1-18. |
Alan Sussman, Beomseok Nam, "Improving Access to Multi-dimensional Self-describing Scientific Datasets", 2002: pp. 172-179. |
Henrique Andrade, Tahsin M. Kurc, Alan Sussman, Joel H. Saltz, "Exploiting Functional Decomposition for Efficient Parallel Processing of Multiple Data Analysis Queries", Proceedings of International Parallel and Distributed Processing Symposium (IPDPS 2003), 2002: pp. 10. |
Henrique Andrade, Tahsin M. Kurc, Alan Sussman, Eugene Borovikov, Joel H. Saltz, "On Cache Replacement Policies for Servicing Mixed Data Intensive Query Workloads", Proceedings of the Second Workshop on Caching, Coherence and Consistency (WC3-02), 2002. |
Henrique Andrade, Tahsin M. Kurc, Alan Sussman, Joel H. Saltz, "Active Proxy-G: Optimizing the Query Execution Process in the Grid", Proceedings of the 2002 ACM/IEEE conference on Supercomputing (SC2002), 2002: pp. 57-57. |
Henrique Andrade, Tahsin M. Kurc, Joel H. Saltz, Alan Sussman, "Multiple Query Optimization for Data Analysis Applications on Clusters of SMPs", Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGrid2002), 2002: pp. 154. |
Michael Beynon, Chialin Chang, Umit V. Catalyurek, Tahsin M. Kurc, Alan Sussman, Henrique Andrade, Renato A. Ferreira, Joel H. Saltz, "Processing Large-Scale Multidimensional Data in Parallel and Distributed Environments", Parallel Computing (special issue on parallel data-intensive algorithms and applications), 2002: pp. 827-859. |
Michael Beynon, Tahsin M. Kurc, Umit V. Catalyurek, Alan Sussman, Joel H. Saltz, "Efficient Manipulation of Large Datasets on Heterogeneous Storage Systems", Proceedings of 16th International Parallel and Distributed Processing Symposium (IPDPS), The 11th Heterogeneous Computing Workshop (HCW 2002), 2002: pp. 0084. |
Henrique Andrade, Tahsin M. Kurc, Umit V. Catalyurek, Alan Sussman, Joel H. Saltz, "Persistent Caching in a Multiple Query Optimization Framework", Proceedings of the Sixth Workshop on Languages, Compilers and Run-time Systems for Scalable Computers, 2002. |
Henrique Andrade, Tahsin M. Kurc, Alan Sussman, Joel H. Saltz, "Scheduling Multiple Data Visualization Query Workloads on a Shared Memory Machine", Proceedings of the Fifth Merged IPPS/SPDP (International Parallel Processing Symposium & Symposium on Parallel and Distributed Processing), 2002: pp. 11-18. |
Henrique Andrade, Tahsin M. Kurc, Alan Sussman, Eugene Borovikov, Joel H. Saltz, "Servicing Mixed Data Intensive Query Workloads", 2002. |
Michael Beynon, Tahsin M. Kurc, Umit V. Catalyurek, Chialin Chang, Alan Sussman, Joel H. Saltz, "Distributed Processing of Very Large Datasets with DataCutter", Parallel Computing, 2001: pp. 1457-1478. |
Tahsin M. Kurc, Umit V. Catalyurek, Chialin Chang, Alan Sussman, Joel H. Saltz, "Exploration and Visualization of Very Large Datasets with the Active Data Repository", IEEE Computer Graphics & Applications, 2001: pp. 24-33. |
Tahsin M. Kurc, Tahsin M. Kurc, Umit V. Catalyurek, Umit V. Catalyurek, Chialin Chang, Chialin Chang, Joel H. Saltz, Alan Sussman, Joel H. Saltz, "Visualization of Large Datasets with the Active Data Repository", IEEE Computer Graphics and Applications, 2001: pp. 24-33. |
Tahsin M. Kurc, Henrique Andrade, Joel H. Saltz, Tahsin M. Kurc, Alan Sussman, Joel H. Saltz, "Efficient Execution of Multiple Query Workloads in Data Analysis Applications", Proceedings of the 2001 ACM/IEEE conference on Supercomputing (SC2001), 2001: pp. 53. |
Alan Sussman, Larry Davis, Eugene Borovikov, "An Efficient System for Multi-Perspective Imaging and Volumetric Shape Analysis", Proceedings of the Workshop on Parallel and Distributed Computing in Image Processing, Video Processing, and Multimedia (PDIVM'2001), 2001. |
Chialin Chang, Tahsin M. Kurc, Alan Sussman, Umit V. Catalyurek, Joel H. Saltz, "A Hypergraph-Based Workload Partitioning Strategy for Parallel Data Aggregation", Proceedings of the Tenth SIAM Conference on Parallel Processing for Scientific Computing, 2001. |
Chialin Chang, Tahsin M. Kurc, Alan Sussman, Joel H. Saltz, "Optimizing Retrieval and Processing of Multi-dimensional Scientific Datasets", 14th International Parallel and Distributed Processing Symposium (IPDPS'00), 2000: pp. 405. |
Michael Beynon, Renato A. Ferreira, Tahsin M. Kurc, Alan Sussman, Joel H. Saltz, "DataCutter: Middleware for Filtering Very Large Scientific Datasets on Archival Storage Systems", Eighth NASA Goddard Conference on Mass Storage Systems and Technologies/Seventeenth IEEE Symposium on Mass Storage Systems, 2000: pp. 119-133. |
Tahsin M. Kurc, Chialin Chang, Renato A. Ferreira, Alan Sussman, Joel H. Saltz, "Querying Very Large Multi-dimensional Datasets in ADR", Proceedings of the 1999 ACM/IEEE SC99 Conference, 1999. |
Chialin Chang, "Cost Models for Query Processing Strategies in the Active Data Repository", 1999. |
Renato A. Ferreira, Alan Sussman, Joel H. Saltz, "Database Methods for Efficient Manipulation of Very Large Datasets", Proceedings of the 1999 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA`99), 1999. |
Chialin Chang, Renato A. Ferreira, Alan Sussman, Joel H. Saltz, "Infrastructure for Building Parallel Database Systems for Multi-dimensional Data", Proceedings of the IEEE Second Merged IPPS/SPDP Symposiums, 1999: pp. 582-588. |
Chialin Chang, Tahsin M. Kurc, Alan Sussman, Joel H. Saltz, "Query Planning for Range Queries with User-defined Aggregation on Multi-dimensional Scientific Datasets", 1999. |
Presentations |
Xi Zhang, Tony C. Pan, Umit V. Catalyurek, Tahsin M. Kurc, Joel H. Saltz, "Serving Queries to Multi-Resolution Datasets on Disk-based Storage Clusters", 4th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid04), Chicago, IL, Presented: 2004-04-21 |
Tech Reports |
Gagan Agrawal, Umit V. Catalyurek, Tahsin M. Kurc, Sivaramakrishnan Narayanan, Joel H. Saltz, "An Approach for Automatic Data Virtualization", Issued: 2004-06-01 |
Xi Zhang, Tony C. Pan, Umit V. Catalyurek, Tahsin M. Kurc, Joel H. Saltz, "Serving Queries to Multi-Resolution Datasets on Disk-based Storage Clusters", Issued: 2004-04-01 |
Michael Beynon, Tahsin M. Kurc, Umit V. Catalyurek, Joel H. Saltz, "A Component-based Implementation of Iso-surface Rendering for Visualizing Large Datasets", Issued: 2001-05-01 |