Database applications that use multi-terabyte datasets are becoming increasingly important for scientific fields such as astronomy and biology. To improve query execution performance, modern DBMS build indexes and materialized views on the wide tables that store experimental data. The replication of data in indexes and views, however, implies large amounts of additional storage space, and incurs high update costs as new experiments add or change large volumes of data. In this paper we explore automatic data partitioning as a tool to redesign the relational tables in a database for faster sequential access before creating indexes and views. We present AutoPart, a vertical partitioning tool that uses the optimizer's hints to determine optimal partitions for a given workload. According to our experiments, the schemas recommended by AutoPart (a) improve query execution by 12%-44% when compared to the original schema without indexes, (b) execute queries up to 37% faster and updates almost twice as fast compared to the original after indexing. Furthermore, they require half the space for indexes. Finally, we show that a form of categorical partitioning can further improve query performance up to 29%-62% without any indexes and up to 54% with indexes, while update performance improves up to 60%. AutoPart can be used with any commercial database system. Our experimental results are based on the Sloan Digital Sky Survey (SDSS) database, a realworld astronomical database, running on Microsoft's SQL Server 2000.
Modern scientific applications consume massive volumes of data produced by computer simulations. Such applications require new data management capabilities in order to scale to terabyte-scale data volumes [25,10]. The most common way to discretize the application domain is to decompose it into pyramids, forming an unstructured tetrahedral mesh. Modern simulations generate meshes of high resolution and precision, to be queried by a visualization or analysis tool. Tetrahedral meshes are extremely flexible and therefore vital to accurately model complex geometries, but also are difficult to index. To reduce query execution time, applications either use only subsets of the data or rely on different (less flexible) structures, thereby trading accuracy for speed. This paper presents efficient indexing techniques for generic spatial queries on tetrahedral meshes. Because the prevailing multidimensional indexing techniques attempt to approximate the tetrahedra using simpler shapes (rectangles) query performance deteriorates significantly as a function of the mesh's geometric complexity. We develop Directed Local Search (DLS), an efficient indexing algorithm based on mesh topology information that is practically insensitive to the geometric properties of meshes. We show how DLS can be easily and efficiently implemented within modern database systems without requiring new exotic index structures and complex preprocessing. Finally, we present a new data layout approach for tetrahedral mesh datasets that provides better performance compared to the traditional space filling curves. In our PostgreSQL implementation DLS reduces the number of disk page accesses and the query execution time each by 25% up to a factor of 4.
This paper presents Oracle Database Replay, a novel approach to testing changes to the relational database management system component of an information system (software upgrades, hardware changes etc). Database Replay makes it possible to subject a test system to a real production system workload, which helps identify all potential problems before implementing the planned changes on the production system. Any interesting workload period of a production database system can be captured with minimal overhead. The captured workload can be used to drive a test system while maintaining the concurrency and load characteristics of the real production workload. Therefore, the test results using database replay can provide very high assurance in determining the impact of changes to a production system before applying these changes. This paper presents the architecture of Database Replay as well as experimental results that demonstrate its usefulness as testing methodology.
Making multi-terabyte scientific databases publicly accessible over the Internet is increasingly important in disciplines such as Biology and Astronomy. However, contention at a centralized, backend database is a major performance bottleneck, limiting the scalability of Internet-based, database applications. Midtier caching reduces contention at the backend database by distributing database operations to the cache. To improve the performance of mid-tier caches, we propose the caching of query prototypes, a workload-driven unit of cache replacement in which the cache object is chosen from various classes of queries in the workload. In existing mid-tier caching systems, the storage organization in the cache is statically defined. Our approach adapts cache storage to workload changes, requires no prior knowledge about the workload, and is transparent to the application. Experiments over a one-month, 1.4 million query Astronomy workload demonstrate up to 70% reduction in network traffic and reduce query response time by up to a factor of three when compared with alternative units of cache replacement.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.