Transaction Management and Query Processing in Massive Digital Databases

Description

Massive Digital Database Systems (MDDSs) store peta-bytes of data with tera-bytes being added every day. Examples of such applications include digital libraries for news, office article management systems and earth observation satellite systems. To store, retrieve and manage such massive amounts of digital data, there is a need to develop efficient MDDSs. MDDSs use hierarchical storage systems consisting of primary, secondary and tertiary storage devices to handle the huge amounts of data while achieving a better price-performance ratio. In such a hierarchical storage system, the tertiary device holds all the data and metadata while the secondary and primary act as a two level cache. While disk farms provide secondary storage, tape libraries provide tertiary storage.

In MDDSs, data is typically retrieved based on contents. The most popular form of retrieval is based on keywords that occur in articles. To retrieve data efficiently, metadata, in the form of indexes, is required. Typically the size of the metadata is of the same order of magnitude as the data and hence metadata will also reside on tertiary storage. The disk will only contain metadata that is currently needed by queries or additions made by updates. Given this, metadata such as index structures become even more important in these systems --- data is accessed only after processing the metadata.

The focus of this work is on high-performance query and transaction processing techniques for MDDSs, with emphasis on metadata management. The performance of a MDDS can be measured in terms of the response time for the queries and the recency or age of the items retrieved. Both need to be minimized. The special characteristics of data and metadata, the high latency of tapes and the desired performance criteria demand the development of novel approaches to query and transaction processing in MDDSs. The key to satisfying the performance requirements is to exploit the characteristics of the metadata as well as of the queries and updates to the metadata.

We are interested in MDDSs in which data is being added continuously and wherein users enter the MDDS dynamically and pose queries on-the-fly. News-on-demand, i.e., digital library for news, and on-line information retrieval systems have these characteristics. Here new articles are constantly added to the database and the articles retrieved by the dynamic queries must ideally include the most recent additions. Thus the articles of interest to a user are known only when a query arrives. This is unlike systems like SIFT and Tapestry which are geared to continuously respond to statically specified queries or filters. In these systems, a user is informed about a new article if it passes the filtering criterion. Even though we are interested in on-the-fly queries, our techniques do find applicability in such situations also.

In this work we specifically address the following:

We analyze the functionality \& correctness properties of updates and then develop an efficient scheme for executing queries concurrently with updates. The queries have short response times and are guaranteed to return the most recent articles. Our concurrency control technique uses just latches, $i.e.,$ short term locks, but still satisfies atomicity, consistency, and durability of updates. Also, a query is executed so as to return the most recent articles satisfying the query predicate -- including those articles being added by the concurrent update transaction(s).
We develop logging and recovery techniques for efficiently migrating metadata updates from disk to tape. They are designed to minimize the disruption experienced by tape accesses entailed by the queries and caused by the need to migrate the updates to tape. Changes to metadata are migrated to tapes in a lazy manner. However, these are stable logged and maintained in such a way that query results are based on the most recent state of the database.
Since the time to mount a tape and seek data within a tape is in the order of few tens of seconds, access to tapes must be even better optimized than to disks. We develop novel approaches to scheduling the tape access requests of dynamically arriving concurrent queries such that the average response time of queries is minimized. Our approaches minimize the number of tape mounts by reading/writing data from/to a mounted tape opportunistically. Which tape to mount next is based on the lengths of the queries, measured in terms of the number of data items accessed, as well as on the specific tapes on which they reside. The tape scheduling techniques have the nice property that response times are minimized while fairness to queries of different lengths is maintained.

Results of the performance tests on a prototype system show the superior performance of the developed algorithms and reveal that to build high performance MDDSs it is imperative that we adopt approaches that exploit the data and transaction characteristics. For information service providers, the resulting high performance will be very attractive in order to remain competitive.

It is important to point out that while the traditional transaction model and associated transaction processing approaches may be able to solve some of the problems in MDDSs, they cater to more restrictive database situations and hence are likely to be wanting when satisfying the recency and response time performance criteria. This is because, data components in MDDSs do not have the tight interrelationships that are the norm in typical database applications. For instance, in a news database, each news item can be considered independent of another. The metadata about two items are also not tightly related. Of course, the metadata about a news item must correctly reflect the contents of that item and this leads to consistency requirements relating a data item and its metadata. These characteristics have implications for how we design the queries and updates as transactions, in particular, how we achieve the atomicity, consistency, and durability of updates and the correctness of query results. Hence issues addressed in this paper to achieve good performance become important given the current trend to used DBMSs for managing metadata.

By concentrating on correctness and system related performance issues, our work complements the work done thus far in information retrieval for building efficient MDDSs. All of our techniques are general enough to be used independently, however maximum benefits can be obtained by combining them. During the course of our work, several other new issues surfaced. These relate to the management of data within the disks and the tapes as well as system integration. We have presented them in a separate section thus providing some directions for future work.

Each of our schemes is based on simple ideas and is founded on sound observations about the nature of the data, of the queries and updates, and of the data storage devices. Nevertheless the practical impact of these ideas has been shown to be significant.

Publications

Mohan U. Kamath and Krithi Ramamritham, "Efficient Transaction Support for Dynamic Information Retrieval Systems", To appear in 19th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '96), Zurich, Switzerland, August 1996.

Mohan U. Kamath and Krithi Ramamritham, "Efficient Transaction Management and Query Processing in Massive Digital Databases", University of Massachusetts Computer Science Technical Report 95-93, October 1995.

Back to the Database Systems Home Page

If you have any comments on this page or need further information, please send your mail to kamath@cs.umass.edu
Last Update: 23 June 95