Massive Digital Database Systems (MDDSs) store peta-bytes of data with tera-bytes being added every day. Examples of such applications include digital libraries for news, office article management systems and earth observation satellite systems. To store, retrieve and manage such massive amounts of digital data, there is a need to develop efficient MDDSs. MDDSs use hierarchical storage systems consisting of primary, secondary and tertiary storage devices to handle the huge amounts of data while achieving a better price-performance ratio. In such a hierarchical storage system, the tertiary device holds all the data and metadata while the secondary and primary act as a two level cache. While disk farms provide secondary storage, tape libraries provide tertiary storage.
In MDDSs, data is typically retrieved based on contents. The most popular form of retrieval is based on keywords that occur in articles. To retrieve data efficiently, metadata, in the form of indexes, is required. Typically the size of the metadata is of the same order of magnitude as the data and hence metadata will also reside on tertiary storage. The disk will only contain metadata that is currently needed by queries or additions made by updates. Given this, metadata such as index structures become even more important in these systems --- data is accessed only after processing the metadata.
The focus of this work is on high-performance query and transaction processing techniques for MDDSs, with emphasis on metadata management. The performance of a MDDS can be measured in terms of the response time for the queries and the recency or age of the items retrieved. Both need to be minimized. The special characteristics of data and metadata, the high latency of tapes and the desired performance criteria demand the development of novel approaches to query and transaction processing in MDDSs. The key to satisfying the performance requirements is to exploit the characteristics of the metadata as well as of the queries and updates to the metadata.
We are interested in MDDSs in which data is being added continuously and wherein users enter the MDDS dynamically and pose queries on-the-fly. News-on-demand, i.e., digital library for news, and on-line information retrieval systems have these characteristics. Here new articles are constantly added to the database and the articles retrieved by the dynamic queries must ideally include the most recent additions. Thus the articles of interest to a user are known only when a query arrives. This is unlike systems like SIFT and Tapestry which are geared to continuously respond to statically specified queries or filters. In these systems, a user is informed about a new article if it passes the filtering criterion. Even though we are interested in on-the-fly queries, our techniques do find applicability in such situations also.
In this work we specifically address the following:Results of the performance tests on a prototype system show the superior performance of the developed algorithms and reveal that to build high performance MDDSs it is imperative that we adopt approaches that exploit the data and transaction characteristics. For information service providers, the resulting high performance will be very attractive in order to remain competitive.
It is important to point out that while the traditional transaction model and associated transaction processing approaches may be able to solve some of the problems in MDDSs, they cater to more restrictive database situations and hence are likely to be wanting when satisfying the recency and response time performance criteria. This is because, data components in MDDSs do not have the tight interrelationships that are the norm in typical database applications. For instance, in a news database, each news item can be considered independent of another. The metadata about two items are also not tightly related. Of course, the metadata about a news item must correctly reflect the contents of that item and this leads to consistency requirements relating a data item and its metadata. These characteristics have implications for how we design the queries and updates as transactions, in particular, how we achieve the atomicity, consistency, and durability of updates and the correctness of query results. Hence issues addressed in this paper to achieve good performance become important given the current trend to used DBMSs for managing metadata.
By concentrating on correctness and system related performance issues, our work complements the work done thus far in information retrieval for building efficient MDDSs. All of our techniques are general enough to be used independently, however maximum benefits can be obtained by combining them. During the course of our work, several other new issues surfaced. These relate to the management of data within the disks and the tapes as well as system integration. We have presented them in a separate section thus providing some directions for future work.
Each of our schemes is based on simple ideas and is founded on sound observations about the nature of the data, of the queries and updates, and of the data storage devices. Nevertheless the practical impact of these ideas has been shown to be significant.
Back to the Database Systems Home Page