The increasing number of XML repositories has provided the impetus to design and develop systems that can store and query XML data efficiently. Research to improve system performance has been largely concentrated on indexing paths and optimizing XML queries. In fact, the storage configuration of XML data on disk also has an impact on the efficiency of an XML data management system. Existing XML storage strategies can be classified into two categories: native XML storage and non-native XML storage. The main distinction between them is their data model. The former is based on the XML Data Models such as Document Object Model (DOM), and Object Exchange Model (OEM), while the latter is based on the traditional relational data model, or object-oriented data model. An evaluation of the alternative non-native storage strategies has been given in [6]. Here, we will focus on native XML storage strategies. Several native storage strategies have been developed in [1,2,3,5,8,11]. These can be classified into Element-Based (EB), Subtree-Based (SB) and Document-Based (DB). Both the Lore system [3] and TIMBER [1] utilize the classic EB strategy, where each element is an atomic unit of storage and is organized in a pre-ordered manner. Natix [2] is a well-known SB strategy. It divides the XML document tree into subtrees according to the physical page size, such that each subtree is a record. The sizes of the subtrees are kept as close as possible to the size of the physical page. A split matrix is defined to ensure that correlated element nodes remain clustered. Similar to the EB strategy, the records are stored in a pre-ordered way. The storage module in the Apache Xindice system [8] employs the DB strategy, whereby the entire XML document constitutes a single record. Other variations of storage strategies can be found in NeoCore XMS [11] where the XML data is first flattened to expose only the pure XML information, before they are passed on to a digital pattern process to create icons.. Tamino [5] is a leading commercial native XML database, but details of its storage structure are fairly sketchy. All the above native storage strategies are schema-independent, when schema information in the form of XML Schema or DTD is usually available or even indispensable. In order to facilitate data exchange, a standard schema (or DTD) is typically defined on the underlying XML files and published. Examples of available standard schema or DTDs include Chemical Markup Language, Mathematical Markup Language, News Markup Language, etc. Popular XML datasets such as the DBLP [9], Movie database [10], Shakespeare' Play [12] and XMark [4] come with its own DTD. The availability of schema information is crucial to data exchange applications, and query optimizations. We observe that schema information also has a key role to play in designing efficient and effective storage strategies for XML management systems. In this work, we develop a prototype native XML storage system, called OrientStore. OrientStore implements two schema-guided storage strategies, namely Element-Based Clustering (EBC), and Logical Partition-Based Clustering (LPC) strategies. In contrast with the present storage systems for XML data, OrientStore has the following unique features: a. It concretely investigates how schema information can be utilized to reduce the storage requirement and the response time of queries. b. It implements two schema-guided storage strategies: EBC and LPC. These strategies cluster correlated data in different ways to reduce the number of I/Os required during retrieval.
展开▼