Bioinformatics xml documents index method based on r - Tree method

In this chapter, the author has researched and presented an improved method for more

efficient processing of XPath queries, which is considered the basis for building complex

queries. The BioX-tree method proposes some important enhancements such as adding

pointers to indicate relationships: ancestors - descendants, parents - children, siblings, the step

of converting from XML documents to space added some parameters, redesigned query

algorithms on XPath's main axes to speed up execution. The experimental part was carried out

on bioinformatics XML data from reputable and popular sources in the biological community.

In this new structure, the experimental results show that it is much more efficient than the Rtree-based indexing method on point queries, which is the most commonly used algorithm in

practice, because the new algorithms are based on Tracking link trajectories by using highly

optimized pointers for reading and recording hard disk I / O. But besides that there are still

some disadvantages that the author continues to study and present solutions in chapter 3.

26 trang | Chia sẻ: honganh20 | Lượt xem: 481 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Bioinformatics xml documents index method based on r - Tree method, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

they have the format: (MBR, object_ptr) - where object_ptr refers to a data set in the database and MBR is an n-dimensional rectangle containing the space objects it presents. The non-leaf nodes have the form: (MBR, chirld_ptr) - where chirld_ptr is the address of another node in the tree and the MBR includes rectangles in the lower nodes. An R-tree satisfies the following predicates: -Each node contains the number of child nodes in the range m and M except the root node. 6 -For each input type (MBR, object_ptr) at leaf node, MBR is the smallest rectangle containing the n-dimensional data object represented by object_ptr. -For each input type (MBR, chirld_ptr) at non-leaf node, MBR is the smallest rectangle containing the rectangle in the chirld node. -The root node has at least 2 subnodes except that it is the leaf node. -This is a balanced tree. 1.4.3. Some basic algorithms in R-tree method [30] a) Search in organizational data structure as R-tree b) Insertion c). 1.4.4.Some improved methods - XR-tree [32] - AR*-tree [87] 1.5. The remaining issues The model shown in Figure 1.1 helps to convert XML data into multi-dimensional space, thereby applying spatial indexing and querying methods to increase processing speed and reduce data size when indexing. Because bioinformatics XML is in fact very diverse, R-trees will be more suitable and selected as the basis in this thesis. Figure 1.1: Overview Model Figure 1.2: Model shows data conversion and indexing on hard disk R-tree method still has some problems when applied to process bioinformatics XML data as follows: 1) Firstly, it is overlapping problem. For spatial-based index technique, the larger search space the more time it takes for getting the returned node set. But the weakness of R-tree based method is that the queries require a fairly large data scan window, thus causing a considerable impact on the query performance. 7 2).The problem of the sibling connection of tags after converting to space, which is expressed as points in space such as parent, preceding, sibling, descendant, child, following, etc. with Xpath axis. In Figure 2.2 shows the data distribution on the coordinate system, the author recognizes that all data is skewed as a trapezoid / diagonally aligned (like an airplane wing). Meanwhile, all the previous methods did not care about that, so the queries have not improved significantly when querying in the data zone of this airplane wing. 1.6 Conclusion Chapter 1 presents some fundamental concepts of bioinformatics and bioinformatics data. Bioinformatics data is becoming huge due to the regular contribution and sharing of the research community. Because the problems of bioinformatics data analysis are very diverse, the storage information documents need structure that is easy to change, flexible, diverse and especially easy to share / contribute. Currently, XML documents are an important standard for describing and storing huge bioinformatics data. However, XML documents have text and semi-structured data, so the extraction is not the same as regular data. Chapter 1 also presents related studies of the problem of XML data extraction, the indexing methods, the algorithms proposed in previous studies have been mentioned, in which R-tree is The algorithm appearing effective with XML documents and XPath queries. On that basis, chapter 1 analyzes and presents the research issues of the thesis. CHAPTER 2. BIOX-TREE INDEXING METHOD 2.1. INTRODUCTION The methods given above for indexing in space based on R-tree are having problems: Firstly, it is overlapping problem. For spatial-based index technique, the larger search space the more time it takes for getting the returned node set. But the weakness of R-tree based method is that they create an unoptimal window query. Figure 2.1 illustrates an instance of XML document with several small points represent XML data in planar. Assume that, from the context node v we want to get all of its descendant nodes by using a window query {pre(E), ; 0, post(E)} [36]. The really needed window is what in white color defined by the tree browse value with the preorder value of left-most descendant node and the tree browse value with the post order value of righ-most descendant node of node E. As the result, the waste area covered by dark color by the query window corresponding to descendant axis causes a considerable impact on the query performance, which the range can be very large in many cases. Figure 2.1. Scanning range of pre-order and post -order (gray zone) and zoomed (white zone) for descendant queries is performed according to the sample query 8 Secondly, it is a matter of the connection of tags after converting to space, which is expressed as points in space such as parent, preceding, sibling, descendant, child, following, etc. with Xpath axis. In Figure 2.2 shows the data distribution on the coordinate system, with tested rice DNA data on 1000 nodes (Figure 2.2a), with tested Swissprot data on about 20,000 nodes (Figure 2.2b), the author recognizes that all data is skewed as a trapezoid / diagonally aligned (like an airplane wing). The author has tested on many different XML documents, from a few hundred nodes to several hundred thousand nodes, with the same results. (a) (b) Figure 2.2: Example of distributing conversion points for an XML document Meanwhile all previous methods did not concern about that, they only focused on processing the relationship between parent/child or ancestor/descendant and omitted the other axes that are considered an important part on query processing, especially processing query stream of XPath queries with predicates. The queries have not improved significantly when querying in the airplane wing data zone. From there, the author digs into the new indexing method, improved from the R-tree to help XPath queries run more efficiently in a number of axes. Based on the model selected in Chapter 1, the author will make suggestions for improvement in the components: conversion, indexing, query processing module. Figure 2.3: Proposed parts for improvement in the BioX-tree method The results of this chapter are published in works 1, 2, 3 and 4 in the "List of author's works". 2.2. BioX-tree Indexing method 2.2.1. XML document conversion Still following that general principle in XML document analysis and transformation, the author has built a separate program to ensure accuracy when compared to the R-tree-based method of the previous studies. In document [20], the conversion is implemented by using two procedures startEuity (t, a, att) and endEuity (t). Here, the author has added a new parameter of 9 the author's own way to increase searching efficiency, besides still using the parameters by the pre order value and the post order value. It is the parameter used to indicate the level l (level) of each node. The above two procedures are modified with new names startElement and endElement, presented in Algorithm 2.1. Algorithm ConvertXMLDocument(XMLdoc) Input: XMLdocument need converting Đầu ra: file txt containing values in space of a node(E) = {pre (E), post (E), par (E), att (E), tag (E)}. Begin l = -1; startElement(t, a, atts) l ← l +1; v ← (pre = gpre, post = _, par = (S.top()).pre, level= l, att = a, tag = t) S.push(v); Gpre ← gpre + 1; for v’ IN atts do startElement(v’, true, nil ); endElement(v’); endElement(t) v ← S.pop(); v.post ← gpost; gpost ← gpost +1; insert v in the result table; l ← l-1; End Algorithm 2.1: Two modified algorithms in XML document conversion 2.2.2. Index structure on BioX-tree BioX-tree applies a different insert / split strategy to achieve sibling relationships of XML data more easily, while not affecting the spatial differential ability of the index too much. Similar to the XPath and R-tree method mentioned in Chapter 1, each tag name in the XML document after conversion is represented as an entry [30] consisting of 5 attributes node (E) = {pre (E), post (E), par (E), att (E), tag (E)}. A node will have the size corresponding to a block in the hard drive. Non-leaf nodes have the form (pointer, MBR) in which the pointer pointer points to the child and the MBR is the smallest rectangle surrounding all entries attached to it. We simply understand that the non-leaf nodes will contain metadata information of leaf nodes, need to know the information about leaf nodes can be found here. Leaf nodes, which contain the elements after conversion, are responsible for maintaining the aligned trajectories of the actual XML data. To do that, the author applies double-linking methods to keep the connections with the preceding and following XML. The author also uses pointers to stay connected with the parents of the XML children. In short, the author uses 3 pointers in a leaf node to connect with the preceding and following siblings and their parents, so each node of this type will have a set of tuple-shaped pointers (previouspointer, nextpointer, parpointer). 10 The purpose is to try to maintain a relationship that reflects the wing airplane data distribution in space to make the query windows smaller and force a node on the tree to contain only Its siblings, making it possible for us to quickly find sibling relationships in BioX-tree. Figure 2.4: Tree hierarchy under tags in rice DNA XML documents Figure 2.5: Leaf nodes show a connection on the structure tree of BioX-tree For example, Figure 2.2 depicts the tree structure of a document related to rice DNA that the author will test in the following sections, here is an XML data set published on Gene NCBI bank. They are numbered aby the pre order value and the post order value on the top based on the numbering type and algorithm described above. After transforming the data (for simplicity we only use the pre order value to describe), the nodes will be represented in the structure of the BioX-tree tree as shown in Figure 2.3, the data nodes whether the same parent will be stored in the same leaf node. In case the leaf node too many entries and overflows from the array, it will be split and have pointers connected to each other to ensure still connecting with the siblings. Straight arrows represent pointers from a leaf node to their previous and next sibling, curved arrows represent connections with their parents. In this example, entries with pre-order value of 21, 22, and 23 are siblings node in the XML document that will be inserted in the same leaf node and a pointer will be used to connect. to their parent node, which is the node containing entry 24. That is, 21, 22, 23 and 24 are siblings and have the same parent. 2.2.3. Algorithms Because changing the tree structure will affect the insertion, deletion and query of nodes, the author will redesign some algorithms to be more appropriate. This section will show the modified algorithms, and the ones that are not shown the author will reuse as in the original method. 11 2.2.3.1. Insertion algorithm With the goal is to keep the sibling connection of the XML data. The insertion algorithm is quite complicated. A plain pseudo-code explains the insert process as well as the split strategy in case of leaf node is fully available in algorithms 2.2. Algorithm Insert(N,E) Insert: context node N và entry E will be inserted. Begin 1. Call FindSiblingNode(N,E) to find leaf node N’ containing siblings of new entry E need inserting. 2. if node N’ is found 3. if (N’ has space to add) then 4. insert entry E into N’ 5. else 6. Call CreateNewLeafNode(E) to create a new leaf node on entry E tree needs inserting here 7. endif 8. else 9. Call CreateNewLeafNode(E) to create a new leaf node on tree needs inserting here 10. endif End. Algorithm 2.2: Insertion algorithm Algorithm FindsiblingNode(N, E) Input: context node N, entry E need finding siblings. Output: node N contain siblings of entry E. Begin 1. if N is not a leaf node 2. Browse searching entries E’ in N has MBR intersect with MBR of entry E 3. Call FindSiblingNode(N’, E) in which N’ is subnode of N indicated by E’ 4. else 5. if N containing an entry is sibling of E 6. return N 7. endif End. Algorithm 2.3: Algorithm FindSiblingNode Algorithm CreateNewLeafNode(E) Input: entry E is inserted. Begin 1. 2. 3. 4. Finding sibling node of entry E, N’ is found if (N’ has room to add entry e) then Add entry E to N’ else 12 5. 6. 7. 8. 9. 10. 12. 13. Search from bottom to top until you see the parent Q. Go to the nearest right path from parent node Q to node P with level 1 if non-leaf node P is not full entry E will be added to leaf node in P else Create a new non-leaf node R in level 1, Create a new non-leaf node and insert entry E. endif endif End. Algorithm 2.4: Algorithm CreateNewLeafNode 2.2.3.2. Query algorithm BioX-tree is different from the R-tree method in that it can directly answer most queries on axes without the need of fine-tuning step, while in fact the R-tree based method is able to directly answer 4 main axies queries (ancestor, preceding, following, descendant) as shown at the beginning. Before going into the details of each algorithm, Algorithm 2.5 and Algorithm 2.6 show the algorithms used for point and range queries. These are basic spatial queries and are considered sub-algorithms available in the R-tree method, the author does not make any further improvements here. With the help of these queries, we can get the result returned as a set of nodes or exactly one node as desired. Algorithm FindNode(N,E) // point query Input: context node N và entry E need finding Out: node N contains entry E Begin 1. if (N is a leaf node 2. Browse finding entries E’ where N has MBR intersect with MBR of entry E 3. Call FindNode(N’, E) where N’ is child node of N pointed by E’ 4. else 5. if N contains entry E 6. return N 7. endif End. Algorithm 2.5: Point query algorithm Algorithm RangeQuery(N, Q, RESULT) //window query Đầu vào: context node N (at beginning, context node will be original node ) và query window Q Output: RESULT list containing all entries has MBRintersect with Q Begin 1. if N in not a leaf node 2. Browse finding entries E’ where N has MBR intersect with MBR in Q 3. Call RangeQuery(N’, Q, RESULT) where N’is child node of N pointed by E 13 4. else 5. Browse finding entries E’’ where N has MBR intersect with MBR in Q 6. 7. Add E’’ to RESULT 8. endif End. Algorithm 2.6: Range query algorithm To avoid listing different types of queries but with similarities, the author divided the query processing algorithms on BioX-tree into two categories: one that included algorithms for sibling queries and one type for other queries. 2.2.4. Query processing 2.2.4.1 Sibling query algorithm Thuật toán SiblingQuery(N, E, RESULT) Input: context node N (at beginning, ontext node is root node and entry E need finding siblings Output: RESULT list containing entries is sibling of entry E Begin 1. Call FindNode(N, E) to find node N’ containing entry E 2. if (N’ is found) 3. Browse entries E’ in N’ 4. Add E’ to RESULT 5. if (following siblings according to pointer F not null) 6. Call FollowingSiblingQuery(NF, RESULT) where NFis node pointed by F 7. if (preceding sibling according to pointer P not null) 8. Call PrecedingSiblingQuery(NP, RESULT) where NP is node pointed by P 9. Else 10. Sibling node is not found 11. Endif End. Algorithm 2.7:Sibling query algorithm Algorithm FollowingSiblingQuery(NF, RESULT) Đầu vào: context node NF Đầu ra: RESULT list containing entries is following sibling of NF Begin 1. Browse entries E’ in NF 2. Add E’ to RESULT 3. if (following siblings according to pointer F not null) 4. Call FollowingSiblingQuery(NF’, RESULT) where NF’ is node pointed by F 5. endif 6. endfor End. Algorithm 2.8: Following sibling query algorithm 14 Algorithm PrecedingSiblingQuery(NP, RESULT) Đầu vào: context node NP Đầu ra: RESULT list containing entries is preceding sibling of NP Begin 1. Browse entries E’ in NP 2. Add E’ to RESULT 3. if (preceding siblings according to pointer P not null) 4. Call PrecedingSiblingQuery(NP’, RESULT) where NP’ is node pointed by P 5. endif 6. endfor End. Algorithm 2.9: Preceding Sibling Query Algorithm 2.2.4.2. Other query algorithms Algorithm ChildrenQuery(N, Q, RESULT) Input: context node N and window query Q Output: RESULT list contains all children of entry E Begin 1. if (N is a non-leaf node) 2. Browse entries E’ in N have MBR intersect with MBR of Q 3. Call ChildrenQuery(N’, Q, RESULT) where N’ is a child node of N pointed by E’ 4. else 5. Browse entries E’’ where E has MBR intersect with MBR of Q 6. Call SiblingQuery(N, E’’, RESULT) 7. endif End. Algorithm 2.10: child query algorithm Algorithm AncestorQuery(N, E, RESULT) Input: context node N and entry E need finding ancestor Output: RESULT list contains all ancestors of E Begin 1. Call FindNode(N, E) to find node N containing entry E 2. if (N not null) 3. if ( parent pointer F not null) 4. Browse entries E’ in NP, NP is node pointed by P 5. if E’ is ancestor of E, add E’ to RESULT 6. Jump to step 3, so that NP replace N 7. else 8. Input Node is root 9. else 10. Node found is not existed 11. endif End. Algorithm 2.11: Ancestor query algorithm 15 Unlike the above algorithms, in the Descendant query, the author only tended to minimize the query window size and then embed this window into a normal range query. As in the conversion, the author has added a level l parameter to the nodes, by using this parameter we can reduce the window size of the search space. Experimental results of algorithm 2.2.5. Assess the complexity of algorithms Sibling query of the proposed method is complicated: -The best case is O (k + logm N), where n is the number of nodes in the tree, m is the number of entries in 1 node. Where k is the number of siblings found in the query. -The worst case is O (k + N) -The average is O (k + m logmN) 2.6. Experimental results of BioX-tree method 2.6.1 Experimetal model and environment  Test model Figure 2.6: Experimental model of BioX-tree và R-tree method  Experimental data The author uses 4 different bioinformatics sources from reputable data sources. They describe various biodiversity: DNA, Protein, and descriptions of subspecies: DNACorn, DNARice, Swissprot, Allhomologies  Scenario In XPath queries, queries on sibling axis are most important and most frequently used because they return small and specific result sets and used in calculations. For example, the user needs to search XML data: descendants of the current node, or sibling of the current node. The other sub-axis queries like ancestors, descendants, preceding, following are often a collection of a lot of results and are rarely used in practice. Experimental scenario of two types of XPath queries on sibling axis and children axis is a type of point query. The remaining axes of XPath use range queries. Each of the above XML documents is treated as a different database and is implemented separately. On an XML document, the author randomly selects 200 tag names, then queries find the tag name sets related to the Xpath axes. The queries performed on XML documents increase in size, as shown by the complexity of the XML tree with the number of name / node tags of 20,000 - 40,000 - 60,000 - 80,000 respectively. For each type of query, the average result of accessing hard drive times of the 16 above 200 options will be retrieved to evaluate the performance of the methods. Fewer hard disk access times mean higher query performance.  Experimental tooling and environment Algorithm programming tool is a programming language of C ++ in Visual Studio 2008. Experimental running environment on computers with CPU configuration: Intel Xeon E5520 - 2.7 GHz, RAM: 20 GB, OS: Windows Server 2008 R2 Enterprise. 2.6.2. Program construction  Design an index file  Program design.  Class diagram Figure 2.7: Class diagram of BioX-tree 2.6.3. Assess the effectiveness of data size reduction In order to evaluate the actual effectiveness of the method of compressing data from XML documents to documents converted into digital space, the author has experimented on practical documents as Figure 2.11. This shows that the compression ratio is quite good, especially with DNA description documents. However, the Allhomologies document describing species information is quite surprising because of its low compression ratio. To understand why compression ratios differ between documents, the author analyzed the XML files and identified a problem. Allhomologies documents describe species information so most of them have Attribute tags, this tag describes many attributes on a string. The conversion algorithm that encounters these tags will have to separate each Attribute in a string into separate tags thus increasing the size of the converted document. Thus, it can be seen that, in fact, this conversion method does not always have good compression ratio because it depends on the structure of the XML document. Figure 2.8: Compare the size of XML documents and documents converted to digital space 17 2.6.4. Compare the results of the BioX-tree and R-tree methods In spatial indexing methods, the unit of performance measurement will be the average number of nodes accessed, because the actual processing time is fast or slow depending on whether a query needs access (I / O) more or less on the hard drive to read the blocks. That is, the less access to read the block nodes, the better the processing time will be. a. Node query Figures 2.9 and 2.10 show that the performance of BioX-tree is much better than R-tree. The reason is that to achieve results, R-trees must use scoping queries to scan all sibling or descendant nodes, then filter out the expected nodes. But the BioX-tree handles these queries by first approaching only one leaf node containing the object and then searching all its sibling and child node through pointers. This helps avoid the R-tree overlap problem. The bigger the size of XML data is, the more R-tree will overlap. That is why R-tree performance decreases rapidly as data size increases. Figure 2.9: Sibling query Figure 2.10: children query b. Range query Figure 2.11 shows that the performance of the BioX-tree is slightly lower than that of R-tree except for large data. The reason is that the author has forced the sibling nodes of an XML data object into some leaf nodes of the R-tree. Certainly, it makes the indexing structure less optimal, leading to overlapping problems. However, thanks to the pointer (to parents) of the BioX-tree, the performance of the BioX-tree is not much worse than that of the R-tree. Figure 2.12 shows that the performance of BioX-tree is slightly better than R-tree. Instead of scanning entirely one of the four discrete areas on the plane, the BioX-tree only looks for the children of a descendant node and then uses the pointers (to the sibling node) to reach the rest. Similar to Figure 2.12, Figure 2.13, 2.14 shows that the performance of BioX-tree is a little less than that of R-tree. The reason is that range queries are forced to scan one of the four discrete regions resulting in a serious overlap problem Figure 2.11: Ancestor query Figure 2.12: descendant query 18 Figure 2.13:Following query node Figure 2.14: Preceeding query node In summary, the author's proposed indexing methods are much better performing with node queries but performance is almost similar to range queries. In fact, node queries are used more than range queries because users rarely need all the ancestor or descendant data of an object. In fact, querying preceding and following sibling, children brings many benefits to the DNA database in searching. 2.7. Conclusion In this chapter, the author has researched and presented an improved method for more efficient processing of XPath queries, which is considered the basis for building complex queries. The BioX-tree method proposes some important enhancements such as adding pointers to indicate relationships: ancestors - descendants, parents - children, siblings, the step of converting from XML documents to space added some parameters, redesigned query algor

Các file đính kèm theo tài liệu này:

bioinformatics_xml_documents_index_method_based_on_r_tree_me.pdf