Efficiently Handling Wildcard Queries in XML
Neha Singh IIT Bombay, Mumbai, India neha_ns@cse.iitb.ac.in |
Nitin Gupta |
ABSTRACT
A fundamental problem in the field of querying repositories of XML documents is the non-existence of an efficient solution to queries involving wildcards. Often, a wildcard node in a XML query either refers to an XPath step with the wildcard nodetest, or as a specifier for a set of nodes. We introduce a navigational model for querying XML data over distributed warehouses that efficiently processes such complex twig query patterns. Our model follows the ancestral path of nodes which have a high probability of returning the answers, and therefore intelligently computes the needed structural identifiers. A thorough experimental evaluation on a synthetic and a real dataset demonstrate the efficacy of our approach.
![]() |
Figure 1: Twig Pattern for queries Q1 and Q2. |
State-of-the-art algorithms for matching XML query twig pattern not involving wildcards, such as Holistic Twig Join [7] use structural identifiers (DocId, LeftPos : RightPos, LevelNum) to capture the structural relationship between the nodes in the XML database. Hence they fail on such wildcard queries because the structural identifiers corresponding to wildcard node of the query are unknown. Commonly known solutions to such incomplete queries include:
We introduce the Navigational Selection Algorithm (NSA), a step towards efficiently processing queries involving wildcards. NSA is a navigational model for querying XML warehouses. In a nutshell, NSA follows the ancestral path of nodes which have a high probability of returning the answers, and therefore intelligently computes the needed structural identifiers. NSA performs a backward search [5, 14] on the XML graph and interleaves this with the state-of-the-art structural twig pattern solvers to find the answers efficiently. We present the effectiveness of our model using a wide range of queries over the benchmark XMark dataset [18], and the real ProteinDB dataset [4], distributed over a number of remote repositories. We show that our model not only performs better in query execution time, but also drastically reduces the total amount of data transferred over the network for distributed repositories.
2. RELATED WORK
To the best of our knowledge, the only previous works that deal with wildcards are [8, 15]. [8] focuses on query reduction and does not propose any efficient algorithm for handling wildcards per se. The proposed layered axis is not generic enough to handle all types of wildcards. [15] is extremely inefficient even when we consider queries with simple string-content matching predicates on non-leaf nodes. On the technical aspect of the approach, the closest research we found was MTrees [16] that uses a specialized index structure, and generic search algorithms [19]. MTree is an index that allows efficient graph traversal, and integrates algorithms to solve XPath queries. However, they do not focus on the specification of the query, and therefore end up suggesting only an alternative to structural joins [19]. There has been considerable work in the field of reachability and efficient labeling of graphs for structural joins [10, 20, 9, 7]. But as mentioned before, all of these are complimentary to what we propose in this paper, and could easily replace the twig pattern solver of the system. There has also been a lot of research on searching in graphs [2, 5, 19]. We use one of these which we discovered had the requisite properties for XML graph search. The last main related area of research has been on node ranking [13, 3]. Though all of these are really good for keyword search or ranking the answers, they are not the best for graph traversal.
3. BACKGROUND
We discuss certain aspects of the Bidirectional Search algorithm [5]
and the Activation Model [14] that are quintessential for the understanding of our
model. Bidirectional Search was proposed by Kacholia et al. [2, 5, 14]
for schema-agnostic text search on graphs. Given a keyword query,
it aims to find a minimal subtree in the data graph that contains at
least one node corresponding to each keyword. The weight of an edge along reflects the
strength of the proximity relationship between two nodes (small
values correspond to greater proximity). Each answer tree obtained
is assigned a relevance score, and answers are presented in decreasing
order of that score. Finding such a subgraph is a NP-complete
problem (computation of minimum Steiner trees).
The search algorithms described in the paper offer heuristics for
incrementally computing query results.
An interesting aspect of this model is the activation spread. The bidirectional algorithm has just two iterators for traversal, and therefore it prioritizes the nodes on some basis for execution. The authors propose a novel prioritization scheme based on spreading activation (a kind of Pagerank [6] which decays with distance). The overall tree score can depend on either the edge score or the node prestige, and both need to be taken into account when defining activation to prioritize search. Higher the node prestige for a node, higher is its priority for expansion. This technique allows preferential expansion of paths that have less branching, and the same mechanism can be extended to implement other useful features, such as enforcing constraints using edge types to restrict search to specified search paths, or prioritizing certain paths over others. We leverage upon this property of the algorithm in order to evaluate twig patterns over graphs of XML repositories.
4. PROPOSED SOLUTION
4.1 DATA MODEL
The XML document is represented as a directed acyclic graph. Here we
discuss a heuristical ranking mechanism to calculate the node and
edge weights for the XML graph that allows efficient processing
of queries involving wildcards. Nodes in the graph represent XML elements and edges represent the parent-child relations, inter-document and intra-document links between the XML elements.
Node Weights: Node weights form a key ingredient of the navigation algorithm as they determine how the graph is traversed. As you will later, we use the navigational model to locate good candidates for NSIs occurring in the upper portion of the query twig pattern and post-processing for those occurring near the leafset. Thus for efficient navigation of XML database, we assign higher weight to nodes that have higher probability for being solutions for the wildcard nodes occurring in the upper portion of the query twig pattern.
Based on the above factors, we use the following normalized
form of node weights:
where
In the above equations, S(.) returns the subtree of a node, D(.)
returns the depth of a node in the document in which it occurs, T(.)
retuns the set of keywords in a node, and R(.) returns the rank of
a keyword, which is directly proportional to its frequency of occurrence.
The function P(.) is PageRank-style popularity function,
which takes into account the weight of the subtree nodes.
g1 represents
the decay in the effect due to distance (for example, far-lying
nodes have less effect on the popularity of a node).
Edge Weights: The edge weights play an
important role in activation spread, and therefore must be chosen
based on the weights of adjacent nodes. We would also ideally like
to spread more activation to ancestor nodes than descendant nodes.
Therefore, we use the following edge weights: For a node n, let A
be the set of nodes that have an outgoing edge to n, D be the
set of nodes that have an incoming edge from n, and E be the set of edges.
Then the weights of edges (m, n) Î E,m Î A and (n,m) Î E,m Î D are given by (a) and (b), respectively.

where g2 represents the skew in activation flow among ancestors
and descendants. The first fraction in (a) gives the relative weight
of one node compared to other competing nodes. The second fraction in (a) gives the fraction
of the total activation that n should pass to nodes which have an
incoming edge to n. The formula in (b) similarly corresponds to
nodes to which n has an outgoing edge.
4.2 NSA ALGORITHM
![]() |
Figure 2: A prefix-based index structure with counts |
4.2.1 Indexing To enable finding the list of structural identifiers containing keywords from the query tree pattern, an inverted index is constructed, where each keyword maps onto a list of structural identifiers, each of which is given by (DocID, Start:End, Level, Pointer to Node in Data Graph). For optimized query processing, the inverted index used by our system requires the presence of a function which estimates the number of structural identifiers for any given regular expression. This is to check whether or a not a node in the query is a good candidate for inclusion in the holistic join. Consider the prefix-based index structure as shown in Figure 2. Given the keyword “a*”, both “asia” and “africa” would have to be considered. But an estimate on the number of structural indices for “a*” is rather easy to pre-compute, and can immediately return the number 35. For large inverted indices, the structural identifiers are distributed over a number of remote machines using DHTs or equivalent techniques. We present the effect of both of these options in the experimental evaluation.
4.2.2 Queries and Conversions In order to process the query efficiently, we apply certain conversion rules to the query. This is to allow the integration of wildcards in specifying nodes of the query tree pattern. Consider a query Q whose twig pattern is represented as a tree. Let the set of nodes of this tree be given by VQ. Let TQ(.) return the keywords present in a query node. We categorize the possible keyword forms in a node as:
![]() |
Figure 3: Query with different keyword types |
![]() |
Figure 4: Query Preprocessing for NSA |
4.2.3 Algorithm
The Navigational Selection Algorithm is a multi-phase process,
and tries to optimize the query and intermediate results in each of
these phases. In a nutshell, the NSA proceeds as follows: the algorithm
identifies a set of nodes in the data graph and begins a backward
search from these nodes in order to find the needed structural
identifiers. Intermittently, it performs a twig pattern join using the
known identifiers and the NSI found by backward search to obtain
final answers. NSA also employs a twig pattern solver. Although
any twig pattern solver that uses the aforementioned structural indexing
can be employed, in order to simplify the explanation of
the algorithm, we use the Holistic Twig Join [7] as the base twig
pattern solver.
Query Preprocessing. In the first phase, NSA fetches the identifier list for each query node using the conversion rules specified above. Some of these identifier lists might be infinite, resulting in Needed Structural Identifiers (NSI). If there is no descendant query node for a NSI node, then we might as well discard it (example the * node below “annotation” in Figure 4). This is a reasonable assumption since such query nodes only test the presence of a node in the documents without any structural constraints. The immediate descendants of NSI nodes in the twig query pattern that are not NSI nodes themselves are called Nodes of Interest (NOI) (example the keyword “dark” in Figure 4). Others nodes are treated either as NSI if the identifier list is very big, or as ordinary nodes for reasonable sized identifier lists. Upon obtaining the nodes corresponding to NSI, and subsequently the nodes of interest, the preprocessing step follows the pointer from structural identifiers of the nodes of interest to mark the corresponding nodes in the data graph as search roots.
Navigation and pattern matching. In this phase, the NSA starts a search in the data graph beginning with the search roots. Search roots are nodes in the data graph that correspond to the nodes of interest located by the preprocessing step. Backward search beginning from these nodes is an attempt to locate the corresponding NSI in the data graph. By exploring nodes, the algorithm populates the lists for NSI. Intermittently, it performs a twig pattern join (holistic-twig-join) using the obtained identifier lists to get answers. However, if the identifier list corresponding to a NSI is small (i.e. the NSI query node does not have enough candidate identifiers), NSA increases the activation of this particular node’s subtree and further explores the graph. It then attempts a join, and keeps iterating this process until all answers have been found.
Algorithm 1 gives the pseudo-code for second phase of the Navigational Selection Algorithm. It takes as input NOISet, the nodes of interest, LQ, a mapping from query nodes to known identifier lists (initially empty for NSI nodes), and NSIMap, a mapping from NOI nodes to the corresponding NSI query node. The algorithm constructs a mapping from nodes in the data graph to query nodes, given by qnode. NSA also maintains a priority queue, queue, which is used to prioritize the nodes for searching. The Navigate() function explores a1 nodes at a time. It removes from queue a1 nodes with the maximum priority and adds their corresponding identifiers to the identifier list of the query node that they may belong to. It then adds to queue the parents of these nodes, or in case the parents are already in the queue, increases their priority. This is synonymous to backward search discussed earlier, with activation spread based on node weights.
A number of parameters such as a1, b1, b2, g1, g2 and g3 have been used in the NSA and weight functions. These parameters have to be specified by the system administrator based on the dataset and workload. Automating this is a part of ongoing research. The use of these various parameters is as follows: a1 is the amount of graph to be explored before a join is attempted. It works in harmony with b1, the minimum number of candidate identifiers required for NSI nodes before holistic twig join is called. b2 is the measure of the size of graph to be explored before the algorithm quits. To find all answers, b2 can be assumed to be |V|. g1and g2 have been explained earlier. g3 is the amount of “priority” or “activation” increase per node during each iteration – it effectively strikes a balance between a depth-first and a breadth-first strategy for exploring the graph.
Answer Post-processing. The answers obtained from the first two phases of NSA may not be entirely correct. Recall that prefix-void words had effectively been converted to unspecified words by query conversion, and some node tests were discarded in the preprocessing phase. Answer post-processing is necessary to ensure the correctness of returned answers. We assume that the occurrence of such keywords in limited, and therefore will not result in large unnecessary data getting transferred over the network. The experimental results validate this.
5. UNIQUENESS OF THE APPROACH
The algorithm described above addresses an unsolved problem in the field of complex twig patterns. We use the navigational model to locate good
candidates for wildcard nodes occurring in the upper portion of the query twig
pattern and post-processing for those occurring near the leafset. To prevent the wasteful exploration of the XML graph thus cutting down on resulting processing and communication overhead, we have developed a ranking mechanism for
prioritizing
exploration of good candidate nodes. The experimental results show the efficacy of our solution.
6. EXPERIMENTAL RESULTS
![]() |
Figure 5 (a): Query Execution Time |
![]() |
Figure 5 (b): Total data transfer |
XMark. The queries over XMark did not include suffix-based queries, and therefore did not require any kind of postprocessing. The results show that both NSA and NSA-D, despite the network data transfer, outperformed PathIndex, though not by big margins. However, the gain in time is primarily due to reduced processing due to structural indices used by NSA as opposed to the memory intensive path-indices, which do not have associated algorithms such as holistic twig joins.
ProteinDB. ProteinDB is a more versatile database. The node variety of nodes for ProteinDB in terms of content, depth and replication is much wider than the synthetic XMark. Therefore, we did a larger number of our tests on ProteinDB. NSA and NSA-D showed a two-fold improvement in query execution time over path indices, and multiple orders of magnitude improvement in the amount of data transferred over the network. An exception to this is the last query, which included suffix-based wildcard nodes, which led to NSA and NSA-D performing as poorly as the naive algorithm.
7. CONCLUSION
Using the rich search framework and popular techniques to solve
a simple twig pattern solver, we have developed an efficient system for solving more complicated twig patterns involving incompletely
specified keywords and nodes. Using
a comprehensive set of experiments, we showed that our algorithm has stability, efficiency and the ability to perform
on a distributed XML warehouse. Furthermore, our algorithm is often orders of magnitude faster than any of the
existing techniques to solve the incomplete query patterns.
8. REFERENCES
[1] S. Abiteboul, I. Manolescu, and N. Preda. Constructing and
querying peer-to-peer warehouses of xml resources. In
ICDE, 2005.
[2] B. Aditya, S. Chakrabarti, R. Nasre, R. Desai, A. Hulgeri,
S. Sudarshan, and H. Karambelkar. User interaction in the
banks system. ICDE, 2003.
[3] A. Balmin, V. Hristidis, and Y. Papakonstantinou.
ObjectRank: Authority-based keyword search in databases.
In VLDB, 2004.
[4] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N.
Bhat, H. Weissig, I. N. Shindyalov, and P.E. Bourne. The
protein data bank. Nucleic Acids Res, 28(1):235–242,
January 2000.
[5] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti, and
S. Sudarshan. Keyword searching and browsing in databases
using BANKS. In ICDE, 2002.
[6] S. Brin and L. Page. The anatomy of a large-scale
hypertextual web search engine. In WWW, 1998.
[7] N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins:
optimal xml pattern matching. In SIGMOD, 2002.
[8] C. Y. Chan, W. Fan, and Y. Zeng. Taming xpath queries by
minimizing wildcard steps. In VLDB, 2004.
[9] L. Chen, A. Gupta, and M. E. Kurul. Efficient algorithms for
pattern matching on directed acyclic graphs. In ICDE, 2005.
[10] E. Cohen, E. Halperin, H. Kaplan, U. Zwick. Reachability
and distance queries via 2-hop labels. In SODA, 2002.
[11] W. Fan, C. Y. Chan, and M. Garofalakis. Secure xml
querying with security views. In SIGMOD, 2004.
[12] G. Gottlob, C. Koch, and R. Pichler. The complexity of xpath
query evaluation. In ACM PODS, 2003.
[13] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram.
Xrank: ranked keyword search over xml documents. In
SIGMOD, 2003.
[14] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan,
R. Desai, and H. Karambelkar. Bidirectional expansion for
keyword search on graph databases. In VLDB, 2005.
[15] J. Lu, T. W. Ling, C. Y. Chan, and T. Chen. From region
encoding to extended dewey: On efficient processing of xml
twig pattern matching. In VLDB, 2005.
[16] P. M. Pettovello and F. Fotouhi. Mtree: an xml xpath graph
index. In SAC, 2006.
[17] J. Robie, M. Kay, D. Chamberlin, A. Berglund, M. F.
Fernández, J. Siméon, and Scott Boag. XML path language
(XPath) 2.0. W3C recommendation, W3C, January 2007.
[18] A. Schmidt, F. Waas, M. Kersten, M. Carey, I. Manolescu,
and R. Busse. Xmark: A benchmark for xml data
management. In VLDB, 2002.
[19] D. Shasha, J. T. Wang, and R. Giugno. Algorithmics and
applications of tree and graph searching. In PODS, 2002.
[20] H. Wang, H. He2, J. Yang, P. S. Yu, and J. Xu Yu. Dual
labeling: Answering graph reachability queries in constant
time. In ICDE, 2006.