Table of Links
Abstract and 1 Introduction
2 Background
3 Approach and 3.1 Differential Testing for XML Processors
3.2 XPath Expression Generation
3.3 XML Generation
4 Evaluation
4.1 Effectiveness
4.2 Efficiency
4.3 Comparison to the State of the Art
4.4 Analysis of BaseX Historical Bug Reports
5 Related Work
6 Conclusion, Acknowledgments, and References
2 BACKGROUND
Running example. Figure 1 shows a running example that we will subsequently use to explain basic XML and XPath concepts and outline the challenges of automated testing as applied in this context. The left shows an XML document with the root node Books, while the right shows an XPath expression //*[@id * -1 < 2]. We adapted this example from a bug-inducing test case that XPress discovered.[2] As shown, for the query on the document, BaseX returned an empty result, while both Saxon and eXist returned all three Book nodes.
XML. Extensible Markup Language (XML) is a text format for describing structured data. XML documents are trees that consist of nodes, as illustrated in Figure 1. An XML document has a single root element node (see ). Each element node has a tag name (see Books, Book, and Author). Element nodes can include attribute nodes. For example, two of the nodes have both attribute nodes id and year. An element node can also include child element nodes; in the example, the node contains three child element nodes. Element nodes can hold text contents, which can be of any defined data type. For the node with attribute id = 1, the text content it holds is “A fairy tale”. Attribute nodes are disallowed from holding child nodes. In the example, id and year are integer-typed attribute nodes and the name attributes are string-typed attribute nodes.
XPath. The XPath language is an expression language that allows navigating the XML tree and hierarchic addressing of the element nodes. XPath is at the core of both eXtensible Stylesheet Language – Transformation (XSLT) [7] and XQuery, a more expressive query language for XML [6]. XSLT transforms XML documents into other formats and the XQuery language is a super-set of XPath expressions. XQuery extends XPath to provide functionalities such as node constructors and SQL-like clauses.
XPath structure. An XPath expression describes the selection and transformation of nodes of the XML tree. Figure 2 shows a simplified XPath 3.0 [4] grammar using EBNF notation from the W3C XML 1.0 standard [3]. We introduce the non-established terms Section and Section Prefix to describe our generation approach in
Section 3.2. XPath expressions consist of one or more sections, and a section contains one section prefix followed by zero or more predicates. In Figure 1, the XPath expression //*[@id*(-1)<2] consists of a single section with section prefix //* and a single predicate [@id*(-1)<2]. Each section prefix starts with either / or //. / is called the path operator, which accepts a node sequence as the left-hand operand and orders it in document order while eliminating duplicate nodes. // represents the abbreviated relative path /descendant-or-self::node()/, which matches the current context and all descendant nodes of the current context, regardless of the intermediate path. An axis step consists of an optional axis and a name test.
XPath axes. Axes define the relationship between selected nodes and current context nodes. For example, the axis parent:: selects all parent nodes of current context nodes. If omitted, it is equivalent to child::, which selects all direct children nodes of current context nodes. A name test is a string literal to fetch only nodes with the same tag name. It could also be a wildcard *, which matches all nodes without applying filters. The section prefix //* in the example selects all descendant nodes of the document node, which is all element nodes in the document.
XPath predicates. Predicates in XPath include positional predicates and boolean predicates. Positional predicates contain an expression that evaluates to a single integer and select only values whose position in the context matches the integer value. In the XPath expression /Books/Book[1], [1] is a positional predicate and selects only the first child of , which is the node with @id=1. Boolean predicates evaluate current context nodes to a boolean value according to a given expression and only nodes for which the predicate evaluates to true are selected. In Figure 1, [@id * -1 < 2] is a boolean predicate. The query //*[@id * -1 < 2] selects all nodes in the XML document with attribute id that satisfy id * -1 < 2. The three nodes with tag name Book in the document have attribute id, and all satisfy the condition. Therefore, if correctly evaluated, this query should return all three Book nodes.
Logic bug. For the test input in Figure 1, systems like Saxon and eXist-DB both returned a result set with three Book nodes, while BaseX returned an empty result set. The difference between the processors indicates a potential bug. Based on our manual analysis, we suspected that BaseX computed an incorrect result, which is why we reported it to the BaseX developers. They fixed the bug quickly. The reason for this bug was an incorrect simplification of the arithmetic expression x * a > b to x > b / a. When the divisor is a negative number, the original operator > should be reversed to <.
XPath standard. There are majorly two different standards of XPath implementations in use today, which we need to consider in our work. The XPath 1.0 standard was the first version. As a superset of XPath 1.0, the XPath 3.0 standard is the latest standard of the XPath language and provides more functionalities such as advanced data types and functions [5]. Most multi-model DBMSs, which support XPath queries, support only XPath 1.0 [1] (e.g., Oracle, MySQL, and PostgreSQL). While some specialized XML processors support also only XPath 1.0 (e.g., libXML2), others support the more recent XPath 3.0 standard (e.g., BaseX, eXist-DB, and Saxon).
XPath versions and differential testing. The same queries might produce different results under different standards. For example, for the XPath expression Book/@name = false(), under the XPath 1.0 standard, the expression is expected to return true. @name is first cast into its equivalent boolean value. In the current case has no name attribute, therefore, an empty node set is returned. The equivalent boolean value evaluates to false for empty nodes. Comparing false to false is equal, therefore true is returned. Under the XPath 3.0 standard, however, the result is expected to be false. @name returns an empty sequence and equality comparison between an empty sequence and a boolean value false would evaluate to false. Thus, applying differential testing to XML processors supporting different standards is infeasible.
[2] https://github.com/BaseXdb/basex/issues/2188
Authors:
(1) Shuxin Li, Southern University of Science and Technology China and Work done during an internship at the National University of Singapore ([email protected]);
(2) Manuel Rigger, National University of Singapore Singapore ([email protected]).