Table of Links
Abstract and 1. Introduction
2. Pairing functions
- Enumerating trees
- LZ-trees
- Conclusion and References
Abstract
I present a simple algorithm for enumerating the trees generated by a Context Free Grammar (CFG). The algorithm uses a pairing function to form a bijection between CFG derivations and natural numbers, so that trees can be uniquely decoded from counting. This provides a general way to number expressions in natural logical languages, and potentially can be extended to other combinatorial problems. I also show how this algorithm may be generalized to more general forms of derivation, including analogs of Lempel-Ziv coding on trees.
1 Introduction
While context-free grammars (CFGs) are important in computational linguistics and theoretical computer science, there is no simple, memoryless algorithm for enumerating the trees generated by an arbitrary CFG. One approach is to maintain a priority queue of partially expanded trees according to probability, and expand them through (e.g.) the leftmost unexpanded nonterminal in the tree. This, however, requires storing multiple trees in memory, which can become slow when enumerating many trees. Incremental polynomial time algorithms are also known [1] and related questions have been studied for lexicographic enumeration [2–4]. These algorithms are not particularly well-known, and the tools required to state and analyze them are complex. In contrast, simple techniques exist for enumerating binary trees with a fixed grammar (e.g. S → SS | x). A variety of techniques and history is reviewed in Section 7.2.1.6 of [5], including permutation-based methods and gray codes [6–9]. These algorithms, however, do not obviously generalize to arbitrary CFGs.
The goal of the present paper is to present an variant of integer-based enumeration schemes that works for arbitrary CFGs. The algorithm is itself very basic—just a few lines—but relies on a abstraction here called an IntegerizedStack that may be useful in other combinatorial problems. The proposed algorithm does not naturally enumerate in lexicographic order (though variants may exist) but it is efficient: its time complexity is linear in the number of nodes present in the next enumerated tree, and it does not require additional data structures or pre-computation of anything from the grammar. Because the algorithm constructs a simple bijection between a the natural numbers N and trees, it also provides a convenient scheme for G¨odel-numbering [10, 11], when the CFG is used to describe formulas. We then extend this algorithm to a tree-based algorithms analogous to LZ compression.
:::info
Author:
(1) Steven T. Piantadosi.
:::
:::info
This paper is available on arxiv under CC BY 4.0 license.
:::
