Bisimulation is an equivalence relating *equivalent behaving* nodes. Bisimulation partitioning uses this equivalence relation to group these *equivalent behaving* nodes, and this grouping can then be used as an index over the original data. One example of the usage of bisimulation partitioning is the grouping of equivalent pieces of information in *XML* documents and in documents on the web. By doing so, bisimulation partitioning provides a way to calculate typical indices such as the 1-index and the F&B index of these *XML* documents.

For my master thesis I studied how this problem can be solved for very large XML documents and, more generally, other very large hierarchical data sources in the form of directed acylic graphs (*DAG*s). As the size of these real-world *DAG* data sets often exceeds available main memory, storage in external memory becomes necessary. Hence, there is a practical need for an efficient approach to computing bisimulation in external memory. The end result presented in my master thesis had a worst-case IO-complexity of *O(Sort(|N| + Sort(|E|)))*, where *|N|* and *|E|* are the numbers of nodes and edges in the data graph and *Sort(n)* is the number of accesses to external memory needed to sort an input of size *n*. Upon further research, we improved on this with a more efficient approach with a worst-case IO cost of *O(Sort(|N| + |E|))*. We also study specializations of this algorithm to common variations of bisimulation for tree-structured *XML* data sets.

The master thesis also briefly looks at partition maintenance. Thereby we have provided lower bounds on the complexity of bisimulation partition maintenance (in terms of observable changes to the partitions). We also give some sketches on how bisimulation partitions can be updated. Thereby we see that edge changes are especially challenging to support efficiently.

I have implemented the bisimulation partitioning algorithms presented in my master thesis in C++. I provide free access (under a BSD-like license) to the implemented algorithms. I have used Microsoft Visual Studio 2010 (Professional Edition) to develop the software; but there should be no reason why the (free) Express edition would not work. Furthermore the amount of platform dependent code is kept to a minimum; it thus should not be too much work to get the source working under any modern compiler on any modern platform. The following libraries have been used:

- Boost C++ libraries; version 1.46.1; retrieved from SourceForge.
- libxml2; version 2.7.8; retrieved from xmlsoft.org.
- STXXL: Standard Template Library for Extra Large Data Sets; version 1.3.1; retrieved from SourceForge.

See the 3rdparty directory in the source package for additional details on how these libraries are setup to work properly.

- The implementation of the bisimulation partitioning algorithms.
source code (.zip).
Full implementation of the bisimulation partitioning algorithms used in the experiments presented in the Master thesis and at SIGMOD.
- SIGMOD 2012
*Efficient external-memory bisimulation on DAGs*. . (2012). In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 553-564, ACM. DOI: 10.1145/2213836.2213899. See also DBDBD 2011, Master thesis. author copy, slides.## Abstract

In this paper we introduce the first efficient external-memory algorithm to compute the bisimilarity equivalence classes of a directed acyclic graph (DAG). DAGs are commonly used to model data in a wide variety of practical applications, ranging from XML documents and data provenance models, to web taxonomies and scientific workflows. In the study of efficient reasoning over massive graphs, the notion of node bisimilarity plays a central role. For example, grouping together bisimilar nodes in an XML data set is the first step in many sophisticated approaches to building indexing data structures for efficient

*XPath*query evaluation. To date, however, only internal-memory bisimulation algorithms have been investigated. As the size of real-world DAG data sets often exceeds available main memory, storage in external memory becomes necessary. Hence, there is a practical need for an efficient approach to computing bisimulation in external memory.Our general algorithm has a worst-case IO-complexity of

*O(Sort(|N| + |E|))*, where*|N|*and*|E|*are the numbers of nodes and edges, resp., in the data graph and*Sort(n)*is the number of accesses to external memory needed to sort an input of size*n*. We also study specializations of this algorithm to common variations of bisimulation for tree-structured*XML*data sets. We empirically verify efficient performance of the algorithms on graphs and*XML*documents having billions of nodes and edges, and find that the algorithms can process such graphs efficiently even when very limited internal memory is available. The proposed algorithms are simple enough for practical implementation and use, and open the door for further study of external-memory bisimulation algorithms. To this end, the full open-source*C++*implementation has been made freely available. - Master thesis
*Bisimulation partitioning and partition maintenance*. . (2011). Eindhoven University of Technology. Adviser: George H. L. Fletcher. See also SIGMOD 2012, DBDBD 2011. author copy, thesis, final slides, mid-term slides, poster.## Abstract

The combination of graphs and node bisimulation is widely used within and outside of computer science. One example of this combination is constructing indices for speeding up queries on

*XML*documents. Thereby*XML*documents can be represented by trees and many index types for indexing*XML*documents utilize the notion of bisimulation. Thereby the notion of bisimulation is used to relate nodes that have equivalent behavior with respect to queries performed on the*XML*documents. By replacing these bisimilar nodes one can reduce the size of the*XML*document and as such speed up queries. The objective of this thesis is to develop techniques for constructing and maintaining bisimulation partitions. Thereby a bisimulation partition groups nodes based on bisimilarity. In this thesis we primarily focus on very large directed acyclic graphs. The results in this thesis can for example be used to index very large*XML*documents.