The SkipJoin temporal join algorithm is a variant of the standard high-performance internal-memory forward-scan algorithm for joining sorted arrays of events (intervals with a start and end time). SkipJoin improves on forward-scan algorithms by providing efficient support for windowed joins and for skewed data. To support SkipJoin, we also provide the Stab-Forest temporal index structure. This index structure supports highly-efficient stab-queries and multi-stab-queries (which retrieve events active at given points in time) and efficiently supports append-only operations.
The implementation of SkipJoin and of the Stab-Forest is in C++14 and is tested using two modern C++ compiler, namely Microsoft C/C++ Compiler Version 19.13.26132 for x64 (part of Visual Studio 2017) and GNU g++ 8.1.0. The code should work with any sufficiently standard-compliant C++ compiler. Some of the supporting scripts are implemented using Java (any modern java compiler should suffice). The implementation is available for public use under the BSD-license (license details included in the source files).
In our experiments, we used the Airline On-Time Performance Data (AOTPD) dataset and the Civil Unrest Event Data (CUED) datasets. The raw datasets used have a total size of approximately 2.5GB. Hence, we do not provide copies of these datasets here. Instructions on how to build the datasets is provided with the source code. Furthermore, we have offline backups of the datasets that can be provided upon request.
Stab-Forests: Dynamic Data Structures for Efficient Temporal Query Processing. (2020). In: 27th International Symposium on Temporal Representation and Reasoning (TIME 2020), 18:1-18:19, Schloss Dagstuhl. DOI: 10.4230/LIPIcs.TIME.2020.18. author copy, slides..
Many sources of data have temporal start and end attributes or are created in a time-ordered manner. Hence, it is only natural to consider joining datasets based on these temporal attributes. To do so efficiently, several internal-memory temporal join algorithms have recently been proposed. Unfortunately, these join algorithms are designed to join entire datasets and cannot efficiently join skewed datasets in which only few events participate in the join result.
To support high-performance internal-memory temporal joins of skewed datasets, we propose the skip-join algorithm, which operates on stab-forests. The stab-forest is a novel dynamic data structure for indexing temporal data that allows efficient updates when events are appended in a time-based order. Our stab-forests efficiently support not only traditional temporal stab-queries, but also more general multi-stab-queries. We conducted an experimental evaluation to compare the skip-join algorithm with state-of-the-art techniques using real-world datasets. We observed that the skip-join algorithm outperforms other techniques by an order of magnitude when joining skewed datasets and delivers comparable performance to other techniques on non-skewed datasets.