Pattern Discovery in Colored Strings

Lipták, Zsuzsanna; Puglisi, Simon J.; Rossi, Massimiliano

doi:10.1145/3429280

Computer Science > Data Structures and Algorithms

arXiv:2004.04858 (cs)

[Submitted on 9 Apr 2020 (v1), last revised 28 May 2021 (this version, v2)]

Title:Pattern Discovery in Colored Strings

Authors:Zsuzsanna Lipták, Simon J. Puglisi, Massimiliano Rossi

View PDF

Abstract:In this paper, we consider the problem of identifying patterns of interest in colored strings. A colored string is a string where each position is assigned one of a finite set of colors. Our task is to find substrings of the colored string that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to automatically find properties of the embedded system from the analysis of its simulation traces.
We show that, in our setting, the number of patterns of interest is upper-bounded by $\mathcal{O}(n^2)$, where $n$ is the length of the string. We introduce a baseline algorithm, running in $\mathcal{O}(n^2)$ time, which identifies all patterns of interest satisfying certain minimality conditions, for all colors in the string. For the case where one is interested in patterns related to one color only, we also provide a second algorithm which runs in $\mathcal{O}(n^2\log n)$ time in the worst case but is faster than the baseline algorithm in practice. Both solutions use suffix trees, and the second algorithm also uses an appropriately defined priority queue, which allows us to reduce the number of computations. We performed an experimental evaluation of the proposed approaches over both synthetic and real-world datasets, and found that the second algorithm outperforms the first algorithm on all simulated data, while on the real-world data, the performance varies between a slight slowdown (on half of the datasets) and a speedup by a factor of up to 11.

Comments:	22 pages, 5 figures, 2 tables, published in ACM Journal of Experimental Algorithmics. This is the journal version of the paper with the same title at SEA 2020 (18th Symposium on Experimental Algorithms, Catania, Italy, June 16-18, 2020)
Subjects:	Data Structures and Algorithms (cs.DS)
Cite as:	arXiv:2004.04858 [cs.DS]
	(or arXiv:2004.04858v2 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.2004.04858
Journal reference:	Zs. Lipták, Simon J. Puglisi, Massimiliano Rossi: Pattern Discovery in Colored Strings. ACM Journal of Experimental Algorithmics, Vol. 26, 1.1:1-1.1:26 (2021)
Related DOI:	https://doi.org/10.1145/3429280

Submission history

From: Massimiliano Rossi [view email]
[v1] Thu, 9 Apr 2020 23:51:23 UTC (4,267 KB)
[v2] Fri, 28 May 2021 15:35:35 UTC (2,942 KB)

Computer Science > Data Structures and Algorithms

Title:Pattern Discovery in Colored Strings

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title:Pattern Discovery in Colored Strings

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators