CCP is a software for searching a regular expression in a text.
It is based on the determinization algorithm optimized for homogeneous automata described
in the article
Compact and fast algorithms for regular expression search
by J.-M. Champarnaud, F. Coulon, T. Paranthoen
Report LIFAR / AIA 2003.01 of the University of Rouen, France
|
- Abstract
This paper describes an improvement of the brute force determinization algorithm in
the case of homogeneous NFAs, as well as its application to pattern matching.
Brute force determinization with limited memory
may provide a partially determinized automaton, but its bounded complexity
makes it be a fail-safe procedure contrary to the classical subset construction.
We investigate the particular case of pattern matching
based on Glushkov NFAs, which involves the determinization
of an almost homogeneous automaton.
Actually, our algorithm is inspired both by recent results of Champarnaud concerning the
subset automaton of a homogeneous NFA and by the algorithm recently designed by
Navarro and Raffinot to implement
the brute force determinization of the Glushkov NFA of a regular pattern.
We obtain an algorithm with an average exponentially smaller memory requirement
for a given level of determinization, which, considering a bounded memory,
implies a quadratically smaller parsing time.
- The software
The sources of CCP (version 0.3) can be downloaded here: ccp-0.3.tar.gz.
This archive will extract into the directory CCP-0.3. To compile,
just type 'make'.
This software is free for academic research only. Please read the file LICENCE contained in the archive
before installation.
- Using CCP
The command
ccp [OPTIONS] <pattern> <filename>
shows the occurences of the regular expression <pattern> in the file
<filename> together with their offsets. Offsets are expressed w.r.t. the beginning of the file.
Type ccp -h to get the list of the options.
The regular expression is entered infix. The operators are
| for the union
* for the Kleene star
The concatenation operator is not written. Let A and B be two subsexpressions,
write AB for their concatenation. The operator '*' has the highest priority.
Then comes the concatenation, and '|' has the lowest priority.
The square brackets are used as shortcuts for the union:
[abc] stands for any letter a, b or c,
[a-z] stands for any letter between a and z in the ascii order.
Some examples:
|
alice|wonder
|
matches "alice", "wonder" and nothing else.
|
|
include|(#*define)
|
matches "include", "define", "#define", "##define", ...
|
|
inc*lude
|
matches "inlude", "include", "incclude", ...
|
|
[A-Z]([a-z]|[A-Z]*)
|
matches "Ae", "ABE", but does not match "BbAdf"
|
Special characters can be entered by typing their decimal ascii code
on 3 digits preceded by \.
"Alice\032and" stands for "Alice and"
The symbol \000 is forbidden.
- Development
Please report bugs to Fabien.Coulon@univ-rouen.fr.
Any improvement to the code is welcome too.
The file countcode.c contains the procedures for managing the output, which can be easily
modified so that CCP shows some other informations.
|