CCP - a compact and fast algorithm for Regular Expression Search
__________________________________
 
CCP is a software for searching a regular expression in a text. It is based on the determinization algorithm optimized for homogeneous automata described in the article
Compact and fast algorithms for regular expression search
by J.-M. Champarnaud, F. Coulon, T. Paranthoen
Report LIFAR / AIA 2003.01 of the University of Rouen, France


  • Abstract
    This paper describes an improvement of the brute force determinization algorithm in the case of homogeneous NFAs, as well as its application to pattern matching. Brute force determinization with limited memory may provide a partially determinized automaton, but its bounded complexity makes it be a fail-safe procedure contrary to the classical subset construction. We investigate the particular case of pattern matching based on Glushkov NFAs, which involves the determinization of an almost homogeneous automaton. Actually, our algorithm is inspired both by recent results of Champarnaud concerning the subset automaton of a homogeneous NFA and by the algorithm recently designed by Navarro and Raffinot to implement the brute force determinization of the Glushkov NFA of a regular pattern. We obtain an algorithm with an average exponentially smaller memory requirement for a given level of determinization, which, considering a bounded memory, implies a quadratically smaller parsing time.
     
  • The software
    The sources of CCP (version 0.3) can be downloaded here: ccp-0.3.tar.gz.
    This archive will extract into the directory CCP-0.3. To compile, just type 'make'.
    This software is free for academic research only. Please read the file LICENCE contained in the archive before installation.
     
  • Using CCP
    The command
     
       ccp [OPTIONS] <pattern> <filename>
     
    shows the occurences of the regular expression <pattern> in the file <filename> together with their offsets. Offsets are expressed w.r.t. the beginning of the file. Type ccp -h to get the list of the options.
    The regular expression is entered infix. The operators are
     
       | for the union
       * for the Kleene star

     
    The concatenation operator is not written. Let A and B be two subsexpressions, write AB for their concatenation. The operator '*' has the highest priority. Then comes the concatenation, and '|' has the lowest priority.
    The square brackets are used as shortcuts for the union:
     
       [abc] stands for any letter a, b or c,
       [a-z] stands for any letter between a and z in the ascii order.

     
    Some examples:
     
    alice|wonder matches "alice", "wonder" and nothing else.
    include|(#*define) matches "include", "define", "#define", "##define", ...
    inc*lude matches "inlude", "include", "incclude", ...
    [A-Z]([a-z]|[A-Z]*)   matches "Ae", "ABE", but does not match "BbAdf"

    Special characters can be entered by typing their decimal ascii code on 3 digits preceded by \.
     
       "Alice\032and" stands for "Alice and"
     
    The symbol \000 is forbidden.
  • Development
    Please report bugs to Fabien.Coulon@univ-rouen.fr. Any improvement to the code is welcome too.
    The file countcode.c contains the procedures for managing the output, which can be easily modified so that CCP shows some other informations.