JPattern

Snobol4-Style Pattern Matching Primitives for Java

Reference Manual


Last Updated:01 October 2007
Latest Version:Jpattern 2.0
Minimum JDK Level:JDK 1.5

Table of contents

Introduction

Most current programming languages provide some form of pattern matching. As a rule, the pattern matching is based on regular expressions. The Comit/Snobol3/Snobol4/Spitbol programming languages have provided pattern matching but based on a pattern matching paradigm that is strictly more powerful than standard regular expressions.

The goal of this project is to make Snobol4-style pattern matching available as a package for the Java programming language. Rather than build such a package from scratch, Rober Dewar's existing Ada-based Gnat Spitbol Patterns code was converted to Java. The result is generally consistent with that Ada package, although some changes were made to conform to the capabilities and limitations of the Java language. In light of the derived nature of this code, the original GNAT license is assumed to cover this code. The compiler, however is licensed under a BSD license. See the License Section for more details.

Pattern Matching Tutorial

The following introduction is taken from the file G-SPIPAT.ADS. It has been modified to conform to the Java version for various details.

API Overview

The API deliberately bears more than a passing resemblance to the java.util.regex structure. The major classes are jpattern.Pattern, jpattern.Matcher and jpattern.MatchResult (which is a java interface).

Pattern Functions

Pattern class instances are constructed using a set of functions defined in the class jpattern.Pattern. These pattern functions take one of three kinds of argument types.
  1. java.lang.string. This represents a simple string. In some cases, a string of length one will be treated as a single character.

  2. jpattern.Pattern. This represents a sub-pattern.

  3. jpattern.Variable. This represents a name that will be interpreted later to refer to a value in a map. See Variables for more details.
The specific pattern construction operators are as follows.

Cancel() Concat(Pattern, Pattern) Rem()
Abort() Defer(Variable) Rest()
Alternate(Pattern, Pattern) Fail() RPos(int)
Any(String) Fence() RPos(Variable)
Any(Variable) Fence(Pattern) RTab(int)
Arb() IAssign(Pattern, Variable) RTab(Variable)
Arbno(String P) Java(Object) Setcur(Variable)
Arbno(Pattern) Len(int) Span(String)
Assign(Pattern, Variable) Len(Variable) Span(Variable)
Bal() NotAny(String) Succeed()
Bal(String) NotAny(Variable) Tab(int)
Bal(Variable) NSpan(String) Tab(Variable)
Break(String) NSpan(Variable) StringPattern(String)
Break(Variable) Pos(int) CharPattern(char)
BreakX(String) Pos(Variable) External(String,Object...)
BreakX(Variable) Replace(Pattern, Variable)

All of these methods are static public and all of them throw the exception jpattern.Error.

Consider the following pattern.

("abc" | "ab") & ("de" | "cd")
This pattern could be construct using the following sequence of Pattern functions.

Pattern pe1 = Pattern.Alt("abc","ab");
Pattern pe2 = Pattern.Alt("de","cd");
Pattern pe3 = Pattern.Concat(pe1,pe2);

Matching

As with java.util.regex.Pattern an instance of jpattern.Matcher is created by calling a factory method of the jpattern.Pattern class. This method has several signatures.
    public Matcher matcher();
    public Matcher matcher(String subject);
    public Matcher matcher(String subject, VarMap vars);
    public Matcher matcher(String subject, VarMap vars, ExternalMap externs);

The meaning of the VarMap and ExternalMap arguments are explained elsewhere in this document.

The class jpattern.Matcher implements the following relevant methods

    public boolean match() throws Error;
    public boolean match(String subject, VarMap vars, ExternalMap externs)
        throws Error;

    public Matcher setAnchorMode(boolean b);
    public Matcher setStackSize(int n) throws Error;
    public Matcher setSubject(String newsubject);
    public Matcher reset(String newsubject); // alias for setSubject
    public Matcher setVarMap(VarMap map);
    public Matcher setExternalMap(ExternalMap map);

The first match() assumes that the various parameters of the match have already been set, either by the call to Pattern.matcher() or by invocation of the various parameter setting methods such as setSubject().

The jpattern.MatchResult interface - like its java.util.regex.MatchResult analog - provides a means for returning three values from the match if the match succeeds. It is an interface defined as follows.

public interface MatchResult {
    public String getSubject();
    public int getStart();
    public int getStop();
}

The meaning of the methods is as follows.

Note that the cursor indices are zero-based to be more useful with Java strings. This differs from the Snobol standard of one-based indices.

The jpattern.Matcher class implements the jpattern.MatchResult interface, so as a rule, the Matcher can be used to obtain the match results.

Variables

It is possible to use variables within patterns for a number of purposes. A variable is represented as a string matching the following regular expression: "[_a-zA-Z][_a-zA-Z0-9]*" (Note that sometimes RE's are more compact).

As a rule, the variable name should be preceded by the defer operator ("+") or an assignment operator ("." "$" "*" "**") or a replacement operator ("="). The leading "+" is not required unless the variable name is also a keyword.

As a rule, any elementary pattern operator that takes an argument may have a variable as its argument.

During the match, if a reference to a variable is required, then it searched for in the VarMap passed to the jpattern.Matcher. The value associated with the variable in the map can be of the following types (but see also here).

java.util.Collection
If a java Collection is associated with the variable, and the variable is being used as an immediate assignment (e.g. "*x"), then the value assigned to the variable is appended to the end of the collection. In conjunction with the fail pattern, this can be used to collect all possible matches of a pattern against a subject string.
jpattern.ExternalVariable
If the value associated with the variable in the map is of this type, then the get() method is invoked for that object when it is used as a value or as the source of a replace. If the variable is being used as the target of an immediate or deferred assignment then its put() method is invoked to do something with the value. This would allow, for example, a trace of a pattern match to be output to standard output. See the discussion about communication with Java for more details.
java.lang.Object
For this final case, the object is converted to a string and that value is used either for replacement or for matching.
If the variable is not defined in the map, then if it is involved in an assignment, it is inserted into the map. Otherwise, an exception is thrown signalling an undefined variable.

Pattern Specification Language

The string representation of patterns uses the following BNF syntax.
    pattern := alt;
    alt := cat | cat '|' alt;
    cat := replace | replace '&' cat | replace cat;
    replace := assign | assign '=' variable;
    assign :=   element
              | element '**' variable
              | element '.' variable
              | element '*' variable
              | element '$' variable;
    element := operator | '(' pattern ')' | defer | INT | STRING;
    operator := 
                | "+" variable
                | "abort" | cancel
                | "any" "(" sjvarg ")"
                | "arb" "(" sjvarg ")"
                | "arbno" "(" sjvarg ")"
                | "bal" | bal "(" sjvarg ")"
                | "break" "(" sjvarg ")"
                | "breakx" "(" sjvarg ")"
                | "fence" | fence "(" sjvarg ")"
                | "fail"
                | "len" "(" njvarg ")"
                | "notany", "(" sjvarg ")"
                | "nspan" "(" sjvarg ")"
                | "pos" "(" njvarg ")"
                | "rem" | "rest"
                | "rpos", "(" njvarg ")"
                | "rtab" "(" njvarg ")"
                | "setcur" "(" defer ")"
                | "span" "(" sjvarg ")"
                | "succeed"
                | "tab" "(" njvarg ")"
                ;
    sjvarg := STRING | JAVASTRING | defer;
    njvarg := INT | JAVASTRING | defer;
    defer := variable | "+" variable;

    variable := ID;

    # Lexical Elements
    ID := [_a-zA-Z][_a-zA-Z0-9]* that is not a keyword.
    INT := [-]?[0-9]+ | "0x" [0-9a-fA-F]+
    STRING := '"'  '"'
    JAVASTRING := '`'  '`'

Whitespace (" \t\n\r...") and comments may appear in expressions where desired. The comments are either the "//...\n" type or the "/*...*/" type. Be careful to not put an end-of-line escape before any comments on that line. That is, do not write

@len(5) \ //comment 1
bal()@
but rather write this.
@len(5) //comment 1 \
bal()@

Pattern Matching Overview

We will start with two basic functions: concatenation ("&") and alternation ("|"). If A and B are patterns, then "A & B" means try to match A, and if that succeeds, then try to match B starting after A. Note that the & is actually optional, and the juxtiposition of the two patterns is taken to imply concatenation. Thus "A B" is the treated the same as "A & B". Note that this means you must be careful with whitespace since, for example, fence("a") and fence  ("a") mean two different things.

The expression "A | B" means try to match A and if that fails, then try to match B starting at the same place where the attempt to match A began.

There is full backtracking, which means that if a given pattern element fails to match, then previous alternatives are matched. For example if we have the pattern:

(A | B) (C | D) (E | F)
First we attempt to match A, if that succeeds, then we go on to try to match C, and if that succeeds, we go on to try to match E. If E fails, then we try F. If F fails, then we go back and try matching D instead of C. Let's make this explicit using a specific example, and introducing the simplest kind of pattern element, which is a literal string. The meaning of this pattern element is simply to match the characters that correspond to the string characters. Now let's rewrite the above pattern form with specific string literals as the pattern elements:
("ABC" | "AB")("DEF" | "CDE")("GH" | "IJ")
The following strings will be attempted in sequence:
  1. ABC . DEF . GH
  2. ABC . DEF . IJ
  3. ABC . CDE . GH
  4. ABC . CDE . IJ
  5. AB . DEF . GH
  6. AB . DEF . IJ
  7. AB . CDE . GH
  8. AB . CDE . IJ
Here we use the dot simply to separate the pieces of the string matched by the three separate elements.

Moving the Start Point

A pattern is not required to match starting at the first character of the string, and is not required to match to the end of the string. The first attempt does indeed attempt to match starting at the first character of the string, trying all the possible alternatives. But if all alternatives fail, then the starting point of the match is moved one character, and all possible alternatives are attempted at the new anchor point.

The entire match fails only when every possible starting point has been attempted. As an example, suppose that we had the subject string

"ABABCDEIJKL"
matched using the pattern in the previous example:
("ABC" | "AB") & ("DEF" | "CDE") & ("GH" or "IJ")
This would succeed, after two anchor point moves:

"ABABCDEIJKL"
   ^^^^^^^
where the "^^^^^^^" indicates the matched section.

This mode of pattern matching is called the unanchored mode. It is also possible to put the pattern matcher into anchored mode by calling the function Match.setAnchorMode(true) This will cause all subsequent matches to be performed in anchored mode, where the match is required to start at the first character. We will also see later how the effect of an anchored match can be obtained for a single specified anchor point if this is desired.

Available Pattern Operations

Abort/Cancel
Pattern Function: Abort() or Cancel()
Compiler Syntax: abort or cancel
Description: The Abort (Cancel) operator immediately aborts the entire pattern match, signalling failure. This is a specialized pattern element, which is useful in conjunction with some of the special pattern elements that have side effects. Note: the original spitbol name was Abort, but the GNU version changed it to Cancel because Abort is a reserved Ada word.

Alternation
Pattern Function: Alternate(Pattern A, Pattern B)
Compiler Syntax: A | B
Description: The alternation operator means first attempt to match A, and then if that does not succeed, match B.

Any
Pattern Function: Any(String s)
Compiler Syntax: any(<string>)
Description: The Any(S) operator where S is a string, matches a single character that is any one of the characters in S. Fails if the current character is not one of the given set of characters.

Arb
Pattern Function: Arb()
Compiler Syntax: arb
Description: The Arb operator matches any string. First it matches the null string, and then on a subsequent failure, matches one character, and then two characters, and so on. It only fails if the entire remaining string is matched.

Arbno
Pattern Function: Arbno(String S) or Arbno(Pattern P)
Compiler Syntax: arbno(<pattern>)
Description: The Arbno(P) operator where P is any pattern, matches any number of instances of the pattern, starting with zero occurrences. It is thus equivalent to ("" | (P & ("" | (P & ("" ....))))). The pattern P may contain any number of pattern elements including the use of alternatiion and concatenation.

Bal
Pattern Function: Bal() or Bal(String parens)
Compiler Syntax: bal or bal(<string>)
Description: The Bal operator matches a non-empty string that is parentheses balanced with respect to ordinary () characters. Examples of balanced strings are "ABC", "A((B)C)", and "A(B)C(D)E". Bal matches the shortest possible balanced string on the first attempt, and if there is a subsequent failure, attempts to extend the string. Optionally, the Bal operator can be given a two character string to indicate the parentheses characters. For example Bal("[]") will match balanced square brackets.

Break
Pattern Function: Break(String S)
Compiler Syntax: break(<string>)
Description: The Break(S) operator where S is a string, matches a string of zero or more characters up to but not including a break character that is one of the characters given in the string S. Can match the null string, but cannot match the last character in the string, since a break character is required to be present.

BreakX
Pattern Function: BreakX(String S)
Compiler Syntax: breakx(<string>)
Description: The BreakX(S) operator where S is a string, behaves exactly like Break(S) when it first matches, but if a string is successfully matched, then a susequent failure causes an attempt to extend the matched string.

Concatenation
Pattern Function: Concat(Pattern A, Pattern B)
Compiler Syntax: A & B
Description: The concatenation operator means match A followed immediately by matching B.

Fail
Pattern Function: Fail()
Compiler Syntax: fail
Description: The Fail operator is equivalent to the null alternation. Matches no possible strings, so it always signals failure. This is a specialized pattern element, which is useful in conjunction with some of the special pattern elements that have side effects.

Fence
Pattern Function: Fence() or Fence(Pattern p)
Compiler Syntax: fence or Fence(<pattern>)
Description: The Fence operator matches the null string at first, and then if a failure causes alternatives to be sought, aborts the match (like a Cancel). Note that using Fence at the start of a pattern has the same effect as matching in anchored mode.

The Fence(P) operator, where P is a pattern, attempts to match the pattern P including trying all possible alternatives of P. If none of these alternatives succeeds, then the Fence pattern fails. If one alternative succeeds, then the pattern match proceeds, but on a subsequent failure, no attempt is made to search for alternative matches of P. The pattern P may contain any number of pattern elements including the use of alternatiion and concatenation.

Len
Pattern Function: Len(int n)
Compiler Syntax: len(<integer>)
Description: The Len(N) operator where N is a natural number, matches the given number of characters. For example, Len(10) matches any string that is exactly ten characters long.

NotAny
Pattern Function: NotAny(String S)
Compiler Syntax: notany(<string>)
Description: The NotAny(N) operator where S is a string, matches a single character that is not one of the characters of S. Fails if the current characer is one of the given set of characters.

NSpan
Pattern Function: NSpan(String S)
Compiler Syntax: nspan(<string>)
Description: The NSpan(S) operator where S is a string, matches a string of zero or more characters that is among the characters given in the string. Always matches the longest possible such string. Always succeeds, since it can match the null string.

Pos
Pattern Function: Pos(int n)
Compiler Syntax: pos(<integer>)
Description: The Pos(N) operator where N is a natural number, matches the null string if exactly N characters have been matched so far, and otherwise fails.

Rem/Rest
Pattern Function: Rem() or Rest()
Compiler Syntax: rem or rest
Description: The Rem (Rest) operator matches from the current point to the last character in the string. This is a specialized pattern element, which is useful in conjunction with some of the special pattern elements that have side effects. Note: the original spitbol name was Rem, but the GNU version changed it to Rest because Rem is a reserved Ada word.

Rpos
Pattern Function: Rpos(int n)
Compiler Syntax: rpos(<integer>)
Description: The Rpos(N) operator where N is a natural number, matches the null string if exactly N characters remain to be matched, and otherwise fails.

Rtab
Pattern Function: Rtab(int n)
Compiler Syntax: rtab(<integer>)
Description: The Rtab(N) operator where N is a natural number, matches characters from the current position until exactly N characters remain to be matched in the string. Fails if fewer than N unmatched characters remain in the string.

Span
Pattern Function: Span(String S)
Compiler Syntax: span(<string>)
Description: The Span(S) operator where S is a string, matches a string of one or more characters that is among the characters given in the string. Always matches the longest possible such string. Fails if the current character is not one of the given set of characters.

Succeed
Pattern Function: Succeed()
Compiler Syntax: succeed
Description: The Succeed operator repeatedly matches the null string (it is equivalent to the alternation ("" | "" | "" ....)). This is a special pattern element, which is useful in conjunction with some of the special pattern elements that have side effects.

Tab
Pattern Function: Tab(int n)
Compiler Syntax: tab(<integer>)
Description: The Tab(N) operator where N is a natural number, matches characters from the current position until exactly N characters have been matched in all. Fails if more than N characters have already been matched. It is technically ambiguous about its value if the cursor is less than the length of the subject string and the length is less than N. Here is it assumed that the match fails in this case.

Recursive Pattern Matching

Defer
Pattern Function: Defer(Variable B)
Compiler Syntax: [+]<name>/td>
Description: The Defer operator (+V) where V is a variable, obtains at match time a value for V from the VarMap argument to the Match function. It then proceeds to match that value as if it had been present in the original pattern at that point. The leading "+" is optional unless the variable name matches a keyword. See also the Variables discussion.

A deferred pattern can be used to create a recursive pattern that will, at pattern matching time, follow the pointer to obtain the referenced pattern, and then match this pattern. Consider for example, the following Java code.

Pattern P
    = Pattern.Alternate("A",
                        Pattern.Concat("B",
                                       Pattern.Defer("P")));
VarMap vars = new VarMap();
vars.put("P",P);
By placing the pattern P into the map with the name P (the name can be anything consistent), we cause P to recurse through the map. The Examples section provides illustrations of recursive patterns.

On the first attempt to match, this pattern attempts to match the string "A". If this fails, then the alternative matches a "B", followed by an attempt to match the value of "P", which is of source the Pattern P. This second attempt now first attempts to match "A", and so on. The result is a pattern that will match a string of B's followed by a single A.

This particular example could simply be written as nspan("B") & "A", but the use of recursive patterns in the general case can construct complex patterns which could not otherwise be built.

Pattern Assignment Operations In addition to the overall result of a pattern match, which indicates success or failure, it is often useful to be able to keep track of the pieces of the subject string that are matched by individual pattern elements, or subsections of the pattern.

Deferred Assignment
Pattern Function: Assign(Pattern P, Variable v)
Compiler Syntax: <pattern> * <variable> or <pattern> . <variable>
Description: The deferred assignment operator takes an arbitrary pattern P and a variable V that will be set to the substring matched by P. The assignment is deferred to the end of the match. If the entire match is successful, and if the pattern P was part of the successful match, then at the end of the matching operation the assignment to S of the string matching P is performed.

Immediate Assignment
Pattern Function: IAssign(Pattern P, Variable v)
Compiler Syntax: <pattern> ** <variable> or <pattern> $ <variable>
Description: The immediate assignment operator takes an arbitrary pattern P P a variable V that will be set to the substring matched by P. This assignment happens during pattern matching, so if P matches more than once, then the assignment happens more than once.

Set Cursor
Pattern Function: Setcur(Variable v)
Compiler Syntax: setcur(<variable>)
Description: This (badly named) cursor assignment operation assigns the current cursor position to the variable N. The cursor position is defined as the count of characters that have been matched so far (including any start point moves).

Deferred Replacement

Deferred Replacement
Pattern Function: Replace(Pattern P, Variable v)
Compiler Syntax: <pattern>=<variable>
Description: Deferred replacement is similar to deferred assignment. With assignment, a variable is associated with a pattern and when matching completes, the substring matched by that pattern is assigned to the variable.

With deferred replacement, after a successful match, the substring matched by the pattern is replaced by the value of the variable (at the end of matching).

In order to access the modified subject string, the MatchResult argument must used.

The JPattern Pseudo-Compiler

The package jpattern.compiler supports the compilation of string expressions into equivalent Pattern objects. It is capable of converting a string representation of a pattern to equivalent Java code (via a command line interface) or to a Pattern object at runtime (via an API).

jpattern.compiler.Main

The command line interface is supported by the program jpattern.compiler.Main.

Summary:

java -classpath jpattern.jar jpattern.compiler.Main OPTIONS

Options:

OptionDescription
-pat '<pattern>' Specify a pattern expression.
-subject="<string>"Specify a subject string.
-var "var=<value>"Specify an initial variable assignment.; may be repeated.
-tag "<string>Specify a string that identifies pattern lines in the standard input. The same tag is used to mark the beginning and the end of the string.
-xmltag "<string>Specify an xml style opening tag that identifies pattern lines in the standard input. The closing tag is the same as the open tag, but with the standard "/". For example, -xmltag <pattern> would mark strings with the tags <pattern>...</pattern>.
-squoteIndicate that pattern expressions are using single quotes instead of the usual double quotes.
-debugTurn on lowest level of debug output.
-debugn <N> Set the debug level to N
-vPrint out the parameters.
-f <filename>Take input from this file
-o <filename>Send non-error output to this file.
-err <filename>Send error output to this file.
-conflictAvoid naming conflicts by using fully qualified names for pattern operations.

Description:

Command Line Usage

There are two primary calling forms, exemplified by the following.
  1. java -classpath jpattern.jar jpattern.compiler.Main
    -pat "arb & len(5)=x" -subject "1234567" -var "x=xyz"

  2. java -classpath jpattern.jar jpattern.compiler.Main
    -tag "@" < javacode.pat > javacode.java

The first command line case performs an actual match. It compiles the "-pat" argument and matches it against the "-subject" string. It reports if the match succeeds, and if so, prints out the final values of all variables and the final subject line after replacement. The example above, for example would report success and that the final subject line was "xyz67" and the final value of x was "xyz".

The second case reads lines from standard input ("javacode.pat" in this case) and writes them to standard output ("javacode.java" in this case). As each line is read, it is searched for paired occurrences of the java tag string (the argument to the -tag flag or -xmltag flag). The text between the occurrences is assumed to be a pattern expression. Multiple patterns per line is disallowed. That expression is compiled to equivalent Java code to construct the Pattern and is written to standard output in place of the tagged text. Patterns may span multiple lines by ending each line with a backslash character ("\").

As an alternative, it should be possible to invoke the compiler directly using the following command form: java -jar jpattern.jar ... This is because the jar file has the appropriate manifest within it. It should be noted, however, that java -D flags cannot be used when using the -jar flag.

To see how the compiler works, suppose the first example was run with the following input file.

public class Test0
{
    static void  makePattern()
    {
        Pattern Lnum = @pos(0) (Digs ".") span(" ")@;
        Pattern Digs = @span("0123456789")@;
    }
}
It would produce the following output file.
public class Test0
{
    static void makePattern()
    {
        Pattern Lnum = Pattern.Concat(
            Pattern.Pos(0),
            Pattern.Concat(
                Pattern.Concat(
                    Pattern.Defer(Variable.create("Digs")),
                    Pattern.StringPattern(".")
                ),
                Pattern.Span(" ")
            )
        );
        Pattern Digs = Pattern.Span("0123456789");
    }
}

You can examine the .pat files in the test directory to see other examples.

API Usage

The compiler may be invoked within a Java program to dynamically generate Patterns from pattern specifications. The Java code fragment equivalent to the first command line above would be as follows.
// Instantiate the compiler and compile the desired pattern.
Compiler compiler = new Compiler();
Pattern p = compiler.compile("arb & len(5)=x");

// Now create a VarMap and get a matcher from pattern p.
VarMap vars = new VarMap();
Match matcher = p.matcher();

//Recast the matcher as a match result
MatchResult result = (MatchResult)matcher;

// Now perform the match and report the result
String subject = "1234567";
boolean ok = matcher.match(subject,vars);
stdout.print(ok?"succeed: "+result.getSubject():"fail.");

Interacting with Java

It can be desirable for a pattern to interact with the surrounding Java code, and some (new) mechanisms are provided for that purpose.

It should be realized that the interaction can occur at two different points in the lifetime of a pattern. First, interaction can occur when a pattern is constructed using the the JPattern Pseudo-Compiler. Second, interaction can occur when the pattern is used to match a string.

Interaction During Pattern Construction
Java expressions may be passed through the compilation process by enclosing them in backquotes ('`'): `5 + 6`, for example. Several test cases - including 1, 3a3 and 5 - demonstrate this. Be careful when using this mechanism because no type checking is performed, so you can cause type errors when compiling the resulting Java code.

If the java expression in backquotes occurs in a position that requires it to be a pattern object, then the expression is passed through a function called Pattern.Java(). This function forces the value of the expression to be converted to a Pattern object. This is shown in test case 5.

Interaction During Pattern Matching
Ideally, it should be possible to refer to Java variables during pattern matching. Unfortunately, this would require the use of reflection, which is cumbersome and relatively slow. This is why the Variable mechanism is provided instead.

ExternalVariable: Another mechanism, the ExternalVariable mechanism, is provided that can give some of the benefits of access to surrounding Java. Consider the following example (taken from test case 5a).

class AccessJava implements ExternalVariable {
...
    Object current = null;
...
    public Object get(VarMap vars) {return current;}
    public void put(VarMap vars, Object val) {current = val;}
...
}
...
class Test5 extends Test {
...
    AccessJava func;
...
    func=new AccessJava();};
...
    vars.put("AccessJava",func);
    Pattern p = @("before" & span(" ")) $ +AccessJava & "after" = +AccessJava@;
}

The AccessJava class implements the jpattern.ExternalVariable interface. This interface provides a method called get() and a method called put(). is evaluated by the matcher whenever a variable of type ExternalFunction is encountered. In this case, AccessJava just returns the value of its parent's javaVariable1 instance variable. Since this is evaluated at match time, it provides a usable mechanism for accessing Java variables and code at that time.

The AccessJava function is used by storing an instance of it under some name ("AccessJava" in this case) in the variable map that will be passed to the Match.match() method. In the code above, the variable AccessJava is referenced at the end of the pattern, so during match it will invoke the AccessJava.get() method as the last part of the match. Of course, the reference could be any where as long as there is no type conflict.

ExternalPattern: The ExternalPattern mechanism allows users of jpattern to define their own pattern operations.

Let us define a new external pattern called RE(<regular expression>), where the <regular expression> is a string in quotes that represents a regular expression as defined by java.util.regex.Pattern.

This semantics of this user-defined pattern are that it attempts to repeatedly match the regular expression - call it RE - against the subject at the current cursor. Thus, it is roughly similar to the altered regular expression "RE+", which of course would be a legal RE for this pattern and would negate the need for the implicit plus. But the point here is to demonstrate how to use the ExternalPattern mechanism.

The basic process is to define two classes: one that implements the ExternalPattern interface and one that subclasses the ExternalMatcher class. The latter can be an inner, final class within the first.

So, for the RE pattern, we define the following outline structure (error checking and such have been elided).

class REPattern implements ExternalPattern
{
    // The ExternalPattern interface requires three methods
    // getName, getNargs, and matcher.

    public String getName() {return "RE";}
    public int getNargs() {return 1;}

    public ExternalMatcher matcher(Object[] argv) throws Error
    {
        java.util.regex.Pattern jpat = null;
        jpat = java.util.regex.Pattern.compile((String)argv[0]);
        ExternalMatcher rem = new REMatcher(jpat);
        return rem;
    }
}

The matcher() method is basically a factory method for creating ExternalMatcher instances. The reason for this is because the ExternalPattern instances are shared by all occurrences in a compiled pattern. So for each actual match of that pattern we must keep a separate piece of state to track the match.

The corresponding ExternalMatcher for REPattern is REMatcher and is defined as follows (again with elision).

    final class REMatcher extends ExternalMatcher
    {
        // The only relevant state is the regular expression
        java.util.regex.Pattern regexp = null;

        public REMatcher(java.util.regex.Pattern jpat) {regexp = jpat;}

        public boolean initial() throws Error {...}
    
        public boolean retry() throws Error {...}

        public void fail() {jpat = null; ...}
    }

On initial match attempt, the initial() method is called. It returns true or false if the initial match succeeds or fails. When the match backtracks into this operation, the retry() method is called. It returns true or false if the match retry succeeds or fails. If it fails, then the fail() method is invoked to allow the external pattern to reclaim any allocated resources.

This external pattern is defined by something like this.

    ExternalMap externs = new ExternalMap(<regular expression string>);
    externs.add(new REPattern());

Assume the invocation of the RE pattern looks something like this.

    RE("[a][b]")

The argument string "[a][b]" specifies the regular expression to match. When the matcher encounters this pattern it invokes REPattern.matcher(<regular expression string>) to obtain the REMatcher instance.

The class ExternalMatcher provides access to a set of fields that provide access to the current state of the overall match. The relevant definition is as follows.

abstract public class ExternalMatcher {
    ...
    public String Subject;
    public VarMap Vars;
    public ExternalMap Externs;
    public int Anchor;
    public int Cursor;
    public boolean Cancel;
}

The meaning of the fields is as follows.

  • Subject: the subject string being used in the match.
  • Vars: the variable map being used in the match.
  • Externs: the external pattern map being used in the match.
  • Anchor: the cursor index as it was at the beginning of the call to the initial() method.
  • Cursor: the current cursor index.
  • Cancel: return true if the match should be completely cancelled.

    The complete initial() method for REMatcher is defined as follows (again with elision).

        // Try the initial match. We want the maximum match
        // starting at the current cursor location.
        public boolean initial() throws Error
        {
            // Get the substring to match (with some test reporting)
            int curse = super.Cursor;
            String target = super.Subject.substring(curse);
            java.util.regex.Matcher m = regexp.matcher(target);
            boolean sf = m.lookingAt(); // anchored RE match
            if(!sf) return false; // failed                 
            // set the new cursor
            super.Cursor = (curse+m.end());
            return true;
        }
    

    The complete retry() method is defined as follows.

        public boolean retry() throws Error
        {
            // Get the substring to match
            int curse = super.Cursor;
            String target = Subject.substring(curse);
            java.util.regex.Matcher m = regexp.matcher(target);
            boolean sf = m.lookingAt(); // anchored RE match
            if(!sf) return false; // failed                 
            // set the new cursor
            super.Cursor = (curse+m.end());
            return true;
        }
    

    Test case 12 shows the example completely worked out and operational.

    Examples of Pattern Matching

    Example 1. This shows a simple example of the use of pattern replacement to remove a line number from the start of a string. We assume that the line number has the form of a string of decimal digits followed by a period, followed by one or more spaces. The tag delimiter for pattern specifications is "@". We will occasionally include the concatenation operator ("&").
    Pattern Digs = @span("0123456789")@;
    Pattern Lnum = @pos(0) & (+Digs & ".")=nil & span(" ")@;
    

    Now to use this pattern we simply do a match with a replacement:

    VarMap vars = new VarMap(); vars.put("Digs",Digs); vars.put("nil","");
    boolean ok = Lnum.matcher().match(Line, vars);
    

    which replaces the line number by the null string.

    The style we use here - defining constant patterns and then using them - is typical. It is possible to build up patterns dynamically, but it is usually more efficient to build them in pieces in advance using constant declarations. Note in particular that although it is possible to construct a pattern directly as an argument for the Match routine, it is much more efficient to preconstruct the pattern as we did in this example.

    Example 2. Now let's look at the use of pattern assignment to break a string into sections. Suppose that the input string has two unsigned decimal integers, separated by spaces or a comma, with spaces allowed anywhere. Then we can isolate the two numbers with the following pattern:

    Pattern B = @nspan(" ")@;
    Pattern N = @span("0123456789")@;
    Pattern T = @nspan(" ") +N*Num1 span(" ,") +N*Num2@;
    

    The match operation

    T.matcher().match(" 124, 257  ")
    

    would assign the string 124 to Num1 and the string 257 to Num2.

    Example 3. Now let's see how more complex elements can be built from the set of primitive elements. The following pattern matches strings that have the syntax of Ada 95 based literals:

    VarMap vars = new VarMap();
    vars.put("DecDigits","0123456789");
    vars.put("HexDigits","0123456789abcdefABCDEF");
    
    Pattern Digs = @span(+DecDigits)@;
    Pattern UDigs = @+Digs arbno("_" +Digs")@;
    Pattern Hdig = @span(+HexDigits)@;
    Pattern UHdig = @+Hdig arbno("_" +Hdig)@;
    Pattern Bnum = @+UDigs "#" +UHdig "#"@;
    vars.put("Digs",Digs);
    vars.put("UDigs",UDigs);
    vars.put("Hdig",Hdigs);
    vars.put("UHdig",UHdigs);
    vars.put("Bnum",Bnum);
    

    A match against Bnum will now match the desired strings, e.g. it will match "16#123_abc#", but not "a#b#". However, this pattern is not quite complete, since it does not allow colons to replace the pound signs. The following is more complete:

    Pattern Bchar = @any("#:")@;
    vars.put("Bchar",Bchar);
    Pattern Bnum = @+UDigs +Bchar +UHdig +Bchar@;
    

    But that is still not quite right, since it allows # and : to be mixed, and they are supposed to be used consistently. We solve this by using a deferred match.

    Pattern Bnum  = @+UDigs Bchar*Temp UHdig +Temp@;
    

    Here the first instance of the base character is stored in Temp, and then later in the pattern we rematch the value that was assigned.

    Example 4. For an example of a recursive pattern, let's define a pattern that is like the built in Bal, but the string matched is balanced with respect to square brackets or curly brackets.

    The language for such strings might be defined in extended BNF as

    ELEMENT ::= <any character other than [] or {}>
               | '[' BALANCED_STRING ']'
               | '{' BALANCED_STRING '}'
    BALANCED_STRING ::= ELEMENT {ELEMENT}
    

    Here we use "{}" to indicate zero or more occurrences of a term, as is common practice in extended BNF. Now we can translate the above BNF into recursive patterns as follows:

    Pattern Element = @notany("[]{}") \
                      | ("[" Balanced_String "]") \
                      | ("{" Balanced_String "}")@;
    Pattern Balanced_String = @Element arbno(Element)@;
    
    VarMap vars = new VarMap();
    vars.put("Balanced_String",Balanced_String);
    

    Note the important use of + here to refer to a pattern not yet defined. Note also that we use assignments precisely because we cannot refer to as yet undeclared variables in initializations.

    Now that this pattern is constructed, we can use it as though it were a new primitive pattern element. For example, the match:

    Pattern p = @+Balanced_String*Current_Output fail@;
    P.matcher().match("xy[ab{cd}]")
    

    will generate the output (assuming that Current_Output has been initialized with an instance of java.util.Collection).

    x xy xy[ab{cd}] y y[ab{cd}] [ab{cd}] a ab ab{cd} b b{cd} {cd} c cd d
    

    Note that the function of the fail operator here is simply to force the pattern Balanced_String to match all possible alternatives. Studying the operation of this pattern in detail is highly instructive.

    Example 5. Finally we give a rather elaborate example of the use of deferred matching and the ExternalFunction mechanism. The following declarations build up a pattern which will find the longest string of decimal digits in the subject string.

    public class Test5b extends Test
    {
        GTS gts = new GTS();
    ...
        public Pattern makePattern()
        {
            vars.put("GTS",gts);
            vars.put("Digit","0123456789");
            Pattern Digs = @span(+Digit)@;
            // Note that the comments are inside the backslash
            Pattern Find = @""*Max fence           // initialize Max to null \
                            & breakx(+Digit)       // scan looking for digits \
                            & ((span(+Digit)*Cur   // assign next string to Cur \
                                & +GTS             // check size(Cur) > Size(Max) \
                                & setcur(+Loc))    // if so, save location \
                               * Max)              // and assign to Max. \
                            & fail@;               // seek all alternatives
            return Find; // top level pattern
        }
    }
    ...
    class GTS implements ExternalVariable
    {
    ...
        public Object get(VarMap vars) {
            String Cur = vars.getString("Cur","");
            String Max = vars.getString("Max","");
            return Boolean.valueOf(Cur.length() > Max.length());
        }
        public void put(VarMap vars, Object val) {} // not used
    }
    

    As we see from the comments here, complex patterns like this take on aspects of sequential programs. In fact they are sequential programs with general backtracking. In this pattern, we first use a pattern assignment that matches null and assigns it to Max, so that it is initialized for the new match. Now BreakX scans to the next digit. Arb would do here, but BreakX will be more efficient. Once we have found a digit, we scan out the longest string of digits with Span, and assign it to Cur. The deferred call to GTS tests if the string we assigned to Cur is the longest so far. If not, then failure is signalled, and we seek alternatives (this means that BreakX will extend and look for the next digit string). If the call to GTS succeeds then the matched string is assigned as the largest string so far into Max and its location is saved in Loc. Finally Fail forces the match to fail and seek alternatives, so that the entire string is searched.

    If the pattern Find is matched against a string, the variable Max at the end of the pattern will have the longest string of digits, and Loc will be the starting character location of the string. For example, Match("ab123cd4657ef23", Find) will assign "4657" to Max and 11 to Loc (indicating that Subject.substring(0,11) has been matched. This example is taken from tests/Test5b.pat.



    Point of Contact

    Author:Dennis Heimbigner
    Software Engineering Research Laboratory
    Computer Science Department
    University of Colorado
    Boulder, Colorado 80309-0430 USA
    dennis.heimbigner@colorado.edu


    License

    The code is divided into two parts. The source code which is derived directly from the Ada source code is licensed under that source's license, which is essentially the LGPL. The compiler source code is licensed under the BSD license. See the file license.txt for more more details.