| Last Updated: | 01 October 2007 |
| Latest Version: | Jpattern 2.0 |
| Minimum JDK Level: | JDK 1.5 |
The goal of this project is to make Snobol4-style pattern matching available as a package for the Java programming language. Rather than build such a package from scratch, Rober Dewar's existing Ada-based Gnat Spitbol Patterns code was converted to Java. The result is generally consistent with that Ada package, although some changes were made to conform to the capabilities and limitations of the Java language. In light of the derived nature of this code, the original GNAT license is assumed to cover this code. The compiler, however is licensed under a BSD license. See the License Section for more details.
| Cancel() | Concat(Pattern, Pattern) | Rem() |
| Abort() | Defer(Variable) | Rest() |
| Alternate(Pattern, Pattern) | Fail() | RPos(int) |
| Any(String) | Fence() | RPos(Variable) |
| Any(Variable) | Fence(Pattern) | RTab(int) |
| Arb() | IAssign(Pattern, Variable) | RTab(Variable) |
| Arbno(String P) | Java(Object) | Setcur(Variable) |
| Arbno(Pattern) | Len(int) | Span(String) |
| Assign(Pattern, Variable) | Len(Variable) | Span(Variable) |
| Bal() | NotAny(String) | Succeed() |
| Bal(String) | NotAny(Variable) | Tab(int) |
| Bal(Variable) | NSpan(String) | Tab(Variable) |
| Break(String) | NSpan(Variable) | StringPattern(String) |
| Break(Variable) | Pos(int) | CharPattern(char) |
| BreakX(String) | Pos(Variable) | External(String,Object...) |
| BreakX(Variable) | Replace(Pattern, Variable) |
All of these methods are static public and all of them throw the exception jpattern.Error.
Consider the following pattern.
("abc" | "ab") & ("de" | "cd")
This pattern could be construct using the following sequence of
Pattern functions.
Pattern pe1 = Pattern.Alt("abc","ab");
Pattern pe2 = Pattern.Alt("de","cd");
Pattern pe3 = Pattern.Concat(pe1,pe2);
public Matcher matcher();
public Matcher matcher(String subject);
public Matcher matcher(String subject, VarMap vars);
public Matcher matcher(String subject, VarMap vars, ExternalMap externs);
The meaning of the VarMap and ExternalMap arguments are explained elsewhere in this document.
The class jpattern.Matcher implements the following relevant methods
public boolean match() throws Error;
public boolean match(String subject, VarMap vars, ExternalMap externs)
throws Error;
public Matcher setAnchorMode(boolean b);
public Matcher setStackSize(int n) throws Error;
public Matcher setSubject(String newsubject);
public Matcher reset(String newsubject); // alias for setSubject
public Matcher setVarMap(VarMap map);
public Matcher setExternalMap(ExternalMap map);
The first match() assumes that the various parameters of the match have already been set, either by the call to Pattern.matcher() or by invocation of the various parameter setting methods such as setSubject().
The jpattern.MatchResult interface - like its java.util.regex.MatchResult analog - provides a means for returning three values from the match if the match succeeds. It is an interface defined as follows.
public interface MatchResult {
public String getSubject();
public int getStart();
public int getStop();
}
The meaning of the methods is as follows.
The jpattern.Matcher class implements the jpattern.MatchResult interface, so as a rule, the Matcher can be used to obtain the match results.
As a rule, the variable name should be preceded by the defer operator ("+") or an assignment operator ("." "$" "*" "**") or a replacement operator ("="). The leading "+" is not required unless the variable name is also a keyword.
As a rule, any elementary pattern operator that takes an argument may have a variable as its argument.
During the match, if a reference to a variable is required, then it searched for in the VarMap passed to the jpattern.Matcher. The value associated with the variable in the map can be of the following types (but see also here).
pattern := alt;
alt := cat | cat '|' alt;
cat := replace | replace '&' cat | replace cat;
replace := assign | assign '=' variable;
assign := element
| element '**' variable
| element '.' variable
| element '*' variable
| element '$' variable;
element := operator | '(' pattern ')' | defer | INT | STRING;
operator :=
| "+" variable
| "abort" | cancel
| "any" "(" sjvarg ")"
| "arb" "(" sjvarg ")"
| "arbno" "(" sjvarg ")"
| "bal" | bal "(" sjvarg ")"
| "break" "(" sjvarg ")"
| "breakx" "(" sjvarg ")"
| "fence" | fence "(" sjvarg ")"
| "fail"
| "len" "(" njvarg ")"
| "notany", "(" sjvarg ")"
| "nspan" "(" sjvarg ")"
| "pos" "(" njvarg ")"
| "rem" | "rest"
| "rpos", "(" njvarg ")"
| "rtab" "(" njvarg ")"
| "setcur" "(" defer ")"
| "span" "(" sjvarg ")"
| "succeed"
| "tab" "(" njvarg ")"
;
sjvarg := STRING | JAVASTRING | defer;
njvarg := INT | JAVASTRING | defer;
defer := variable | "+" variable;
variable := ID;
# Lexical Elements
ID := [_a-zA-Z][_a-zA-Z0-9]* that is not a keyword.
INT := [-]?[0-9]+ | "0x" [0-9a-fA-F]+
STRING := '"' '"'
JAVASTRING := '`' '`'
Whitespace (" \t\n\r...") and comments may appear in expressions where desired. The comments are either the "//...\n" type or the "/*...*/" type. Be careful to not put an end-of-line escape before any comments on that line. That is, do not write
@len(5) \ //comment 1 bal()@but rather write this.
@len(5) //comment 1 \ bal()@
The expression "A | B" means try to match A and if that fails, then try to match B starting at the same place where the attempt to match A began.
There is full backtracking, which means that if a given pattern element fails to match, then previous alternatives are matched. For example if we have the pattern:
First we attempt to match A, if that succeeds, then we go on to try to match C, and if that succeeds, we go on to try to match E. If E fails, then we try F. If F fails, then we go back and try matching D instead of C. Let's make this explicit using a specific example, and introducing the simplest kind of pattern element, which is a literal string. The meaning of this pattern element is simply to match the characters that correspond to the string characters. Now let's rewrite the above pattern form with specific string literals as the pattern elements:(A | B) (C | D) (E | F)
("ABC" | "AB")("DEF" | "CDE")("GH" | "IJ")
The following strings will be attempted in sequence:
The entire match fails only when every possible starting point has been attempted. As an example, suppose that we had the subject string
matched using the pattern in the previous example:"ABABCDEIJKL"
("ABC" | "AB") & ("DEF" | "CDE") & ("GH" or "IJ")
This would succeed, after two anchor point moves:
where the "^^^^^^^" indicates the matched section."ABABCDEIJKL" ^^^^^^^
This mode of pattern matching is called the unanchored mode. It is also possible to put the pattern matcher into anchored mode by calling the function Match.setAnchorMode(true) This will cause all subsequent matches to be performed in anchored mode, where the match is required to start at the first character. We will also see later how the effect of an anchored match can be obtained for a single specified anchor point if this is desired.
| Abort/Cancel | |
|---|---|
| Pattern Function: | Abort() or Cancel() |
| Compiler Syntax: | abort or cancel |
| Description: | The Abort (Cancel) operator immediately aborts the entire pattern match, signalling failure. This is a specialized pattern element, which is useful in conjunction with some of the special pattern elements that have side effects. Note: the original spitbol name was Abort, but the GNU version changed it to Cancel because Abort is a reserved Ada word. |
| Alternation | |
|---|---|
| Pattern Function: | Alternate(Pattern A, Pattern B) |
| Compiler Syntax: | A | B |
| Description: | The alternation operator means first attempt to match A, and then if that does not succeed, match B. |
| Any | |
|---|---|
| Pattern Function: | Any(String s) |
| Compiler Syntax: | any(<string>) |
| Description: | The Any(S) operator where S is a string, matches a single character that is any one of the characters in S. Fails if the current character is not one of the given set of characters. |
| Arb | |
|---|---|
| Pattern Function: | Arb() |
| Compiler Syntax: | arb |
| Description: | The Arb operator matches any string. First it matches the null string, and then on a subsequent failure, matches one character, and then two characters, and so on. It only fails if the entire remaining string is matched. |
| Arbno | |
|---|---|
| Pattern Function: | Arbno(String S) or Arbno(Pattern P) |
| Compiler Syntax: | arbno(<pattern>) |
| Description: | The Arbno(P) operator where P is any pattern, matches any number of instances of the pattern, starting with zero occurrences. It is thus equivalent to ("" | (P & ("" | (P & ("" ....))))). The pattern P may contain any number of pattern elements including the use of alternatiion and concatenation. |
| Bal | |
|---|---|
| Pattern Function: | Bal() or Bal(String parens) |
| Compiler Syntax: | bal or bal(<string>) |
| Description: | The Bal operator matches a non-empty string that is parentheses balanced with respect to ordinary () characters. Examples of balanced strings are "ABC", "A((B)C)", and "A(B)C(D)E". Bal matches the shortest possible balanced string on the first attempt, and if there is a subsequent failure, attempts to extend the string. Optionally, the Bal operator can be given a two character string to indicate the parentheses characters. For example Bal("[]") will match balanced square brackets. |
| Break | |
|---|---|
| Pattern Function: | Break(String S) |
| Compiler Syntax: | break(<string>) |
| Description: | The Break(S) operator where S is a string, matches a string of zero or more characters up to but not including a break character that is one of the characters given in the string S. Can match the null string, but cannot match the last character in the string, since a break character is required to be present. |
| BreakX | |
|---|---|
| Pattern Function: | BreakX(String S) |
| Compiler Syntax: | breakx(<string>) |
| Description: | The BreakX(S) operator where S is a string, behaves exactly like Break(S) when it first matches, but if a string is successfully matched, then a susequent failure causes an attempt to extend the matched string. |
| Concatenation | |
|---|---|
| Pattern Function: | Concat(Pattern A, Pattern B) |
| Compiler Syntax: | A & B |
| Description: | The concatenation operator means match A followed immediately by matching B. |
| Fail | |
|---|---|
| Pattern Function: | Fail() |
| Compiler Syntax: | fail |
| Description: | The Fail operator is equivalent to the null alternation. Matches no possible strings, so it always signals failure. This is a specialized pattern element, which is useful in conjunction with some of the special pattern elements that have side effects. |
| Fence | |
|---|---|
| Pattern Function: | Fence() or Fence(Pattern p) |
| Compiler Syntax: | fence or Fence(<pattern>) |
| Description: | The Fence operator
matches the null string at first, and then if a failure
causes alternatives to be sought, aborts the match (like
a Cancel). Note that using Fence at the start of a pattern
has the same effect as matching in anchored mode.
The Fence(P) operator, where P is a pattern, attempts to match the pattern P including trying all possible alternatives of P. If none of these alternatives succeeds, then the Fence pattern fails. If one alternative succeeds, then the pattern match proceeds, but on a subsequent failure, no attempt is made to search for alternative matches of P. The pattern P may contain any number of pattern elements including the use of alternatiion and concatenation. |
| Len | |
|---|---|
| Pattern Function: | Len(int n) |
| Compiler Syntax: | len(<integer>) |
| Description: | The Len(N) operator where N is a natural number, matches the given number of characters. For example, Len(10) matches any string that is exactly ten characters long. |
| NotAny | |
|---|---|
| Pattern Function: | NotAny(String S) |
| Compiler Syntax: | notany(<string>) |
| Description: | The NotAny(N) operator where S is a string, matches a single character that is not one of the characters of S. Fails if the current characer is one of the given set of characters. |
| NSpan | |
|---|---|
| Pattern Function: | NSpan(String S) |
| Compiler Syntax: | nspan(<string>) |
| Description: | The NSpan(S) operator where S is a string, matches a string of zero or more characters that is among the characters given in the string. Always matches the longest possible such string. Always succeeds, since it can match the null string. |
| Pos | |
|---|---|
| Pattern Function: | Pos(int n) |
| Compiler Syntax: | pos(<integer>) |
| Description: | The Pos(N) operator where N is a natural number, matches the null string if exactly N characters have been matched so far, and otherwise fails. |
| Rem/Rest | |
|---|---|
| Pattern Function: | Rem() or Rest() |
| Compiler Syntax: | rem or rest |
| Description: | The Rem (Rest) operator matches from the current point to the last character in the string. This is a specialized pattern element, which is useful in conjunction with some of the special pattern elements that have side effects. Note: the original spitbol name was Rem, but the GNU version changed it to Rest because Rem is a reserved Ada word. |
| Rpos | |
|---|---|
| Pattern Function: | Rpos(int n) |
| Compiler Syntax: | rpos(<integer>) |
| Description: | The Rpos(N) operator where N is a natural number, matches the null string if exactly N characters remain to be matched, and otherwise fails. |
| Rtab | |
|---|---|
| Pattern Function: | Rtab(int n) |
| Compiler Syntax: | rtab(<integer>) |
| Description: | The Rtab(N) operator where N is a natural number, matches characters from the current position until exactly N characters remain to be matched in the string. Fails if fewer than N unmatched characters remain in the string. |
| Span | |
|---|---|
| Pattern Function: | Span(String S) |
| Compiler Syntax: | span(<string>) |
| Description: | The Span(S) operator where S is a string, matches a string of one or more characters that is among the characters given in the string. Always matches the longest possible such string. Fails if the current character is not one of the given set of characters. |
| Succeed | |
|---|---|
| Pattern Function: | Succeed() |
| Compiler Syntax: | succeed |
| Description: | The Succeed operator repeatedly matches the null string (it is equivalent to the alternation ("" | "" | "" ....)). This is a special pattern element, which is useful in conjunction with some of the special pattern elements that have side effects. |
| Tab | |
|---|---|
| Pattern Function: | Tab(int n) |
| Compiler Syntax: | tab(<integer>) |
| Description: | The Tab(N) operator where N is a natural number, matches characters from the current position until exactly N characters have been matched in all. Fails if more than N characters have already been matched. It is technically ambiguous about its value if the cursor is less than the length of the subject string and the length is less than N. Here is it assumed that the match fails in this case. |
| Defer | |
|---|---|
| Pattern Function: | Defer(Variable B) |
| Compiler Syntax: | [+]<name>/td> |
| Description: |
The Defer operator (+V) where V is a variable,
obtains at match time
a value for V from the VarMap argument to the Match function.
It then proceeds to match that value as if it had been present
in the original pattern at that point. The leading "+" is optional
unless the variable name matches a keyword.
See also the Variables discussion.
A deferred pattern can be used to create a recursive pattern that will, at pattern matching time, follow the pointer to obtain the referenced pattern, and then match this pattern. Consider for example, the following Java code.
By placing the pattern P into the map with the name P (the name can be
anything consistent), we cause P to recurse through the map.
The Examples section
provides illustrations of recursive patterns.
On the first attempt to match, this pattern attempts to match the string "A". If this fails, then the alternative matches a "B", followed by an attempt to match the value of "P", which is of source the Pattern P. This second attempt now first attempts to match "A", and so on. The result is a pattern that will match a string of B's followed by a single A. This particular example could simply be written as nspan("B") & "A", but the use of recursive patterns in the general case can construct complex patterns which could not otherwise be built. |
| Deferred Assignment | |
|---|---|
| Pattern Function: | Assign(Pattern P, Variable v) |
| Compiler Syntax: | <pattern> * <variable> or <pattern> . <variable> |
| Description: | The deferred assignment operator takes an arbitrary pattern P and a variable V that will be set to the substring matched by P. The assignment is deferred to the end of the match. If the entire match is successful, and if the pattern P was part of the successful match, then at the end of the matching operation the assignment to S of the string matching P is performed. |
| Immediate Assignment | |
|---|---|
| Pattern Function: | IAssign(Pattern P, Variable v) |
| Compiler Syntax: | <pattern> ** <variable> or <pattern> $ <variable> |
| Description: | The immediate assignment operator takes an arbitrary pattern P P a variable V that will be set to the substring matched by P. This assignment happens during pattern matching, so if P matches more than once, then the assignment happens more than once. |
| Set Cursor | ||
|---|---|---|
| Pattern Function: | Setcur(Variable v) |
|
| Compiler Syntax: | setcur(<variable>) | |
| Description: | This (badly named) cursor assignment operation assigns the current cursor position to the variable N. The cursor position is defined as the count of characters that have been matched so far (including any start point moves). | |
| Deferred Replacement | ||
|---|---|---|
| Pattern Function: | Replace(Pattern P, Variable v) |
|
| Compiler Syntax: | <pattern>=<variable> | |
| Description: |
Deferred replacement is similar to deferred assignment.
With assignment, a variable is associated with a pattern and
when matching completes, the substring matched by that pattern
is assigned to the variable.
With deferred replacement, after a successful match, the substring matched by the pattern is replaced by the value of the variable (at the end of matching). In order to access the modified subject string, the MatchResult argument must used. | |
| Option | Description |
| -pat '<pattern>' | Specify a pattern expression. |
| -subject="<string>" | Specify a subject string. |
| -var "var=<value>" | Specify an initial variable assignment.; may be repeated. |
| -tag "<string> | Specify a string that identifies pattern lines in the standard input. The same tag is used to mark the beginning and the end of the string. |
| -xmltag "<string> | Specify an xml style opening tag that identifies pattern lines in the standard input.
The closing tag is the same as the open tag, but with the standard "/".
For example, -xmltag <pattern> would mark strings with the tags
<pattern>...</pattern>. |
| -squote | Indicate that pattern expressions are using single quotes instead of the usual double quotes. |
| -debug | Turn on lowest level of debug output. |
| -debugn <N> | Set the debug level to N |
| -v | Print out the parameters. |
| -f <filename> | Take input from this file |
| -o <filename> | Send non-error output to this file. |
| -err <filename> | Send error output to this file. |
| -conflict | Avoid naming conflicts by using fully qualified names for pattern operations. |
java -classpath jpattern.jar jpattern.compiler.Main
-pat "arb & len(5)=x" -subject "1234567" -var "x=xyz"
java -classpath jpattern.jar jpattern.compiler.Main
-tag "@" < javacode.pat > javacode.java
The second case reads lines from standard input ("javacode.pat" in this case) and writes them to standard output ("javacode.java" in this case). As each line is read, it is searched for paired occurrences of the java tag string (the argument to the -tag flag or -xmltag flag). The text between the occurrences is assumed to be a pattern expression. Multiple patterns per line is disallowed. That expression is compiled to equivalent Java code to construct the Pattern and is written to standard output in place of the tagged text. Patterns may span multiple lines by ending each line with a backslash character ("\").
As an alternative, it should be possible to invoke
the compiler directly using the following command form:
java -jar jpattern.jar ...
This is because the jar file has the appropriate manifest within it.
It should be noted, however, that java -D flags
cannot be used when using the -jar flag.
To see how the compiler works, suppose the first example was run with the following input file.
public class Test0
{
static void makePattern()
{
Pattern Lnum = @pos(0) (Digs ".") span(" ")@;
Pattern Digs = @span("0123456789")@;
}
}
It would produce the following output file.
public class Test0
{
static void makePattern()
{
Pattern Lnum = Pattern.Concat(
Pattern.Pos(0),
Pattern.Concat(
Pattern.Concat(
Pattern.Defer(Variable.create("Digs")),
Pattern.StringPattern(".")
),
Pattern.Span(" ")
)
);
Pattern Digs = Pattern.Span("0123456789");
}
}
You can examine the .pat files in the test directory to see other examples.
// Instantiate the compiler and compile the desired pattern.
Compiler compiler = new Compiler();
Pattern p = compiler.compile("arb & len(5)=x");
// Now create a VarMap and get a matcher from pattern p.
VarMap vars = new VarMap();
Match matcher = p.matcher();
//Recast the matcher as a match result
MatchResult result = (MatchResult)matcher;
// Now perform the match and report the result
String subject = "1234567";
boolean ok = matcher.match(subject,vars);
stdout.print(ok?"succeed: "+result.getSubject():"fail.");
It should be realized that the interaction can occur at two different points in the lifetime of a pattern. First, interaction can occur when a pattern is constructed using the the JPattern Pseudo-Compiler. Second, interaction can occur when the pattern is used to match a string.
Interaction During Pattern Construction
Java expressions may be passed through the compilation process
by enclosing them in backquotes ('`'): `5 + 6`, for example.
Several test cases - including 1, 3a3 and 5 -
demonstrate this. Be careful when using this
mechanism because no type checking is performed, so you can cause
type errors when compiling the resulting Java code.
If the java expression in backquotes occurs in a position that requires it to be a pattern object, then the expression is passed through a function called Pattern.Java(). This function forces the value of the expression to be converted to a Pattern object. This is shown in test case 5.
Interaction During Pattern Matching
Ideally, it should be possible to refer to Java variables
during pattern matching. Unfortunately, this would require the
use of reflection, which is cumbersome and relatively slow.
This is why the
Variable
mechanism is provided instead.
ExternalVariable: Another mechanism, the ExternalVariable mechanism, is provided that can give some of the benefits of access to surrounding Java. Consider the following example (taken from test case 5a).
class AccessJava implements ExternalVariable {
...
Object current = null;
...
public Object get(VarMap vars) {return current;}
public void put(VarMap vars, Object val) {current = val;}
...
}
...
class Test5 extends Test {
...
AccessJava func;
...
func=new AccessJava();};
...
vars.put("AccessJava",func);
Pattern p = @("before" & span(" ")) $ +AccessJava & "after" = +AccessJava@;
}
The AccessJava class implements the jpattern.ExternalVariable interface. This interface provides a method called get() and a method called put(). is evaluated by the matcher whenever a variable of type ExternalFunction is encountered. In this case, AccessJava just returns the value of its parent's javaVariable1 instance variable. Since this is evaluated at match time, it provides a usable mechanism for accessing Java variables and code at that time.
The AccessJava function is used by storing an instance of it under some name ("AccessJava" in this case) in the variable map that will be passed to the Match.match() method. In the code above, the variable AccessJava is referenced at the end of the pattern, so during match it will invoke the AccessJava.get() method as the last part of the match. Of course, the reference could be any where as long as there is no type conflict.
ExternalPattern: The ExternalPattern mechanism allows users of jpattern to define their own pattern operations.
Let us define a new external pattern called RE(<regular expression>), where the <regular expression> is a string in quotes that represents a regular expression as defined by java.util.regex.Pattern.
This semantics of this user-defined pattern are that it attempts to repeatedly match the regular expression - call it RE - against the subject at the current cursor. Thus, it is roughly similar to the altered regular expression "RE+", which of course would be a legal RE for this pattern and would negate the need for the implicit plus. But the point here is to demonstrate how to use the ExternalPattern mechanism.
The basic process is to define two classes: one that implements the ExternalPattern interface and one that subclasses the ExternalMatcher class. The latter can be an inner, final class within the first.
So, for the RE pattern, we define the following outline structure (error checking and such have been elided).
class REPattern implements ExternalPattern
{
// The ExternalPattern interface requires three methods
// getName, getNargs, and matcher.
public String getName() {return "RE";}
public int getNargs() {return 1;}
public ExternalMatcher matcher(Object[] argv) throws Error
{
java.util.regex.Pattern jpat = null;
jpat = java.util.regex.Pattern.compile((String)argv[0]);
ExternalMatcher rem = new REMatcher(jpat);
return rem;
}
}
The matcher() method is basically a factory method for creating ExternalMatcher instances. The reason for this is because the ExternalPattern instances are shared by all occurrences in a compiled pattern. So for each actual match of that pattern we must keep a separate piece of state to track the match.
The corresponding ExternalMatcher for REPattern is REMatcher and is defined as follows (again with elision).
final class REMatcher extends ExternalMatcher
{
// The only relevant state is the regular expression
java.util.regex.Pattern regexp = null;
public REMatcher(java.util.regex.Pattern jpat) {regexp = jpat;}
public boolean initial() throws Error {...}
public boolean retry() throws Error {...}
public void fail() {jpat = null; ...}
}
On initial match attempt, the initial() method is called. It returns true or false if the initial match succeeds or fails. When the match backtracks into this operation, the retry() method is called. It returns true or false if the match retry succeeds or fails. If it fails, then the fail() method is invoked to allow the external pattern to reclaim any allocated resources.
This external pattern is defined by something like this.
ExternalMap externs = new ExternalMap(<regular expression string>);
externs.add(new REPattern());
Assume the invocation of the RE pattern looks something like this.
RE("[a][b]")
The argument string "[a][b]" specifies the regular expression to match. When the matcher encounters this pattern it invokes REPattern.matcher(<regular expression string>) to obtain the REMatcher instance.
The class ExternalMatcher provides access to a set of fields that provide access to the current state of the overall match. The relevant definition is as follows.
abstract public class ExternalMatcher {
...
public String Subject;
public VarMap Vars;
public ExternalMap Externs;
public int Anchor;
public int Cursor;
public boolean Cancel;
}
The meaning of the fields is as follows.
The complete initial() method for REMatcher is defined as follows (again with elision).
// Try the initial match. We want the maximum match
// starting at the current cursor location.
public boolean initial() throws Error
{
// Get the substring to match (with some test reporting)
int curse = super.Cursor;
String target = super.Subject.substring(curse);
java.util.regex.Matcher m = regexp.matcher(target);
boolean sf = m.lookingAt(); // anchored RE match
if(!sf) return false; // failed
// set the new cursor
super.Cursor = (curse+m.end());
return true;
}
The complete retry() method is defined as follows.
public boolean retry() throws Error
{
// Get the substring to match
int curse = super.Cursor;
String target = Subject.substring(curse);
java.util.regex.Matcher m = regexp.matcher(target);
boolean sf = m.lookingAt(); // anchored RE match
if(!sf) return false; // failed
// set the new cursor
super.Cursor = (curse+m.end());
return true;
}
Test case 12 shows the example completely worked out and operational.
Pattern Digs = @span("0123456789")@;
Pattern Lnum = @pos(0) & (+Digs & ".")=nil & span(" ")@;
Now to use this pattern we simply do a match with a replacement:
VarMap vars = new VarMap(); vars.put("Digs",Digs); vars.put("nil","");
boolean ok = Lnum.matcher().match(Line, vars);
which replaces the line number by the null string.
The style we use here - defining constant patterns and then using them - is typical. It is possible to build up patterns dynamically, but it is usually more efficient to build them in pieces in advance using constant declarations. Note in particular that although it is possible to construct a pattern directly as an argument for the Match routine, it is much more efficient to preconstruct the pattern as we did in this example.
Example 2. Now let's look at the use of pattern assignment to break a string into sections. Suppose that the input string has two unsigned decimal integers, separated by spaces or a comma, with spaces allowed anywhere. Then we can isolate the two numbers with the following pattern:
Pattern B = @nspan(" ")@;
Pattern N = @span("0123456789")@;
Pattern T = @nspan(" ") +N*Num1 span(" ,") +N*Num2@;
The match operation
T.matcher().match(" 124, 257 ")
would assign the string 124 to Num1 and the string 257 to Num2.
Example 3. Now let's see how more complex elements can be built from the set of primitive elements. The following pattern matches strings that have the syntax of Ada 95 based literals:
VarMap vars = new VarMap();
vars.put("DecDigits","0123456789");
vars.put("HexDigits","0123456789abcdefABCDEF");
Pattern Digs = @span(+DecDigits)@;
Pattern UDigs = @+Digs arbno("_" +Digs")@;
Pattern Hdig = @span(+HexDigits)@;
Pattern UHdig = @+Hdig arbno("_" +Hdig)@;
Pattern Bnum = @+UDigs "#" +UHdig "#"@;
vars.put("Digs",Digs);
vars.put("UDigs",UDigs);
vars.put("Hdig",Hdigs);
vars.put("UHdig",UHdigs);
vars.put("Bnum",Bnum);
A match against Bnum will now match the desired strings, e.g. it will match "16#123_abc#", but not "a#b#". However, this pattern is not quite complete, since it does not allow colons to replace the pound signs. The following is more complete:
Pattern Bchar = @any("#:")@;
vars.put("Bchar",Bchar);
Pattern Bnum = @+UDigs +Bchar +UHdig +Bchar@;
But that is still not quite right, since it allows # and : to be mixed, and they are supposed to be used consistently. We solve this by using a deferred match.
Pattern Bnum = @+UDigs Bchar*Temp UHdig +Temp@;
Here the first instance of the base character is stored in Temp, and then later in the pattern we rematch the value that was assigned.
Example 4. For an example of a recursive pattern, let's define a pattern that is like the built in Bal, but the string matched is balanced with respect to square brackets or curly brackets.
The language for such strings might be defined in extended BNF as
ELEMENT ::= <any character other than [] or {}>
| '[' BALANCED_STRING ']'
| '{' BALANCED_STRING '}'
BALANCED_STRING ::= ELEMENT {ELEMENT}
Here we use "{}" to indicate zero or more occurrences of a term, as is common practice in extended BNF. Now we can translate the above BNF into recursive patterns as follows:
Pattern Element = @notany("[]{}") \
| ("[" Balanced_String "]") \
| ("{" Balanced_String "}")@;
Pattern Balanced_String = @Element arbno(Element)@;
VarMap vars = new VarMap();
vars.put("Balanced_String",Balanced_String);
Note the important use of + here to refer to a pattern not yet defined. Note also that we use assignments precisely because we cannot refer to as yet undeclared variables in initializations.
Now that this pattern is constructed, we can use it as though it were a new primitive pattern element. For example, the match:
Pattern p = @+Balanced_String*Current_Output fail@;
P.matcher().match("xy[ab{cd}]")
will generate the output (assuming that Current_Output has been initialized with an instance of java.util.Collection).
x xy xy[ab{cd}] y y[ab{cd}] [ab{cd}] a ab ab{cd} b b{cd} {cd} c cd d
Note that the function of the fail operator here is simply to force the pattern Balanced_String to match all possible alternatives. Studying the operation of this pattern in detail is highly instructive.
Example 5. Finally we give a rather elaborate example of the use of deferred matching and the ExternalFunction mechanism. The following declarations build up a pattern which will find the longest string of decimal digits in the subject string.
public class Test5b extends Test
{
GTS gts = new GTS();
...
public Pattern makePattern()
{
vars.put("GTS",gts);
vars.put("Digit","0123456789");
Pattern Digs = @span(+Digit)@;
// Note that the comments are inside the backslash
Pattern Find = @""*Max fence // initialize Max to null \
& breakx(+Digit) // scan looking for digits \
& ((span(+Digit)*Cur // assign next string to Cur \
& +GTS // check size(Cur) > Size(Max) \
& setcur(+Loc)) // if so, save location \
* Max) // and assign to Max. \
& fail@; // seek all alternatives
return Find; // top level pattern
}
}
...
class GTS implements ExternalVariable
{
...
public Object get(VarMap vars) {
String Cur = vars.getString("Cur","");
String Max = vars.getString("Max","");
return Boolean.valueOf(Cur.length() > Max.length());
}
public void put(VarMap vars, Object val) {} // not used
}
As we see from the comments here, complex patterns like this take on aspects of sequential programs. In fact they are sequential programs with general backtracking. In this pattern, we first use a pattern assignment that matches null and assigns it to Max, so that it is initialized for the new match. Now BreakX scans to the next digit. Arb would do here, but BreakX will be more efficient. Once we have found a digit, we scan out the longest string of digits with Span, and assign it to Cur. The deferred call to GTS tests if the string we assigned to Cur is the longest so far. If not, then failure is signalled, and we seek alternatives (this means that BreakX will extend and look for the next digit string). If the call to GTS succeeds then the matched string is assigned as the largest string so far into Max and its location is saved in Loc. Finally Fail forces the match to fail and seek alternatives, so that the entire string is searched.
If the pattern Find is matched against a string, the variable Max at the end of the pattern will have the longest string of digits, and Loc will be the starting character location of the string. For example, Match("ab123cd4657ef23", Find) will assign "4657" to Max and 11 to Loc (indicating that Subject.substring(0,11) has been matched. This example is taken from tests/Test5b.pat.
| Author: | Dennis Heimbigner |
| Software Engineering Research Laboratory | |
| Computer Science Department | |
| University of Colorado | |
| Boulder, Colorado 80309-0430 USA | |
| dennis.heimbigner@colorado.edu |