public class RE extends Object
To compile a regular expression (RE), you can simply construct an RE matcher object from the string specification of the pattern, like this:
RE r = new RE("a*b");
Once you have done this, you can call either of the RE.match methods to perform matching on a String. For example:
boolean matched = r.match("aaaab");will cause the boolean matched to be set to true because the pattern "a*b" matches the string "aaaab".
If you were interested in the number of a's which matched the first part of our example expression, you could change the expression to "(a*)b". Then when you compiled the expression and matched it against something like "xaaaab", you would get results like this:
RE r = new RE("(a*)b"); // Compile expression boolean matched = r.match("xaaaab"); // Match against "xaaaab" String wholeExpr = r.getParen(0); // wholeExpr will be 'aaaab' String insideParens = r.getParen(1); // insideParens will be 'aaaa' int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1 int endWholeExpr = r.getParenEnd(0); // endWholeExpr will be index 6 int lenWholeExpr = r.getParenLength(0); // lenWholeExpr will be 5 int startInside = r.getParenStart(1); // startInside will be index 1 int endInside = r.getParenEnd(1); // endInside will be index 5 int lenInside = r.getParenLength(1); // lenInside will be 4You can also refer to the contents of a parenthesized expression within a regular expression itself. This is called a 'backreference'. The first backreference in a regular expression is denoted by \1, the second by \2 and so on. So the expression:
([0-9]+)=\1will match any string of the form n=n (like 0=0 or 2=2).
The full regular expression syntax accepted by RE is described here:
Characters unicodeChar Matches any identical unicode character \ Used to quote a meta-character (like '*') \\ Matches a single '\' character \0nnn Matches a given octal character \xhh Matches a given 8-bit hexadecimal character \\uhhhh Matches a given 16-bit hexadecimal character \t Matches an ASCII tab character \n Matches an ASCII newline character \r Matches an ASCII return character \f Matches an ASCII form feed character Character Classes [abc] Simple character class [a-zA-Z] Character class with ranges [^abc] Negated character classNOTE: Incomplete ranges will be interpreted as "starts from zero" or "ends with last character".
Standard POSIX Character Classes [:alnum:] Alphanumeric characters. [:alpha:] Alphabetic characters. [:blank:] Space and tab characters. [:cntrl:] Control characters. [:digit:] Numeric characters. [:graph:] Characters that are printable and are also visible. (A space is printable, but not visible, while an `a' is both.) [:lower:] Lower-case alphabetic characters. [:print:] Printable characters (characters that are not control characters.) [:punct:] Punctuation characters (characters that are not letter, digits, control characters, or space characters). [:space:] Space characters (such as space, tab, and formfeed, to name a few). [:upper:] Upper-case alphabetic characters. [:xdigit:] Characters that are hexadecimal digits. Non-standard POSIX-style Character Classes [:javastart:] Start of a Java identifier [:javapart:] Part of a Java identifier Predefined Classes . Matches any character other than newline \w Matches a "word" character (alphanumeric plus "_") \W Matches a non-word character \s Matches a whitespace character \S Matches a non-whitespace character \d Matches a digit character \D Matches a non-digit character Boundary Matchers ^ Matches only at the beginning of a line $ Matches only at the end of a line \b Matches only at a word boundary \B Matches only at a non-word boundary Greedy Closures A* Matches A 0 or more times (greedy) A+ Matches A 1 or more times (greedy) A? Matches A 1 or 0 times (greedy) A{n} Matches A exactly n times (greedy) A{n,} Matches A at least n times (greedy) A{n,m} Matches A at least n but not more than m times (greedy) Reluctant Closures A*? Matches A 0 or more times (reluctant) A+? Matches A 1 or more times (reluctant) A?? Matches A 0 or 1 times (reluctant) Logical Operators AB Matches A followed by B A|B Matches either A or B (A) Used for subexpression grouping (?:A) Used for subexpression clustering (just like grouping but no backrefs) Backreferences \1 Backreference to 1st parenthesized subexpression \2 Backreference to 2nd parenthesized subexpression \3 Backreference to 3rd parenthesized subexpression \4 Backreference to 4th parenthesized subexpression \5 Backreference to 5th parenthesized subexpression \6 Backreference to 6th parenthesized subexpression \7 Backreference to 7th parenthesized subexpression \8 Backreference to 8th parenthesized subexpression \9 Backreference to 9th parenthesized subexpression
All closure operators (+, *, ?, {m,n}) are greedy by default, meaning that they match as many elements of the string as possible without causing the overall match to fail. If you want a closure to be reluctant (non-greedy), you can simply follow it with a '?'. A reluctant closure will match as few elements of the string as possible when finding matches. {m,n} closures don't currently support reluctancy.
Line terminators
A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence.
The following are recognized as line terminators:
RE runs programs compiled by the RECompiler class. But the RE matcher class does not include the actual regular
expression compiler for reasons of efficiency. You can construct a single RECompiler object and re-use it to compile
each expression. Similarly, you can change the program run by a given matcher object at any time. However, RE and
RECompiler are not threadsafe (for efficiency reasons, and because requiring thread safety in this class is deemed to
be a rare requirement), so you will need to construct a separate compiler or matcher object for each thread (unless
you do thread synchronization yourself). Once expression compiled into the REProgram object, REProgram can be safely
shared across multiple threads and RE objects.
ISSUES:
RECompiler
Modifier and Type | Field and Description |
---|---|
static int |
MATCH_CASEINDEPENDENT
Flag to indicate that matching should be case-independent (folded)
|
static int |
MATCH_MULTILINE
Newlines should match as BOL/EOL (^ and $)
|
static int |
MATCH_NORMAL
Specifies normal, case-sensitive matching behaviour.
|
static int |
MATCH_SINGLELINE
Consider all input a single body of text - newlines are matched by .
|
static int |
REPLACE_ALL
Flag bit that indicates that subst should replace all occurrences of this regular expression.
|
static int |
REPLACE_BACKREFERENCES
Flag bit that indicates that subst should replace backreferences
|
static int |
REPLACE_FIRSTONLY
Flag bit that indicates that subst should only replace the first occurrence of this regular expression.
|
Constructor and Description |
---|
RE()
Constructs a regular expression matcher with no initial program.
|
RE(REProgram program)
Construct a matcher for a pre-compiled regular expression from program (bytecode) data.
|
RE(REProgram program,
int matchFlags)
Construct a matcher for a pre-compiled regular expression from program (bytecode) data.
|
RE(String pattern)
Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler.
|
RE(String pattern,
int matchFlags)
Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler.
|
Modifier and Type | Method and Description |
---|---|
int |
getMatchFlags()
Returns the current match behaviour flags.
|
String |
getParen(int which)
Gets the contents of a parenthesized subexpression after a successful match.
|
int |
getParenCount()
Returns the number of parenthesized subexpressions available after a successful match.
|
int |
getParenEnd(int which)
Returns the end index of a given paren level.
|
int |
getParenLength(int which)
Returns the length of a given paren level.
|
int |
getParenStart(int which)
Returns the start index of a given paren level.
|
REProgram |
getProgram()
Returns the current regular expression program in use by this matcher object.
|
String[] |
grep(Object[] search)
Returns an array of Strings, whose toString representation matches a regular expression.
|
protected void |
internalError(String s)
Throws an Error representing an internal error condition probably resulting from a bug in the regular expression
compiler (or possibly data corruption).
|
boolean |
match(CharacterIterator search,
int i)
Matches the current regular expression program against a character array, starting at a given index.
|
boolean |
match(String search)
Matches the current regular expression program against a String.
|
boolean |
match(String search,
int i)
Matches the current regular expression program against a character array, starting at a given index.
|
protected boolean |
matchAt(int i)
Match the current regular expression program against the current input string, starting at index i of the input
string.
|
protected int |
matchNodes(int firstNode,
int lastNode,
int idxStart)
Try to match a string against a subset of nodes in the program
|
void |
setMatchFlags(int matchFlags)
Sets match behaviour flags which alter the way RE does matching.
|
protected void |
setParenEnd(int which,
int i)
Sets the end of a paren level
|
protected void |
setParenStart(int which,
int i)
Sets the start of a paren level
|
void |
setProgram(REProgram program)
Sets the current regular expression program used by this matcher object.
|
static String |
simplePatternToFullRegularExpression(String pattern)
Converts a 'simplified' regular expression to a full regular expression
|
String[] |
split(String s)
Splits a string into an array of strings on regular expression boundaries.
|
String |
subst(String substituteIn,
String substitution)
Substitutes a string for this regular expression in another string.
|
String |
subst(String substituteIn,
String substitution,
int flags)
Substitutes a string for this regular expression in another string.
|
public static final int MATCH_CASEINDEPENDENT
public static final int MATCH_MULTILINE
public static final int MATCH_NORMAL
public static final int MATCH_SINGLELINE
public static final int REPLACE_ALL
public static final int REPLACE_BACKREFERENCES
public static final int REPLACE_FIRSTONLY
public RE()
public RE(REProgram program)
program
- Compiled regular expression programRECompiler
public RE(REProgram program, int matchFlags)
program
- Compiled regular expression program (see RECompiler)matchFlags
- One or more of the RE match behaviour flags (RE.MATCH_*):
MATCH_NORMAL // Normal (case-sensitive) matching MATCH_CASEINDEPENDENT // Case folded comparisons MATCH_MULTILINE // Newline matches as BOL/EOL
RECompiler
,
REProgram
public RE(String pattern) throws RESyntaxException
pattern
- The regular expression pattern to compile.RESyntaxException
- Thrown if the regular expression has invalid syntax.RECompiler
public RE(String pattern, int matchFlags) throws RESyntaxException
pattern
- The regular expression pattern to compile.matchFlags
- The matching styleRESyntaxException
- Thrown if the regular expression has invalid syntax.RECompiler
public int getMatchFlags()
MATCH_NORMAL // Normal (case-sensitive) matching MATCH_CASEINDEPENDENT // Case folded comparisons MATCH_MULTILINE // Newline matches as BOL/EOL
setMatchFlags(int)
public String getParen(int which)
which
- Nesting level of subexpressionpublic int getParenCount()
public final int getParenEnd(int which)
which
- Nesting level of subexpressionpublic final int getParenLength(int which)
which
- Nesting level of subexpressionpublic final int getParenStart(int which)
which
- Nesting level of subexpressionpublic REProgram getProgram()
setProgram(me.regexp.REProgram)
public String[] grep(Object[] search)
search
- Array of Objects to searchprotected void internalError(String s) throws Error
s
- Error descriptionError
public boolean match(CharacterIterator search, int i)
search
- String to match againsti
- Index to start searching atpublic boolean match(String search)
search
- String to match againstpublic boolean match(String search, int i)
search
- String to match againsti
- Index to start searching atprotected boolean matchAt(int i)
i
- The input string index to start matching atprotected int matchNodes(int firstNode, int lastNode, int idxStart)
firstNode
- Node to start at in programlastNode
- Last valid node (used for matching a subexpression without matching the rest of the program as well).idxStart
- Starting position in character arraypublic void setMatchFlags(int matchFlags)
matchFlags
- One or more of the RE match behaviour flags (RE.MATCH_*):
MATCH_NORMAL // Normal (case-sensitive) matching MATCH_CASEINDEPENDENT // Case folded comparisons MATCH_MULTILINE // Newline matches as BOL/EOL
protected final void setParenEnd(int which, int i)
which
- Which paren leveli
- Index in input arrayprotected final void setParenStart(int which, int i)
which
- Which paren leveli
- Index in input arraypublic void setProgram(REProgram program)
program
- Regular expression program compiled by RECompiler.RECompiler
,
REProgram
public static String simplePatternToFullRegularExpression(String pattern)
pattern
- The pattern to convertpublic String[] split(String s)
Please note that the first string in the resulting array may be an empty string. This happens when the very first character of input string is matched by the pattern.
s
- String to split on this regular exressionpublic String subst(String substituteIn, String substitution)
substituteIn
- String to substitute withinsubstitution
- String to substitute for all matches of this regular expression.public String subst(String substituteIn, String substitution, int flags)
It is also possible to reference the contents of a parenthesized expression with $0, $1, ... $9. A regular expression of "http://[\\.\\w\\-\\?/~_@&=%]+", a String to substituteIn of "visit us: http://www.apache.org!" and the substitution String "<a href=\"$0\">$0</a>", the resulting String returned by subst would be "visit us: <a href=\"http://www.apache.org\">http://www.apache.org</a>!".
Note: $0 represents the whole match.
substituteIn
- String to substitute withinsubstitution
- String to substitute for matches of this regular expressionflags
- One or more bitwise flags from REPLACE_*. If the REPLACE_FIRSTONLY flag bit is set, only the first
occurrence of this regular expression is replaced. If the bit is not set (REPLACE_ALL), all
occurrences of this pattern will be replaced. If the flag REPLACE_BACKREFERENCES is set, all
backreferences will be processed.