Anant Athale

Subscribe to Anant Athale: eMailAlertsEmail Alerts
Get Anant Athale: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: Java Developer Magazine

Java Developer : Article

Simplify Pattern Matching

Use java.util.regex

Put together, the two regular expressions would accept the passwords "010101" and "m0rpheus" and reject the password "agentsmith." There are other ways (without regular expressions) one could have achieve the same results but notice how regular expressions make the solution concise and elegant.

More Meta-Characters - Grouping, Capturing
The parentheses () are used for two functions, grouping and capturing.

They group the enclosed elements and capture the text matched by the enclosed sub-expression. The backreferences (\1, \2, etc.) allow the text captured by the group to be used again in the same regular expression. The parentheses are evaluated from left to right and their position (from left) in the regular expression determines the contents of the corresponding backreferences. Java also provides access to the contents of the captured text outside the regular expression through constructs like '$1' and '$2' where '$1' is a handle to the value contained in '\1' or the group(n) function in the class java.util.regex.Matcher.

The other type of parenthesis (?: ) known as grouping-only parenthesis group but do not capture any matched text. This is a useful construct when backreferences aren't required. It can speed up the match operation by not preserving any data from the match. It also answers the question asked earlier about the difference between the regular expressions in Listings 2 and 3. Notice that Listing 3 uses the non-capturing parenthesis and is therefore more efficient.

Package java.util.regex
The package is relatively small consisting of two final classes namely 'java.util.regex.Pattern' and 'java.util.regex.Matcher' and an exception class 'java.util.regex.PatternSyntaxException'. Together these classes form Java's regular expressions framework.

The instance of class 'Pattern' is the compiled representation of the specified regular expression string. The 'Matcher' object does the match operations on a specified character sequence and provides additional functions to access and use the results from the match. Decoupling the pattern definition from the pattern matcher lets the same pattern be used by multiple matchers. The 'Pattern' class also provides a static 'matches (String pattern, String text)' method that can be used in cases where the pattern (and matcher) need not be preserved for reuse later. Notice that in Listing 3 the patterns 'pContent' and 'pLength' have been declared outside the method. This lets the pattern instances be reused across multiple invocations of the method and so is more efficient. This option isn't available when using java.lang.String matches () method as shown in Listing 2.

The complie() methods in the 'Pattern' class accept a regular expression string to compile and verify the expression syntax. An unchecked exception 'java.util.regex.PatternSyntaxException' is thrown if the syntax is invalid. The mode modifiers discussed earlier can be specified as flags at the pattern compile time using the 'Pattern' class's static final variables (CASE_INSENSITIVE, DOTALL, etc.). Multiple flags can be specified using the (|) operator as shown in the pattern pLength (Line 23, Listing 3).

The 'Matcher' class allows multiple ways to do a match and query the results. The matches() method checks if the input string exactly matches the regular expression and returns a Boolean. The find() method checks for an instance of the regular expression in the input string and can be re-invoked to check for multiple instances. These are the two most commonly used search methods although class provides additional methods that are appropriate under special conditions.

Similarly, there are a number of results query methods. The group(), group(int i) are normally the ones used most. The group() method provides access to the text matched by the previous match application while the group(int i) method returns the text captured by the ith group (capturing parenthesis). The regular expression "(\\d)([A-C])" when applied to the string "2140AD," for example, would make group(1) return "0," group(2) return "A," and group(0) return "0A". Notice that group(0) always returns the entire text matched.

Finally, let's discuss some of the text-replacement functions in the Matcher class and then look at an example that uses them. The replaceAll(String newText) method replaces all instances of the text matched by the regular expression with the new text while replaceFirst(String newText) replaces just the first instance. The advanced replace operations appendReplacement() and appendTail() together offer fine-grained control over how the replace is done. Their use is shown in the example in Listing 4.

Example 2: Text Conversion
So far we've looked at examples that do validation and determine if the input text meets the requirements. Now let's look at an extension of the previous password example that modifies the target text to make it meet the requirements. Zion's security policy now requires that the passwords be encrypted in some way (fudged) before they're sent across the wire. Let's look at a simple text-conversion utility shown in Listing 4 that uses regular expressions to reverse all the digits in the input text. The idea is not to write a sophisticated encryption algorithm but to demonstrate some of the advanced regular expression features.

The program uses a simple regular expression (Line 5) used to match the instances of the digit class in the password. The digit instances are searched using the find() method, which helps to iterate through the match results (Line 7). The group(1) method returns the text captured by the capturing parenthesis(\\d) in the regular expression, i.e., all instances of the digit class. Each digit is then reversed using the specified array and appended to the new password StringBuffer. The appendReplacement method takes care of inserting the string found between the matches while appendTail takes care of appending the remainder text. The password "010101" would be sent as "989898."

Example 3
A couple of patterns shown in Listing 5 demonstrate the use of regular expressions in matching e-mail and Web addresses. The e-mail pattern (Line1, Listing 5) matches addresses that end with matrix.com, matrix.net, or matrix.org. It also matches the subsequent person name. For example, the pattern matches the word 'Trinity' along with the e-mail address in tn@matrix.com (Trinity). The URL pattern matches the hostname followed by optional path names as in http://www.zion.com/antimatrix.html. You'll note the extra 'Pattern.MULTILINE' option argument that indicates that the match should take into account the fact that the URL may span multiple lines. The default behavior just matches the current line.

Conclusion
Regular expressions are handy when writing pattern-matching programs like Internet form validations, converting text to HTML or vice versa, parsing documents, help programs, etc. With the introduction of a regular expressions package in Java, it's become more convenient to use them in a wide variety of applications without having to rely on external packages. We saw some of the commonly used regular expression constructs supported by Java and their use in a number of examples. However, the scope of this article limits the depth and number of constructs discussed. For a more detailed discussion please use the references mentioned at the end.

Note that the regular expression constructs in Java may have a slightly different meaning in other languages that support regular expressions (like Perl, .NET, etc.) and therefore regular expressions may not be entirely portable across languages.

Listing 1 shows a list (subset) of regular expression meta-characters that Java supports. Listing 2 shows the Password validation program that uses java.lang.String. Listing3 shows the Password validation program that uses java.util.regex. Listing 4 shows a text conversion utility using java.util.regex.

References

  • JavaDoc J2SE 1.4.2 API, Sun Microsystems, Inc., (http://java.sun.com/j2se/1.4.2/docs/api/index.html)
  • Mastering Regular Expressions, 2nd Edition, Jeffery.E.F. Friedl, (www.oreilly.com/catalog/regex/)
  • Java Performance Tuning, 2nd Edition, Jack Shirazi, (www.oreilly.com/catalog/javapt2/index.html)
  • More Stories By Anant Athale

    Anant Athale is a senior software engineer at Motorola Labs. He specializes in enterprise Java technologies and is an active participant in the Java Community Process (JSR 262,260). He is Sun certified and has a masters degree from Arizona State University.

    Comments (2) View Comments

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    Most Recent Comments
    Kaarle 06/06/05 01:35:55 AM EDT

    Could be an interesting article if the Listings 1... would be included that the article referes to.

    Regex Group 04/15/05 02:33:40 PM EDT

    http://groups-beta.google.com/group/regex