Anant Athale

Subscribe to Anant Athale: eMailAlertsEmail Alerts
Get Anant Athale: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: Java Developer Magazine

Java Developer : Article

Simplify Pattern Matching

Use java.util.regex

Pattern matching using "regular expressions" can help automate a number of text-processing operations like search and replace, input validation, text conversion, and filters. What otherwise requires significant amounts of code can be done in just a few lines with regular expressions because of the powerful underlying regular expressions processing engine. Some programming languages such as Perl and operating systems utilities such as grep have supported regular expressions for a number of years. But before J2SE 1.4, Java (J2SDK) didn't support it and one had to use external packages like Jakarta Regexp, IBM's commercial package (com.ibm.regex). Thankfully that changed with the introduction of the java.util.regex package. The package provides standard implementations for specifying and handling regular expressions. This article will show you how you can quickly use it to implement regular expressions for pattern-based search features. The article starts out by reviewing some important regular expressions fundamentals and then dives into the details of the package. The embedded examples demonstrate the important constructs through simple use cases.

What's a Regular Expression and Why It's Important
If you've used regular expressions in other languages, the following sections will introduce you to the Java flavor and help uncover some of the new features. If you're not familiar with regular expressions, you'll soon discover how to use them effectively to handle text processing in ways you never thought possible before.

A regular expression is a mechanism to specify a textual pattern and detect the presence of the pattern in a given character sequence. In other words, it's a pattern language. A regular expressions pattern is typically specified as a combination of two types of characters, literals and meta-characters. Literals are normal text characters (a, b, c, 1, 2) while meta-characters (ex. *, $, etc.) convey a special meaning to the regular expression engine discussed in the next few sections. A regular expression engine understands the pattern language. The engine interprets the regular expression, does the pattern match, and processes the results. The language and the engine together make regular expressions a powerful tool that simplifies pattern matching. A given implementation like java.util.regex and JRegex provides additional query and utility functions (replace, split, etc.) that are useful in modifying the target text. For details about other Java implementations and implementations available in other languages, please consult the references section.

Meta-Characters
Meta-characters provide advanced expressive power to regular expressions. I will discuss a frequently used meta-character subset that Java supports. For a complete list, please consult the Sun's API documentation (class java.util.regex.Pattern). A number of examples that use these meta-characters immediately follow this discussion.

Anchors
An anchor matches a pre-defined position in the target text. Anchors are similar to reference points and are used to determine the relative positions of other elements in the regular expression. They are typically used to match the boundary positions of string, line, word, etc., although they could also match any other position using the special "Lookaround" constructs shown in Listing 1. The Lookaround constructs match a position based on a given condition. A positive lookahead (?= Neo) matches a position that's immediately followed by the text 'Neo' whereas a negative lookahead (?! Neo) matches the positions that don't have the text 'Neo' at the end. Lookbehind constructs (positive ?<=..., negative ?<!...) work in the opposite way.

Character Classes, Class Shorthands and Alternation
A character class construct [...] is used to specify a list of characters to be included in the regular expression while the construct [^...] specifies the character list to be excluded. In the case of [...] a match is considered successful if any of the characters specified in the list is found. For example, the regular expression [cw]ould matches the instances of words 'could' and 'would'. The class notation implies a logical OR condition also known as "Alternation" between its elements. Alternation is used to specify conditions (x|y) where matching either x or y is considered a success. Therefore, the earlier regular expression could also be written as (c|w)ould.

Special class meta-characters such as (-) can be used to specify a range of values, so class [a-z] specifies all letters from a through z. Class shorthand is a simplified representation of commonly used classes such as the class digit (\d), word (\w), whitespace, etc. A list of class shorthands available in Java is shown in Listing 1.

Quantifiers
Quantifiers are used to indicate the number of instances of the element (to which they are applied in the regular expression) required for a successful match. Java supports three quantifier types namely greedy, reluctant, and possessive. Greedy quantifiers try to match as much as possible while their reluctant counterparts (with ? at the end) try to match the least required to fulfill a match. What this means is that a greedy quantifier will try to match the entire line whether or not a successful match has occurred. It can turn into real performance overhead when the target text is big. Reluctant (or lazy) quantifiers quit as soon as a successful match occurs without bothering to run through the entire line. Possessive quantifiers (with + appended) are useful in optimizing the match operations since they don't keep the prior match states around. Listing 1 details all three types of quantifiers.

Mode Modifiers
These are special constructs to turn certain powerful regex features 'on' or 'off.' The default mode for these features is 'off' since they involve additional overhead when doing a match. The use of (?i), for example, in a regular expression turns on the case insensitive match mode. Java also supports specifying the mode modifiers at compile time using the static final variables in the class java.util.regex.Pattern. The Pattern class is discussed below in the java.util.regex section.

Example 1: Input Validation
Let's now review an example that uses the meta-characters discussed so far to address the password validation needs at Zion. The security standards set at Zion Corporation require that passwords contain only alphanumeric characters, with at least one digit and ranging between six and 32 characters long.

Listings 2 and 3 show two possible solutions to the same problem. The first approach (Listing 2) uses the built-in regular expression support inside the java.lang.String matches() method. The second approach (Listing 3) uses the classes provided by the java.util.regex package. The underlying mechanics are the same in either case and are discussed next. I'll leave the API specifics to the next section.

Let's see how the solution meets the specified requirements. The regular expression pattern on Line 3 (Listing2) is same as the Patttern pContent (Line5, Lisiting 3). The pattern uses a combination of the meta-characters, namely the character class [a-z], class shorthand (\d shorthand for character class [0-9]), and greedy quantifiers (*, +). When put in a solution context the pattern "\\b(?i)([a-z]*\\d+[a-z]*)\\b" is successful if between the word boundaries, there are 0 or more letters followed by 1 or more digits followed by 0 or more letters. The mode modifier ?i is used to indicate that the search is case-insensitive. Notice that there are a couple of differences in the regular expressions in the two listings. The obvious one is the use of comments in Listing 3. The other difference is more subtle but important, did you find it? Check out the next section (Capturing, Grouping) to verify the answer.

The pattern on line 4 (Listing 2) addresses the password-length requirement, using the {min,max} quantifier that imposes minimum and maximum limits on the number of successful matches. In this case a match is successful if "\\b(?i)([a-z0-9]){6,32}\\b" there are between six and 32 instances of alphanumeric characters between the word boundaries. Notice that in Listing 3 the case-insensitive option is specified using the final variables in the class Pattern, which makes the expression more readable. The variables are discussed further in the following sections.

More Stories By Anant Athale

Anant Athale is a senior software engineer at Motorola Labs. He specializes in enterprise Java technologies and is an active participant in the Java Community Process (JSR 262,260). He is Sun certified and has a masters degree from Arizona State University.

Comments (2) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
Kaarle 06/06/05 01:35:55 AM EDT

Could be an interesting article if the Listings 1... would be included that the article referes to.

Regex Group 04/15/05 02:33:40 PM EDT

http://groups-beta.google.com/group/regex