Sunday, July 20, 2008

How to Use Regular Expression Classes in the .Net Framework

Programming Language: Any .NET Language
Type: Tutorial
Level: Everyone
Topic Category: General Programming
Main Series: Regular Expressions in the .NET Framework
Topic Title:
How to Use Regular Expression Classes in the .Net Framework
My last article on wikiHow:
How to Use Regular Expression Classes in the .Net Framework - wikiHow

Many beginning programmers spend much time coding search and replace logic while the .NET framework offers a powerful class framework to use regular expressions that you can use to perform various tasks on text and binary data. This article introduces you to the regular expression classes in the .NET framework and how to use them to perform tasks easier in a systematic way.

Steps
  1. Decide the specific use of regular expressions you need. Regular expressions are normally used for one of the following:
    1. Check if a string has a certain pattern within it
    2. Validate a string against a pattern
    3. Replace a specific pattern with another pattern
    4. Split a string using a pattern delimiter
    5. Find all ocurrences of a certain pattern within a string
    6. Extract pattern pieces/groups from a string (like in syntax checking/highlighting)
  2. Design and decide the regular expression you want to use. See a tutorial on regular expressions or visit an online library of regular expressions
  3. Decide the regular expressions matching options you want to use. The setting you need to decide are:
    1. Case sensitivity of matching
    2. Whether you want to ignore any white spaces within the regular expression while matching or not
    3. Whether the matching is multiline or not (this changes the meaning of ^ and $ to match to the beginning and end of lines and not only the beginning and end of the whole string)
    4. The direction of the matching process (left-to-right or right-to-left)
    5. Whether . will match any character including or not including new line
    6. Whether to compile the regular expression to the assembly (slow start up, fast processing) or not (the contrary)
    7. Whether to ignore culture variance/change or not
    8. Whether to use ECMA Script compliant mode or not
    9. Whether to capture every group (any sub-expression within parenthesis) or only groups that are named
  4. Create a RegexOptions object and add all options (by using bit-wise or "|") - This step is optional
  5. Choose the member method you want to use. This depends on your choice in step 1. You have the following choices:
    • IsMatch() - if you only need to check whether a match was found or not
    • Match() - if you want to get the first match found. Calling this method again will get the next match and so on.
    • Matches() - if you want to retrieve all matches of the pattern in one call
    • Split() - if you want to split the string at the matches of the pattern
    • Replace() - if you want to replace the matches of the pattern with another pattern or string
  6. Decide whether you want to use the static version of the method or the instance version. Static methods do not require the creation of a Regex object but they do not keep the status of matching between calls. According to your choice, follow the following steps:
    • For static versions of the methods:
      1. Declare an appropriate reference to hold the results of the operation if necessary. Here is a list of the methods and the type of results they return:
        • IsMatch() - bool
        • Match() - Match
        • Matches() - MatchCollection
        • Replace() - string
        • Split() - string[]
      2. Call the appropriate method from the Regex class passing it the regular expression, the string, the options, and the replacement pattern in case of Replace() and assign the result to the reference you declared in the previous step.
    • For instance versions of the methods:
      1. Create a Regex object passing the regular expression you created, and the matching options to the constructor.
      2. Declare an appropriate result reference to hold the results obtained (like step 1 in the static version of the method)
      3. Call the method you choose from the regex object passing it the string to be matched and assign the result to the reference created in the previous step.
  7. Use the results obtained at the end of step 6 in the rest of your code. Usual uses of results are listed below in the "Common Uses of Regex Methods Results" section
Common Uses of Regex Methods Results
  • IsMatch(): returns a boolean value that is usally used in:
    • A single conditional construct such as the "if .. else" code construct or the "?:" operator construct. This is usually when you want to check if a pattern generally exists in a string to decide whether to do some action or not (for example check if the text of a post contains some offensive word to decide whether to allow the post or ban it altogether)
    • A looping construct such as a while or do ... while loop. This usually is used with instance versions of the method Match to iterate through all matches as long as there are matches to iterate through. Sometimes, it is used with streamed string to check the matches of apattern in a text while it arrives through the stream.
    • Validation of controls. Usually this is done by binding some property of the control to the result (for example, making a textbox disabled as long as the regex can not find a match in the string within the text box
  • Replace(): returns a string with the pattern replaced with the replacement pattern or string. The resulting string is usually used in place of the original string. Some examples of the use of Replace() are:
    • Replacing all offensive words in a post with special characters such as ! or #
    • Replacing all html markup with html code that will display the markup instead of executing it (for example, replacing <>
    • Encrypting strings (premutation encryption)
    • Replacing special characters with other escaped values (for example replacing \ with \\)
  • Split(): returns a string array with all tokens after spliting the original string at the patterns found. This is usualy used in code parsing.
  • Match(): returns a Match object that has information on the match found by the last call of Match(). Match objects contain information on capture groups and captures within the single match. Example uses of the Match object result and the Match() method:
    • Syntax highlighting.
    • Storing the matches found in some data storage facility such as a database.
    • Performing more complex replacing on the string by calculating the replacement of each capture.
    • Detailed finding of matches within a long text.
  • Matches(): returns a MatchCollection object which is actually a collection of Match objects (you may think of it as a typed ArrayList of Match objects). This method is a call-once alternative to Match() so it has the same uses.
Tips
  • All matching options are turned off by default (that is, if you don't specify any of the options or use RegexOptions.None)
  • If you don't uinderstand any of the regular expressions matching options, leave turned off.
  • You can use Regex.Escape() and Regex.Unescape() methods to escape/un-escape special regex characters within a regular expression
  • To get the regular expression that was passed to the constructor of a Regex object, use the ToString() method. The method is overriden in Regex so that it returns the regular expression.
  • Thoroughly examine your regular expression before using it in a production-environment application. Regular expressions can be very tricky. Look at the "Related websites and online tutorials" section for further reading on regular expressions.
  • Use lookahead and lookbehind expressions wisely. They are hard to write and cost a lot of processing.
  • If your regular expression is large, uses a lot of backtracking, very complex or is intended to process large amounts of data, consider using a compiled assembly version of the regular expression. You can create regular expressions compiled assembly by using the Regex.CompileToAssembly() method.
Warnings
  • Passing an invalid regular expression to the constructor of Regex or to one of the static methods in the Regex class will through an exception so use try ... catch blocks.
  • Extra large regular expressions might cause the executing PC to run out of memory, so avoid using them.
Things You will Need
  • A .NET or mono compliant compiler and programming language.
  • A prior knowledge of regular expressions. See the "Related Websites and Online Tutorials" section for further information on regulare expressions.
  • A regular expressions toolbox application like RegexBuddy will help you test your regular expressions. This is optional.
  • A .NET integrated development environment will help you write code easier. This is optional.
Related Websites and Online Tutorials
Related wikiHows

No comments: