Regular Expressions in Python

4.2.3 Regex functions

search( pattern, string[, flags])
If pattern matches (parts of) string, then a MatchObject is returned. Returns None if pattern is not found in the string. (Note: A match does not necessarily contain part of the given string. For example, these patterns matches any string: r'', r'y*'.) Here's an example of using search():
result=re.search(r'\w+@\w+\.com', 'long text xyz@xyz.com long')
if result:
    print "contains email!"
    print result.group()
else:
    print "no!"

Note: pattern string must be enclosed as raw string like r'...', otherwise, backslashes in it must be escaped. For example, to search for a sequence of tabs, use re.search(r'\t+') or re.search('\\t+').

The optional second argument (flags) modifies the meaning of the given pattern. The flags can be any of I, L, M, S, U, S. They can be combined with the | operator. For example, re.search(pat,re.M|re.U) creates a regex pattern that matches multiple lines of a Unicode string. Each of the flag is detailed below.

I
IGNORECASE
Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale.
L
LOCALE
Make \w, \W, \b, and \B dependent on the current locale. For Example:
NEED EXAMPLE HERE. Basically,
need to explicitly indicate what exactly "local"
means in terms of actual code or some explicit settings.
M
MULTILINE
When specified, the pattern character "^" matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character "$" matches at the end of the string and at the end of each line (immediately preceding each newline). By default, "^" matches only at the beginning of the string, and "$" only at the end of the string and immediately before the newline (if any) at the end of the string.
S
DOTALL
Make the "." special character match any character at all, including a newline; without this flag, "." will match anything except a newline.
U
UNICODE
Make \w, \W, \b, and \B dependent on the Unicode character properties database. For Example:
result=re.search(r'\w+', u'真善美αβγ!',re.U)
if result:
    print result.group().encode('utf8')
else:
    print "no match"

Note that Python re module also allows unicode in the pattern string. Just be sure to use the unicode prefix 'u' to the pattern string. For Example:

result=re.findall(ur'善+', u'真善美αβγ!',re.U)
print result[0].encode('utf8')
X
VERBOSE
This flag allows you to write regular expressions that look nicer. Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash, and, when a line contains a "#" neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such "#" through the end of the line are ignored.
match( pattern, string[, flags])
match() is like search(), except that the match must start at the beginning of string. For example, re.search('me','somestring') matches, while re.match('me','somestring') returns None.

Note: Match() is not exactly equivalent to Search() with "^". For example:

 re.search(r'^B', 'A\nB',re.M) # succeeds
 re.match(r'B', 'A\nB',re.M)   # fails
split( pattern, string[, maxsplit = 0])
Returns a list of splitted string with pattern as boundary. For example:
re.split(r' +', 'what   do  you think')
# returns ['what', 'do', 'you', 'think']

If the boundary pattern is enclosed in parenthesis, then it is included in the returned list. For Example:

re.split(r'( +)', 'what   do  you think')
# returns ['what', '   ', 'do', '  ', 'you', ' ', 'think']

If there are more than one capturing parenthesis in pattern, they are all included in the returned list in sequence. For Example:

re.split(r'( +)(@+)', 'what   @@do  @@you @@think')
# returns ['what', '   ', '@@', 'do', '  ', '@@', 'you', ' ', '@@', 'think']

If the optional maxsplit is given, then the returned list's length is no more than maxsplit.

findall( pattern, string[, flags])
Return a list of all non-overlapping matches of pattern in string. For example:
re.findall(r'@+', 'what   @@@do  @@you @think')
# returns ['@@@', '@@', '@']
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. For example:
re.findall(r'( +)(@+)', 'what   @@@do  @@you @think')
# returns [('   ', '@@@'), ('  ', '@@'), (' ', '@')]
Empty matches are included in the result unless they touch the beginning of another match. For example:
re.findall(r'\b', 'what   @@@do  @@you @think')
# returns ['', '', '', '', '', '', '', '']
need another example here showing what is
meant by "unless they touch the beginning of another match."
finditer( pattern, string[, flags])
Like findall(), except an "iterator" is returned with MatchObject as members. This is to be used in a loop. For example:
for matched in re.finditer(r'(\w+)', 'what   do  you think'):
    print matched.group()
sub( pattern, repl, string[, count])
Returns a string by substituting pattern in string by the replacement repl. If the pattern isn't found, string is returned unchanged. Any “\number” in repl are replaced by the captured pattern in pattern (That is, sub patterns enclosed in parenthesis). For Example:
newstr=re.sub(r'([^-]+)--(.+)$', r'\1--Me, not \2','"what do you mean?" --A Sage')
# returns: "what do you mean?" --Me, not A Sage
repl can also be a function for more complicated replacement. When a match is found, the function is called and its return value used as the replacement string. For example:
def fun(matchObj):
    if matchObj.group(0) == '--A Sage':
        return '--Me'
    else:
        return '--Some Joe'

newstr=re.sub(r'--.+$', fun,'"what do you mean?" --xyz')
print newstr       # prints:  "what do you mean?"  --Some Joe

The first argument pattern may be a string or an regex object. If you need to specify regular expression flags, you must use a regex object. Alternatively, you can embed a flag in your regex pattern by “(?iLmsux)” in the beginning of your pattern. For example, "sub("(?i)b+", "x", "bbbb BBBB")" returns 'x x'. ( See regex pattern syntax for detail. )

The optional argument count is the maximum number of pattern occurrences to be replaced.

In addition to character escapes and backreferences as described above, "\g<name>" will use the substring matched by the group named "name", as defined by the (?P<name>...) syntax. "\g<number>" uses the corresponding group number; "\g<2>" is therefore equivalent to "\2", but isn't ambiguous in a replacement such as "\g<2>0". "\20" would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character "0". The backreference "\g<0>" substitutes in the entire substring matched by the pattern.

subn( pattern, repl, string[, count])
Perform the same operation as sub(), but returns a tuple: (new_string, number_of_subs_made).
escape( string)
Return string with a backslash character "\" inserted in front of every non-alphanumeri character. This is useful if you want to use a given string as a pattern for exact match, without any backslashes in the string interpreted with regex's normal meaning.
exception error
Exception raised when a string passed to one of the functions here is not a valid regular expression (for example, it might contain unmatched parentheses) or when some other error occurs during compilation or matching. It is never an error if a string contains no match for a pattern.

Page created: 2005-04, by Xah Lee.
For copyright and terms, see terms.html
Xah Signet