Introduction
Regular expressions are a powerful tool used for searching, matching and manipulating strings based on patterns. To use it the following steps can be followed:
- Import the regex module with
import re
. - Create a Regex object with the
re.compile()
function. (Remember to use a raw string.) - Pass the string to search into the Regex object’s
search()
method. This returns a Match object. - Call the Match object’s
group()
method to return the first found string of the actual matched text.
import re
>>> phone_regex = re.compile(r'\d{3}-\d{3}-\d{4}') # check for a phone number
>>> mo = phone_regex.search('My number is 312-156-8940.') # get match objects
>>> print(mo.group()) # print match objects
'312-156-8940'
Pattern matching capabilities
Grouping with parenthesis
Adding parenthesis to the regular expression will create groups. The different
groups can be returned with the group()
method, specifying the group number
(1-indexed) to return.
>>> phone_regex = re.compile(r'(\d{3})-(\d{3}-\d{4})') # added parenthesis
>>> mo = phone_regex.search('My number is 312-156-8940.')
>>> mo.group(1)
'312'
>>> mo.group(2)
'156-8940'
>>> mo.group() # returns all as a string
'312-156-8940'
>>> mo.groups() # returns tuple
('312', '156-8940')
Other functionalities
Regex | Returns | Functionality |
---|---|---|
r'Batman|Robin' | There's Batman and Robin -> Batman | Looks for either of the given substrings |
r'Bat(man|mobile|copter)' | Batmobile lost a wheel -> Batmobile | The first string + any of the substrings in parenthesis |
r'Bat(wo)?man' | Batman and Batwoman -> Batman | The substring before ? can be zero or once |
r'Bat(wo)*man' | Batwowoman and Batwoman -> Batwowoman | The substring before * can be zero or more times |
r'Bat(wo)+man' | Batman and Batwoman -> batwoman | The substring before + has to be once or more |
r'(ha){3}' | hahahaha -> hahaha | Match the pattern {n} times |
r'(ha){2,4}' | hahahahaha -> hahahaha | Match pattern within range {n,m} , always takes higher range by default (greedy) |
r'(ha){2,4}?' | hahahahaha -> haha | Match pattern within range {n,m} , takes lower range (non-greedy) |
search() vs findall()
While search()
will return a match object of the first matched text in the
search string, the findall()
method will return a list of every match in the
string. If there are groups in the regex, findall()
will return a list of
tuples.
>>> phoneNumRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
>>> mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
>>> mo.group()
'415-555-9999'
>>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
['415-555-9999', '212-555-0000']
Character Classes
Shorthand codes for common character classes
Shorthand character class | Represents |
---|---|
\d | Any numeric digit from 0-9 |
\D | Any character that is not a numeric digit |
\w | Any letter, numeric digit, or the underscore character. (like matching “word” characters) |
\W | Any character that is not a letter, numeric digit, or the underscore character |
\s | Any space, tab, or newline character (“space” characters) |
\S | Any character that is not a space, tab, or newline |
Creating own character classes
To create character classes, the characters or character ranges can be specified inside square brackets, for example:
r'[aeiouAEIOU]'
for vowelsr'[a-zA-Z]'
for lower case and upper case lettersr'[0-5.]'
for characters from 0 to 5 and period.
A negative character class can also be created, i.e. a class to exclude the
speficied characters. To do this, the caret character (^
) is used at the
beginning of the class. For example:
r'[^aeiouAEIOU]'
for any character that is not a vowel
The caret and dollar signs
The caret ^
is also used to indicate that the match must occur at the
beginning of the searched text. Likewise, the dollar sign $
character
indicates that the string must end with the regex pattern.
# starts with
>>> begginsw = re.compile(r'^Hello')
>>> begginsw.search('Hello world')
<re.Match object; span=(0, 5), match='Hello'>
# ends with
>>> endsWithNumber = re.compile(r'\d$')
>>> endsWithNumber.search('Your number is 42')
<re.Match object; span=(16, 17), match='2'>
# begins and ends with
>>> wholeStringIsNum = re.compile(r'^\d+$')
>>> wholeStringIsNum.search('1234567890')
<re.Match object; span=(0, 10), match='1234567890'>
>>> wholeStringIsNum.search('12345xyz67890') == None
True
Wildcard character
- The dot
.
is a wildcard and will match anything except for a new line. .*
will match everything and anything except for a new line. Use case below.- To match also the new line character, pass
re.DOTALL
as a second argument inre.compile()
# match everything and anything
>>> nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
>>> mo = nameRegex.search('First Name: Al Last Name: Sweigart')
>>> mo.group(1)
'Al'
>>> mo.group(2)
'Sweigart'
# include new line
>>> newlineRegex = re.compile('.*', re.DOTALL)
>>> newlineRegex.search('Serve the public trust.\nProtect the innocent.').group()
'Serve the public trust.\nProtect the innocent.'
Case insensitive matching
To ignore casing pass the re.IGNORECASE
or re.I
as a second argument to
re.compile()
Substituting strings
The sub()
method for a regex object can be used to replace text patterns by
another one. It takes as arguments the string to replace any matches and the
string.
# using another word
>>> namesRegex = re.compile(r'Agent \w+')
>>> namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')
'CENSORED gave the secret documents to CENSORED.'
# using the same word to replace
>>> agentNamesRegex = re.compile(r'Agent (\w)\w*')
>>> agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent
Eve knew Agent Bob was a double agent.')
A**** told C**** that E**** knew B**** was a double agent.'
Multi-line regex
To improve readability and break complex regex into separate lines, the
re.VERBOSE
argument can be passed to re.compile()
. This would allow a regex
as follows:
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # area code
(\s|-|\.)? # separator
\d{3} # first 3 digits
(\s|-|\.) # separator
\d{4} # last 4 digits
(\s*(ext|x|ext.)\s*\d{2,5})? # extension
)''', re.VERBOSE)
Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE
re.compile()
only takes two arguments, so in order to use multiple
functionalities in the same object the pipe |
character can be used:
>>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL)
>>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)
Cheat Sheet
Regex | Functionality |
---|---|
? | matches zero or one of the preceding group. |
* | matches zero or more of the preceding group. |
+ | matches one or more of the preceding group. |
{n} | matches exactly n of the preceding group. |
{n,} | matches n or more of the preceding group. |
{,m} | matches 0 to m of the preceding group. |
{n,m} | matches at least n and at most m of the preceding group. |
{n,m}? , *? , +? | performs a non-greedy match of the preceding group. |
^spam | means the string must begin with spam. |
spam$ | means the string must end with spam. |
. | matches any character, except newline characters. |
\d , \w , \s | match a digit, word, or space character, respectively. |
\D , \W , \S | match anything except a digit, word, or space character, respectively. |
[abc] | matches any character between the brackets (such as a, b, or c). |
[^abc] | matches any character that isn’t between the brackets. |