A primer on Python Regular Expression
Regex is a group of characters which helps to find pattern within a string. Regex is used in lot of applications including the search engines, search and for find and replace in text documents
Being a Data Scientist it is good to know regex which is found useful in data cleaning since it helps to identify text of a given pattern and perform operations like replace, count etc.
Every character in a regular expression is either a meta-character or a literal. For example: In regex b.?, b is a literal and . & ? is a meta-character
List of widely used Meta-characters:
Meta-Characters | Description | |
---|---|---|
. | Matches any character except a newline | |
^ | Matches start of the String | |
$ | Matches end of the String | |
[] | Indicate a Set of Characters | |
[^] | Match any character except given | |
() | capture and group | |
Match either abc or def | ||
|Either escapes special characters or denotes special sequence |
Examples:
Regex | String | Results | |
---|---|---|---|
a.c | asc,apc,b6c,a5c | ‘asc’, ‘apc’, ‘a5c’ | |
(^a.+\d) | abad6c | ‘abad6’ | |
(\d.+b$) | abad6cb | ‘6cb’ | |
\d([a-z]+)\d | 197-63abd698-cdef | ‘abd’ | |
\w([^a-z]+)\w | abABCd | ‘ABC’ | |
(\w.+) (\w.+) | My hero | (‘My’, ‘hero’) | |
abc | def | 124-pdef9-24abc98 | ‘def’,’abc’ |
* | hgy*opt | * |
Quantifiers
A quantifier after a token or group specifies how often that a preceding element is allowed to occur.
The most common quantifiers are the question mark ?, the asterisk * and the plus sign +
List of common Quantifiers
Quantifier | Description |
---|---|
? | Match 0 or One repetitions of preceding RE |
* | Match 0 or more repetitions of preceding RE |
+ | Match 1 or more repetitions of preceding RE but not Zero. Will not match ac |
{n} | Matches exactly n copies of the preceding RE, it’s 2 here |
{m,n} | Match minimum of m and maximum of n repetitions of preceding RE |
Examples:
Regex | String | Results |
---|---|---|
ab?c | ac | ac |
de*fg | dfg | dfg |
ab+c | abbbc | abbbc |
\d([a-z]+)\d | 197-63abd698-cdef | ‘abd’ |
a{2} | 24abc98aa0 | aa |
c{3,4} | aacccbbb | ccc |
Special Sequence:
it contains an escape character() followed by the below literals which gives them a special meaning
List of commonly used special sequences
Meta-Characters | Description |
---|---|
\d | Matches decimal digits including [0-9] |
\D | Opposite of \d i.e. [^0-9]. Matches any character which is not decimal |
\w | [A-Za-z0-9_] |
\W | Matches any character which is not word character |
\s | Whitespace |
\S | Matches any character not whitespace |
\b | Matches empty string at start or end of a word |
\B | Matches empty string when not at start or end of a word |
Examples:
Regex | String | Results |
---|---|---|
\d | Absdd45sd23sd | ac |
\D | dfg | dfg |
\w | abbbc | abbbc |
\W | 197-63abd698-cdef | ‘abd’ |
\s | 24abc98aa0 | aa |
\S | aacccbbb | ccc |
\b | 24abc98aa0 | aa |
\B | aacccbbb | ccc |
How to use Regular Expression in Python?
First import the regular expression module
import re
Convert Regex Expression into Regex Object using Compile
Compile function is used to convert a regular expression into a regular expression object. Which can be used with the functions match(), find(), search () etc.
import re
regex_obj = re.compile('STRE')
regex_obj
Output: re.compile(r’stre’, re.UNICODE)
Now use this regular expression object to find pattern in a string
search_str = 'NEWYORK15AVE26STREET'
regex_obj.search(search_str)
We used search() function to find the pattern ‘STRE’ in search_str i.e. NEWYORK15AVE26STREET and the result is a Match Object
Output: <_sre.SRE_Match object; span=(14, 18), match=’STRE’>
Commonly Used Regex methods and attributes in Python
Search (Regex Extract)
It looks through the entire string and returns a match object where regex match is found or else returns a None
Lets find London in the given string using search() method
import re
expr = r'LONDON'
str = 'NEWYORK15AVE26STREET56DOWNINGAVE65 \
LONDON65MOUNTAINAVE23FIFTYSTREET11 LONDON'
re.search(expr,str)
It returns a match object because LONDON is found in the second line. So you have seen search() looks for the pattern in multi-line mode as well.
Output: <_sre.SRE_Match object; span=(41, 47), match=’LONDON’>
Note: span=(41,47) in the match object is the position of the string from 41 thru 46.
Match
This method matches the regular expression at the beginning of String only and returns a Match Object. Unlike Search() it doesn’t look for the expression in a new line.
if you run the same regular expression on the string above then it returns a None
We change the expression to ‘NEWYORK’
import re
expr = r'NEWYORK'
str = 'NEWYORK15AVE26STREET56DOWNINGAVE65 LONDON65MOUNTAINAVE23FIFTYSTREET11'
print(re.match(expr,str))
It returns a match object because NEWYORK matches the beginning of the string
Output: <_sre.SRE_Match object; span=(0, 3), match=’NEW’>
split
split is used to split the string using the compiled pattern. if no pattern is passed then space is used.
We are splitting this string ‘5A123B456C7’ on any character between A-Z
import re
re.split('[A-Z]+', '5A123B456C7')
Output: [‘5’, ‘123’, ‘456’, ‘7’]
split with flag ignore case
You can use the re.IGNORECASE flag to ignore the case for splitting, if I re-write the above expression like this it will give the same output
re.split('[a-z]+', '5A123B456C7',flags=re.IGNORECASE)
Output: [‘5’, ‘123’, ‘456’, ‘7’]
maxsplit
You can also specify the maximum number of split that you want for the given string.
We will just pass the maxsplit value as 1 in the above expression
re.split('[a-z]+', '5A123B456C7',1,flags=re.IGNORECASE)
It just splits the string one time only because we passed the maxsplit value = 1
Output: [‘5’, ‘123B456C7’]
Findall
It Returns all non-overlapping matches of pattern in string, as a list of strings
Find all the Characters after a number in the String
expr = r'\d([A-Z])'
str = '5A123B456C7'
re.findall(expr,str)
The output is A,B and C because A follows 5 and B follows 3 and C follows 6 in the string.
All these characters are returned as a list of strings
Output: [‘A’, ‘B’, ‘C’]
finditer
it returns an iterable match object.
Get all the characters from the string as an iterator
expr = r'\w'
str = '5A123B456C7'
itrobj=re.finditer(expr,str)
itrobj
Output: <callable_iterator at 0x255dbf58978>
Now lets iterate through this object and get all the characters of the string
[item.group() for item in itrobj]
Output: [‘5’, ‘A’, ‘1’, ‘2’, ‘3’, ‘B’, ‘4’, ‘5’, ‘6’, ‘C’, ‘7’]
Sub (Regex replace)
It replaces the matching pattern with the given value starting from left to right
We will replace all characters which are followed by a digit in the given string with the character ‘C’
expr = r'\d([A-Z])'
str = '5A123B456C7'
re.sub(expr,'C',str)
Output: ‘T12T45T7’
Maximum number of pattern occurrences to be replaced
Now we will use count parameter to define the maximum number of patterns found that needs to be replaced. will pass the count value = 2
# sub - replace
expr = r'\d([A-Z])'
str = '5A123B456C7'
re.sub(expr,'T',str, count=2)
You can see only the first two characters after a digit is replaced. the third character is not replaced and still it’s ‘C’
Output: ‘T12T456C7’
Subn
It returns a tuple with the replaced string and the number of replacement that is done
expr = r'\d([A-Z])'
str = '5A123B456C7'
re.subn(expr,'T',str,count=2)
So in the above case with count=2 it returns (‘T12T456C7’, 2)
Groups in Python regex
group returns the subgroups of the match. if there is a single argument then it returns a single value and if there are multiple argument then a tuple of matched strings are returned.
expr = r'(\w+) (\w+)'
str = '5A123B 4T56C7'
matchobj=re.match(expr,str)
Get Entire match subgroups
matchobj.groups()
it returns tuple of both the matched string i.e. the entire match subgroups
Output: (‘5A123B 4T56C7’)
Get the first matched group
matchobj.group(1)
Output: ‘5A123B’
Get the Second matched group
matchobj.group(1)
Output: ‘4T56C7’
Start and End in group
If you want to know the index of the match group in the string then use start for the starting position index of matched pattern and end for the ending position index
Here is an Address
Address = "12 Downing StreetBoulevard,Fischer Ave, Kraigal, PY"
We want to Read the Address after stripping Boulevard and using start and end group function
addmatch = re.search('Boulevard',Address)
Address[:addmatch.start()] + Address[addmatch.end():]
Output: ‘12 Downing Street,Fischer Ave, Kraigal, PY’
How to debug a Regular Expression?
debug function gives the debug statements for the regular expression like their ASCII characters and other details that help you to debug if the regular expression is syntactically correct or not
import re
re.compile('stre',re.DEBUG)
Output:
LITERAL 115 LITERAL 116 LITERAL 114 LITERAL 101 LITERAL 101
expr = r'\d{2}(.+)\d{2}'
# str = 'NEWYORK15AVE26STREET'
re.compile(expr,re.DEBUG)
Output:
MAX_REPEAT 2 2
IN
CATEGORY CATEGORY_DIGIT
SUBPATTERN 1 0 0
MAX_REPEAT 1 MAXREPEAT
ANY None
MAX_REPEAT 2 2
IN
CATEGORY CATEGORY_DIGIT
How to handle Error in Regular Expression?
An exception is raised when an invalid regular expression is passed to the function or when aan error occurs during compile.
Here we are passing \k an invalid sequence and it throws an exception when compiled.
# Error in Regex
expr = r'\d(\k)'
str = '5A123B456C7'
try:
rex=re.compile(expr)
except re.error as e:
print(e)
Output: bad escape \k at position 3
Conclusion:
In this detailed post on regex we have covered the critical python functions associated and used for regular expression.
Here are some of the highlights from this post: that we have touched in this post:
- What are metachracters, Quantifiers and special sequences in regex
- How Regex is used in python and what are the commonly used regex methods and functions in python
- Groups in Python Regex and Debugging a regular expression
- Error handling in Regular Expression
I would say regex is more of practice and understanding the concept and it’s implementation to use it like a pro