Using Regular Expressions in Python: A Brief Guide
This article will provide a brief guide to using regular expressions in Python.
Join the DZone community and get the full member experience.
Join For FreeRegular expressions are effective tools for pattern matching and text processing. A regex, or regular expression, is a group of characters that forms a search pattern.
To determine whether a string contains a particular search pattern, use RegEx. They are supported by many programming languages, including Python, which provides a powerful and flexible regular expression engine that can handle a wide range of text-matching tasks. This article will provide a brief guide to using regular expressions in Python.
What Are Regular Expressions?
Regular Expressions (RegEx) are unique character combinations that use a search pattern to locate a string or group of strings, such as finding all email addresses in a document or validating the format of a phone number. It can distinguish between the presence and absence of a text by comparing it to a specific pattern. It may also divide a pattern into one or more sub-patterns.
The use of regex in Python is supported via the re-module, which is provided by Python. Its main purpose is to provide a search; to do this, a regular expression and a string are required. In this case, it either returns the first match or none at all. Regular expressions are used to match patterns in text. Regular expressions are often used in text editors, command-line utilities, and programming languages.
Regular expressions consist of two types of characters:
- Literals: These are characters that match themselves. For example, the letter "a" will match the letter "a" in a text string.
- Metacharacters: These are special characters that have a special meaning. For example, the dot (.) metacharacter matches any single character.
Using Regular Expressions in Python
Python provides a built-in module called "re" that provides regular expression support. This module provides several functions for working with regular expressions, including searching for matches, replacing matches, and splitting a string into a list of substrings based on a pattern.
The "re" module also provides several special characters that can be used to create complex regular expressions. Here are some of the most commonly used special characters in regular expressions:
Character |
Description |
. |
any single character, excluding the newline (\n), is matched. For example, the regular expression for he .. will match for “hell”,” help,” etc. |
* |
compares to 0 or more occurrences of the preceding character. For example, the regular expression a* will match zero or more occurrences of the letter "a". |
+ |
matches to one or more occurrences of the preceding character. For example, the regular expression a+ will match one or more occurrences of the letter "a". |
? |
matches either zero or one instance of the preceding character. For example, the regular expression colo?r will match both "color" and "colour". |
{m,n} |
Matches the previous character between m and n times. For example, the regular expression a{2,3} will match either "aa" or "aaa". |
[] |
Matches any single character within the brackets. For instance, the regular expression [aeiou] will match any kind of vowel. |
\ |
Used to drop the special meaning of the character following it. For example, the regular expression \. will match a period character. |
^ |
The string should start with the characters following ^. For example, the regular expression ^hello will match only if the sentence starts with hello. |
$ |
The string should end with the characters following $. For example, the regular expression hello$ will match only if the sentence ends with hello. |
| |
Either or. For example, the regular expression suman|ritik Check if the string contains either "suman" or "ritik" |
Let’s discuss some important of these metacharacters in detail:
. – Dot
Except for the newline character (\n), the dot (.) symbol only recognizes one character. For instance:
- a.b will look for any character other than a dot in the string, including acb, acbd, abbb, etc.
- .. will determine whether the string has at least two characters.
* – Star
Star (*) symbol matches zero or more instances of the regex that comes before the star symbol. For instance:
Because b is not followed by c, ab*c will be matched for the strings ac, abc, abbbc, dabc, etc., but not for abdc.
+ - Plus
One or more instances of the regex that comes before the + symbol are matched by the Plus (+) symbol. For instance :
- Because there is no b in ac and b is not followed by c in abdc, ab+c will match for the string abc, abbc, dabc but not for ac, abdc.
? - Question
The question mark (?) determines whether the string in the regex appears at least once or not at all. For instance:
- As there are two b’s in the string abbc, it will not be matched. However, ab?c will be matched for the strings ac, acb, and dabc. Because b is not followed by c, it will also not match for abdc.
Braces {m, n}
All repetitions from m to n, inclusive, before the regex are matched by the braces. Example-
- The strings aaab, baaac, and gaad will be matched for regular expression a{2, 4}, but it won't be matched for strings like abc, bc because there is either just one an or none at all in both situations.
Square brackets [].
A character class made up of a group of characters that we want to match is represented by square brackets ([]). The character class [abc] will, for instance, match any single a, b, or c.
With the - symbol between the square brackets, we can also specify a range of characters. For instance:
- The sample for [0123] is [0,3].
- The sample for [abc] is [a-c].
The caret(^) sign can be used to reverse the character class as well. For instance:
- [^0-3] denotes any number other than 0 and 1 or 3.
- [^a-c]Any character that is not an a, b, or c.
\ Backslash
To ensure that the character is not given special treatment, use the backslash (/). This could be thought of as a metacharacter escape. As an illustration, the dot (.) will be treated as a special character and one of the metacharacters if you want to search for it in the string (as shown in the above table). In order to prevent it from losing its specialization, we will employ the backslash (/) before the dot (.) in this instance. The example below will help you understand.
Code:
import re
s = 'suman.singh'
# without using \
match = re.search(r'.', s)
print(match)
# using \
match = re.search(r'\.', s)
print(match)
Output:
<re.Match object; span=(0, 1), match='s'>
<re.Match object; span=(6, 7), match='.'>
| - Or Symbol
Determines whether the pattern before or after the or symbol is present in the string. For instance:
- Any string that contains either an or b, such as acd, bcd, abcd, etc., will be matched by a|b.
Special Sequences
Special sequences provide the precise position in the search string where the match must take place rather than matching for the actual character in the string. It makes it simpler to write patterns that are used frequently.
Special Sequences List
Special Sequence |
Description |
Examples |
\A |
matches if the specified character appears at the start of the string. |
\Afor -> for suman |
\b |
Matches if the provided character either starts or finishes the word. \b(string) will look for the word's beginning, and \b(string) will look for the word's ending. |
\bsh -> suman |
\B |
In contrast to the \b, the string shall not begin or end with the specified pattern. |
\Bge -> together |
\d |
This is similar to the set class [0-9] because it matches any decimal digit. |
\d -> 1526 |
\D |
matches any character that is not a digit; this is the same as the set class [0-9]. |
\D -> suman |
\s |
each whitespace character is a match. |
\s -> sum an |
\S |
any non-whitespace character is a match. |
\S -> s uman |
\w |
This is comparable to the class [a-zA-Z0-9_] and matches any alphanumeric character. |
\w -> 3425 |
\W |
any non-alphanumeric character is matched. |
\W -> >$ |
\Z |
matches if the string contains the specified regex at the end. |
an\Z -> suman |
Basic Regular Expression Operations
1. Searching for Matches
The most basic operation in regular expressions is searching for a match in a string. The "re" module provides the "search" function for this purpose. Here is an example of how to use the "search" function to find a pattern in a string:
Code:
import re
text = "Suman Raghav and Ron are friends"
pattern = "friends"
result = re.search(pattern, text)
if result:
print("String Pattern Found")
else:
print("String Pattern not Found")
This code will output "String Pattern Found" because the pattern "friends" is found in the text.
2. Replacing Matches
Another common operation in regular expressions is replacing matches in a string. The "re" module provides the "sub" function for this purpose. Here is an example of how to use the "sub" function to replace a pattern in a string:
Code:
import re
text = "Suman Raghav and Ron are friends"
pattern = "friends"
replacement = "students"
result = re.sub(pattern, replacement, text)
print(result)
This code will output "Suman Raghav and Ron are students" because the pattern "friends" is replaced with "students" in the original text.
3. Splitting a String Based on a Pattern
The "re" module can also be used to split a string into a list of substrings based on a pattern. The split function is used for this purpose. Here is an example of how to use the "split" function to split a string based on whitespace characters:
Code:
import re
text = "Suman Raghav and Ron are friends"
result = re.split("\s", text)
print(result)
This code will output ["Suman", "Raghav", "and", "Ron", "are", "friends"] because the string is split based on whitespace characters.
4. Regular Expression Flags
Regular expressions in Python support flags that modify the behavior of the regular expression engine. Flags are specified as an optional second argument to the regular expression function. Some of the most widely used flags are listed below:
- re.IGNORECASE or re.I: Makes the regular expression case-insensitive.
- re.MULTILINE or re.M: Allows the ^ and $ metacharacters to match the beginning and end of each line in a multiline string rather than just the beginning and end of the entire string.
- re.DOTALL or re.S: Makes the dot (.) metacharacter match any character, including a newline character (\n).
- re.ASCII or re.A: Limits the regular expression engine to ASCII characters only.
Here is an example of how to use the IGNORECASE flag to make a regular expression case-insensitive:
Code:
import re
text = "Suman has a brown coloured bag."
pattern = "BROWN"
result = re.search(pattern, text, re.IGNORECASE)
if result:
print("String Pattern Found")
else:
print("String Pattern not Found")
This code will output "String Pattern Found" because the pattern "BROWN" is found in the text, even though it is in uppercase and the search was performed with the IGNORECASE flag.
5. Grouping and Capturing
Regular expressions in Python also support the grouping and capturing of substrings within a match. Grouping is achieved using parentheses (()). The contents of the first group are captured and can be accessed using the "group" method of the match object.
Here is an example of how to use grouping and capturing in regular expressions:
Code:
import re
text = "Suman Singh (sumansingh@example.com) wrote an email"
pattern = "(\w+@\w+\.\w+)"
result = re.search(pattern, text)
if result:
print("Email address validated: " + result.group(1))
else:
print("Email address not validated")
This code will output "Email address validated: sumansingh@example.com" because the regular expression matches the email address in the text and captures it using a group.
Conclusion
- For text processing and pattern matching, regular expressions are an effective tool.
- The "re" module in Python provides a flexible and powerful regular expression engine.
- Special characters such as ., *, +, ?, ^, $, [], (), and | are used to define patterns in regular expressions.
- The most commonly used regular expression functions in Python are "search", "match", "findall", "sub", and "split".
- Regular expression flags such as re.IGNORECASE, re.MULTILINE, re.DOTALL, and re.ASCII can modify the behavior of the regular expression engine.
- Grouping and capturing of substrings within a match can be achieved using parentheses (()) and the "group" method of the match object.
Opinions expressed by DZone contributors are their own.
Comments