A regular expression (regex) is a special sequence of characters that defines a search pattern. It’s used for searching, matching, and manipulating strings based on patterns. also used to extract features from the text, replace strings and perform other string manipulations
Think of it as a powerful search tool that can match complex patterns of text, espicially in future when you will be learning Natural Language Processing.
They consist of literal characters and special characters that define the pattern.
There are certain quantifiers that you should know about :
regular expression quantifiers which are used to specify how many times a given character can be repeated before matching is done. This is mainly used when the number of characters going to be matched is unknown. There are six types of quantifiers :
Example:
Regex: a*
Matches: “” (empty string), “a”, “aa”, “aaa”, etc.
‘+’ = Matches: 1 or more occurrences of the preceding element
Example:
Regex: a+
Matches: “a”, “aa”, “aaa”, but not “” (empty string).
? = Matches: 0 or 1 occurrence of the preceding
Example:
Regex: a?
Matches: “” (empty string), “a”.
{n} = Matches: Exactly n occurrences of the preceding element
Example:
Regex: a{3}
Matches: “aaa”, but not “aa” or “aaaa”.
{n, } = Matches: at least n or more occurrences of the preceding
Example:
Regex: a{2,}
Matches: “aa”, “aaa”, “aaaa”, etc.
{n, m} = Matches: Between n and m (inclusive) occurrences of the preceding (range)
Example:
Regex: a{2,4}
Matches: “aa”, “aaa”, “aaaa”, but not “a” or “aaaaa”.
Python’s re module provides functions for working with regular expressions
Basic Regex Operations
Let’s first understand common regex patterns:
1. Matching Patterns
using regular expressions to extract hashtags from a given text. This can be useful for analyzing social media posts or any text with hashtags.
• import re
This module helps with regular expression operations
import re
# Sample text with hashtags
text = “Check out our new features! #Exciting #TechNews #Innovation #2024”
# Regular expression pattern to match hashtags
pattern = r’#\w+’
# Find all hashtags in the text
hashtags = re.findall(pattern, text)
# Print the hashtags
print(“Extracted hashtags:”, hashtags)
Extracted hashtags: [‘#Exciting’, ‘#TechNews’, ‘#Innovation’, ‘#2024’]
#’ matches the hashtag
\w+ matches one or more word characters (letters, digits, and underscores) that follow the #.
Searching Patterns :
using regular expressions to find all occurrences of a specific word in a text. Let’s say we want to find all instances of the word “cat” in a given text, regardless of its case
import re
# Sample text
text = “The cat is on the roof. The other cat is playing with a toy. My cat loves to sleep.”
pattern = r’\bcat\b’
# Find all occurrences of the word “cat” in the text
matches = re.findall(pattern, text, re.IGNORECASE)
# Print the matches
print(“Found occurrences of ‘cat’:”, matches)
Found occurrences of ‘cat’: [‘cat’, ‘cat’, ‘cat’]
1. Replacing Patterns
It allows us to search for specific patterns and replace them with other strings.
A simple example of using regular expressions to replace specific patterns in a text. In this case, we’ll replace all occurrences of the word “cat” with “dog”.
1. Positional Arguments:
The order of arguments matters.
import re
#text
text = “The cat is on the roof. The other cat is playing with a toy. My cat loves to sleep.”
# Regular expression pattern to match the word “cat”
pattern = r’\bcat\b’
# Replace all occurrences of “cat” with “dog”
replaced_text = re.sub(pattern, ‘dog’, text, flags=re.IGNORECASE)
# Print the replaced text
print(“Replaced text:”, replaced_text)
Replaced text: The dog is on the roof. The other dog is playing with a toy. My dog loves to sleep.
The pattern r’\bcat\b’ ensures that we match the word “cat” exactly, and re.sub() performs the replacement.
1. The split() function:
The split() function returns a list where the string has been split at each match
import re
txt = “The rain in Spain” x = re.split(“\s”, txt) print(x)
[‘The’, ‘rain’, ‘in’, ‘Spain’]
Using regular expression to search for and extract email addresses from a text. The below code might come off a bit complicated but it is very easy.
import re
# Sample text containing email addresses
text = “””
Here are some email addresses to check:
john.doe@example.com, jane_smith123@domain.org, and admin@my- site.co.uk.
Feel free to contact us at support@company.com! “””
# Regular expression pattern to match email addresses
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}’
# Find all email addresses in the text
emails = re.findall(pattern, text)
# Print the email addresses
print(“Extracted email addresses:”, emails)
Extracted email addresses: [‘john.doe@example.com’, ‘jane_smith123@domain.org’, ‘admin@my-site.co.uk’, ‘support@company.com’]
