Meritshot Tutorials
- Home
- »
- Regular Expressions in Python
Python Tutorial
-
Understanding Variables in PythonUnderstanding Variables in Python
-
Types Of Operators in PythonTypes Of Operators in Python
-
How to create string ?How to create string ?
-
Data Structure in PythonData Structure in Python
-
What is a Function in PythonWhat is a Function in Python
-
Parameters and Arguments in PythonParameters and Arguments in Python
-
What is a Lambda FunctionWhat is a Lambda Function
-
What is a Regular Expression?What is a Regular Expression?
-
Introduction to Loops in PythonIntroduction to Loops in Python
-
If-else Statements in PythonIf-else Statements in Python
-
Break Statement in PythonBreak Statement in Python
-
OOPS in PythonOOPS in Python
-
Space and Time in PythonSpace and Time in Python
-
Data Type in PythonData Type in Python
Regular Expressions
What is a Regular Expression?
A regular expression (regex) is a special sequence of characters that defines a search pattern. It’s used for searching, matching, and manipulating strings based on patterns. also used to extract features from the text, replace strings and perform other string manipulations
Think of it as a powerful search tool that can match complex patterns of text, espicially in future when you will be learning Natural Language Processing.
They consist of literal characters and special characters that define the pattern.
Quantifiers
There are certain quantifiers that you should know about :
regular expression quantifiers which are used to specify how many times a given character can be repeated before matching is done. This is mainly used when the number of characters going to be matched is unknown. There are six types of quantifiers :
- “*” = Matches: 0 or more occurrences of the preceding
Example:
Regex: a*
Matches: “” (empty string), “a”, “aa”, “aaa”, etc.
- ‘+’ = Matches: 1 or more occurrences of the preceding element
Example:
Regex: a+
Matches: “a”, “aa”, “aaa”, but not “” (empty string).
- ? = Matches: 0 or 1 occurrence of the preceding
Example:
Regex: a?
Matches: “” (empty string), “a”.
{n} = Matches: Exactly n occurrences of the preceding element
Example:
Regex: a{3}
Matches: “aaa”, but not “aa” or “aaaa”.
- {n, } = Matches: at least n or more occurrences of the preceding
Example:
Regex: a{2,}
Matches: “aa”, “aaa”, “aaaa”, etc.
- {n, m} = Matches: Between n and m (inclusive) occurrences of the preceding (range)
Example:
Regex: a{2,4}
Matches: “aa”, “aaa”, “aaaa”, but not “a” or “aaaaa”.
Using the re Module
Python’s re module provides functions for working with regular expressions
Basic Regex Operations
Let’s first understand common regex patterns:
- \d: Matches any digit (0-9).
- \w: Matches any alphanumeric character (a-z, A-Z, 0-9) and
- \s: Matches any whitespace character (spaces, tabs).
- .: Matches any character except a
1. Matching Patterns
using regular expressions to extract hashtags from a given text. This can be useful for analyzing social media posts or any text with hashtags.
• import re
This module helps with regular expression operations
import re
# Sample text with hashtags
text = “Check out our new features! #Exciting #TechNews #Innovation #2024”
# Regular expression pattern to match hashtags
pattern = r’#\w+’
# Find all hashtags in the text
hashtags = re.findall(pattern, text)
# Print the hashtags
print(“Extracted hashtags:”, hashtags)
Extracted hashtags: [‘#Exciting’, ‘#TechNews’, ‘#Innovation’, ‘#2024’]
- #’ matches the hashtag
- \w+ matches one or more word characters (letters, digits, and underscores) that follow the #.
Searching Patterns :
using regular expressions to find all occurrences of a specific word in a text. Let’s say we want to find all instances of the word “cat” in a given text, regardless of its case
import re
# Sample text
text = “The cat is on the roof. The other cat is playing with a toy. My cat loves to sleep.”
pattern = r’\bcat\b’
# Find all occurrences of the word “cat” in the text
matches = re.findall(pattern, text, re.IGNORECASE)
# Print the matches
print(“Found occurrences of ‘cat’:”, matches)
Found occurrences of ‘cat’: [‘cat’, ‘cat’, ‘cat’]
- Regular expression pattern to match the word “cat”, case-insensitive
- findall() searches the text for all substrings that match the pattern.
- IGNORECASE makes the search case-insensitive, so it matches “cat”, “Cat”, “CAT”, etc.
1. Replacing Patterns
It allows us to search for specific patterns and replace them with other strings.
A simple example of using regular expressions to replace specific patterns in a text. In this case, we’ll replace all occurrences of the word “cat” with “dog”.
1.2 Types of Arguments
1. Positional Arguments:
The order of arguments matters.
import re
#text
text = “The cat is on the roof. The other cat is playing with a toy. My cat loves to sleep.”
# Regular expression pattern to match the word “cat”
pattern = r’\bcat\b’
# Replace all occurrences of “cat” with “dog”
replaced_text = re.sub(pattern, ‘dog’, text, flags=re.IGNORECASE)
# Print the replaced text
print(“Replaced text:”, replaced_text)
Replaced text: The dog is on the roof. The other dog is playing with a toy. My dog loves to sleep.
- sub(pattern, replacement, string)
- pattern: The regular expression pattern to search
- replacement: The string to replace the pattern
- string: The string to search
- \b matches a word boundary to ensure “cat” is matched as a whole
- cat is the literal word we want to Basically,
The pattern r’\bcat\b’ ensures that we match the word “cat” exactly, and re.sub() performs the replacement.
1. The split() function:
The split() function returns a list where the string has been split at each match
import re
txt = “The rain in Spain” x = re.split(“\s”, txt) print(x)
[‘The’, ‘rain’, ‘in’, ‘Spain’]
Example (real life application):
Using regular expression to search for and extract email addresses from a text. The below code might come off a bit complicated but it is very easy.
import re
# Sample text containing email addresses
text = “””
Here are some email addresses to check:
john.doe@example.com, jane_smith123@domain.org, and admin@my- site.co.uk.
Feel free to contact us at support@company.com! “””
# Regular expression pattern to match email addresses
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}’
# Find all email addresses in the text
emails = re.findall(pattern, text)
# Print the email addresses
print(“Extracted email addresses:”, emails)
Extracted email addresses: [‘john.doe@example.com’, ‘jane_smith123@domain.org’, ‘admin@my-site.co.uk’, ‘support@company.com’]
- [a-zA-Z0-9._%+-]+ matches the part before the @ symbol, which can include letters, digits, dots, underscores, percent signs, plus signs, and
- @ matches the @
- [a-zA-Z0-9.-]+ matches the domain part of the email, which can include letters, digits, dots, and
- .[a-zA-Z]{2,} matches the top-level domain (TLD) part, which starts with a dot followed by at least two letters (e.g., .com, .org, .uk).