Meritshot Tutorials

  1. Home
  2. »
  3. Regular Expressions in Python

Python Tutorial

Regular Expressions

What is a Regular Expression?

A regular expression (regex) is a special sequence of characters that defines a search pattern. It’s used for searching, matching, and manipulating strings based on patterns. also used to extract features from the text, replace strings and perform other string manipulations

Think of it as a powerful search tool that can match complex patterns of text, espicially in future when you will be learning Natural Language Processing.

They consist of literal characters and special characters that define the pattern.

Quantifiers

There are certain quantifiers that you should know about :

regular expression quantifiers which are used to specify how many times a given character can be repeated before matching is done. This is mainly used when the number of characters going to be matched is unknown. There are six types of quantifiers :

  • “*” = Matches: 0 or more occurrences of the preceding

Example:

Regex: a*

Matches: “” (empty string), “a”, “aa”, “aaa”, etc.

  • ‘+’ = Matches: 1 or more occurrences of the preceding element

Example:

Regex: a+

Matches: “a”, “aa”, “aaa”, but not “” (empty string).

  • ? = Matches: 0 or 1 occurrence of the preceding

Example:

Regex: a?

Matches: “” (empty string), “a”.

{n} = Matches: Exactly n occurrences of the preceding element

Example:

Regex: a{3}

Matches: “aaa”, but not “aa” or “aaaa”.

  • {n, } = Matches: at least n or more occurrences of the preceding

Example:

Regex: a{2,}

Matches: “aa”, “aaa”, “aaaa”, etc.

  • {n, m} = Matches: Between n and m (inclusive) occurrences of the preceding (range)

Example:

Regex: a{2,4}

Matches: “aa”, “aaa”, “aaaa”, but not “a” or “aaaaa”.

Using the re Module

Python’s re module provides functions for working with regular expressions

Basic Regex Operations

Let’s first understand common regex patterns:

  • \d: Matches any digit (0-9).
  • \w: Matches any alphanumeric character (a-z, A-Z, 0-9) and
  • \s: Matches any whitespace character (spaces, tabs).
  • .: Matches any character except a

1.     Matching Patterns 

using regular expressions to extract hashtags from a given text. This can be useful for analyzing social media posts or any text with hashtags.

•           import re

This module helps with regular expression operations

import re

# Sample text with hashtags

text = “Check out our new features! #Exciting #TechNews #Innovation #2024”

# Regular expression pattern to match hashtags

pattern = r’#\w+’

# Find all hashtags in the text

hashtags = re.findall(pattern, text)

# Print the hashtags

print(“Extracted hashtags:”, hashtags)

Extracted hashtags: [‘#Exciting’, ‘#TechNews’, ‘#Innovation’, ‘#2024’]

  • #’ matches the hashtag
  • \w+ matches one or more word characters (letters, digits, and underscores) that follow the #.

Searching Patterns :

using regular expressions to find all occurrences of a specific word in a text. Let’s say we want to find all instances of the word “cat” in a given text, regardless of its case

import re

# Sample text

text = “The cat is on the roof. The other cat is playing with a toy. My cat loves to sleep.”

pattern = r’\bcat\b’

# Find all occurrences of the word “cat” in the text

matches = re.findall(pattern, text, re.IGNORECASE)

# Print the matches

print(“Found occurrences of ‘cat’:”, matches)

Found occurrences of ‘cat’: [‘cat’, ‘cat’, ‘cat’]

  • Regular expression pattern to match the word “cat”, case-insensitive
  • findall() searches the text for all substrings that match the pattern.
  • IGNORECASE makes the search case-insensitive, so it matches “cat”, “Cat”, “CAT”, etc.

1.     Replacing Patterns

It allows us to search for specific patterns and replace them with other strings.

A simple example of using regular expressions to replace specific patterns in a text. In this case, we’ll replace all occurrences of the word “cat” with “dog”.

1.2 Types of Arguments

1.   Positional Arguments:

The order of arguments matters.

import re

#text

text = “The cat is on the roof. The other cat is playing with a toy. My cat loves to sleep.”

# Regular expression pattern to match the word “cat”

pattern = r’\bcat\b’

# Replace all occurrences of “cat” with “dog”

replaced_text = re.sub(pattern, ‘dog’, text, flags=re.IGNORECASE)

# Print the replaced text

print(“Replaced text:”, replaced_text)

Replaced text: The dog is on the roof. The other dog is playing with a toy. My dog loves to sleep.

  • sub(pattern, replacement, string)
  • pattern: The regular expression pattern to search
  • replacement: The string to replace the pattern
  • string: The string to search
  • \b matches a word boundary to ensure “cat” is matched as a whole
  • cat is the literal word we want to Basically,

The pattern r’\bcat\b’ ensures that we match the word “cat” exactly, and re.sub() performs the replacement.

1.     The split() function:

The split() function returns a list where the string has been split at each match

import re

txt = “The rain in Spain” x = re.split(“\s”, txt) print(x)

[‘The’, ‘rain’, ‘in’, ‘Spain’]

Example (real life application):

Using regular expression to search for and extract email addresses from a text. The below code might come off a bit complicated but it is very easy.

import re

 

# Sample text containing email addresses

text = “””

Here are some email addresses to check:

john.doe@example.com, jane_smith123@domain.org, and admin@my- site.co.uk.

Feel free to contact us at support@company.com! “””

# Regular expression pattern to match email addresses

pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}’

# Find all email addresses in the text

emails = re.findall(pattern, text)

# Print the email addresses

print(“Extracted email addresses:”, emails)

Extracted email addresses: [‘john.doe@example.com’, ‘jane_smith123@domain.org’, ‘admin@my-site.co.uk’, ‘support@company.com’]

  • [a-zA-Z0-9._%+-]+ matches the part before the @ symbol, which can include letters, digits, dots, underscores, percent signs, plus signs, and
  • @ matches the @
  • [a-zA-Z0-9.-]+ matches the domain part of the email, which can include letters, digits, dots, and
  • .[a-zA-Z]{2,} matches the top-level domain (TLD) part, which starts with a dot followed by at least two letters (e.g., .com, .org, .uk).