What is RegEx?

Regular Expressions are a language of their own for matching patterns
They are highly useful in text data processing

The official Python source defines the Regex in the following way:

An expression containing ‘meta’ characters and literals to identify and/or replace a pattern matching that expression Meta Characters: these characters have a special meaning Literals: these characters are the actual characters that are to be matched

Use Cases - To search a string pattern - To split a string based on a pattern - To replace a part of the string

A Selected List of Advanced Regex Usages in Python

Case 1: Extract Username and Domain from Email

Key Concepts: Use of group() attribute in re.search and numbered captures using proper paranthesis
pattern: “(+).(+)@(+).+”

email = "senthil.kumar@gutszen.com"
pattern = "(\w+)\.(\w+)@(\w+)\.\w+"
match = re.search(pattern, email)
if match:
    first_name = match.group(1)
    last_name = match.group(2)
    company = match.group(3)
    print(f"first_name: {first_name}")
    print(f"last_name: {last_name}")
    print(f"company: {company}")

first_name: senthil
last_name: kumar
company: gutszen

Case 2: A Regex Gotcha - An example where `raw_string_literal` is needed

In most cases without or without a raw literal, the python pattern works fine. stackoverflow comment
But for the followiing example where text is a raw literal string with a in it

text = r"Can you capture? this\that"
pattern = r"\w+\\\w+"

matches = re.findall(pattern, text)
for match in matches:
    print(f"Matched String: {match}")

Matched String: this\that

What happens if I try below example where both text and pattern are devoid of raw literal?
Do notice the hat word in the end of the matches

text = "Can you capture? this\that"
pattern = "\w+\\w+"

matches = re.findall(pattern, text)
for match in matches:
    print(f"Matched String: {match}")

Matched String: Can
Matched String: you
Matched String: capture
Matched String: this
Matched String: hat

What if I try below example?
Do notice the capture of this<tab_space>hat

text = "Can you capture? this\that"
pattern = r"\w+\t\w+"

matches = re.findall(pattern, text)
for match in matches:
    print(f"Matched String: {match}")

Matched String: this    hat

Case 3A: Importance of `Greedy` Operator !!

Use of ? as a greedy operator

text = "She said, 'Hello', and he replied, 'Hi'"
pattern = "'(.+?)'"
matches = re.findall(pattern, text)
for match in matches:
    print(f"Matched String: {match}")

Matched String: Hello
Matched String: Hi

text = "She said, 'Hello', and he replied, 'Hi'"
pattern = "'(.+)'"
matches = re.findall(pattern, text)
for match in matches:
    print(f"Matched String: {match}")

Matched String: Hello', and he replied, 'Hi

Case 3B: Importance of Escaping Paranthesis!!

What if you want to capture text within paranthesis?

text = "She said, (Hello), and he replied, (Hi)"
pattern = "\((.+?)\)"
matches = re.findall(pattern, text)
for match in matches:
    print(f"Matched String: {match}")

Matched String: Hello
Matched String: Hi

Case 4: Splitting Sentences using Regex

Use of [^<patterns>] to look for Negative matches until we meet any of the <patterns>

text = "Hello! My Name is Senthil. How are you doing?"
pattern = r"([^.?!]+[.?!])"
sentences = re.findall(pattern, text)
for sentence in sentences:
    print(f"Sentence: {sentence.strip()}")

Sentence: Hello!
Sentence: My Name is Senthil.
Sentence: How are you doing?

Case 5: Extraction of different URL Formats

Multiple Concepts: Operator OR |; ? for 0 or 1 match; [^/\s]+ means anything but a / or a space

text = "Visit my website at https://www.example.com and check out www.blog.example.com or http://blogspot.com"
pattern = r"https?://[^/\s]+|www.[^/\s]+"
matches = re.findall(pattern, text)
for match in matches:
    print(f"URL: {match}")

URL: https://www.example.com
URL: www.blog.example.com
URL: http://blogspot.com

Where Regex Fails!

Bonus Case - The Repeating Patterns - Extracting html tags

The expectation for below python code is to capture all tags and and their contents.
But regex will capture only the outermost <div> tag

html = "<div><p>This is a paragraph</p><span>This is a span</span></div>"
pattern = r"<(.+?)>(.+?)</\1>"
matches = re.findall(pattern, html)
for match in matches:
    tag = match[0]
    content = match[1]
    print(f"Tag: {tag}, Content: {content}")

Tag: div, Content: <p>This is a paragraph</p><span>This is a span</span>

Best Solution - Use specific modules (avoid regex)

In this html parsing case, use BeautifulSoup

from bs4 import BeautifulSoup

html = "<div><p>This is a paragraph</p><span>This is a span</span></div>"
soup = BeautifulSoup(html, 'html.parser')

def process_tags(element):
    if not element.name.startswith(r"["):
        print(f"Tag: {element.name}, Content: {element.get_text()}")
    for child in element.children:
        if child.name:
            process_tags(child)

process_tags(soup)

Tag: div, Content: This is a paragraphThis is a span
Tag: p, Content: This is a paragraph
Tag: span, Content: This is a span

Insisting on a Regex Solution?

fetch_tags_pattern = r"\<(\w+)\>"
tag_matches = re.findall(fetch_tags_pattern, html)

for tag in tag_matches:
    tag_pattern = f"<({tag})>(.*?)</{tag}>"
    matches = re.findall(tag_pattern, html)
    for match in matches:
        tag = match[0]
        content = re.sub('(<.*?>)',' ',match[1])
        print(f"Tag: {tag}, Content: {content}")

Tag: div, Content:  This is a paragraph  This is a span 
Tag: p, Content: This is a paragraph
Tag: span, Content: This is a span

Conclusion

I am sure, over years you had worked on Regex, you have many better examples
Like Bash Scripting, the key with Regex is to know when to stop trying it and use some other solution
Thank you for reading this blog piece, Happy Regexing !

What is RegEx?

A Selected List of Advanced Regex Usages in Python

Case 1: Extract Username and Domain from Email

Case 2: A Regex Gotcha - An example where raw_string_literal is needed

Case 3A: Importance of Greedy Operator !!

Case 3B: Importance of Escaping Paranthesis!!

Case 4: Splitting Sentences using Regex

Case 5: Extraction of different URL Formats

Where Regex Fails!

Bonus Case - The Repeating Patterns - Extracting html tags

Best Solution - Use specific modules (avoid regex)

Insisting on a Regex Solution?

Conclusion

Case 2: A Regex Gotcha - An example where `raw_string_literal` is needed

Case 3A: Importance of `Greedy` Operator !!