An Ode to the Ubiquitous Regex

A selected list of confusing Regex Patterns that helped me learn its working better
Coding
Python
Author

Senthil Kumar

Published

November 8, 2022

What is RegEx?

  • Regular Expressions are a language of their own for matching patterns
  • They are highly useful in text data processing

The official Python source defines the Regex in the following way:

An expression containing ‘meta’ characters and literals to identify and/or replace a pattern matching that expression Meta Characters: these characters have a special meaning Literals: these characters are the actual characters that are to be matched

Use Cases - To search a string pattern - To split a string based on a pattern - To replace a part of the string

A Selected List of Advanced Regex Usages in Python

Case 1: Extract Username and Domain from Email

  • Key Concepts: Use of group() attribute in re.search and numbered captures using proper paranthesis
  • pattern: “(+).(+)@(+).+”
email = "senthil.kumar@gutszen.com"
pattern = "(\w+)\.(\w+)@(\w+)\.\w+"
match = re.search(pattern, email)
if match:
    first_name = match.group(1)
    last_name = match.group(2)
    company = match.group(3)
    print(f"first_name: {first_name}")
    print(f"last_name: {last_name}")
    print(f"company: {company}")
first_name: senthil
last_name: kumar
company: gutszen

Case 2: A Regex Gotcha - An example where raw_string_literal is needed

  • In most cases without or without a raw literal, the python pattern works fine. stackoverflow comment
  • But for the followiing example where text is a raw literal string with a  in it
text = r"Can you capture? this\that"
pattern = r"\w+\\\w+"

matches = re.findall(pattern, text)
for match in matches:
    print(f"Matched String: {match}")
Matched String: this\that
  • What happens if I try below example where both text and pattern are devoid of raw literal?
  • Do notice the hat word in the end of the matches
text = "Can you capture? this\that"
pattern = "\w+\\w+"

matches = re.findall(pattern, text)
for match in matches:
    print(f"Matched String: {match}")
Matched String: Can
Matched String: you
Matched String: capture
Matched String: this
Matched String: hat
  • What if I try below example?
  • Do notice the capture of this<tab_space>hat
text = "Can you capture? this\that"
pattern = r"\w+\t\w+"

matches = re.findall(pattern, text)
for match in matches:
    print(f"Matched String: {match}")
Matched String: this    hat

Case 3A: Importance of Greedy Operator !!

  • Use of ? as a greedy operator
text = "She said, 'Hello', and he replied, 'Hi'"
pattern = "'(.+?)'"
matches = re.findall(pattern, text)
for match in matches:
    print(f"Matched String: {match}")
Matched String: Hello
Matched String: Hi
text = "She said, 'Hello', and he replied, 'Hi'"
pattern = "'(.+)'"
matches = re.findall(pattern, text)
for match in matches:
    print(f"Matched String: {match}")
Matched String: Hello', and he replied, 'Hi

Case 3B: Importance of Escaping Paranthesis!!

  • What if you want to capture text within paranthesis?
text = "She said, (Hello), and he replied, (Hi)"
pattern = "\((.+?)\)"
matches = re.findall(pattern, text)
for match in matches:
    print(f"Matched String: {match}")
Matched String: Hello
Matched String: Hi

Case 4: Splitting Sentences using Regex

  • Use of [^<patterns>] to look for Negative matches until we meet any of the <patterns>
text = "Hello! My Name is Senthil. How are you doing?"
pattern = r"([^.?!]+[.?!])"
sentences = re.findall(pattern, text)
for sentence in sentences:
    print(f"Sentence: {sentence.strip()}")
Sentence: Hello!
Sentence: My Name is Senthil.
Sentence: How are you doing?

Case 5: Extraction of different URL Formats

  • Multiple Concepts: Operator OR |; ? for 0 or 1 match; [^/\s]+ means anything but a / or a space
text = "Visit my website at https://www.example.com and check out www.blog.example.com or http://blogspot.com"
pattern = r"https?://[^/\s]+|www.[^/\s]+"
matches = re.findall(pattern, text)
for match in matches:
    print(f"URL: {match}")
URL: https://www.example.com
URL: www.blog.example.com
URL: http://blogspot.com

Where Regex Fails!

Bonus Case - The Repeating Patterns - Extracting html tags

  • The expectation for below python code is to capture all tags and and their contents.
  • But regex will capture only the outermost <div> tag
html = "<div><p>This is a paragraph</p><span>This is a span</span></div>"
pattern = r"<(.+?)>(.+?)</\1>"
matches = re.findall(pattern, html)
for match in matches:
    tag = match[0]
    content = match[1]
    print(f"Tag: {tag}, Content: {content}")
Tag: div, Content: <p>This is a paragraph</p><span>This is a span</span>

Best Solution - Use specific modules (avoid regex)

  • In this html parsing case, use BeautifulSoup
from bs4 import BeautifulSoup

html = "<div><p>This is a paragraph</p><span>This is a span</span></div>"
soup = BeautifulSoup(html, 'html.parser')

def process_tags(element):
    if not element.name.startswith(r"["):
        print(f"Tag: {element.name}, Content: {element.get_text()}")
    for child in element.children:
        if child.name:
            process_tags(child)

process_tags(soup)
Tag: div, Content: This is a paragraphThis is a span
Tag: p, Content: This is a paragraph
Tag: span, Content: This is a span

Insisting on a Regex Solution?

fetch_tags_pattern = r"\<(\w+)\>"
tag_matches = re.findall(fetch_tags_pattern, html)

for tag in tag_matches:
    tag_pattern = f"<({tag})>(.*?)</{tag}>"
    matches = re.findall(tag_pattern, html)
    for match in matches:
        tag = match[0]
        content = re.sub('(<.*?>)',' ',match[1])
        print(f"Tag: {tag}, Content: {content}")
Tag: div, Content:  This is a paragraph  This is a span 
Tag: p, Content: This is a paragraph
Tag: span, Content: This is a span

Conclusion

  • I am sure, over years you had worked on Regex, you have many better examples
  • Like Bash Scripting, the key with Regex is to know when to stop trying it and use some other solution
  • Thank you for reading this blog piece, Happy Regexing !