What is RegEx?
- Regular Expressions are a language of their own for matching patterns
- They are highly useful in text data processing
The official Python source defines the Regex in the following way:
An expression containing ‘meta’ characters and literals to identify and/or replace a pattern matching that expression Meta Characters: these characters have a special meaning Literals: these characters are the actual characters that are to be matched
Use Cases - To search a string pattern - To split a string based on a pattern - To replace a part of the string
A Selected List of Advanced Regex Usages in Python
Case 1: Extract Username and Domain from Email
Key Concepts
: Use ofgroup()
attribute inre.search
and numbered captures using proper paranthesispattern
: “(+).(+)@(+).+”
= "senthil.kumar@gutszen.com"
email = "(\w+)\.(\w+)@(\w+)\.\w+"
pattern = re.search(pattern, email)
match if match:
= match.group(1)
first_name = match.group(2)
last_name = match.group(3)
company print(f"first_name: {first_name}")
print(f"last_name: {last_name}")
print(f"company: {company}")
first_name: senthil
last_name: kumar
company: gutszen
Case 2: A Regex Gotcha - An example where raw_string_literal
is needed
- In most cases without or without a raw literal, the python pattern works fine. stackoverflow comment
- But for the followiing example where
text
is a raw literal string with a in it
= r"Can you capture? this\that"
text = r"\w+\\\w+"
pattern
= re.findall(pattern, text)
matches for match in matches:
print(f"Matched String: {match}")
Matched String: this\that
- What happens if I try below example where both text and pattern are devoid of raw literal?
- Do notice the
hat
word in the end of the matches
= "Can you capture? this\that"
text = "\w+\\w+"
pattern
= re.findall(pattern, text)
matches for match in matches:
print(f"Matched String: {match}")
Matched String: Can
Matched String: you
Matched String: capture
Matched String: this
Matched String: hat
- What if I try below example?
- Do notice the capture of
this<tab_space>hat
= "Can you capture? this\that"
text = r"\w+\t\w+"
pattern
= re.findall(pattern, text)
matches for match in matches:
print(f"Matched String: {match}")
Matched String: this hat
Case 3A: Importance of Greedy
Operator !!
- Use of
?
as a greedy operator
= "She said, 'Hello', and he replied, 'Hi'"
text = "'(.+?)'"
pattern = re.findall(pattern, text)
matches for match in matches:
print(f"Matched String: {match}")
Matched String: Hello
Matched String: Hi
= "She said, 'Hello', and he replied, 'Hi'"
text = "'(.+)'"
pattern = re.findall(pattern, text)
matches for match in matches:
print(f"Matched String: {match}")
Matched String: Hello', and he replied, 'Hi
Case 3B: Importance of Escaping Paranthesis!!
- What if you want to capture text within paranthesis?
= "She said, (Hello), and he replied, (Hi)"
text = "\((.+?)\)"
pattern = re.findall(pattern, text)
matches for match in matches:
print(f"Matched String: {match}")
Matched String: Hello
Matched String: Hi
Case 4: Splitting Sentences using Regex
- Use of
[^<patterns>]
to look for Negative matches until we meet any of the<patterns>
= "Hello! My Name is Senthil. How are you doing?"
text = r"([^.?!]+[.?!])"
pattern = re.findall(pattern, text)
sentences for sentence in sentences:
print(f"Sentence: {sentence.strip()}")
Sentence: Hello!
Sentence: My Name is Senthil.
Sentence: How are you doing?
Case 5: Extraction of different URL Formats
- Multiple Concepts: Operator OR
|
;?
for 0 or 1 match;[^/\s]+
means anything but a/
or a space
= "Visit my website at https://www.example.com and check out www.blog.example.com or http://blogspot.com"
text = r"https?://[^/\s]+|www.[^/\s]+"
pattern = re.findall(pattern, text)
matches for match in matches:
print(f"URL: {match}")
URL: https://www.example.com
URL: www.blog.example.com
URL: http://blogspot.com
Where Regex Fails!
Best Solution - Use specific modules (avoid regex)
- In this html parsing case, use
BeautifulSoup
from bs4 import BeautifulSoup
= "<div><p>This is a paragraph</p><span>This is a span</span></div>"
html = BeautifulSoup(html, 'html.parser')
soup
def process_tags(element):
if not element.name.startswith(r"["):
print(f"Tag: {element.name}, Content: {element.get_text()}")
for child in element.children:
if child.name:
process_tags(child)
process_tags(soup)
Tag: div, Content: This is a paragraphThis is a span
Tag: p, Content: This is a paragraph
Tag: span, Content: This is a span
Insisting on a Regex Solution?
= r"\<(\w+)\>"
fetch_tags_pattern = re.findall(fetch_tags_pattern, html)
tag_matches
for tag in tag_matches:
= f"<({tag})>(.*?)</{tag}>"
tag_pattern = re.findall(tag_pattern, html)
matches for match in matches:
= match[0]
tag = re.sub('(<.*?>)',' ',match[1])
content print(f"Tag: {tag}, Content: {content}")
Tag: div, Content: This is a paragraph This is a span
Tag: p, Content: This is a paragraph
Tag: span, Content: This is a span
Conclusion
- I am sure, over years you had worked on Regex, you have many better examples
- Like Bash Scripting, the key with Regex is to know when to stop trying it and use some other solution
- Thank you for reading this blog piece, Happy Regexing !