Regex

Introduction

1. What is the purpose of regex?
Regular Expressions (Regex) is a tool for pattern matching and string manipulation.
2. List the common use cases and examples for regex.
  • Text Search and Manipulation: Search for specific patterns or strings in text documents.
  • Data Validation: Validate user input in forms or applications. (E.g. password creation)
  • Data Extraction: Extract specific information from a larger dataset. (E.g. Extracting email addresses)
  • Data Cleaning: Clean and preprocess data by removing or replacing unwanted patterns. (E.g. Fix formatting issues in datasets.)
3. What are the key steps to solve a regex problem?
  1. Understand the requirements - What needs to be included or excluded?
  2. Identify the patterns in the inclusion or exclusion list.
  3. Represent the patterns using regular expression.

Meta Characters

Meta-characters are special characters with a predefined meaning in regular expressions.

1. Syntax: .
A wildcard that represents any character (except newline character).
2. Syntax: *
Zero or more occurrences.
3. Syntax: +
One or more occurrences.
4. Syntax: ?
Zero or one occurrences.
5. Syntax: .*
To account for zero or more occurrences of any characters. Therefore, .* means “match any sequence of characters, including an empty sequence.”
6. Syntax: .*?
Matches any character (.) any number of times (*), as few times as possible to make the regex match (?).
7. Syntax: \
The backslash \ is used as an escape character and to be placed before a special characters such as ^$*.{()\ such that regex can recognise the characters.
8. Syntax: ^pattern
Placeholder that signifies beginning of a line. Starts with the pattern.
9. Syntax: pattern$
Placeholder that signifies the end of a line. Ends with the patterns.
10. Syntax: ^pattern$
When ^ and $ are used together, they ensure that the entire string conforms to the specified pattern, not just a part of it.
11. Syntax: |
Acts as a logical OR, allowing you to specify alternative patterns.
12. Syntax: ()
Parentheses are used to group characters or subpatterns together.
13. Syntax: {}
Curly brackets are used to specify the number of occurrences or a range of occurrences of a character or a group of character.
14. Syntax: [ab]
To match a set of characters inside the square brackets. It returns a match where one of the specified characters (a,b) is present.
15. Syntax: [^ab]
To match all characters that are not inside the square brackets. It returns a match for any character except for a and b.
16. Syntax: [a-c]
To match for the range of characters between a and c.
17. Syntax: [a-cm]

To match for the range of characters..

  • between a and c or
  • m.
18. Syntax: [a-cA-Cx]

To match for the range of characters,

  • between a and c lowercase,
  • between A and C uppercase or
  • x.

Exercises

1. Match exactly 3 random digits pattern.
^[0-9]{3}$
2. Match exactly 3 random characters patterns.
^[.]{3}$
3. Match 4-6 random alphabets patterns.
[a-z]{4-6}
4. Match at least 4 ha.
(ha){4,}
5. Match less than or equals to 3 ha.
(ha){,3}
6. Match at least one a.
a+
7. Match zero or one a.
a?
8. Match either logwood or plywood.
(log|ply)wood

Special Sequences

Regex special sequences are sequences of characters with a special meaning when used in a regular expression. They are represented by a backslash () followed by a specific character.

1. Syntax: \A
Matches if the specified characters are at the beginning of a string.
2. Syntax: \Z
Matches if the specified characters are at the end of a string.
3. Syntax: \w
Matches any word character (equivalent to the character class [a-zA-Z0-9_]).
4. Syntax: \W
Matches any character that is not a word character. It is equivalent to the character class [^a-zA-Z0-9_].
5. Syntax: \b
Assets a word boundary that returns a match where the specified word (sequence of word characters) starts or ends.
6. Syntax: \B
Asserts a position in the input string where there is no word boundary.
7. Syntax: \d
Matches any decimal digit. Equivalent to [0-9]
8. Syntax: \D
Matches any non-decimal digit. Equivalent to [^0-9].
9. Syntax: \s
Matches where the string contains one whitespace.
10. Syntax: \S
Matches where the string contains any non-whitespace characters.

Exercises

1. Given a string “This is a ball.” Use \b to match the word ball.
\bball\b
2. Provide the regex to find all words starting with 'b' or 'e' in a given string.
[be]\w+
3. Given a string “This is a baseball.” Will regex: \bball\b match the ball baseball?
No, because it matches whole words not as part of another word.
4. Provide the regex to check if a string start with hi.
\Ahi
5. Provide the regex to find the whole word red.
\bred\b
6. Provide the regex to check if a string ends with bye.
bye\Z
7. Provide the regex to check if a string starts with exactly 2 digits.
\A[\d]{2}
8. Provide the regex to check if a string ends with exactly 2 non-digit characters.
[\D]{2}\Z
9. Provide the regex to match with the pattern: [1270X160 , 800X600, 1024X768].
\d{3,4}X\d{3}
10. Provide the regex to match with the pattern: [John Wallace, Steve King, Adam Smith].
([a-zA-Z]+)\s([a-zA-Z]+)
11. Provide the regex to match with the pattern: [7:32, 6.12, 12:23, 1.23].
(\d{1,2})[:.](\d{2})
12. Provide the regex to match with the pattern: [745.246.4369, 234.325.6543].
(\d{3})\.(\d{3})\.(\d{4})
13. Provide the regex to match with the pattern: [Jan 5th 1987, Aug 3rd 2009].
([a-zA-Z]{3})\s(\w{3,4})\s(\d{4})
14. Provide the regex to match with the pattern: [(745).246.4369, (234).325.6543].
\(\d{3}\)\.\d{3}\.\d{4}

Python Regex Functions

In Python, the re module provides support for regular expressions.

1. findall()
Returns a list containing all matches. The list contains the matches in the order they are found.
2. search()
Searches the string for a match, and returns a match object if there is a match. If there is more than one match, only the first occurrence of the match will be returned.
3. split()
Returns a list where the string has been split at each match.
4. sub()
Replaces the matches with the text of your choice.
5. match_object.string
Returns the string passed into the function.
6. match_object.group
Returns the part of the string where there was a match.
7. match_object.span
Returns the position (start- and end-position) of the first match occurrence.

Exercises

1. Search for the first white-space character in the string.
re.search(”\s”, txt)
2. Split at each white-space character from a string.
re.split('\s', txt)
3. Split at the first white-space character from a string.
re.split('\s', txt, 1)
4. Replace every white-space character with :.
re.sub('\s',':', str)
5. Replace the first and second white-space character with :.
re.sub('\s',':', str, 2)

Useful resources to learn Regex:

Last updated on 25 Aug 2024