Regex

Introduction

1. What is the purpose of regex?

Regular Expressions (Regex) is a tool for pattern matching and string manipulation.

2. List the common use cases and examples for regex.

Text Search and Manipulation: Search for specific patterns or strings in text documents.
Data Validation: Validate user input in forms or applications. (E.g. password creation)
Data Extraction: Extract specific information from a larger dataset. (E.g. Extracting email addresses)
Data Cleaning: Clean and preprocess data by removing or replacing unwanted patterns. (E.g. Fix formatting issues in datasets.)

3. What are the key steps to solve a regex problem?

Understand the requirements - What needs to be included or excluded?
Identify the patterns in the inclusion or exclusion list.
Represent the patterns using regular expression.

Meta Characters

Meta-characters are special characters with a predefined meaning in regular expressions.

1. Syntax: .

A wildcard that represents any character (except newline character).

2. Syntax: *

Zero or more occurrences.

3. Syntax: +

One or more occurrences.

4. Syntax: ?

Zero or one occurrences.

5. Syntax: .*

To account for zero or more occurrences of any characters. Therefore, .* means “match any sequence of characters, including an empty sequence.”

6. Syntax: .*?

Matches any character (.) any number of times (*), as few times as possible to make the regex match (?).

7. Syntax: \

The backslash \ is used as an escape character and to be placed before a special characters such as ^$*.{()\ such that regex can recognise the characters.

8. Syntax: ^pattern

Placeholder that signifies beginning of a line. Starts with the pattern.

9. Syntax: pattern$

Placeholder that signifies the end of a line. Ends with the patterns.

10. Syntax: ^pattern$

When ^ and $ are used together, they ensure that the entire string conforms to the specified pattern, not just a part of it.

11. Syntax: |

Acts as a logical OR, allowing you to specify alternative patterns.

12. Syntax: ()

Parentheses are used to group characters or subpatterns together.

13. Syntax: {}

Curly brackets are used to specify the number of occurrences or a range of occurrences of a character or a group of character.

14. Syntax: [ab]

To match a set of characters inside the square brackets. It returns a match where one of the specified characters (a,b) is present.

15. Syntax: [^ab]

To match all characters that are not inside the square brackets. It returns a match for any character except for a and b.

16. Syntax: [a-c]

To match for the range of characters between a and c.

17. Syntax: [a-cm]

To match for the range of characters..

between a and c or
m.

18. Syntax: [a-cA-Cx]

To match for the range of characters,

between a and c lowercase,
between A and C uppercase or
x.

Exercises

1. Match exactly 3 random digits pattern.

^[0-9]{3}$

2. Match exactly 3 random characters patterns.

^[.]{3}$

3. Match 4-6 random alphabets patterns.

[a-z]{4-6}

4. Match at least 4 ha.

(ha){4,}

5. Match less than or equals to 3 ha.

(ha){,3}

6. Match at least one a.

a+

7. Match zero or one a.

a?

8. Match either logwood or plywood.

(log|ply)wood

Special Sequences

Regex special sequences are sequences of characters with a special meaning when used in a regular expression. They are represented by a backslash () followed by a specific character.

1. Syntax: \A

Matches if the specified characters are at the beginning of a string.

2. Syntax: \Z

Matches if the specified characters are at the end of a string.

3. Syntax: \w

Matches any word character (equivalent to the character class [a-zA-Z0-9_]).

4. Syntax: \W

Matches any character that is not a word character. It is equivalent to the character class [^a-zA-Z0-9_].

5. Syntax: \b

Assets a word boundary that returns a match where the specified word (sequence of word characters) starts or ends.

6. Syntax: \B

Asserts a position in the input string where there is no word boundary.

7. Syntax: \d

Matches any decimal digit. Equivalent to [0-9]

8. Syntax: \D

Matches any non-decimal digit. Equivalent to [^0-9].

9. Syntax: \s

Matches where the string contains one whitespace.

10. Syntax: \S

Matches where the string contains any non-whitespace characters.

Exercises

1. Given a string “This is a ball.” Use \b to match the word ball.

\bball\b

2. Provide the regex to find all words starting with 'b' or 'e' in a given string.

[be]\w+

3. Given a string “This is a baseball.” Will regex: \bball\b match the ball baseball?

No, because it matches whole words not as part of another word.

4. Provide the regex to check if a string start with hi.

\Ahi

5. Provide the regex to find the whole word red.

\bred\b

6. Provide the regex to check if a string ends with bye.

bye\Z

7. Provide the regex to check if a string starts with exactly 2 digits.

\A[\d]{2}

8. Provide the regex to check if a string ends with exactly 2 non-digit characters.

[\D]{2}\Z

9. Provide the regex to match with the pattern: [1270X160 , 800X600, 1024X768].

\d{3,4}X\d{3}

10. Provide the regex to match with the pattern: [John Wallace, Steve King, Adam Smith].

([a-zA-Z]+)\s([a-zA-Z]+)

11. Provide the regex to match with the pattern: [7:32, 6.12, 12:23, 1.23].

(\d{1,2})[:.](\d{2})

12. Provide the regex to match with the pattern: [745.246.4369, 234.325.6543].

(\d{3})\.(\d{3})\.(\d{4})

13. Provide the regex to match with the pattern: [Jan 5th 1987, Aug 3rd 2009].

([a-zA-Z]{3})\s(\w{3,4})\s(\d{4})

14. Provide the regex to match with the pattern: [(745).246.4369, (234).325.6543].

$\d{3}$\.\d{3}\.\d{4}

Python Regex Functions

In Python, the re module provides support for regular expressions.

1. findall()

Returns a list containing all matches. The list contains the matches in the order they are found.

2. search()

Searches the string for a match, and returns a match object if there is a match. If there is more than one match, only the first occurrence of the match will be returned.

3. split()

Returns a list where the string has been split at each match.

4. sub()

Replaces the matches with the text of your choice.

5. match_object.string

Returns the string passed into the function.

6. match_object.group

Returns the part of the string where there was a match.

7. match_object.span

Returns the position (start- and end-position) of the first match occurrence.

Exercises

1. Search for the first white-space character in the string.

re.search(”\s”, txt)

2. Split at each white-space character from a string.

re.split('\s', txt)

3. Split at the first white-space character from a string.

re.split('\s', txt, 1)

4. Replace every white-space character with :.

re.sub('\s',':', str)

5. Replace the first and second white-space character with :.

re.sub('\s',':', str, 2)

Useful resources to learn Regex:

Last updated on 25 Aug 2024