Python, Tutorials

Python Regex (re- regular expression operation)

A regular expression or regex is a sequence of characters that define a search pattern. It is mostly used with string to find characters, a sequence of characters, or substrings in a string. Regex is also used in many development aspects such as form validations, password validation, pattern matching etc.

Python has a module that is dedicated to regular expressions. This module is called re and it has several useful functions that are useful while performing regex operations.

Regular expression in simple word is a rule set upon which the string will be matched.

Example of Regex.

^a.z$

The above is a regular expression made using metacharacters which matches a string which “starts with ‘a’, should have only one (any) character after it and ends with ‘z'”

The following strings will match the regex.

  • anz
  • adz
  • axz

The following strings will not match the regex.

  • axxxz
  • bxz
  • bxb

So this is how regex works. We have a pattern, a string, and then we use a function from the re module to operate on them.

To make a regular expression you should know what pattern are you looking for and then use meta character to define the pattern. and then use regex functions to search the pattern in the string.

Types of Metacharacters in re python

Metacharacters are special characters that are used to create regular expressions. The regex engine is especially to interpret metacharacters. In the above example, we had already used three metacharacters. Let’s discuss them.

1. Caret – ^

To check if a string starts with a certain character, we use the caret (^) symbol. The caret symbol only checks the very first character of the string.

For example:

  • ‘^a’ will match ‘abc’, ‘abcdef’, ‘a123z’, ‘aaaa’, and ‘a’.
  • ‘^a’ will not match ‘baz’, ‘zaaa’, ‘1aaaaa’, ‘123aaa’ and ‘zazaza’.

2. Dollar – $

To check if a string ends with a certain character, we use the dollar (^) symbol. The dollar symbol only checks the last character of the string.

For example:

  • ‘z$’ will match ‘abz’, ‘abczzz’, ‘a123z’, ‘zzzz’, and ‘z’.
  • ‘z$’ will not match ‘bza’, ‘zaaa’, ‘zzz1’, ‘zzz123’ and ‘zazaza’.

3. Period (dot) – .

The period or dot matches a single character. The character can be anything. The n number of period defined match n number of characters.

For example:

  • ‘a.z’ will match ‘abz’, ‘a1z’, and ‘azz’.
  • ‘a.z’ will not match ‘abbz’, ‘a11z’, and ‘azzz’.
  • ‘a..z’ will match ‘abcz’, ‘a12z’, and ‘azzz’.
  • ‘a..z’ will not match ‘abz’, ‘abbbbz’, and ‘az’.

4. Star – *

To check if a certain character occurs zero or more times, we use the star (*) symbol.

For example:

  • ‘ab*z’ will match ‘abbz’, ‘abz’, and ‘az’.

5. Plus – +

To check if a certain character occurs one or more times, we use the plus (+) symbol.

For example:

  • ‘ab+z’ will match ‘abz’, ‘abbz’, and ‘abbbbbbz’.
  • ‘ab+z’ will not match ‘az’.

7. Square brackets – []

The square brackets are used to defines a set of characters we wish to match. Not every character specified inside the square brackets needs to match.

For example:

  • ‘[az]’ will match ‘abc’, xyz’, and ‘az’.
  • ‘[az] will not match ‘bc’, ‘xy’, and ‘bcxy’.

8. Alternation (vertical bar) – |

The alternation or vertical bar (|) is used as an or operator. It means either one of the characters or multiple characters, or all the characters specified can match.

For example:

  • ‘a|z’ will match ‘abc’, ‘xyz’, and ‘abcz’.
  • ‘a|z’ will not match ‘aaa’ and ‘zzz’.

9. Braces – {}

The braces are used to check the number of times a character is occurring consecutively. Two numbers are specified – n and m. n indicates the minimum number of times the character should appear and m indicates the maximum number of times the character should appear.

For example:

  • ‘a{2,6}’ will match ‘aaaa’, ‘aaaaa’, and ‘aaaaa’.
  • ‘a{2,6}’ will not match ‘a’ and ‘aaaaaaaaa’.

Special sequences

The special sequences in Python work quite similar to metacharacters, but efficiently making regular expressions easier to define.

1. \A

\A is used to check if certain characters are present at the start of a string or not.

For example:

  • ‘\Ahell’ will match ‘hello’ and ‘hello world’.
  • ‘\Ahell’ will not match ‘hey hello’.

2. \b and \B

\b is used to check if certain characters are present at the starting or beginning of a word.

For example:

  • ‘\bhell’ will match ‘hello world’, ‘hey hello’, and ‘world hell’.
  • ‘\bhell’ will not match ‘hey world’ and ‘world hey’.

Note: \b works on the words of a string, not the whole string.

\B is used to check if certain characters are not present at the starting or beginning of a word. It is the opposite of /b

For example:

  • ‘\Bhell’ will not match ‘hello world’, ‘hey hello’, and ‘world hell’.
  • ‘\Bhell’ will match ‘hey world’ and ‘world hey’.

3. \d and \D

\d matches if decimal digits are present in the string.

For example:

  • ‘\d’ will match ‘hey123’, ‘1234’, and ‘123hello234’.
  • ‘\d’ will not match ‘hey’ and ‘hello’.

\D matches if decimal digits are not present in the string. It is the opposite of \d.

For example:

  • ‘\D’ will not match ‘hey123’, ‘1234’, and ‘123hello234’.
  • ‘\D’ will match ‘hey’ and ‘hello’.

4. \w and \W

\w match any character that is alphanumeric, meaning, it is either an alphabet, digit, or underscore (equivalent to [a-zA-Z0-9_]).

For example:

  • ‘\w’ will match ‘heyhello123’, ‘hey’, and ‘12234’.
  • ‘\w’ will match ‘*&^&*’.

\W matches character other than alphanumeric characters. It is the opposite of \w.

  • ‘\W’ will not match ‘heyhello123’, ‘hey’, and ‘12234’.
  • ‘\W’ will match ‘*&^&*’.

Python Regular Expression Module ‘re’ Functions

Now we know how to create regular expressions using metacharacters and special sequences, we can move to the re module.

As mentioned earlier, the re module has several functions that can be used for regex operations.

1. search()

The search() function is the most commonly used function of the re module. This function takes two arguments – regex and the string and it returns a match object. It will match the first matching occurrence.

If the pattern is matched, then the search() function will return a match object (with details) and if don’t, then it will return ‘None’

2. findall()

The findall() function is used to find all the matches of a string. It returns a list that contains all the matches. Similar to the search() function, the findall() function also has two arguments – regex and string.

If there is no match, it returns an empty list.

3. split()

The split() function splits the string where there is a match. It also returns a list that contains the split string.

4. sub()

The sub() function is used to replace the matched substring with another substring. It returns a new string. It has three arguments – regex, new substring, and string.

5. subn()

The subn() function is similar to the sub() function but it returns a tuple.

The tuple has the new string and number of changes made.

TL;DR

  • A regular expression or regex is a sequence of characters that define a search pattern.
  • Python has a re module to work with regular expressions.
  • Metacharacters are special characters used to create regular expressions.
  • The metacharacters available in Python are dollar ($), caret (^), plus (+), star(*), square brackets([]), alternation (|), period (.), question mark (?), and braces ({}).
  • The special sequences are also used to create regular expressions efficiently.
  • The re module has several functions to use with regular expressions such as search, split, sub, subn, and findall.

For more in-depth details refer python re documentation.


Thank you for reading, Happy Learning, drop your suggestion in the comments.

Feel free to follow us on Youtube , Linked In , Instagram

0 0 vote
Article Rating