Regular Expressions in Python: Everything You Need To Know

 


Regex was something I avoided at all costs, and for far longer than I would like to admit. However, I decided one day that I needed to learn how to use it.
The journey of a programmer is not complete without regex. It is a tool you need to have in your tool belt, and it can save you the pain of writing unnecessarily long code.

My first step into the world of regex was to dive right in. The basic concepts were difficult to grasp, but they became quite clear once I understood them.
The process of stringing sentences together became second nature once I understood the words and the grammar.

As a rule, regular expressions search for patterns among characters. Character sequences like these are commonly used in text parsing and string validation in Regular language.
Consider a sheet of cardboard cut into certain shapes. The shape can only pass through if it matches the cut-outs precisely. It translates into a regex string then.

What would a regex search pattern look like for this?

Regex: circle|triangle|hexagon

Input statement: The three incoming shapes are circles, triangles, and rectangles.

What a simple concept!

In this context, it is important to note that regex patterns are part of the Regular Language. However, most of the programming languages we use today support regex and come with inbuilt (or downloadable) modules.

We can use them in the language of our choice. For all the codes in this post, in addition to the regex101 links sprinkled throughout, I will be using the Python regex module re.

In the world of code, how do you create this metaphorical cardboard? Let’s look at an example.

You have the string “Sylvie is 20 years old.” and you want to extract the age only from it. You only need the number here. For this regex pattern, we use /d, a special character that matches patterns that only have digits (we’ll discuss the details of the patterns later).

# import libraries
import re
txt = "Sylvie is 20 years old."
# regex to get only numbers from the string
age = re.findall(r'd', txt)
print(age)

The output is:

['2', '0']

We are not exactly where we want to be, but we are getting closer. At least we figured out the digits! Our cut-out red cardboard block acts as a number identifier.

According to the requirement, it can refer to actual characters of the alphabet (for example, regex a in a string input searches for any character called a), or a group of special characters. We will see many of them later.

The number as a whole, not individual digits, is what we need. Using another block, let’s modify the identifier a little.

# import libraries
import re
txt = "Sylvie is 20 years old."
# get only 2 digit numbers
age = re.findall(r'd{2}', txt)
print(age)

Which gives:

['20']

Congratulations! we have Sylvie’s age from the string! However, what if the year of birth was also included in the string? Using the expression above, we will get all the 2-digit sets.

That’s not what we want, so let’s modify it a little more. Do we have any options?

We have to include a boundary expression since we have spaces both on the left and right side of the two-digit number.

# import libraries
import re
txt = "Sylvie is 20 years old, she was born in the year 2001"
# get only numbers
age = re.findall(r'bd{2}b', txt)
print(age)

Output:

Even when there is another 4 digit number in the string, we get the result we need. Check out the above regex strings here for fun:

That’s pretty cool, isn’t it? Using just one line of code, we could extract the number from a string.

In our code-cardboard-sheet, we already talked about how there are a bunch of characters we need to use to create these virtual cut-outs. Let’s take a look at them.

Characters

  • d: any digit 0 to 9
  • D: anything other than a digit
  • s: space
  • S: anything other than a space
  • w: any character
  • W: anything other than a character
  • b: boundary whitespace around walls
  • : matches any character (try)
  • . : matches a period

Modifiers

  • {}: grouping quantities, like d{3} gives pair of 3 digits, d{3,5} gives 3 to 5 digit pairs. In general, it is {min, max}
  • []: grouping of characters. It will match a single character from the content of the brackets. Like [a-z] will match every character in the lower case alphabet.
  • +: Matches the element before it one or more times, like [a-z]+a will give the grouped match result like shown here
  • ?: Matches the element before it zero or one time, see how [a-z]?a works
  • *: Matches the element before it zero or more times. See [a-z]*a here
  • $: Indicates the end of line
  • ^: Indicates start of the line
  • |: Or operator. For example col(o|u)r will match both American and British spellings of the word color.

These are the most common ones and would come in handy in this post and for most use cases. However, a cheat sheet with many more character classes can be found here.

Apart from knowing these characters, it also makes sense to understand some basics of how a regex engine works. It will save you an immense amount of time spent in guesswork in case of unexpected results from complex regexes.

How do Regular Expressions Work

A regex engine can be either text-directed, or regex-directed, with the latter being the most popular. Most likely, you are using the same software. Using Python, we can run a simple test to see which type we are using:

import re
pattern = "regex|regex not"
output = re.findall(pattern, "regex not")
print(output)
['regex]

Using the regex-directed engine, the output is regex. Do you know why this is important? The engine implements certain important features, such as lazy quantifiers and backreferences.

Important to know is that this engine works on the left-hand side (which is how the above example works). With this example, it will be even clearer.

import re
pattern = r"dragon|fly|ing"
output = re.findall(pattern, "The dragonfly became friends with the flying dragon")
print(output)

Output:

['dragon', 'fly', 'fly', 'ing', 'dragon']

The matching is clearly from left to right. Knowing these basics will definitely benefit you in the long run.

Our knowledge of these skills has now allowed us to apply them a bit. How can regex be used? Among the several uses are:

  • Parsing input like text, logs, web data, etc.
  • Input validation
  • Testing output results
  • Searching text
  • Data restructuring

However, it is always easier to learn from examples. Now that we understand how to use regex and all its characters, let’s see three concrete examples of how you can use it yourself.

Example 1. Validating email entries

Almost all regex tutorials include this example by default. It’s like the “Hello World” of regexes, so I’d include it here. From an input, we can validate the email address format. Here is a simple example you can follow.

Here are some steps to take. Our first assumption is that (most of the time) valid email addresses look like:

someone@mailservice.domain

There are usually alphabets or numbers associated with the user. Occasionally, there can also be special characters, but let’s start with a simple example.

The characters are from a to z, and capitalization is allowed, so A to Z, as well as numbers 0 to 9. The regular expression for this group will look like this:

[a-zA-Z0–9]

Similarly, the mail service is generally alphabetic like Gmail, gmx, Hotmail, etc., and is followed by an @ sign. Hence:

@[a-zA-Z]

The most popular domain names include com, net, edu, and org, which end with the period character.

.(com|net|edu|org)

If we bring this all together, we have the regex:

[a-zA-Z0–9]+@[a-zA-Z]+.(com|net|org|edu)

Now let us use it in a python script that takes an input email id and validates if it matches the format requirements.

# import libraries
import re
'''
In this case, we consider the general email pattern:
someone@mailservice.domain
'''
# define the valid email input patterns
pattern_email = r"[a-zA-Z0-9]+@[a-zA-Z]+.(com|net|org|edu)"
#create field to enter email address
user_input = input()
if (re.search(pattern_email, user_input)):
print(f"{user_input} is a valid email.")
else:
print(f"{user_input} is invalid.")

Intput:

test12@355.com

Output:

test12@355.com is invalid.

Input:

gmenariya@gmail.com

Output:

gmenariya@gmail.com is valid.

Example 2. Names and ages from text

Previously, we looked at how to get age out of a string using a simplified example. Let’s try putting in a dictionary the names and text extracted from a file.

We already understand now, how the regex bd{2}b can be used to extract a 2-digit number from the string. For this example, we consider 2 or 3 digits (since there is also a centenarian).

We note that each and every name in this text has more than 3 letters. Therefore, string groups with a length of 3 or more should work:

b[A-z][a-z]{3,}b

Getting the age of the names in the dictionary shouldn’t be a problem because we always add the age after naming in English grammar.

# import libraries
import re
txt = "Sylvie is 20 years old, her father, Christoph, is 55.
Her grandfather Johannes was born at the end of WW-1 in 1918.
He was 100 years old when he died in 2018"
'''
Since the ages are either two or three digits, the regex is d{2,3}
It is bound on both sides with b so that we do not get every 2 or 3 digit groups
from a larger number like 1918
For the names, it is relatively simple in the case of this text. All capitalised
words that are greater than 3 characters long should do the trick, also bound.
'''
ages = re.findall(r'bd{2,3}b', txt)
names = re.findall (r'b[A-Z][a-z]{3,}b',txt)
print(dict(zip(names, ages)))

Output:

{ 'Sylvie': '20', 'Christoph': '55', 'Johannes': '100' }

In the future, we will have to use more than just regex as the input text becomes more complex, and we will need to extract more information intelligently from the text. In any case, you will definitely use regex to some extent for simplification, even in complex cases.

I hope you enjoyed learning from this article. As always, there are lot more intricacies in the world of regular expressions than explained in the scope of this post. However it should get you started really well, and soon you will be on your way to being a wizard!

*

Post a Comment (0)
Previous Post Next Post