Regex Tutorial with Examples

This tutorial covers various concepts of regular expression (regex) with hands-on examples. It also includes usage of regex using various tools such as R and Python.

Introduction

regex is an acronym for 'Regular Expression'. It is mainly used in extracting sub-string from string by searching a specific search pattern. The search pattern is defined by regular expression.

The search pattern can be finding a single letter, a fixed string or a complex pattern which consists of numeric, punctuation and character values.

Regular expressions can be used to search and replace text.

Regex Made Easy

Uses of Regular expression

There are several use-cases of regular expression in real-world. Some of them are as follows -

Fetch email addresses mentioned in the long paragraph
Validate 10-digit phone number, Social Security Number and email address
Extract text from HTML or XML code
Rename multiple files at a single run
Remove punctuation specified in the text
Web scraping : Searching specific content from all the web pages that contain a specific string
Replace complex pattern with blank or specific character

Lets start with the basics

1. Anchor and Word Boundaries

Symbol	Description
^	Beginning of line
$	End of line
\b	Whole word

Examples

1. ^abc matches the string that begins with abc in text 'abcd'
Test it yourself!

2. ^the matches the string that starts with the in text 'the beginning'
Test it yourself!

3. done$ matches the string that ends with done in text 'I am done'
Test it yourself!

4. \ban\b matches the whole word an in text 'Elephant an animal'
\ban\b does not match an from Elephant and animal as it only perform the whole word searching.
Test it yourself!

2. OR Condition

OR condition can be defined by symbols | or [ ]. See the examples below.

1. the[m|n] matches strings them or then in text 'them then there theme'
Test it yourself!

2. the[mn] is equivalent to the[m|n]
Test it yourself!

3. \bthe[mn]\b matches the complete them or then in text 'them then there theme'
Test it yourself!

3. Case Insensitive

Search patterns mentioned in all of the above examples are case-sensitive. To make it case insensitive, we have to use the expression (?i)

1. (?i)abc matches both abc and ABC in text 'abc ABC'
Test it yourself!

2. (?i)a[bd]a performs insensitive match 'a' followed by either b or d and then a in text 'abc ABA Ada'
Test it yourself!

4. Quantifiers

It talks about quantity of element(s). In simple words, it means how often a particular regex element can occur.

Expression	Description
*	Item occurs zero or more times
+	Item occurs one or more times
?	Item occurs zero or one time
{A}	Item occurs A number of times
{A,B}	Item occurs between A and B times
.	Any character
.*	Matches zero or more of any character

1. def* matches strings that contains de then followed by f zero or more times. Example - dedefdeff defff
Test it yourself!

2. def+ matches strings having de then followed by f at least 1 time. Example - def deff defff
Test it yourself!

3. \bdef?\b matches strings having exact match of whole de then followed by f zero or one time. Example - de def
Test it yourself!

4. \bdef{2}\b matches strings having exact match of de then followed by f exactly two times. Example - deff
Test it yourself!

5. \bdef{2,}\b matches strings having exact match of de then followed by f two or more times. Example - deff defff
Test it yourself!

6. \bdef{3,4}\b matches strings having exact match of de then followed by f either 3 or 4 times. Example - deff defff
Test it yourself!

7. a.* matches all characters after a
Test it yourself!

5. Create Grouping

By using regular expression inside ( ), you can create a group which would let you apply OR condition to portion of regex or you can put in quantifier to the entire group.

It also helps to extract a portion of information from strings.

ab(cd|de)* matches strings having ab then followed by either cd or de zero or more times.
Test it yourself!

6. Back Reference

(name)\1 matches text 'name' that is matched first.
Test it yourself!

Replace (Substitution) using Back-reference

(ab|cd)e(fg|hi) matches either ab or cd then followed by e then either fg or hi
Enter \1\2 in substitution, it will return values of first and second group.
Test it yourself!

7. Lazy Quantifier

Lazy (or non-greedy) quantifier matches a regex element as few times as possible. However greedy quantifier matches a regex element as many as possible.

You can covert a greedy quantifier into a lazy quantifier by simply adding a ?

<.*?> matches strings having <character(s) >.

Regex lazy quantifier

Test it yourself!

8. How to program literal meaning of dot, asterisk

By using backslash \ you can avoid asterisk and dot. In other words, it makes regex understand the literal meaning of character.

abc\* means abc* not abcc

Test it yourself!

9. POSIX Regular Expressions

POSIX expressions use square brackets. Like regular expressions, it matches characters, digits, punctuations and many more

POSIX	Description	ASCII
[:digit:]	Digits	[0-9]
[:lower:]	Lowercase letters	[a-z]
[:upper:]	Uppercase letters	[A-Z]
[:alpha:]	Lower and uppercase letters	[a-zA-Z]
[:alnum:]	Lower and uppercase letters and digits	[a-zA-Z0-9]
[:blank:]	Space and tab	[ \t]
[:space:]	All whitespace characters, including line breaks	[ \t\r\n\v\f]
[:punct:]	Punctuations	"[!\#$%()*+,\-./:;?@\\\\]^_'{\|}~]"

Select string having first letter character followed by numeric

[[:alpha:]][[:digit:]]+

[[:alpha:]] means any letter character
[[:digit:]] means any digit
+ means previous one or more time

Test it yourself!

How to use regex with R and Python

R

1. grep(pattern, x)
Search for a particular pattern in each element of a vector x

2. gsub(pattern, replacement, x)
Replace a particular pattern in each element of a vector x

x = "sample text B2 testing B52"
gsub('[[:alpha:]][[:digit:]]+', '',x)

Python

The package re can be used for regular expressions in Python.

1. re.search(pattern, x)
Search for a particular pattern in each element of a vector x

2. re.sub(pattern, replacement, x)
Replace a particular pattern in each element of a vector x

import re
x = 'Welcome to Python3.6'
re.sub( '[a-zA-Z]+[0-9|.]+','', x)

Exercises : Regular Expression

1. Replace abbreviation of thousand (K) with 000?

x = "K 25K 2K"
Desired Output : K 25000 2000

Show Solution

gsub('([0-9])K', '\\1000',x)

Using two backward slash as a single backward slash not allowed in R

2. Remove extra characters

x = "var1_avg_a1 var1_a_avg_7"
Desired Output :var1 var1_a

Show Solution

gsub('_avg_.*?[0-9]', '',x)

? making the regular expression non-greedy (lazy) quantifier

Regex Tutorial with Examples

Introduction

Uses of Regular expression

1. Anchor and Word Boundaries

2. OR Condition

3. Case Insensitive

4. Quantifiers

5. Create Grouping

6. Back Reference

7. Lazy Quantifier

8. How to program literal meaning of dot, asterisk

9. POSIX Regular Expressions

1. Replace abbreviation of thousand (K) with 000?

2. Remove extra characters

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112