Regular Expressions
Chapter 19
In the upcoming chapters, we'll delve into tools for manipulating text, a crucial element in Unix-like systems like Linux. However, before we can fully grasp the capabilities of these tools, we need to explore a technology closely linked with their most advanced uses: regular expressions.
Throughout our exploration of the command line's numerous features, we've encountered some complex shell functionalities—shell expansion, quoting, keyboard shortcuts, command history, and even the vi editor. Regular expressions follow this trend of complexity and might arguably be the most intricate feature among them. Yet, this complexity doesn't diminish their worth; investing time to learn about them is truly rewarding. A comprehensive understanding will empower us to achieve remarkable tasks, even if their full potential isn't immediately evident.
What Are Regular Expressions?
Put simply, regular expressions serve as symbolic notations employed to recognize patterns within text. They share some similarities with the shell's method of using wildcards to match file and pathnames, but their scope is significantly broader. Embraced by various command line tools and nearly all programming languages, regular expressions streamline the resolution of text manipulation challenges. However, it's important to note that not all regular expressions are uniform; they exhibit slight variations across tools and programming languages. In our discussion, we'll focus on regular expressions as outlined in the POSIX standard, applicable to most command line tools. Contrarily, numerous programming languages, particularly Perl, utilize slightly larger and more elaborate sets of notations.
grep
grepOur primary tool for handling regular expressions will be our familiar companion, grep. The term "grep" originates from "global regular expression print," indicating its association with regular expressions. Fundamentally, grep scans text files to find instances of a specified regular expression and displays any line that matches to standard output.
So far, we have used grep with fixed strings, like so:
[me@linuxmachine ~]$ ls /usr/bin | grep zipTo display all files within the /usr/bin directory that include the substring "zip" in their names:
The grep program follows this format for accepting options and arguments:
grep [options] regex [file...]
where regex represents a regular expression.
Below are the frequently utilized grep options:
-i
Ignore case. Do not distinguish between upper and lower case characters. May also be specified --ignore-case.
-v
Invert match. Normally, grep prints lines that contain a match. This option causes grep to print every line that does not contain a match. May also be specified --invert-match.
-c
Print the number of matches (or non-matches if the -v option is also specified) instead of the lines themselves. May also be specified --count.
-l
Print the name of each file that contains a match instead of the lines themselves. May also be specified --files-with-matches.
-L
Like the -l option, but print only the names of files that do not contain matches. May also be specified --files-withoutmatch.
-n
Prefix each matching line with the number of the line within the file. May also be specified --line-number.
-h
For multi-file searches, suppress the output of filenames. May also be specified --no-filename.
In order to more fully explore grep, let’s create some text files to search:
We can perform a simple search of our list of files like this:
In this instance, grep scans through all the specified files to find the string "bzip," discovering two matches within the file dirlist-bin.txt. If our focus were solely on obtaining the list of files containing matches rather than the matches themselves, we could indicate the -l option:
On the contrary, if our aim was to exclusively display a list of files devoid of any matches, we could accomplish this by:
Metacharacters And Literals
Although not immediately obvious, our grep searches have consistently utilized regular expressions, albeit relatively straightforward ones. The regular expression "bzip" is interpreted to require a match where the line in the file contains a minimum of four characters, and within that line, the characters "b," "z," "i," and "p" must appear consecutively, without any other intervening characters. These characters in the "bzip" string are considered literal characters, matching themselves. Beyond literals, regular expressions can encompass metacharacters used to define more intricate matches. Metacharacters in regular expressions include:
^ $ . [ ] { } - ? * + ( ) | \
All remaining characters are treated as literals, except for the backslash character, which is used in specific instances to create meta-sequences and to escape metacharacters, allowing them to be treated as literals rather than interpreted as metacharacters.
Note
Numerous regular expression metacharacters also hold significance within the shell's expansion process. When we use regular expressions with metacharacters in command-line arguments, it's crucial to enclose them in quotes. This prevents the shell from attempting to interpret or expand these characters.
The Any Character
Let's start by examining the dot or period metacharacter, which serves to match any character. When incorporated into a regular expression, it effectively matches any character in that specific position. Here's an illustrative example:
We conducted a search across our files to find any line matching the regular expression ".zip". There are a couple of intriguing observations regarding the outcomes. Firstly, note that the zip program wasn't found. This occurred because the inclusion of the dot metacharacter in our regular expression extended the required match to four characters. As the name "zip" only consists of three characters, it didn't meet the criterion for a match. Furthermore, had any files in our lists contained the file extension .zip, they would have been matched. This is because the period character in the file extension functions as "any character" within the regular expression context.
Anchors
In regular expressions, the caret (^) and dollar sign ($) characters serve as anchors. They dictate that the match must occur exclusively at the start (^) or the end ($) of a line:
In this instance, we scanned through the list of files to locate the string "zip" positioned at the line's start, end, and in cases where it appears both at the beginning and end of a line, essentially when it stands alone on the line. It's important to note that the regular expression ^$—representing a beginning and an end with no content in between—will match blank lines.
A Crossword Puzzle Helper
Even with our current limited grasp of regular expressions, we can achieve something practical.
Consider a scenario where my wife seeks help with a crossword puzzle, asking something like, "What’s a five-letter word with 'j' as the third letter and 'r' as the last letter, signifying...?" This prompted an interesting thought. Did you know that your Linux system holds a dictionary? Indeed, it does. You can explore the /usr/share/dict directory, where you might discover one or more dictionary files. These files consist of extensive lists of words, each on a separate line and organized alphabetically. In my system, the words file comprises just over 98,500 words. To assist in finding potential answers to the crossword puzzle query mentioned earlier, we could execute the following:
By using this regular expression, we can identify all the words within our dictionary file that are five letters long, containing 'j' as the third character and 'r' as the final character.
Bracket Expressions And Character Classes
Apart from matching any character at a specific position within our regular expression, we can also match a single character from a defined set by utilizing bracket expressions. These expressions allow us to designate a collection of characters, even those that might typically be interpreted as metacharacters, to be considered for matching. In this instance, let's explore the use of a two-character set:
we match any line that contains the string “bzip” or “gzip”.
Within a set, any number of characters can be included, and metacharacters cease to hold their typical significance when positioned within brackets. However, there are two instances where metacharacters retain distinct meanings within bracket expressions. The caret (^) signifies negation, while the dash (-) is employed to denote a character range.
Negation
When a caret (^) appears as the initial character in a bracket expression, the subsequent characters are interpreted as a set of characters that should not appear at the specified character position. To demonstrate, let's modify our previous example accordingly:
When employing negation, we generate a list of files containing the string "zip," preceded by any character except "b" or "g." Observe that the file named zip wasn't located. Despite the negated character set, a character is still required at the specified position, but it mustn't belong to the excluded set.
It's crucial to note that the caret character only triggers negation when it's the initial character within a bracket expression. Otherwise, it loses its special functionality and becomes an ordinary character within the set.
Traditional Character Ranges
To create a regular expression that identifies every file in our lists starting with an uppercase letter, we could form the following expression:
It's simply a matter of including all 26 uppercase letters within a bracket expression. However, the idea of manually typing all of them can be quite concerning. Thankfully, there's an alternative method:
Through the utilization of a three-character range, we can condense the representation of the 26 letters. This method allows for the expression of any character range, even multiple ranges, as demonstrated by this expression designed to match all filenames commencing with letters and numbers:
Within character ranges, the dash character holds a special function. But how do we incorporate an actual dash character in a bracket expression? The solution is to position it as the initial character within the expression. Take a look at these two examples for clarification:
This pattern will identify any filename that includes an uppercase letter. Whereas:
will identify any filename that contains either a dash, an uppercase "A," or an uppercase "Z".
POSIX Character Classes
Traditional character ranges offer a straightforward and efficient method for swiftly defining sets of characters. However, they're not universally foolproof. Although we haven't faced any issues yet with our grep usage, potential problems might arise when employing other programs.
In Chapter 4, we explored how wildcards facilitate pathname expansion. During that discussion, we highlighted that character ranges could be employed similarly to their usage in regular expressions. Yet, herein lies the issue:
(Results may vary depending on the Linux distribution, potentially yielding a different file list, or even an empty one. The example provided is from Ubuntu.) This command generates the anticipated outcome: a compilation solely comprising files whose names commence with an uppercase letter. However:
executing this command yields a notably distinct outcome (only a partial excerpt of the results is presented). Why the discrepancy? It's a complex narrative, but here's the abridged version:
During Unix's initial development, the system recognized solely ASCII characters, a characteristic that echoes through this feature. In the ASCII character set, the first 32 characters (ranging from 0 to 31) encompass control codes like tabs, backspaces, and carriage returns. The subsequent 32 characters (from 32 to 63) comprise printable characters, including most punctuation symbols and the numerical digits zero through nine. The following 32 characters (64 to 95) encompass the uppercase letters along with a few additional punctuation symbols. Lastly, the remaining 31 characters (96 to 127) encompass the lowercase letters and more punctuation symbols. Consequently, systems using ASCII followed a collation order resembling this sequence:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
This deviates from the conventional dictionary order, which typically appears as:
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
As Unix's popularity transcended the United States, the demand emerged for supporting characters beyond those found in U.S. English. To accommodate various languages, the ASCII table expanded to utilize a complete eight bits, introducing characters numbered 128 to 255. This extension facilitated language support for numerous global languages. In tandem with this capability, the POSIX standards introduced the concept of a locale, permitting the selection of a specific character set tailored to a particular location's needs. To ascertain the language setting of our system, we can use this command:
Under this configuration, applications compliant with POSIX standards will adopt a dictionary-based collation order rather than adhering to the ASCII sequence. This elucidates the behavior observed in the aforementioned commands. When interpreted according to dictionary order, a character range [A-Z] incorporates all alphabetic characters except for the lowercase "a," thereby explaining our outcomes.
To partially address this issue, the POSIX standard incorporates several character classes that offer helpful ranges of characters. These classes are delineated in the table provided below:
[:alnum:]
The alphanumeric characters. In ASCII, equivalent to: [A-Za-z0-9]
[:word:]
The same as [:alnum:], with the addition of the underscore (_) character.
[:alpha:]
The alphabetic characters. In ASCII, equivalent to: [A-Za-z]
[:blank:]
Includes the space and tab characters.
[:cntrl:]
The ASCII control codes. Includes the ASCII characters 0 through 31 and 127.
[:digit:]
The numerals zero through nine.
[:graph:]
The visible characters. In ASCII, it includes characters 33 through 126.
[:lower:]
The lowercase letters.
[:punct:]
The punctuation characters. In ASCII, equivalent to: [-!"#$%&'()*+,./:;<=>?@[\]_`{|}~]
[:print:]
The printable characters. All the characters in [:graph:] plus the space character.
[:space:]
The whitespace characters including space, tab, carriage return, newline, vertical tab, and form feed. In ASCII, equivalent to: [ \t\r\n\v\f]
[:upper:]
The uppercase characters
[:xdigit:]
Characters used to express hexadecimal numbers. In ASCII, equivalent to: [0-9A-Fa-f]
Despite the availability of character classes, there remains no straightforward method to articulate partial ranges, such as [A-M].
By employing character classes, we can reiterate our directory listing, resulting in an enhanced outcome:
Keep in mind, though, this isn't an instance of a regular expression; rather, it's the shell executing pathname expansion. We've demonstrated this here because POSIX character classes can be employed for both purposes.
Reverting To Traditional Collation Order
You have the option to switch your system back to the traditional (ASCII) collation order by modifying the value of the LANG environment variable. As demonstrated earlier, the LANG variable stores the language and character set used in your locale, established during the installation of your Linux system when selecting an installation language.
To view the locale settings, use the locale command:
To revert the locale to embrace traditional Unix behaviors, set the LANG variable to POSIX:
Keep in mind that this modification transitions the system to utilize U.S. English (specifically, ASCII) for its character set. Ensure this aligns with your intended configuration.
For a permanent alteration, include this line in your .bashrc file:
POSIX Basic Vs. Extended Regular Expressions
Just when it seemed things couldn't become more complex, we uncover that POSIX divides regular expression implementations into two categories: basic regular expressions (BRE) and extended regular expressions (ERE). The functionalities covered thus far are supported by any POSIX-compliant application that implements BRE, including our grep program.
So, what distinguishes BRE from ERE? It boils down to metacharacters. In BRE, the recognized metacharacters are:
^ $ . [ ] *
All other characters are regarded as literals. On the other hand, ERE introduces the following metacharacters (along with their respective functions):
( ) { } ? + |
Here's the interesting part: in BRE, if the characters "(", ")", "{", and "}" are preceded by a backslash, they're treated as metacharacters; whereas in ERE, prefixing any metacharacter with a backslash causes it to be considered as a literal. The peculiarities that arise will be discussed in subsequent conversations.
As the upcoming features are part of ERE, we'll require a different grep. Historically, this was accomplished using the egrep program, but the GNU version of grep also supports extended regular expressions when the -E option is employed.
POSIX
During the 1980s, Unix experienced a surge in popularity as a commercial operating system. However, by 1988, the Unix landscape was in disarray. Numerous computer manufacturers had licensed the Unix source code from AT&T, the creators, and were distributing diverse versions of the operating system alongside their systems. In their pursuit of product differentiation, each manufacturer introduced proprietary alterations and enhancements, leading to a reduction in software compatibility. As customary in the realm of proprietary vendors, each sought to ensnare customers in a strategy known as "lock-in." This tumultuous period in Unix's history is now referred to as "the Balkanization."
In response to this fragmentation, the Institute of Electrical and Electronics Engineers (IEEE) stepped in. In the mid-1980s, they embarked on formulating a series of standards intended to define the performance of Unix and Unix-like systems. These standards, formally termed IEEE 1003, delineate the application programming interfaces (APIs), shell functions, and utilities expected on a standard Unix-like system. The moniker "POSIX," denoting Portable Operating System Interface with an "X" appended for added vigor, was proposed by none other than Richard Stallman (yes, the Richard Stallman) and was officially adopted by the IEEE.
Alternation
Let's delve into the initial extended regular expression feature known as "alternation." This feature enables matching from a set of expressions, similar to how a bracket expression permits matching a single character from a defined set of characters. Alternation extends this capability to allow matches from various strings or regular expressions.
To illustrate, we'll employ grep alongside echo. Initially, let's attempt a straightforward string match.
Here's a simple example: we're using a pipe to pass the output of echo into grep and observing the outcomes. When a match happens, it gets displayed, and when there's no match, there are no results shown.
Next, we'll introduce alternation, indicated by the vertical-bar metacharacter:
In this instance, the regular expression 'AAA|BBB' signifies "match either the string AAA or the string BBB." It's important to note that since this is an extended feature, we included the -E option with grep (although using the egrep program would achieve the same result), and we enclosed the regular expression in quotes to prevent the shell from interpreting the vertical-bar metacharacter as a pipe operator. It's worth noting that alternation isn't restricted to only two choices:
To integrate alternation with additional regular expression components, we utilize () to delineate the alternation:
The pattern will identify filenames from our lists that commence with either "bz," "gz," or "zip." If we had omitted the parentheses, the interpretation of this regular expression would be:
modifies to match filenames starting with “bz,” or contains “gz,” or contains “zip.”
Quantifiers
Extended regular expressions offer multiple methods to define the frequency of matching an element.
? - Match An Element Zero Or One Time
? - Match An Element Zero Or One TimeThis quantifier essentially implies "Make the preceding element optional." Let's consider validating a phone number, deeming it valid if it adheres to either of these formats:
(nnn) nnn-nnnn
nnn nnn-nnnn
In order to create a regular expression for this, we can use:
^(?[0-9][0-9][0-9])? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$
Within this expression, we append question marks after the parentheses to signify that they should be matched zero or one time. Since parentheses are typically metacharacters in ERE, we prefix them with backslashes to ensure they are treated as literal characters.
Let’s try it:
In this example, the expression successfully matches both phone number formats but doesn't match one that contains non-numeric characters.
* - Match An Element Zero Or More Times
* - Match An Element Zero Or More TimesSimilar to the ? metacharacter, the * symbolizes an optional element; however, unlike ?, it allows the item to appear any number of times, not just once. Suppose we aim to identify a string as a sentence—beginning with an uppercase letter, followed by a sequence of upper and lowercase letters along with spaces, and concluding with a period. To match this basic definition of a sentence, a regular expression like the following could be utilized:
[[:upper:]][[:upper:][:lower:] ]*.
Breaking down the expression: it comprises three components—a bracket expression incorporating the [:upper:] character class, another bracket expression encompassing both [:upper:] and [:lower:] character classes including a space, and an escaped period denoted by a backslash. The second element is trailed by a * metacharacter, allowing any number of upper and lowercase letters alongside spaces to follow the initial uppercase letter in our sentence and still qualify as a match.
The expression successfully matches the first two tests but fails to match the third one because it doesn't contain the necessary leading uppercase character or the trailing period.
+ - Match An Element One Or More Times
+ - Match An Element One Or More TimesThe + metacharacter functions similarly to *, but it necessitates at least one occurrence of the preceding element for a match. Here's a regular expression specifically designed to match lines containing groups of one or more alphabetic characters separated by single spaces:
^([[:alpha:]]+ ?)+$
This expression doesn't match the line "a b 9" because it contains a non-alphabetic character, and it also doesn't match "abc d" because there's more than one space character between "c" and "d".
{ } - Match An Element A Specific Number Of Times
{ } - Match An Element A Specific Number Of TimesThe { and } symbols serve as metacharacters to indicate the minimum and maximum required number of matches. These can be specified in four different ways:
{n}
Match the preceding element if it occurs exactly n times.
{n,m}
Match the preceding element if it occurs at least n times, but no more than m times.
{n,}
Match the preceding element if it occurs n or more times.
{,m}
Match the preceding element if it occurs no more than m times.
Returning to our prior example concerning phone numbers, we can utilize this repetition specification method to streamline our initial regular expression:
From:
^(?[0-9][0-9][0-9])? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$
To:
^(?[0-9]{3})? [0-9]{3}-[0-9]{4}$
Let’s try it:
Our updated expression effectively validates numbers, accommodating both formats with or without parentheses, and filters out improperly formatted ones.
Putting Regular Expressions To Work
Let's explore some of the commands we're familiar with and examine their application alongside regular expressions.
Validating A Phone List With grep
grepIn our previous instance, we examined individual phone numbers to validate their format. However, a more practical scenario involves assessing a list of numbers. To create this list, we'll perform a sort of magical ritual using command-line instructions. This may seem like magic as we haven't covered most of the involved commands yet, but fear not! We'll explore these commands in future chapters. Here's the incantation:
Executing this command generates a file named phonelist.txt comprising ten phone numbers. Each repetition of the command appends another set of ten numbers to the list. Adjusting the value 10 at the start of the command allows for generating more or fewer phone numbers. Upon inspecting the file's contents, however, a problem becomes evident:
Several numbers within the list are incorrectly formatted, which suits our purpose as we'll employ grep to validate them.
A valuable validation technique involves scanning the file to detect invalid numbers and presenting the resulting list on the screen:
In this instance, we utilize the -v option to create a reverse match, displaying only the lines within the list that don't align with the specified expression. The expression itself encompasses anchor metacharacters at each end, guaranteeing that the number lacks additional characters at either end. Unlike our prior phone number example, this expression mandates the presence of parentheses in a valid number.
Finding Ugly Filenames With find
findThe find command allows for testing based on a regular expression. However, a critical distinction exists when employing regular expressions in find compared to grep. While grep prints a line when it contains a string matching an expression, find demands that the pathname precisely matches the regular expression. In the upcoming example, we'll utilize find with a regular expression to identify each pathname containing characters not belonging to this specific set:
[-_./0-9a-zA-Z]
This scan will uncover pathnames that contain spaces and other potentially problematic characters.
Because an exact match of the entire pathname is necessary, we employ .* at both ends of the expression to accommodate zero or more instances of any character. Within the expression's center, a negated bracket expression is utilized, encompassing our set of permissible characters for pathnames.
Searching For Files With locate
locateThe locate tool offers support for both basic (using the --regexp option) and extended (via the --regex option) regular expressions. Using this tool, we can execute numerous operations akin to those conducted earlier with our dirlist files:
By employing alternation, we conduct a search for pathnames containing either bin/bz, bin/gz, or /bin/zip.
Searching For Text With less And vim
less And vimBoth less and vim utilize a similar approach for text search. Pressing the / key, followed by a regular expression, initiates the search function. For instance, if we use less to browse through our phonelist.txt file:
then search for our validation expression:
less will highlight the strings that match, leaving the invalid ones easy to spot:
On the flip side, vim supports basic regular expressions, hence our search expression would resemble:
/([0-9]{3}) [0-9]{3}-[0-9]{4}
Comparatively, the expression remains mostly the same. However, in basic expressions, many characters regarded as metacharacters in extended expressions are seen as literals. They only adopt metacharacter status when preceded by a backslash. The highlighting for matches may vary depending on the specific vim configuration on our system. If highlighting isn't active, try this command mode command:
:hlsearch
to enable search highlighting.
Note
Vim's capability for text search highlighting can vary based on your distribution. For instance, Ubuntu typically provides a quite basic version of vim by default. On such systems, you might consider using your package manager to install a more comprehensive version of vim for enhanced functionality.
Summary
Throughout this chapter, we've explored several applications of regular expressions. However, a multitude of other uses exists if we employ regular expressions to seek out additional applications utilizing them. We can accomplish this by searching through the man pages:
zgrep serves as a frontend for grep, enabling it to read compressed files. In the provided example, we perform a search within the compressed section of one-man page files in their standard location. This command generates a list of files containing either the string "regex" or "regular expression," showcasing the prevalence of regular expressions across various programs.
One aspect of basic regular expressions remains unexplored in our discussion—back references. This feature will be the focus of our next chapter.
Last updated