Text Processing

Chapter 20

Unix-like operating systems heavily rely on text files as a primary means of storing diverse data types. Consequently, a multitude of tools has been developed to manipulate text efficiently. This chapter delves into programs specifically designed for text manipulation, exploring techniques for slicing and reconfiguring text. The subsequent chapter will focus on text processing for formatting, catering to various human consumption needs, such as printing.

Throughout this chapter, we'll revisit familiar tools and encounter new ones, including:

  • cat: Concatenates files and displays the output

  • sort: Arranges lines within text files

  • uniq: Identifies and optionally excludes duplicate lines

  • cut: Eliminates specified segments from each line in files

  • paste: Combines lines from files

  • join: Connects lines from two files using a common field

  • comm: Compares lines between two sorted files

  • diff: Compares files line by line

  • patch: Applies a differential file to an original

  • tr: Translates or removes characters

  • sed: Acts as a stream editor for text filtering and transformation

  • aspell: An interactive spell checker

Applications Of Text

Up to this point, we've familiarized ourselves with a few text editors (nano and vim), explored numerous configuration files, and observed the output generated by numerous text-based commands. However, what other purposes does text serve? It happens to be utilized for many diverse tasks, as we'll discover.

Documents

Numerous individuals opt for plain text formats when creating documents. While it's clear how a concise text file serves well for basic notes, it's equally feasible to compose extensive documents in text format. A prevalent method involves crafting a sizable document in a text format and subsequently employing a markup language to detail its final formatting. This approach is favored in composing scientific papers, owing to Unix-based text processing systems being among the pioneers in supporting the intricate typographical layout essential for technical writers.

Web Pages

The most widely used form of electronic document globally is likely the web page. These pages are textual documents employing either HTML (Hypertext Markup Language) or XML (Extensible Markup Language) as markup languages to define the visual layout of the document.

Email

Email operates fundamentally as a text-centric platform. Even attachments in non-text formats undergo conversion into a textual representation for transmission. To observe this process firsthand, one can download an email message and view it using a tool like less. Upon inspection, the message typically starts with a header detailing its origin and the processing it underwent throughout its journey, succeeded by the message body containing its content.

Printer Output

Within Unix-like systems, when sending output to a printer, it's transmitted either in plain text form or, if the page includes graphics, it gets transformed into a textual format referred to as PostScript. This PostScript data is then directed to a program responsible for generating the graphical dots necessary for printing.

Program Source Code

Numerous command-line programs present in Unix-like systems were developed to aid in system administration and software development, and text processing tools are no different. A significant portion of these tools is specifically crafted to address issues encountered in software development. The significance of text processing for software developers lies in the fact that all software originates as text. The core component of a program—the source code, which the programmer directly crafts—is consistently in text format.

Revisiting Some Old Friends

In Chapter 6 (Redirection), we explored commands capable of receiving standard input alongside command line arguments. We only briefly introduced these commands at that time. Now, we'll delve deeper into their functionality and explore how they can effectively facilitate text processing tasks.

cat

The cat program offers several intriguing options, many aimed at enhancing the visualization of text content. One such example is the -A option, employed to reveal nonprinting characters within the text. There are instances where identifying embedded control characters in otherwise visible text becomes essential. Among the most common are tab characters (distinct from spaces) and carriage returns, often marking the end-of-line in MS-DOS-style text files. Another prevalent scenario involves files containing lines of text with trailing spaces.

Let's create a test file using cat as a rudimentary word processor. To achieve this, we'll input the cat command (along with specifying a file for redirected output), type our text, press Enter to conclude the line properly, and then use Ctrl-d to signal cat that we've reached the end-of-file. For instance, in this example, we'll input a leading tab character followed by some trailing spaces after the line:

Next, we will use cat with the -A option to display the text:

The displayed results indicate the presence of a tab character represented as ^I within our text. This notation, frequently signifying Control-I, happens to be synonymous with a tab character. Additionally, a $ symbol is visible at the actual end of the line, suggesting the existence of trailing spaces in our text.

MS-DOS Text Vs. Unix Text

Using cat to detect non-printing characters in text is particularly useful for uncovering concealed carriage returns. Where do these hidden carriage returns originate? DOS and Windows operating systems! Unix and DOS define the end of a line differently within text files. Unix denotes line endings with a linefeed character (ASCII 10), whereas MSDOS and its derivatives utilize the sequence of a carriage return (ASCII 13) followed by a linefeed to conclude each line of text.

Various methods exist for converting files from DOS to Unix format. On numerous Linux systems, dedicated programs such as dos2unix and unix2dos facilitate the conversion of text files to and from DOS format. However, lacking dos2unix on your system shouldn't cause concern. Converting text from DOS to Unix format is straightforward—it involves simply eliminating the troublesome carriage returns. This task can be easily accomplished using a couple of the programs detailed later in this chapter.

cat features options designed for text manipulation as well. Among the most notable are -n, used for numbering lines, and -s, employed to suppress the display of consecutive blank lines. These functionalities can be showcased in the following manner:

In this instance, we generate a revised version of our foo.txt test file, comprising two lines of text divided by two consecutive blank lines. Upon utilizing cat with the -ns options, the surplus blank line is eliminated, and the existing lines are sequentially numbered. Although this may seem like a minor text manipulation, it constitutes a notable process in itself.

sort

The sort utility arranges the content either from standard input or from one or multiple files specified through the command line and then forwards the sorted results to the standard output. Employing a technique similar to the one applied with cat, we can showcase the direct processing of standard input directly from the keyboard:

Upon inputting the command, we enter the letters "c", "b", and "a," concluding by using Ctrl-d to indicate the end-of-file. Subsequently, upon examining the resultant file, we observe that the lines now present themselves in sorted order.

Given that sort has the capability to receive multiple files as command line arguments, it allows the merging of numerous files into a unified, sorted entity. For instance, if there were three text files and the objective was to merge them into a single sorted file, the process could resemble the following:

sort has several interesting options. Here is a partial list:

Option
Long Option
Description

-b

--ignore-leading-blanks

By default, sorting is executed on the complete line, commencing from the initial character within the line. This particular option instructs 'sort' to disregard any leading spaces within lines, computing the sorting based on the initial non-whitespace character found on each line.

-f

--ignore-case

Makes sorting case-insensitive.

-n

--numeric-sort

This option conducts sorting based on the numeric interpretation of a string. Its utilization enables sorting to be carried out according to numeric values rather than relying on alphabetic sequences.

-r

--reverse

Arrange in reverse sequence. This function yields results in descending order rather than the usual ascending sequence.

-k

--key=field1[,field2]

Arrange according to a key field positioned from field1 to field2, disregarding the entire line. Further details are discussed below.

-m

--merge

Consider each argument as the label of an already sorted file. Combine multiple files into a unified sorted output without executing further sorting processes.

-o

--output=file

Send sorted output to file rather than standard output.

-t

--field-separator=char

Specify the character used as the field separator. By default, fields are delimited by spaces or tabs.

While many of the aforementioned options are quite straightforward, some require further explanation. To begin, let's consider the -n option, designed for numeric sorting. This option facilitates sorting values based on their numerical representation. To illustrate its functionality, we can utilize this option in sorting the output of the du command to identify the largest consumers of disk space. Typically, the du command lists summary results in pathname order.

In this instance, we redirect the output into head to restrict the display to the initial ten lines. Employing this method allows us to generate a list sorted numerically, showcasing the ten most significant space consumers.

Utilizing the -nr options facilitates a reverse numerical sort, presenting the largest values at the forefront of the results. This sorting method operates effectively since the numerical values are positioned at the start of each line. However, what if our objective is to sort a list based on a value embedded within the line? For instance, consider the output from an ls -l command:

For now, setting aside the fact that ls can arrange its results by size, we could employ sort to organize this list based on file size:

Sort often involves handling tabular data, such as the output from the ls command mentioned earlier. If we apply database terminology to the displayed information, each line represents a record, comprising various fields like file attributes, link count, filename, file size, and more. sort possesses the ability to process these individual fields. In database language, we can specify one or more key fields to act as the basis for sorting.

In the previously mentioned example, we indicate the n and r options to conduct a reverse numerical sort and specify -k 5 to prompt sort to utilize the fifth field as the sorting key.

The k option holds significant interest and comes with various functionalities. However, before delving into its intricacies, it's essential to understand how sort defines fields. Consider a straightforward text file comprising a solitary line containing the author's name:

By default, sort perceives this line as encompassing two fields. The initial field comprises the characters "William," while the subsequent field contains " Shotts," indicating that whitespace characters (spaces and tabs) serve as separators between fields. Additionally, these delimiters are encompassed within the field during the sorting process.

Upon revisiting a line from our ls output, it becomes apparent that each line contains eight fields, with the fifth field representing the file size:

In our upcoming set of experiments, let's examine a file detailing the history of three prevalent Linux distributions released between 2006 and 2008. Each line within the file consists of three fields: the distribution name, version number, and release date in MM/DD/YYYY format:

Through a text editor like vim, let's input this data and save the resulting file as distros.txt. Then, we'll proceed to sort the file and examine the outcomes:

Mostly, the sorting worked well, except for the issue with the Fedora version numbers. Due to the character set order, the sort places "10" above "9" because "1" precedes "5" in sequence.

To address this problem, we need to sort based on multiple keys. Our objective is an alphabetical sort for the first field and a numeric sort for the second field. sort permits multiple uses of the -k option to specify multiple sort keys. Additionally, a key can encompass a range of fields. When no range is specified (as in our previous examples), sort utilizes a key that initiates from the specified field and extends to the end of the line. Below is the syntax for our multi-key sort:

Although we opted for the long form of the option for clarity, -k 1,1 -k 2n would be precisely equivalent. In the first instance of the key option, we outlined a field range for the initial key. To confine the sort exclusively to the first field, we specified 1,1, signifying "start at field one and conclude at field one." Meanwhile, in the second instance, we designated 2n, indicating that field 2 serves as the sorting key and necessitates a numeric sort.

A letter option can be included at the conclusion of a key specifier to denote the type of sort to be executed. These letters correspond to the same global options available for the sort program: b (ignore leading blanks), n (numeric sort), r (reverse sort), and so forth.

Within our list, the third field contains a date formatted inconveniently for sorting. Computers typically adopt the YYYY-MM-DD format for dates, ensuring chronological sorting ease, whereas our dates adhere to the American MM/DD/YYYY format. How can we arrange this list chronologically?

Thankfully, sort offers a solution. The key option allows specification of offsets within fields, enabling us to define keys within fields:

By utilizing -k 3.7, we direct sort to utilize a sorting key starting at the seventh character within the third field, which signifies the commencement of the year. Simultaneously, we designate -k 3.1 and -k 3.4 to pinpoint the month and day portions within the date. To execute a reverse numeric sort, we append the n and r options. Additionally, the b option is added to suppress leading spaces in the date field. These spaces, varying in number across lines, might influence the sorting outcome.

Certain files deviate from using tabs and spaces as field separators, such as the /etc/passwd file:

Within this file, the fields are demarcated by colons (:). How might we go about sorting this file using a specific field as the key? sort offers the -t option to designate the field separator character. For instance, to sort the passwd file based on the seventh field, representing the account's default shell, we could proceed as follows:

By specifying the colon character as the field separator, we can sort on the seventh field.

uniq

In contrast to sort, the uniq program is lightweight. It carries out a seemingly simple task: when provided with a sorted file (including standard input), it eliminates any duplicate lines and directs the unique results to standard output. It's commonly employed alongside sort to cleanse the output of redundant entries.

Tip

While uniq stands as a traditional Unix utility frequently paired with sort, the GNU version of sort introduces a -u option that directly eliminates duplicates from the sorted output.

To experiment with this, let's create a text file:

Remember to use Ctrl-d to conclude standard input. Once that's done, if we execute uniq on our text file:

The output remains unchanged, mirroring our original file as the duplicates were not eliminated. For uniq to effectively perform its function, the input must undergo sorting beforehand:

This occurs because uniq solely eliminates duplicate lines that appear consecutively.

uniq offers various options. Here are some commonly used ones:

Option
Description

-c

Output a list of duplicate lines preceded by the number of times the line occurs.

-d

Only output repeated lines, rather than unique lines.

-f n

Ignore n leading fields in each line. Fields are separated by whitespace as they are in sort; however, unlike sort, uniq has no option for setting an alternate field separator.

-i

Ignore case during the line comparisons.

-s n

Skip (ignore) the leading n characters of each line.

-u

Only output unique lines. This is the default.

In this example, we observe uniq employed to indicate the count of duplicates detected within our text file, utilizing the -c option:

Slicing And Dicing

The following trio of programs are designed for extracting text columns from files and then reassembling them in practical ways.

cut

The 'cut' program serves to extract a specific text segment from a line and then output that segment to standard output. It can handle multiple file inputs or take input from the standard input.

Defining the segment of the line to extract can be a bit cumbersome and is determined through the utilization of the following options:

Option
Description

-c char_list

Retrieve the section of the line specified by char_list. This list could contain one or several numerical ranges separated by commas.

-f field_list

Retrieve one or multiple fields from the line based on the definition provided by field_list. This list might include one or more fields or ranges of fields separated by commas.

-d delim_char

When -f is indicated, delim_char serves as the character used to separate fields. By default, fields are expected to be divided by a solitary tab character.

--complement

Retrieve the complete text line, excluding the segments specified by -c and/or -f.

As observed, the method through which cut extracts text appears somewhat rigid. cut is most effective when extracting text from files generated by other programs rather than text input directly from humans. Let's examine our distros.txt file to determine if it's sufficiently organized for our cut demonstrations. Utilizing cat along with the -A option will allow us to check if the file aligns with our criteria for tab-separated fields.

Everything seems in order. There are no spaces within the text, just individual tab characters separating the fields. Because the file relies on tabs instead of spaces, we'll employ the -f option to extract a field.

Given that our distros file is delimited by tabs, it's preferable to utilize cut for extracting fields rather than characters. This preference arises from the inconsistency in the number of characters per line when a file is tab-delimited, making it challenging or even impossible to calculate character positions within each line accurately. However, in our preceding example, we've managed to extract a field that fortunately contains data of uniform length. This allows us to demonstrate character extraction by retrieving the year from each line.

Executing cut for a second time from our list enables us to extract characters located at positions 7 through 10, representing the year within our date field. The representation of 7-10 denotes a range. For a comprehensive understanding of specifying ranges, the cut manual page provides a detailed description.

Expanding Tabs

Our distros.txt file is optimally structured for cut to extract fields. However, if our goal were to create a file that could be entirely manipulated by cut in terms of characters rather than fields, it would entail replacing the tab characters in the file with an equivalent number of spaces. Fortunately, the GNU Coreutils package offers a tool for this purpose called expand. This utility accepts one or more file arguments or standard input, modifying the text and outputting the result to standard output.

By processing our distros.txt file with expand, we enable the use of cut -c to extract any character range from the file. For instance, we could execute the following command to extract the year of release from our list by expanding the file and employing cut to extract every character from the twenty-third position to the end of each line:

Additionally, Coreutils includes the unexpand program, which serves to replace spaces with tabs.

When handling fields, it's feasible to define an alternative field separator instead of relying on the tab character. In this instance, we'll retrieve the initial field from the /etc/passwd file:

Through the -d option, we can designate the colon character as the field delimiter.

paste

The paste command functions in contrast to cut. Instead of extracting a column of text from a file, it appends one or more columns of text to a file. This action involves reading multiple files and amalgamating the fields found in each file into a unified stream on standard output. Similar to cut, paste accepts multiple file arguments and/or standard input. To illustrate the functioning of paste, we'll manipulate our distros.txt file to create a chronological compilation of releases.

Building upon our prior use of sort, our initial step involves generating a sorted list of distros by date and saving the outcome in a file named distros-by-date.txt:

Following that, we'll utilize cut to retrieve the initial two fields from the file (the distro name and version), saving that output into a file named distro-versions.txt:

The last step in preparation involves extracting the release dates and saving them into a file named distro-dates.txt:

We've gathered the necessary components. To finalize the process, employ paste to place the column of dates preceding the distro names and versions, thereby constructing a chronological list. This is achieved straightforwardly by using paste and arranging its arguments in the desired sequence:

join

Join and paste share similarities in adding columns to a file, yet they employ distinct methods. A join operation typically associated with relational databases involves merging data from multiple tables with a shared key field to generate a desired output. The join program essentially executes a similar operation, combining data from multiple files based on a common key field.

To illustrate the application of a join operation in a relational database, let's envision a tiny database consisting of two tables, each containing a single record. The CUSTOMERS table comprises three fields: a customer number (CUSTNUM), the customer’s first name (FNAME), and the customer’s last name (LNAME):

CUSTNUM FNAME LNAME

======== ===== ======

4681934 John Smith

The second table, ORDERS, includes four fields: an order number (ORDERNUM), the customer number (CUSTNUM), the quantity (QUAN), and the item ordered (ITEM).

ORDERNUM CUSTNUM QUAN ITEM

======== ======= ==== ====

3014953305 4681934 1 Blue Widget

Both tables share the CUSTNUM field, establishing a linkage between them.

A join operation enables the amalgamation of fields from the two tables, facilitating useful outcomes like preparing an invoice. By utilizing matching values in the CUSTNUM fields of both tables, a join operation could yield a result as follows:

FNAME LNAME QUAN ITEM

===== ===== ==== ====

John Smith 1 Blue Widget

To demonstrate the join program, let's create a couple of files with a shared key. We'll derive these files from our distros-by-date.txt file, constructing one containing the release dates (serving as our shared key) and another with the release names:

and the second file, which contains the release dates and the version numbers:

Presently, we possess two files sharing a common key—the "release date" field. It's crucial to note that these files must be sorted based on the key field for join to function accurately.

Additionally, by default, join employs whitespace as the input field delimiter and a singular space as the output field delimiter. This functionality can be altered by indicating specific options. For further information, refer to the join manual page for detailed explanations.

Comparing Text

Comparing versions of text files holds significant utility, especially for system administrators and software developers. In scenarios where a system administrator needs to diagnose a system problem by comparing an existing configuration file to a previous version, or when a programmer seeks insights into the modifications made to programs over time, this functionality proves particularly crucial.

comm

The comm program contrasts two text files, revealing the unique lines in each and those shared between them. To showcase its functionality, we'll generate two almost identical text files using cat:

Next, we will compare the two files using comm:

As observed, comm generates output in three columns. The initial column comprises lines unique to the first file argument, the second column contains lines unique to the second file argument, and the third exhibits lines shared by both files. 'comm' provides options in the format of -n, where n represents either 1, 2, or 3. When utilized, these options determine which column(s) to suppress. For instance, to exclusively display the lines shared by both files, we would suppress the output of columns one and two:

diff

Similar to the comm program, diff serves the purpose of identifying distinctions between files. However, diff is a considerably more intricate tool, offering support for various output formats and the capacity to handle extensive collections of text files simultaneously. It's frequently employed by software developers to scrutinize alterations among different versions of program source code. This tool possesses the capability to recursively inspect directories of source code, often termed as source trees.

One prevalent application of diff involves creating differential files or patches utilized by programs like patch (which we'll discuss shortly). These patches facilitate the conversion of one version of a file or files into another version.

Let's apply diff to examine our earlier example files:

We observe its default output style: a concise representation of the disparities between the two files. In this default format, each set of modifications begins with a change command structured as range operation range. This form delineates the positions and types of alterations necessary to transform the first file into the second file.

Change
Description

r1ar2

Append the lines at the position r2 in the second file to the position r1 in the first file.

r1cr2

Change (replace) the lines at position r1 with the lines at the position r2 in the second file.

r1dr2

Delete the lines in the first file at position r1, which would have appeared at range r2 in the second file

This structure defines a range as a list separated by commas, encompassing both the starting and ending lines. While this format stands as the default (primarily for POSIX compliance and to maintain backward compatibility with traditional Unix versions of diff), it's not as prevalent as other available formats. Among the more commonly used formats are the context format and the unified format.

When inspected using the context format (via the -c option), the output will appear as follows:

The output commences by displaying the names of the two files alongside their timestamps. An asterisk denotes the first file, while dashes indicate the second one throughout the rest of the listing, denoting their respective files. Following this, there are clusters of changes, accompanied by the default number of adjacent context lines. In the initial group, we encounter:

*** 1,4 ***

This signifies lines 1 through 4 in the first file. Later on, we come across:

--- 1,4 ---

This denotes lines 1 through 4 in the second file. Within a change group, lines are prefaced by one of four indicators:

Indicator
Meaning

blank

A line shown for context. It does not indicate a difference between the two files.

-

A line deleted. This line will appear in the first file but not in the second file.

+

A line added. This line will appear in the second file but not in the first file.

!

A line changed. The two versions of the line will be displayed, each in its respective section of the change group.

The unified format, akin to the context format, offers a more streamlined presentation. It's activated using the -u option:

The primary distinction between the context and unified formats lies in the removal of duplicated context lines, rendering the results of the unified format more concise compared to the context format. In the previously illustrated example, we observe file timestamps akin to those in the context format, succeeded by the string @@ -1,4 +1,4 @@. This specifies the lines in the first and second files outlined within the change group. Subsequently, the lines themselves appear, accompanied by the default three lines of context. Each line is initiated by one of three potential characters:

Character
Meaning

blank

This line is shared by both files.

-

This line was removed from the first file.

+

This line was added to the first file.

patch

The patch program is employed to implement alterations to text files, accepting output generated from diff. Typically, it's utilized to convert older versions of files into newer ones. Consider the Linux kernel as a prime example. Developed by a vast, loosely organized team of contributors, the kernel receives a continuous flow of small changes to its source code. Despite comprising millions of lines of code, the changes introduced by a single contributor at any given time tend to be rather modest. Consequently, it's impractical for a contributor to dispatch the entire kernel source tree each time a small alteration is made. Instead, a diff file is submitted, encapsulating the transition from the prior version of the kernel to the updated version, featuring the contributor's changes. The recipient can then utilize the patch program to apply these modifications to their own source tree. Employing diff/patch presents two notable advantages:

  1. The diff file retains a significantly smaller size in comparison to the complete source tree.

  2. The diff file precisely displays the modifications made, facilitating swift evaluation by patch reviewers.

It's crucial to note that while diff/patch is commonly associated with source code, it's equally applicable to various text files, including configuration files or any other text-based documents.

To generate a diff file compatible with patch, the GNU documentation (refer to Further Reading below) suggests using diff as follows:

diff -Naur old_file new_file > diff_file

Here, old_file and new_file can represent individual files or directories containing files. The -r option facilitates directory tree recursion.

Subsequently, once the diff file is prepared, it can be applied to patch the old file into the new file using:

patch < diff_file

We'll demonstrate this process using our test file:

In this instance, we generated a diff file called patchfile.txt and subsequently utilized the patch program to implement the patch. It's worth noting that we didn't need to designate a target file for patching because the diff file (formatted in unified format) already encompasses the filenames within the header. Upon applying the patch, we observe that file1.txt now mirrors file2.txt.

The patch command offers an extensive range of options, and additional utility programs exist to analyze and edit patches.

Editing On The Fly

So far, our interactions with text editors have primarily been interactive, involving manual cursor movement and typing changes. Nevertheless, there are non-interactive methods for editing text. It's feasible, for instance, to apply a series of modifications to multiple files using just one command.

tr

The tr utility is designed for character transliteration, functioning akin to a character-based search-and-replace operation. Transliteration involves the process of altering characters from one alphabet to another. For instance, converting characters from lowercase to uppercase is a form of transliteration. We can execute this conversion using tr in the following manner:

As evident, tr operates by receiving input from standard input and producing its outcomes on standard output. It requires two arguments: a set of characters for conversion from, and a corresponding set of characters for conversion to. These character sets can be represented in one of three ways:

  1. An explicit list of characters (e.g., ABCDEFGHIJKLMNOPQRSTUVWXYZ)

  2. A character range (e.g., A-Z). Note that this approach might encounter issues similar to other commands, influenced by locale collation order, and hence, should be used cautiously.

  3. POSIX character classes (e.g., [:upper:]).

Usually, both character sets should have equal length. Yet, it's possible for the first set to be larger than the second, especially if we aim to convert multiple characters to a single character:

Apart from transliteration, tr provides the ability to outright delete characters from the input stream. In an earlier chapter, we explored the challenge of converting MS-DOS text files into Unix-style text. This conversion necessitates the removal of carriage return characters from the end of each line. Achieving this can be done using tr in the following manner:

tr -d '\r' < dos_file > unix_file

Here, dos_file represents the file intended for conversion, and unix_file denotes the resulting file. This command structure employs the escape sequence \r to denote the carriage return character. To explore a comprehensive list of the sequences and character classes supported by tr, consider trying:

ROT13: The Not-So-Secret Decoder Ring

A playful application of tr involves executing ROT13 encoding on text. ROT13 represents a simplistic form of encryption reliant on a basic substitution cipher. Referring to ROT13 as "encryption" might be generous; it's more accurately termed as "text obfuscation." This method is occasionally used to obscure potentially offensive content within text. The approach involves shifting each character 13 places up the alphabet. As this shift marks the midway point of the possible 26 characters, applying the algorithm a second time on the text restores it to its original form. Executing this encoding with tr can be demonstrated as follows:

Conducting the same process a second time yields the reverse translation:

Several email programs and Usenet news readers facilitate ROT13 encoding. For more comprehensive insights into ROT13, Wikipedia offers an informative article on the subject: Wikipedia - ROT13

tr possesses an additional capability. With the -s option, tr has the ability to "squeeze" (remove) consecutive occurrences of a character:

In this example, the string comprises repeated characters. When we designate the set "ab" to tr, it removes the duplicated occurrences of the letters in the set, while retaining the character absent from the set ("c"). It's essential to note that the repeating characters must be contiguous. If they aren't:

the squeezing will have no effect

sed

sed, short for "stream editor," conducts text modifications on a stream of text, which can either be a set of designated files or standard input. sed is a robust and relatively intricate program, often detailed extensively in comprehensive guides (there are entire books dedicated to it), hence we won't delve into its full scope here. Generally, sed operates by receiving either a solitary editing command (via the command line) or the designation of a script file containing multiple commands. It subsequently executes these commands on each line within the stream of text. Below is a straightforward demonstration showcasing sed in operation:

In this demonstration, we generate a single-word text stream using echo and direct it into sed. Subsequently, sed executes the instruction s/front/back/ on the text within the stream, resulting in the output "back." This command structure is reminiscent of the "substitution" (search-and-replace) command in vi.

Commands within sed typically commence with a single letter. In the above example, the substitution command is denoted by the letter s and is succeeded by the search and replacement strings, separated by the slash character acting as a delimiter. The choice of the delimiter character is arbitrary. Conventionally, the slash character is commonly used, yet sed will accept any character immediately following the command as the delimiter. This same command could be executed in the following manner:

Upon using the underscore character directly following the command, it assumes the role of the delimiter. This flexibility in setting the delimiter can enhance command readability, as we'll soon demonstrate.

In sed, most commands can be preceded by an address, indicating which line(s) of the input stream will undergo editing. In the absence of an address, the editing command is executed on every line within the input stream. The most basic form of an address is a line number. Let's expand on our example by adding one:

Integrating the address 1 into our command triggers the substitution to be executed solely on the initial line of our single-line input stream. If we indicate another number:

the editing process doesn't proceed because our input stream lacks a second line, designated as line 2.

There are several ways to express addresses, among the most common are:

Address
Description

n

A line number where n is a positive integer

$

The last line.

/regexp/

Lines that match a POSIX basic regular expression. It's important to note that the regular expression is enclosed within slash characters. Additionally, the regular expression can be enclosed within an alternative character by specifying the expression as \cregexpc, where c represents the alternate character.

addr1,addr2

A range of lines from addr1 to addr2, inclusively. The addresses can encompass any of the single address formats mentioned earlier.

first~step

Identify the line corresponding to the initial number, followed by each subsequent line at regular step intervals. For instance, 1~2 pertains to every odd-numbered line, while 5~5 refers to the fifth line and every fifth line that follows.

addr1,+n

Match addr1 and the following n lines.

addr!

Match all lines except addr, which may be any of the forms above.

Let's showcase various address types using the distros.txt file we used earlier in this chapter. Initially, we'll start with a range of line numbers:

Here, we display a range of lines from line 1 to line 5. To achieve this, we utilize the p command, responsible for printing matched lines. However, for this to work as intended, we must include the -n option (no auto-print) to prevent sed from automatically printing every line by default.

Next, let's explore a regular expression:

By incorporating the slash-delimited regular expression /SUSE/, we can extract the lines that contain it, akin to the functionality of grep.

Lastly, we'll experiment with negation by appending an exclamation point (!) to the address:

Here, we observe the anticipated outcome: all lines within the file except for those that match the regular expression.

Until now, we've explored two of the fundamental sed editing commands, s and p. Below is a more comprehensive roster of the basic editing commands:

Command
Description

=

Output current line number.

a

Append text after the current line.

d

Delete the current line.

i

Insert text in front of the current line.

p

Display the current line. By default, sed prints every line while editing only those that correspond to a specified address within the file. This default behavior can be altered by indicating the -n option.

q

Terminate sed without processing additional lines. When the -n option is not specified, the current line is output.

Q

Exit sed without processing any more lines

s/regexp/replacement/

Replace the text matched by regexp with the contents of replacement. The replacement field can incorporate the special character &, representing the text matched by regexp. Additionally, replacement can contain the sequences \1 through \9, corresponding to the contents of the respective subexpressions in regexp. Further details on this topic, including back references, are discussed below. Optionally, after the trailing slash following replacement, a flag can be specified to alter the behavior of the s command.

y/set1/set2

Conduct character transliteration by converting characters from set1 to their corresponding characters in set2. It's essential to note that unlike tr, sed mandates both sets to have identical lengths.

The s command stands out as the most frequently employed editing command. We'll exhibit a fraction of its capabilities by making an edit to our distros.txt file. Previously, we mentioned that the date field in distros.txt wasn't in a "computer-friendly" format. Despite being in MM/DD/YYYY format, it would be more convenient (especially for sorting purposes) if it were in YYYY-MM-DD format. Manual alteration of the file in this manner would be both laborious and error-prone. However, with sed, this modification can be executed in a single step:

That command might not win a beauty contest, but it does the job. In a single operation, we've successfully altered the date format in our file. This serves as a prime example of why regular expressions are sometimes humorously called a "write-only" medium—writable but occasionally challenging to decipher. Before we consider fleeing in terror from this command, let's examine how it was built. Initially, we knew that the command would have this fundamental structure:

Next, let's determine a regular expression that isolates the date. Given its MM/DD/YYYY format positioned at the end of the line, we can employ an expression resembling this:

This pattern corresponds to two digits, a slash, two more digits, another slash, followed by four digits, and ultimately the end of the line. So, that effectively captures the required regexp. Now, to address the replacement aspect, we need to introduce a unique feature seen in some applications utilizing BRE (Basic Regular Expressions). This feature, known as back references, operates in the following manner: Whenever the sequence \n appears in replacement—where n represents a number between 1 to 9—it refers back to the respective subexpression in the preceding regular expression. To establish these subexpressions, we enclose them within parentheses:

We've created three subexpressions: the first encompasses the month, the second the day of the month, and the third the year. With these in place, we can now build the replacement as follows:

resulting in the year, followed by a dash, the month, another dash, and finally the day.

Hence, our command appears as follows:

Two issues persist. Initially, the surplus slashes within our regular expression might perplex sed when interpreting the s command. Secondly, as sed typically accepts only basic regular expressions, certain characters in our expression might be construed as literals instead of metacharacters. To resolve these concerns, we can resolve both by judiciously applying backslashes to escape the problematic characters:

And that's it!

Another facet of the s command involves optional flags that can trail the replacement string. Among these, the most significant is the g flag, directing sed to implement the search-and-replace operation globally within a line, rather than just the first instance—which is the default behavior.

Here's an example to illustrate:

The replacement was executed solely on the initial occurrence of the letter "b," leaving the subsequent instances unaltered. However, by appending the g flag, we can modify all occurrences:

Until now, we've solely provided sed with single commands via the command line. However, it's feasible to create more intricate commands within a script file using the -f option. To exemplify, we'll utilize sed along with our distros.txt file to generate a report. The report will contain a title at the beginning, our adjusted dates, and all distribution names converted to uppercase. To achieve this, we'll need to craft a script. Let's open our text editor and input the following:

We will save our sed script as distros.sed and run it like this:

As observed, our script successfully generates the expected outcomes. But how does it accomplish this? Let's review our script once more. We can utilize cat to number the lines:

Line 1 in our script functions as a comment, indicated by the # character—a common practice in Linux system configuration files and programming languages. Comments offer human-readable context and can be placed anywhere within the script, but not within the commands themselves. They aid in identifying and maintaining the script's components.

Line 2 is left blank, serving to enhance readability. Blank lines are often added for better visual organization.

Lines 3 through 6 consist of text to be inserted at address 1, denoting the first line of the input. The i command is followed by the backslash-carriage return sequence, creating an escaped carriage return or a line-continuation character. This sequence, prevalent in various contexts like shell scripts, allows embedding a carriage return within text without signaling the interpreter, in this case, sed, about the line's termination. Both the i (for insertion), a (for appending), and c (for replacement) commands support multiple lines of text, necessitating each line, except the last, to end with a line-continuation character. The sixth line marks the conclusion of our inserted text, ending with a regular carriage return, signifying the end of the i command.

Note

A line-continuation character is created by a backslash immediately followed by a carriage return, without any intervening spaces.

Line 7 initiates our search-and-replace command. Not accompanied by an address, this command affects every line in the input stream.

Line 8 executes a transliteration, converting lowercase letters into uppercase. Unlike tr, sed's y command doesn't support character ranges (e.g., [az]) or POSIX character classes. Similar to the prior command, it applies to each line in the input stream without an address.

People Who Like sed Also Like...

sed is a powerful tool adept at handling intricate text editing tasks, typically for concise, single-line operations rather than extensive scripts. For larger tasks, users often turn to more comprehensive tools like awk and perl. These go beyond the functionalities of programs covered here, branching into complete programming languages. perl, especially, finds extensive use in system management, administration, and web development, often replacing shell scripts due to its versatility.

On the other hand, awk is known for its expertise in manipulating tabular data. It shares some similarities with sed, as awk programs commonly process text files line by line, employing a concept akin to sed's approach of addressing followed by an action. While exploring awk and perl is beyond the scope of this book, mastering them proves valuable for Linux command line users.

aspell

We'll take a look at another tool called aspell, an interactive spell checker. It's a successor to the older program named ispell and can generally be used as a direct substitute for it. While it's often employed by other programs needing spell-checking functions, it's also quite effective as a standalone tool from the command line. Its capabilities extend to intelligently checking different types of text files, such as HTML documents, C/C++ programs, emails, and various specialized texts.

To spell-check a text file with basic written content, you could employ it in the following manner:

where textfile represents the file you wish to check. For instance, as an illustration, let's generate a basic text file called foo.txt that deliberately contains some spelling mistakes:

Next we’ll check the file using aspell:

As aspell is interactive in the check mode, we will see a screen like this:

At the top of the screen, our text is displayed, highlighting a potentially misspelled word. In the center, there are ten suggestions numbered from zero to nine, along with various available actions. Finally, at the bottom, there's a prompt awaiting our input.

Upon pressing the 1 key, aspell substitutes the questionable word with "jumped" and proceeds to the next incorrectly spelled word, "laxy." Choosing "lazy" as the replacement, aspell makes the change and completes its process. Upon completion, upon inspecting our file, we'll notice that the misspelled words have been rectified.

By default, unless instructed otherwise using the command-line option --dont-backup, aspell generates a backup file by adding the extension .bak to the original filename.

Demonstrating our proficiency in sed editing, we'll reintroduce our spelling errors to reuse the file.

The sed flag -i instructs sed to modify the file directly instead of displaying the edited content on the standard output. It overwrites the file with the applied changes. Additionally, multiple editing commands can be placed on the same line by using semicolons to separate them.

Moving forward, let's explore how aspell manages various text file formats. To modify our file, we'll employ a text editor like vim (those feeling adventurous might consider experimenting with sed) to incorporate HTML markup.

Presently, attempting to conduct a spell check on our altered file presents a challenge if we proceed in this manner:

we’ll get this:

The content within the HTML tags will be perceived as misspelled by aspell. However, this hurdle can be surmounted by incorporating the -H (HTML) checking-mode option, illustrated as follows:

which will result in this:

Only the non-markup sections of the file undergo scrutiny; the HTML is disregarded. In this particular mode, the content within HTML tags remains unchecked for spelling errors. Nonetheless, the contents within ALT tags, which benefit from inspection, are indeed checked within this mode.

Note

As a default setting, aspell excludes URLs and email addresses from its text checks, but this can be altered using command line options. Furthermore, it's feasible to define which markup tags undergo scrutiny or are bypassed. For further information, refer to the aspell manual page for specifics.

Summary

In this chapter, we've explored a handful of command line tools that manipulate text. The upcoming chapter will introduce several additional tools. While their day-to-day application might not immediately appear evident or essential, we've aimed to demonstrate semi-practical instances of their utilization. Subsequent chapters will unveil how these tools compose a foundational toolkit capable of addressing various real-world challenges. This significance becomes notably evident when delving into shell scripting, where these tools showcase their true value.

Last updated