Text Processing

Chapter 20

Unix-like operating systems heavily rely on text files as a primary means of storing diverse data types. Consequently, a multitude of tools has been developed to manipulate text efficiently. This chapter delves into programs specifically designed for text manipulation, exploring techniques for slicing and reconfiguring text. The subsequent chapter will focus on text processing for formatting, catering to various human consumption needs, such as printing.

Throughout this chapter, we'll revisit familiar tools and encounter new ones, including:

cat: Concatenates files and displays the output
sort: Arranges lines within text files
uniq: Identifies and optionally excludes duplicate lines
cut: Eliminates specified segments from each line in files
paste: Combines lines from files
join: Connects lines from two files using a common field
comm: Compares lines between two sorted files
diff: Compares files line by line
patch: Applies a differential file to an original
tr: Translates or removes characters
sed: Acts as a stream editor for text filtering and transformation
aspell: An interactive spell checker

Applications Of Text

Up to this point, we've familiarized ourselves with a few text editors (nano and vim), explored numerous configuration files, and observed the output generated by numerous text-based commands. However, what other purposes does text serve? It happens to be utilized for many diverse tasks, as we'll discover.

Documents

Numerous individuals opt for plain text formats when creating documents. While it's clear how a concise text file serves well for basic notes, it's equally feasible to compose extensive documents in text format. A prevalent method involves crafting a sizable document in a text format and subsequently employing a markup language to detail its final formatting. This approach is favored in composing scientific papers, owing to Unix-based text processing systems being among the pioneers in supporting the intricate typographical layout essential for technical writers.

Web Pages

The most widely used form of electronic document globally is likely the web page. These pages are textual documents employing either HTML (Hypertext Markup Language) or XML (Extensible Markup Language) as markup languages to define the visual layout of the document.

Email

Email operates fundamentally as a text-centric platform. Even attachments in non-text formats undergo conversion into a textual representation for transmission. To observe this process firsthand, one can download an email message and view it using a tool like less. Upon inspection, the message typically starts with a header detailing its origin and the processing it underwent throughout its journey, succeeded by the message body containing its content.

Printer Output

Within Unix-like systems, when sending output to a printer, it's transmitted either in plain text form or, if the page includes graphics, it gets transformed into a textual format referred to as PostScript. This PostScript data is then directed to a program responsible for generating the graphical dots necessary for printing.

Program Source Code

Numerous command-line programs present in Unix-like systems were developed to aid in system administration and software development, and text processing tools are no different. A significant portion of these tools is specifically crafted to address issues encountered in software development. The significance of text processing for software developers lies in the fact that all software originates as text. The core component of a program—the source code, which the programmer directly crafts—is consistently in text format.

Revisiting Some Old Friends

In Chapter 6 (Redirection), we explored commands capable of receiving standard input alongside command line arguments. We only briefly introduced these commands at that time. Now, we'll delve deeper into their functionality and explore how they can effectively facilitate text processing tasks.

`cat`

The cat program offers several intriguing options, many aimed at enhancing the visualization of text content. One such example is the -A option, employed to reveal nonprinting characters within the text. There are instances where identifying embedded control characters in otherwise visible text becomes essential. Among the most common are tab characters (distinct from spaces) and carriage returns, often marking the end-of-line in MS-DOS-style text files. Another prevalent scenario involves files containing lines of text with trailing spaces.

Let's create a test file using cat as a rudimentary word processor. To achieve this, we'll input the cat command (along with specifying a file for redirected output), type our text, press Enter to conclude the line properly, and then use Ctrl-d to signal cat that we've reached the end-of-file. For instance, in this example, we'll input a leading tab character followed by some trailing spaces after the line:

[me@linuxmachine ~]$ cat > foo.txt
    The quick brown fox jumped over the lazy dog.
[me@linuxmachine ~]$

Next, we will use cat with the -A option to display the text:

[me@linuxmachine ~]$ cat -A foo.txt
^IThe quick brown fox jumped over the lazy dog. $
[me@linuxmachine ~]$

The displayed results indicate the presence of a tab character represented as ^I within our text. This notation, frequently signifying Control-I, happens to be synonymous with a tab character. Additionally, a $ symbol is visible at the actual end of the line, suggesting the existence of trailing spaces in our text.

MS-DOS Text Vs. Unix Text

Using cat to detect non-printing characters in text is particularly useful for uncovering concealed carriage returns. Where do these hidden carriage returns originate? DOS and Windows operating systems! Unix and DOS define the end of a line differently within text files. Unix denotes line endings with a linefeed character (ASCII 10), whereas MSDOS and its derivatives utilize the sequence of a carriage return (ASCII 13) followed by a linefeed to conclude each line of text.

Various methods exist for converting files from DOS to Unix format. On numerous Linux systems, dedicated programs such as dos2unix and unix2dos facilitate the conversion of text files to and from DOS format. However, lacking dos2unix on your system shouldn't cause concern. Converting text from DOS to Unix format is straightforward—it involves simply eliminating the troublesome carriage returns. This task can be easily accomplished using a couple of the programs detailed later in this chapter.

cat features options designed for text manipulation as well. Among the most notable are -n, used for numbering lines, and -s, employed to suppress the display of consecutive blank lines. These functionalities can be showcased in the following manner:

[me@linuxmachine ~]$ cat > foo.txt
The quick brown fox
jumped over the lazy dog.
[me@linuxmachine ~]$ cat -ns foo.txt
     1 The quick brown fox
     2
     3 jumped over the lazy dog.
[me@linuxmachine ~]$

In this instance, we generate a revised version of our foo.txt test file, comprising two lines of text divided by two consecutive blank lines. Upon utilizing cat with the -ns options, the surplus blank line is eliminated, and the existing lines are sequentially numbered. Although this may seem like a minor text manipulation, it constitutes a notable process in itself.

`sort`

The sort utility arranges the content either from standard input or from one or multiple files specified through the command line and then forwards the sorted results to the standard output. Employing a technique similar to the one applied with cat, we can showcase the direct processing of standard input directly from the keyboard:

[me@linuxmachine ~]$ sort > foo.txt
c
b
a
[me@linuxmachine ~]$ cat foo.txt
a
b
c

Upon inputting the command, we enter the letters "c", "b", and "a," concluding by using Ctrl-d to indicate the end-of-file. Subsequently, upon examining the resultant file, we observe that the lines now present themselves in sorted order.

Given that sort has the capability to receive multiple files as command line arguments, it allows the merging of numerous files into a unified, sorted entity. For instance, if there were three text files and the objective was to merge them into a single sorted file, the process could resemble the following:

sort file1.txt file2.txt file3.txt > final_sorted_list.txt

sort has several interesting options. Here is a partial list:

Option

Long Option

Description

-b

--ignore-leading-blanks

By default, sorting is executed on the complete line, commencing from the initial character within the line. This particular option instructs 'sort' to disregard any leading spaces within lines, computing the sorting based on the initial non-whitespace character found on each line.

-f

--ignore-case

Makes sorting case-insensitive.

-n

--numeric-sort

This option conducts sorting based on the numeric interpretation of a string. Its utilization enables sorting to be carried out according to numeric values rather than relying on alphabetic sequences.

-r

--reverse

Arrange in reverse sequence. This function yields results in descending order rather than the usual ascending sequence.

-k

--key=field1[,field2]

Arrange according to a key field positioned from field1 to field2, disregarding the entire line. Further details are discussed below.

-m

--merge

Consider each argument as the label of an already sorted file. Combine multiple files into a unified sorted output without executing further sorting processes.

-o

--output=file

Send sorted output to file rather than standard output.

-t

--field-separator=char

Specify the character used as the field separator. By default, fields are delimited by spaces or tabs.

While many of the aforementioned options are quite straightforward, some require further explanation. To begin, let's consider the -n option, designed for numeric sorting. This option facilitates sorting values based on their numerical representation. To illustrate its functionality, we can utilize this option in sorting the output of the du command to identify the largest consumers of disk space. Typically, the du command lists summary results in pathname order.

[me@linuxmachine ~]$ du -s /usr/share/* | head
252     /usr/share/aclocal
96      /usr/share/acpi-support
8       /usr/share/adduser
196     /usr/share/alacarte
344     /usr/share/alsa
8       /usr/share/alsa-base
12488   /usr/share/anthy
8       /usr/share/apmd
21440   /usr/share/app-install
48      /usr/share/application-registry

In this instance, we redirect the output into head to restrict the display to the initial ten lines. Employing this method allows us to generate a list sorted numerically, showcasing the ten most significant space consumers.

[me@linuxmachine ~]$ du -s /usr/share/* | sort -nr | head
509940     /usr/share/locale-langpack
242660     /usr/share/doc
197560     /usr/share/fonts
179144     /usr/share/gnome
146764     /usr/share/myspell
144304     /usr/share/gimp
135880     /usr/share/dict
76508      /usr/share/icons
68072      /usr/share/apps
62844      /usr/share/foomatic

Utilizing the -nr options facilitates a reverse numerical sort, presenting the largest values at the forefront of the results. This sorting method operates effectively since the numerical values are positioned at the start of each line. However, what if our objective is to sort a list based on a value embedded within the line? For instance, consider the output from an ls -l command:

[me@linuxmachine ~]$ ls -l /usr/bin | head
total 152948
-rwxr-xr-x 1 root root 34824     2008-04-04 02:42 [
-rwxr-xr-x 1 root root 101556    2007-11-27 06:08 a2p
-rwxr-xr-x 1 root root 13036     2008-02-27 08:22 aconnect
-rwxr-xr-x 1 root root 10552     2007-08-15 10:34 acpi
-rwxr-xr-x 1 root root 3800      2008-04-14 03:51 acpi_fakekey
-rwxr-xr-x 1 root root 7536      2008-04-19 00:19 acpi_listen
-rwxr-xr-x 1 root root 3576      2008-04-29 07:57 addpart
-rwxr-xr-x 1 root root 20808     2008-01-03 18:02 addr2line
-rwxr-xr-x 1 root root 489704    2008-10-09 17:02 adept_batch

For now, setting aside the fact that ls can arrange its results by size, we could employ sort to organize this list based on file size:

[me@linuxmachine ~]$ ls -l /usr/bin | sort -nr -k 5 | head
-rwxr-xr-x 1 root root 8234216 2008-04-07 17:42 inkscape
-rwxr-xr-x 1 root root 8222692 2008-04-07 17:42 inkview
-rwxr-xr-x 1 root root 3746508 2008-03-07 23:45 gimp-2.4
-rwxr-xr-x 1 root root 3654020 2008-08-26 16:16 quanta
-rwxr-xr-x 1 root root 2928760 2008-09-10 14:31 gdbtui
-rwxr-xr-x 1 root root 2928756 2008-09-10 14:31 gdb
-rwxr-xr-x 1 root root 2602236 2008-10-10 12:56 net
-rwxr-xr-x 1 root root 2304684 2008-10-10 12:56 rpcclient
-rwxr-xr-x 1 root root 2241832 2008-04-04 05:56 aptitude
-rwxr-xr-x 1 root root 2202476 2008-10-10 12:56 smbcacls

Sort often involves handling tabular data, such as the output from the ls command mentioned earlier. If we apply database terminology to the displayed information, each line represents a record, comprising various fields like file attributes, link count, filename, file size, and more. sort possesses the ability to process these individual fields. In database language, we can specify one or more key fields to act as the basis for sorting.

In the previously mentioned example, we indicate the n and r options to conduct a reverse numerical sort and specify -k 5 to prompt sort to utilize the fifth field as the sorting key.

The k option holds significant interest and comes with various functionalities. However, before delving into its intricacies, it's essential to understand how sort defines fields. Consider a straightforward text file comprising a solitary line containing the author's name:

William     Shotts

By default, sort perceives this line as encompassing two fields. The initial field comprises the characters "William," while the subsequent field contains " Shotts," indicating that whitespace characters (spaces and tabs) serve as separators between fields. Additionally, these delimiters are encompassed within the field during the sorting process.

Upon revisiting a line from our ls output, it becomes apparent that each line contains eight fields, with the fifth field representing the file size:

-rwxr-xr-x 1 root root 8234216 2008-04-07 17:42 inkscape

In our upcoming set of experiments, let's examine a file detailing the history of three prevalent Linux distributions released between 2006 and 2008. Each line within the file consists of three fields: the distribution name, version number, and release date in MM/DD/YYYY format:

SUSE     10.2     12/07/2006
Fedora   10       11/25/2008
SUSE     11.0     06/19/2008
Ubuntu   8.04     04/24/2008
Fedora   8        11/08/2007
SUSE     10.3     10/04/2007 
Ubuntu   6.10     10/26/2006
Fedora   7        05/31/2007
Ubuntu   7.10     10/18/2007
Ubuntu   7.04     04/19/2007
SUSE     10.1     05/11/2006
Fedora   6        10/24/2006
Fedora   9        05/13/2008
Ubuntu   6.06     06/01/2006
Ubuntu   8.10     10/30/2008
Fedora   5        03/20/2006

Through a text editor like vim, let's input this data and save the resulting file as distros.txt. Then, we'll proceed to sort the file and examine the outcomes:

[me@linuxmachine ~]$ sort distros.txt
Fedora     10     11/25/2008
Fedora     5      03/20/2006
Fedora     6      10/24/2006
Fedora     7      05/31/2007
Fedora     8      11/08/2007
Fedora     9      05/13/2008
SUSE       10.1   05/11/2006
SUSE       10.2   12/07/2006
SUSE       10.3   10/04/2007
SUSE       11.0   06/19/2008
Ubuntu     6.06   06/01/2006
Ubuntu     6.10   10/26/2006
Ubuntu     7.04   04/19/2007
Ubuntu     7.10   10/18/2007
Ubuntu     8.04   04/24/2008
Ubuntu     8.10   10/30/2008

Mostly, the sorting worked well, except for the issue with the Fedora version numbers. Due to the character set order, the sort places "10" above "9" because "1" precedes "5" in sequence.

To address this problem, we need to sort based on multiple keys. Our objective is an alphabetical sort for the first field and a numeric sort for the second field. sort permits multiple uses of the -k option to specify multiple sort keys. Additionally, a key can encompass a range of fields. When no range is specified (as in our previous examples), sort utilizes a key that initiates from the specified field and extends to the end of the line. Below is the syntax for our multi-key sort:

[me@linuxmachine ~]$ sort --key=1,1 --key=2n distros.txt
Fedora     5     03/20/2006
Fedora     6     10/24/2006
Fedora     7     05/31/2007
Fedora     8     11/08/2007
Fedora     9     05/13/2008
Fedora     10    11/25/2008
SUSE       10.1  05/11/2006
SUSE       10.2  12/07/2006
SUSE       10.3  10/04/2007
SUSE       11.0  06/19/2008
Ubuntu     6.06  06/01/2006
Ubuntu     6.10  10/26/2006
Ubuntu     7.04  04/19/2007
Ubuntu     7.10  10/18/2007
Ubuntu     8.04  04/24/2008
Ubuntu     8.10  10/30/2008

Although we opted for the long form of the option for clarity, -k 1,1 -k 2n would be precisely equivalent. In the first instance of the key option, we outlined a field range for the initial key. To confine the sort exclusively to the first field, we specified 1,1, signifying "start at field one and conclude at field one." Meanwhile, in the second instance, we designated 2n, indicating that field 2 serves as the sorting key and necessitates a numeric sort.

A letter option can be included at the conclusion of a key specifier to denote the type of sort to be executed. These letters correspond to the same global options available for the sort program: b (ignore leading blanks), n (numeric sort), r (reverse sort), and so forth.

Within our list, the third field contains a date formatted inconveniently for sorting. Computers typically adopt the YYYY-MM-DD format for dates, ensuring chronological sorting ease, whereas our dates adhere to the American MM/DD/YYYY format. How can we arrange this list chronologically?

Thankfully, sort offers a solution. The key option allows specification of offsets within fields, enabling us to define keys within fields:

[me@linuxmachine ~]$ sort -k 3.7nbr -k 3.1nbr -k 3.4nbr distros.txt
Fedora     10     11/25/2008
Ubuntu     8.10   10/30/2008
SUSE       11.0   06/19/2008
Fedora     9      05/13/2008
Ubuntu     8.04   04/24/2008
Fedora     8      11/08/2007
Ubuntu     7.10   10/18/2007
SUSE       10.3   10/04/2007
Fedora     7      05/31/2007
Ubuntu     7.04   04/19/2007
SUSE       10.2   12/07/2006
Ubuntu     6.10   10/26/2006
Fedora     6      10/24/2006
Ubuntu     6.06   06/01/2006
SUSE       10.1   05/11/2006
Fedora     5      03/20/2006

By utilizing -k 3.7, we direct sort to utilize a sorting key starting at the seventh character within the third field, which signifies the commencement of the year. Simultaneously, we designate -k 3.1 and -k 3.4 to pinpoint the month and day portions within the date. To execute a reverse numeric sort, we append the n and r options. Additionally, the b option is added to suppress leading spaces in the date field. These spaces, varying in number across lines, might influence the sorting outcome.

Certain files deviate from using tabs and spaces as field separators, such as the /etc/passwd file:

[me@linuxmachine ~]$ head /etc/passwd
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/bin/sh
man:x:6:12:man:/var/cache/man:/bin/sh
lp:x:7:7:lp:/var/spool/lpd:/bin/sh
mail:x:8:8:mail:/var/mail:/bin/sh
news:x:9:9:news:/var/spool/news:/bin/sh

Within this file, the fields are demarcated by colons (:). How might we go about sorting this file using a specific field as the key? sort offers the -t option to designate the field separator character. For instance, to sort the passwd file based on the seventh field, representing the account's default shell, we could proceed as follows:

[me@linuxmachine ~]$ sort -t ':' -k 7 /etc/passwd | head
me:x:1001:1001:Myself,,,:/home/me:/bin/bash
root:x:0:0:root:/root:/bin/bash
dhcp:x:101:102::/nonexistent:/bin/false
gdm:x:106:114:Gnome Display Manager:/var/lib/gdm:/bin/false
hplip:x:104:7:HPLIP system user,,,:/var/run/hplip:/bin/false
klog:x:103:104::/home/klog:/bin/false
messagebus:x:108:119::/var/run/dbus:/bin/false
polkituser:x:110:122:PolicyKit,,,:/var/run/PolicyKit:/bin/false
pulse:x:107:116:PulseAudio daemon,,,:/var/run/pulse:/bin/false

By specifying the colon character as the field separator, we can sort on the seventh field.

`uniq`

In contrast to sort, the uniq program is lightweight. It carries out a seemingly simple task: when provided with a sorted file (including standard input), it eliminates any duplicate lines and directs the unique results to standard output. It's commonly employed alongside sort to cleanse the output of redundant entries.

Tip

While uniq stands as a traditional Unix utility frequently paired with sort, the GNU version of sort introduces a -u option that directly eliminates duplicates from the sorted output.

To experiment with this, let's create a text file:

[me@linuxmachine ~]$ cat > foo.txt
a
b
c
a
b
c

Remember to use Ctrl-d to conclude standard input. Once that's done, if we execute uniq on our text file:

[me@linuxmachine ~]$ uniq foo.txt
a
b
c
a
b
c

The output remains unchanged, mirroring our original file as the duplicates were not eliminated. For uniq to effectively perform its function, the input must undergo sorting beforehand:

[me@linuxmachine ~]$ sort foo.txt | uniq
a
b
c

This occurs because uniq solely eliminates duplicate lines that appear consecutively.

uniq offers various options. Here are some commonly used ones:

Option

Description

-c

Output a list of duplicate lines preceded by the number of times the line occurs.

-d

Only output repeated lines, rather than unique lines.

-f n

Ignore n leading fields in each line. Fields are separated by whitespace as they are in sort; however, unlike sort, uniq has no option for setting an alternate field separator.

-i

Ignore case during the line comparisons.

-s n

Skip (ignore) the leading n characters of each line.

-u

Only output unique lines. This is the default.

In this example, we observe uniq employed to indicate the count of duplicates detected within our text file, utilizing the -c option:

[me@linuxmachine ~]$ sort foo.txt | uniq -c
     2 a
     2 b
     2 c

Slicing And Dicing

The following trio of programs are designed for extracting text columns from files and then reassembling them in practical ways.

`cut`

The 'cut' program serves to extract a specific text segment from a line and then output that segment to standard output. It can handle multiple file inputs or take input from the standard input.

Defining the segment of the line to extract can be a bit cumbersome and is determined through the utilization of the following options:

Option

Description

-c char_list

Retrieve the section of the line specified by char_list. This list could contain one or several numerical ranges separated by commas.

-f field_list

Retrieve one or multiple fields from the line based on the definition provided by field_list. This list might include one or more fields or ranges of fields separated by commas.

-d delim_char

When -f is indicated, delim_char serves as the character used to separate fields. By default, fields are expected to be divided by a solitary tab character.

--complement

Retrieve the complete text line, excluding the segments specified by -c and/or -f.

As observed, the method through which cut extracts text appears somewhat rigid. cut is most effective when extracting text from files generated by other programs rather than text input directly from humans. Let's examine our distros.txt file to determine if it's sufficiently organized for our cut demonstrations. Utilizing cat along with the -A option will allow us to check if the file aligns with our criteria for tab-separated fields.

[me@linuxmachine ~]$ cat -A distros.txt
SUSE^I10.2^I12/07/2006$
Fedora^I10^I11/25/2008$
SUSE^I11.0^I06/19/2008$
Ubuntu^I8.04^I04/24/2008$
Fedora^I8^I11/08/2007$
SUSE^I10.3^I10/04/2007$
Ubuntu^I6.10^I10/26/2006$
Fedora^I7^I05/31/2007$
Ubuntu^I7.10^I10/18/2007$
Ubuntu^I7.04^I04/19/2007$
SUSE^I10.1^I05/11/2006$
Fedora^I6^I10/24/2006$
Fedora^I9^I05/13/2008$ 
Ubuntu^I6.06^I06/01/2006$
Ubuntu^I8.10^I10/30/2008$
Fedora^I5^I03/20/2006$

Everything seems in order. There are no spaces within the text, just individual tab characters separating the fields. Because the file relies on tabs instead of spaces, we'll employ the -f option to extract a field.

[me@linuxmachine ~]$ cut -f 3 distros.txt
12/07/2006
11/25/2008
06/19/2008
04/24/2008
11/08/2007
10/04/2007
10/26/2006
05/31/2007
10/18/2007
04/19/2007
05/11/2006
10/24/2006
05/13/2008
06/01/2006
10/30/2008
03/20/2006

Given that our distros file is delimited by tabs, it's preferable to utilize cut for extracting fields rather than characters. This preference arises from the inconsistency in the number of characters per line when a file is tab-delimited, making it challenging or even impossible to calculate character positions within each line accurately. However, in our preceding example, we've managed to extract a field that fortunately contains data of uniform length. This allows us to demonstrate character extraction by retrieving the year from each line.

[me@linuxmachine ~]$ cut -f 3 distros.txt | cut -c 7-10
2006
2008
2008
2008
2007
2007
2006
2007
2007
2007
2006 
2006
2008
2006
2008
2006

Executing cut for a second time from our list enables us to extract characters located at positions 7 through 10, representing the year within our date field. The representation of 7-10 denotes a range. For a comprehensive understanding of specifying ranges, the cut manual page provides a detailed description.

Expanding Tabs

Our distros.txt file is optimally structured for cut to extract fields. However, if our goal were to create a file that could be entirely manipulated by cut in terms of characters rather than fields, it would entail replacing the tab characters in the file with an equivalent number of spaces. Fortunately, the GNU Coreutils package offers a tool for this purpose called expand. This utility accepts one or more file arguments or standard input, modifying the text and outputting the result to standard output.

By processing our distros.txt file with expand, we enable the use of cut -c to extract any character range from the file. For instance, we could execute the following command to extract the year of release from our list by expanding the file and employing cut to extract every character from the twenty-third position to the end of each line:

[me@linuxmachine ~]$ expand distros.txt | cut -c 23-

Additionally, Coreutils includes the unexpand program, which serves to replace spaces with tabs.

When handling fields, it's feasible to define an alternative field separator instead of relying on the tab character. In this instance, we'll retrieve the initial field from the /etc/passwd file:

[me@linuxmachine ~]$ cut -d ':' -f 1 /etc/passwd | head
root
daemon
bin
sys
sync
games
man
lp 
mail
news

Through the -d option, we can designate the colon character as the field delimiter.

`paste`

The paste command functions in contrast to cut. Instead of extracting a column of text from a file, it appends one or more columns of text to a file. This action involves reading multiple files and amalgamating the fields found in each file into a unified stream on standard output. Similar to cut, paste accepts multiple file arguments and/or standard input. To illustrate the functioning of paste, we'll manipulate our distros.txt file to create a chronological compilation of releases.

Building upon our prior use of sort, our initial step involves generating a sorted list of distros by date and saving the outcome in a file named distros-by-date.txt:

[me@linuxmachine ~]$ sort -k 3.7nbr -k 3.1nbr -k 3.4nbr distros.txt > dis
tros-by-date.txt

Following that, we'll utilize cut to retrieve the initial two fields from the file (the distro name and version), saving that output into a file named distro-versions.txt:

[me@linuxmachine ~]$ cut -f 1,2 distros-by-date.txt > distros-versions.t
xt
[me@linuxmachine ~]$ head distros-versions.txt
Fedora 10
Ubuntu 8.10
SUSE 11.0
Fedora 9
Ubuntu 8.04
Fedora 8
Ubuntu 7.10
SUSE 10.3
Fedora 7
Ubuntu 7.04

The last step in preparation involves extracting the release dates and saving them into a file named distro-dates.txt:

[me@linuxmachine ~]$ cut -f 3 distros-by-date.txt > distros-dates.txt
[me@linuxmachine ~]$ head distros-dates.txt
11/25/2008
10/30/2008
06/19/2008
05/13/2008
04/24/2008
11/08/2007
10/18/2007
10/04/2007
05/31/2007
04/19/2007

We've gathered the necessary components. To finalize the process, employ paste to place the column of dates preceding the distro names and versions, thereby constructing a chronological list. This is achieved straightforwardly by using paste and arranging its arguments in the desired sequence:

[me@linuxmachine ~]$ paste distros-dates.txt distros-versions.txt
11/25/2008 Fedora 10
10/30/2008 Ubuntu 8.10
06/19/2008 SUSE 11.0
05/13/2008 Fedora 9
04/24/2008 Ubuntu 8.04
11/08/2007 Fedora 8
10/18/2007 Ubuntu 7.10
10/04/2007 SUSE 10.3
05/31/2007 Fedora 7
04/19/2007 Ubuntu 7.04
12/07/2006 SUSE 10.2
10/26/2006 Ubuntu 6.10
10/24/2006 Fedora 6
06/01/2006 Ubuntu 6.06
05/11/2006 SUSE 10.1
03/20/2006 Fedora 5

`join`

Join and paste share similarities in adding columns to a file, yet they employ distinct methods. A join operation typically associated with relational databases involves merging data from multiple tables with a shared key field to generate a desired output. The join program essentially executes a similar operation, combining data from multiple files based on a common key field.

To illustrate the application of a join operation in a relational database, let's envision a tiny database consisting of two tables, each containing a single record. The CUSTOMERS table comprises three fields: a customer number (CUSTNUM), the customer’s first name (FNAME), and the customer’s last name (LNAME):

CUSTNUM FNAME LNAME

======== ===== ======

4681934 John Smith

The second table, ORDERS, includes four fields: an order number (ORDERNUM), the customer number (CUSTNUM), the quantity (QUAN), and the item ordered (ITEM).

ORDERNUM CUSTNUM QUAN ITEM

======== ======= ==== ====

3014953305 4681934 1 Blue Widget

Both tables share the CUSTNUM field, establishing a linkage between them.

A join operation enables the amalgamation of fields from the two tables, facilitating useful outcomes like preparing an invoice. By utilizing matching values in the CUSTNUM fields of both tables, a join operation could yield a result as follows:

FNAME LNAME QUAN ITEM

===== ===== ==== ====

John Smith 1 Blue Widget

To demonstrate the join program, let's create a couple of files with a shared key. We'll derive these files from our distros-by-date.txt file, constructing one containing the release dates (serving as our shared key) and another with the release names:

[me@linuxmachine ~]$ cut -f 1,1 distros-by-date.txt > distros-names.txt
[me@linuxmachine ~]$ paste distros-dates.txt distros-names.txt > distroskey-names.txt
[me@linuxmachine ~]$ head distros-key-names.txt
11/25/2008 Fedora
10/30/2008 Ubuntu
06/19/2008 SUSE
05/13/2008 Fedora
04/24/2008 Ubuntu
11/08/2007 Fedora
10/18/2007 Ubuntu
10/04/2007 SUSE
05/31/2007 Fedora
04/19/2007 Ubuntu

and the second file, which contains the release dates and the version numbers:

[me@linuxmachine ~]$ cut -f 2,2 distros-by-date.txt > distros-vernums.txt
[me@linuxmachine ~]$ paste distros-dates.txt distros-vernums.txt > distro
s-key-vernums.txt
[me@linuxmachine ~]$ head distros-key-vernums.txt
11/25/2008 10
10/30/2008 8.10
06/19/2008 11.0
05/13/2008 9
04/24/2008 8.04
11/08/2007 8
10/18/2007 7.10
10/04/2007 10.3
05/31/2007 7
04/19/2007 7.04

Presently, we possess two files sharing a common key—the "release date" field. It's crucial to note that these files must be sorted based on the key field for join to function accurately.

[me@linuxmachine ~]$ join distros-key-names.txt distros-key-vernums.txt |
head
11/25/2008 Fedora 10
10/30/2008 Ubuntu 8.10
06/19/2008 SUSE 11.0
05/13/2008 Fedora 9
04/24/2008 Ubuntu 8.04
11/08/2007 Fedora 8
10/18/2007 Ubuntu 7.10
10/04/2007 SUSE 10.3
05/31/2007 Fedora 7
04/19/2007 Ubuntu 7.04

Additionally, by default, join employs whitespace as the input field delimiter and a singular space as the output field delimiter. This functionality can be altered by indicating specific options. For further information, refer to the join manual page for detailed explanations.

Comparing Text

Comparing versions of text files holds significant utility, especially for system administrators and software developers. In scenarios where a system administrator needs to diagnose a system problem by comparing an existing configuration file to a previous version, or when a programmer seeks insights into the modifications made to programs over time, this functionality proves particularly crucial.

`comm`

The comm program contrasts two text files, revealing the unique lines in each and those shared between them. To showcase its functionality, we'll generate two almost identical text files using cat:

[me@linuxmachine ~]$ cat > file1.txt
a
b
c
d
[me@linuxmachine ~]$ cat > file2.txt
b
c
d
e

Next, we will compare the two files using comm:

[me@linuxmachine ~]$ comm file1.txt file2.txt
a
          b
          c
          d
     e

As observed, comm generates output in three columns. The initial column comprises lines unique to the first file argument, the second column contains lines unique to the second file argument, and the third exhibits lines shared by both files. 'comm' provides options in the format of -n, where n represents either 1, 2, or 3. When utilized, these options determine which column(s) to suppress. For instance, to exclusively display the lines shared by both files, we would suppress the output of columns one and two:

[me@linuxmachine ~]$ comm -12 file1.txt file2.txt
b
c
d

`diff`

Similar to the comm program, diff serves the purpose of identifying distinctions between files. However, diff is a considerably more intricate tool, offering support for various output formats and the capacity to handle extensive collections of text files simultaneously. It's frequently employed by software developers to scrutinize alterations among different versions of program source code. This tool possesses the capability to recursively inspect directories of source code, often termed as source trees.

One prevalent application of diff involves creating differential files or patches utilized by programs like patch (which we'll discuss shortly). These patches facilitate the conversion of one version of a file or files into another version.

Let's apply diff to examine our earlier example files:

[me@linuxmachine ~]$ diff file1.txt file2.txt
1d0
< a
4a4
> e

We observe its default output style: a concise representation of the disparities between the two files. In this default format, each set of modifications begins with a change command structured as range operation range. This form delineates the positions and types of alterations necessary to transform the first file into the second file.

Change

Description

r1ar2

Append the lines at the position r2 in the second file to the position r1 in the first file.

r1cr2

Change (replace) the lines at position r1 with the lines at the position r2 in the second file.

r1dr2

Delete the lines in the first file at position r1, which would have appeared at range r2 in the second file

This structure defines a range as a list separated by commas, encompassing both the starting and ending lines. While this format stands as the default (primarily for POSIX compliance and to maintain backward compatibility with traditional Unix versions of diff), it's not as prevalent as other available formats. Among the more commonly used formats are the context format and the unified format.

When inspected using the context format (via the -c option), the output will appear as follows:

[me@linuxmachine ~]$ diff -c file1.txt file2.txt 
*** file1.txt 2008-12-23 06:40:13.000000000 -0500
--- file2.txt 2008-12-23 06:40:34.000000000 -0500
***************
*** 1,4 ****
- a
  b
  c
  d
--- 1,4 ----
  b
  c
  d
+ e

The output commences by displaying the names of the two files alongside their timestamps. An asterisk denotes the first file, while dashes indicate the second one throughout the rest of the listing, denoting their respective files. Following this, there are clusters of changes, accompanied by the default number of adjacent context lines. In the initial group, we encounter:

*** 1,4 ***

This signifies lines 1 through 4 in the first file. Later on, we come across:

--- 1,4 ---

This denotes lines 1 through 4 in the second file. Within a change group, lines are prefaced by one of four indicators:

Indicator

Meaning

blank

A line shown for context. It does not indicate a difference between the two files.

-

A line deleted. This line will appear in the first file but not in the second file.

+

A line added. This line will appear in the second file but not in the first file.

!

A line changed. The two versions of the line will be displayed, each in its respective section of the change group.

The unified format, akin to the context format, offers a more streamlined presentation. It's activated using the -u option:

[me@linuxmachine ~]$ diff -u file1.txt file2.txt
--- file1.txt 2008-12-23 06:40:13.000000000 -0500
+++ file2.txt 2008-12-23 06:40:34.000000000 -0500
@@ -1,4 +1,4 @@
-a
 b
 c
 d
+e

The primary distinction between the context and unified formats lies in the removal of duplicated context lines, rendering the results of the unified format more concise compared to the context format. In the previously illustrated example, we observe file timestamps akin to those in the context format, succeeded by the string @@ -1,4 +1,4 @@. This specifies the lines in the first and second files outlined within the change group. Subsequently, the lines themselves appear, accompanied by the default three lines of context. Each line is initiated by one of three potential characters:

Character

Meaning

blank

This line is shared by both files.

-

This line was removed from the first file.

+

This line was added to the first file.

`patch`

The patch program is employed to implement alterations to text files, accepting output generated from diff. Typically, it's utilized to convert older versions of files into newer ones. Consider the Linux kernel as a prime example. Developed by a vast, loosely organized team of contributors, the kernel receives a continuous flow of small changes to its source code. Despite comprising millions of lines of code, the changes introduced by a single contributor at any given time tend to be rather modest. Consequently, it's impractical for a contributor to dispatch the entire kernel source tree each time a small alteration is made. Instead, a diff file is submitted, encapsulating the transition from the prior version of the kernel to the updated version, featuring the contributor's changes. The recipient can then utilize the patch program to apply these modifications to their own source tree. Employing diff/patch presents two notable advantages:

The diff file retains a significantly smaller size in comparison to the complete source tree.
The diff file precisely displays the modifications made, facilitating swift evaluation by patch reviewers.

It's crucial to note that while diff/patch is commonly associated with source code, it's equally applicable to various text files, including configuration files or any other text-based documents.

To generate a diff file compatible with patch, the GNU documentation (refer to Further Reading below) suggests using diff as follows:

diff -Naur old_file new_file > diff_file

Here, old_file and new_file can represent individual files or directories containing files. The -r option facilitates directory tree recursion.

Subsequently, once the diff file is prepared, it can be applied to patch the old file into the new file using:

patch < diff_file

We'll demonstrate this process using our test file:

[me@linuxmachine ~]$ diff -Naur file1.txt file2.txt > patchfile.txt
[me@linuxmachine ~]$ patch < patchfile.txt
patching file file1.txt
[me@linuxbox ~]$ cat file1.txt
b
c
d
e

In this instance, we generated a diff file called patchfile.txt and subsequently utilized the patch program to implement the patch. It's worth noting that we didn't need to designate a target file for patching because the diff file (formatted in unified format) already encompasses the filenames within the header. Upon applying the patch, we observe that file1.txt now mirrors file2.txt.

The patch command offers an extensive range of options, and additional utility programs exist to analyze and edit patches.

Editing On The Fly

So far, our interactions with text editors have primarily been interactive, involving manual cursor movement and typing changes. Nevertheless, there are non-interactive methods for editing text. It's feasible, for instance, to apply a series of modifications to multiple files using just one command.

`tr`

The tr utility is designed for character transliteration, functioning akin to a character-based search-and-replace operation. Transliteration involves the process of altering characters from one alphabet to another. For instance, converting characters from lowercase to uppercase is a form of transliteration. We can execute this conversion using tr in the following manner:

[me@linuxmachine ~]$ echo "lowercase letters" | tr a-z A-Z
LOWERCASE LETTERS

As evident, tr operates by receiving input from standard input and producing its outcomes on standard output. It requires two arguments: a set of characters for conversion from, and a corresponding set of characters for conversion to. These character sets can be represented in one of three ways:

An explicit list of characters (e.g., ABCDEFGHIJKLMNOPQRSTUVWXYZ)
A character range (e.g., A-Z). Note that this approach might encounter issues similar to other commands, influenced by locale collation order, and hence, should be used cautiously.
POSIX character classes (e.g., [:upper:]).

Usually, both character sets should have equal length. Yet, it's possible for the first set to be larger than the second, especially if we aim to convert multiple characters to a single character:

[me@linuxmachine ~]$ echo "lowercase letters" | tr [:lower:] A
AAAAAAAAA AAAAAAA

Apart from transliteration, tr provides the ability to outright delete characters from the input stream. In an earlier chapter, we explored the challenge of converting MS-DOS text files into Unix-style text. This conversion necessitates the removal of carriage return characters from the end of each line. Achieving this can be done using tr in the following manner:

tr -d '\r' < dos_file > unix_file

Here, dos_file represents the file intended for conversion, and unix_file denotes the resulting file. This command structure employs the escape sequence \r to denote the carriage return character. To explore a comprehensive list of the sequences and character classes supported by tr, consider trying:

[me@linuxmachine ~]$ tr --help

ROT13: The Not-So-Secret Decoder Ring

A playful application of tr involves executing ROT13 encoding on text. ROT13 represents a simplistic form of encryption reliant on a basic substitution cipher. Referring to ROT13 as "encryption" might be generous; it's more accurately termed as "text obfuscation." This method is occasionally used to obscure potentially offensive content within text. The approach involves shifting each character 13 places up the alphabet. As this shift marks the midway point of the possible 26 characters, applying the algorithm a second time on the text restores it to its original form. Executing this encoding with tr can be demonstrated as follows:

echo "secret text" | tr a-zA-Z n-za-mN-ZA-M
frperg grkg

Conducting the same process a second time yields the reverse translation:

echo "frperg grkg" | tr a-zA-Z n-za-mN-ZA-M
secret text

Several email programs and Usenet news readers facilitate ROT13 encoding. For more comprehensive insights into ROT13, Wikipedia offers an informative article on the subject: Wikipedia - ROT13

tr possesses an additional capability. With the -s option, tr has the ability to "squeeze" (remove) consecutive occurrences of a character:

[me@linuxmachine ~]$ echo "aaabbbccc" | tr -s ab
abccc

In this example, the string comprises repeated characters. When we designate the set "ab" to tr, it removes the duplicated occurrences of the letters in the set, while retaining the character absent from the set ("c"). It's essential to note that the repeating characters must be contiguous. If they aren't:

[me@linuxmachine ~]$ echo "abcabcabc" | tr -s ab
abcabcabc

the squeezing will have no effect

`sed`

sed, short for "stream editor," conducts text modifications on a stream of text, which can either be a set of designated files or standard input. sed is a robust and relatively intricate program, often detailed extensively in comprehensive guides (there are entire books dedicated to it), hence we won't delve into its full scope here. Generally, sed operates by receiving either a solitary editing command (via the command line) or the designation of a script file containing multiple commands. It subsequently executes these commands on each line within the stream of text. Below is a straightforward demonstration showcasing sed in operation:

[me@linuxmachine ~]$ echo "front" | sed 's/front/back/'
back

In this demonstration, we generate a single-word text stream using echo and direct it into sed. Subsequently, sed executes the instruction s/front/back/ on the text within the stream, resulting in the output "back." This command structure is reminiscent of the "substitution" (search-and-replace) command in vi.

Commands within sed typically commence with a single letter. In the above example, the substitution command is denoted by the letter s and is succeeded by the search and replacement strings, separated by the slash character acting as a delimiter. The choice of the delimiter character is arbitrary. Conventionally, the slash character is commonly used, yet sed will accept any character immediately following the command as the delimiter. This same command could be executed in the following manner:

[me@linuxmachine ~]$ echo "front" | sed 's_front_back_'
back

Upon using the underscore character directly following the command, it assumes the role of the delimiter. This flexibility in setting the delimiter can enhance command readability, as we'll soon demonstrate.

In sed, most commands can be preceded by an address, indicating which line(s) of the input stream will undergo editing. In the absence of an address, the editing command is executed on every line within the input stream. The most basic form of an address is a line number. Let's expand on our example by adding one:

[me@linuxmachine ~]$ echo "front" | sed '1s/front/back/'
back

Integrating the address 1 into our command triggers the substitution to be executed solely on the initial line of our single-line input stream. If we indicate another number:

[me@linuxmachine ~]$ echo "front" | sed '2s/front/back/'
front

the editing process doesn't proceed because our input stream lacks a second line, designated as line 2.

There are several ways to express addresses, among the most common are:

Address

Description

n

A line number where n is a positive integer

$

The last line.

/regexp/

Lines that match a POSIX basic regular expression. It's important to note that the regular expression is enclosed within slash characters. Additionally, the regular expression can be enclosed within an alternative character by specifying the expression as \cregexpc, where c represents the alternate character.

addr1,addr2

A range of lines from addr1 to addr2, inclusively. The addresses can encompass any of the single address formats mentioned earlier.

first~step

Identify the line corresponding to the initial number, followed by each subsequent line at regular step intervals. For instance, 1~2 pertains to every odd-numbered line, while 5~5 refers to the fifth line and every fifth line that follows.

addr1,+n

Match addr1 and the following n lines.

addr!

Match all lines except addr, which may be any of the forms above.

Let's showcase various address types using the distros.txt file we used earlier in this chapter. Initially, we'll start with a range of line numbers:

[me@linuxmachine ~]$ sed -n '1,5p' distros.txt
SUSE     10.2     12/07/2006
Fedora   10       11/25/2008
SUSE     11.0     06/19/2008
Ubuntu   8.04     04/24/2008 
Fedora   8        11/08/2007

Here, we display a range of lines from line 1 to line 5. To achieve this, we utilize the p command, responsible for printing matched lines. However, for this to work as intended, we must include the -n option (no auto-print) to prevent sed from automatically printing every line by default.

Next, let's explore a regular expression:

[me@linuxmachine ~]$ sed -n '/SUSE/p' distros.txt
SUSE     10.2     12/07/2006
SUSE     11.0     06/19/2008
SUSE     10.3     10/04/2007
SUSE     10.1     05/11/2006

By incorporating the slash-delimited regular expression /SUSE/, we can extract the lines that contain it, akin to the functionality of grep.

Lastly, we'll experiment with negation by appending an exclamation point (!) to the address:

[me@linuxmachine ~]$ sed -n '/SUSE/!p' distros.txt
Fedora     10     11/25/2008
Ubuntu     8.04   04/24/2008
Fedora     8      11/08/2007
Ubuntu     6.10   10/26/2006
Fedora     7      05/31/2007
Ubuntu     7.10   10/18/2007
Ubuntu     7.04   04/19/2007
Fedora     6      10/24/2006
Fedora     9      05/13/2008
Ubuntu     6.06   06/01/2006
Ubuntu     8.10   10/30/2008
Fedora     5      03/20/2006

Here, we observe the anticipated outcome: all lines within the file except for those that match the regular expression.

Until now, we've explored two of the fundamental sed editing commands, s and p. Below is a more comprehensive roster of the basic editing commands:

Command

Description

=

Output current line number.

a

Append text after the current line.

d

Delete the current line.

i

Insert text in front of the current line.

p

Display the current line. By default, sed prints every line while editing only those that correspond to a specified address within the file. This default behavior can be altered by indicating the -n option.

q

Terminate sed without processing additional lines. When the -n option is not specified, the current line is output.

Q

Exit sed without processing any more lines

s/regexp/replacement/

Replace the text matched by regexp with the contents of replacement. The replacement field can incorporate the special character &, representing the text matched by regexp. Additionally, replacement can contain the sequences \1 through \9, corresponding to the contents of the respective subexpressions in regexp. Further details on this topic, including back references, are discussed below. Optionally, after the trailing slash following replacement, a flag can be specified to alter the behavior of the s command.

y/set1/set2

Conduct character transliteration by converting characters from set1 to their corresponding characters in set2. It's essential to note that unlike tr, sed mandates both sets to have identical lengths.

The s command stands out as the most frequently employed editing command. We'll exhibit a fraction of its capabilities by making an edit to our distros.txt file. Previously, we mentioned that the date field in distros.txt wasn't in a "computer-friendly" format. Despite being in MM/DD/YYYY format, it would be more convenient (especially for sorting purposes) if it were in YYYY-MM-DD format. Manual alteration of the file in this manner would be both laborious and error-prone. However, with sed, this modification can be executed in a single step:

[me@linuxmachine ~]$ sed 's/\([0-9]\{2\}\)\/\([0-9]\{2\}\)\/\([0-9]\{4\}\)$/\3-\1-\2/' distros.txt
SUSE     10.2     2006-12-07
Fedora   10       2008-11-25
SUSE     11.0     2008-06-19
Ubuntu   8.04     2008-04-24
Fedora   8        2007-11-08
SUSE     10.3     2007-10-04
Ubuntu   6.10     2006-10-26
Fedora   7        2007-05-31
Ubuntu   7.10     2007-10-18
Ubuntu   7.04     2007-04-19
SUSE     10.1     2006-05-11
Fedora   6        2006-10-24
Fedora   9        2008-05-13
Ubuntu   6.06     2006-06-01
Ubuntu   8.10     2008-10-30
Fedora   5        2006-03-20

That command might not win a beauty contest, but it does the job. In a single operation, we've successfully altered the date format in our file. This serves as a prime example of why regular expressions are sometimes humorously called a "write-only" medium—writable but occasionally challenging to decipher. Before we consider fleeing in terror from this command, let's examine how it was built. Initially, we knew that the command would have this fundamental structure:

sed 's/regexp/replacement/' distros.txt

Next, let's determine a regular expression that isolates the date. Given its MM/DD/YYYY format positioned at the end of the line, we can employ an expression resembling this:

[0-9]{2}/[0-9]{2}/[0-9]{4}$

This pattern corresponds to two digits, a slash, two more digits, another slash, followed by four digits, and ultimately the end of the line. So, that effectively captures the required regexp. Now, to address the replacement aspect, we need to introduce a unique feature seen in some applications utilizing BRE (Basic Regular Expressions). This feature, known as back references, operates in the following manner: Whenever the sequence \n appears in replacement—where n represents a number between 1 to 9—it refers back to the respective subexpression in the preceding regular expression. To establish these subexpressions, we enclose them within parentheses:

([0-9]{2})/([0-9]{2})/([0-9]{4})$

We've created three subexpressions: the first encompasses the month, the second the day of the month, and the third the year. With these in place, we can now build the replacement as follows:

\3-\1-\2

resulting in the year, followed by a dash, the month, another dash, and finally the day.

Hence, our command appears as follows:

sed 's/([0-9]{2})/([0-9]{2})/([0-9]{4})$/\3-\1-\2/' distros.txt

Two issues persist. Initially, the surplus slashes within our regular expression might perplex sed when interpreting the s command. Secondly, as sed typically accepts only basic regular expressions, certain characters in our expression might be construed as literals instead of metacharacters. To resolve these concerns, we can resolve both by judiciously applying backslashes to escape the problematic characters:

sed 's/\([0-9]\{2\}\)\/\([0-9]\{2\}\)\/\([0-9]\{4\}\)$/\3-\1-\2/' distros.txt

And that's it!

Another facet of the s command involves optional flags that can trail the replacement string. Among these, the most significant is the g flag, directing sed to implement the search-and-replace operation globally within a line, rather than just the first instance—which is the default behavior.

Here's an example to illustrate:

[me@linuxmachine ~]$ echo "aaabbbccc" | sed 's/b/B/'
aaaBbbccc

The replacement was executed solely on the initial occurrence of the letter "b," leaving the subsequent instances unaltered. However, by appending the g flag, we can modify all occurrences:

[me@linuxmachine ~]$ echo "aaabbbccc" | sed 's/b/B/g'
aaaBBBccc

Until now, we've solely provided sed with single commands via the command line. However, it's feasible to create more intricate commands within a script file using the -f option. To exemplify, we'll utilize sed along with our distros.txt file to generate a report. The report will contain a title at the beginning, our adjusted dates, and all distribution names converted to uppercase. To achieve this, we'll need to craft a script. Let's open our text editor and input the following:

# sed script to produce Linux distributions report
1 i\
\
Linux Distributions Report\
s/\([0-9]\{2\}\)\/\([0-9]\{2\}\)\/\([0-9]\{4\}\)$/\3-\1-\2/
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/

We will save our sed script as distros.sed and run it like this:

[me@linuxmachine ~]$ sed -f distros.sed distros.txt
Linux Distributions Report
SUSE         10.2     2006-12-07
FEDORA       10       2008-11-25
SUSE         11.0     2008-06-19
UBUNTU       8.04     2008-04-24
FEDORA       8        2007-11-08
SUSE         10.3     2007-10-04
UBUNTU       6.10     2006-10-26
FEDORA       7        2007-05-31
UBUNTU       7.10     2007-10-18
UBUNTU       7.04     2007-04-19
SUSE         10.1     2006-05-11
FEDORA       6        2006-10-24
FEDORA       9        2008-05-13
UBUNTU       6.06     2006-06-01
UBUNTU       8.10     2008-10-30
FEDORA       5        2006-03-20

As observed, our script successfully generates the expected outcomes. But how does it accomplish this? Let's review our script once more. We can utilize cat to number the lines:

[me@linuxbox ~]$ cat -n distros.sed
         1 # sed script to produce Linux distributions report
         2
         3 1 i\
         4 \
         5 Linux Distributions Report\
         6
         7 s/\([0-9]\{2\}\)\/\([0-9]\{2\}\)\/\([0-9]\{4\}\)$/\3-\1-\2/
         8 y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/

Line 1 in our script functions as a comment, indicated by the # character—a common practice in Linux system configuration files and programming languages. Comments offer human-readable context and can be placed anywhere within the script, but not within the commands themselves. They aid in identifying and maintaining the script's components.

Line 2 is left blank, serving to enhance readability. Blank lines are often added for better visual organization.

Lines 3 through 6 consist of text to be inserted at address 1, denoting the first line of the input. The i command is followed by the backslash-carriage return sequence, creating an escaped carriage return or a line-continuation character. This sequence, prevalent in various contexts like shell scripts, allows embedding a carriage return within text without signaling the interpreter, in this case, sed, about the line's termination. Both the i (for insertion), a (for appending), and c (for replacement) commands support multiple lines of text, necessitating each line, except the last, to end with a line-continuation character. The sixth line marks the conclusion of our inserted text, ending with a regular carriage return, signifying the end of the i command.

Note

A line-continuation character is created by a backslash immediately followed by a carriage return, without any intervening spaces.

Line 7 initiates our search-and-replace command. Not accompanied by an address, this command affects every line in the input stream.

Line 8 executes a transliteration, converting lowercase letters into uppercase. Unlike tr, sed's y command doesn't support character ranges (e.g., [az]) or POSIX character classes. Similar to the prior command, it applies to each line in the input stream without an address.

People Who Like sed Also Like...

sed is a powerful tool adept at handling intricate text editing tasks, typically for concise, single-line operations rather than extensive scripts. For larger tasks, users often turn to more comprehensive tools like awk and perl. These go beyond the functionalities of programs covered here, branching into complete programming languages. perl, especially, finds extensive use in system management, administration, and web development, often replacing shell scripts due to its versatility.

On the other hand, awk is known for its expertise in manipulating tabular data. It shares some similarities with sed, as awk programs commonly process text files line by line, employing a concept akin to sed's approach of addressing followed by an action. While exploring awk and perl is beyond the scope of this book, mastering them proves valuable for Linux command line users.

`aspell`

We'll take a look at another tool called aspell, an interactive spell checker. It's a successor to the older program named ispell and can generally be used as a direct substitute for it. While it's often employed by other programs needing spell-checking functions, it's also quite effective as a standalone tool from the command line. Its capabilities extend to intelligently checking different types of text files, such as HTML documents, C/C++ programs, emails, and various specialized texts.

To spell-check a text file with basic written content, you could employ it in the following manner:

aspell check textfile

where textfile represents the file you wish to check. For instance, as an illustration, let's generate a basic text file called foo.txt that deliberately contains some spelling mistakes:

[me@linuxmachine ~]$ cat > foo.txt
The quick brown fox jimped over the laxy dog.

Next we’ll check the file using aspell:

[me@linuxmachine ~]$ aspell check foo.txt

As aspell is interactive in the check mode, we will see a screen like this:

The quick brown fox jimped over the laxy dog.
1) jumped        6) wimped
2) gimped        7) camped
3) comped        8) humped
4) limped        9) impede
5) pimped        0) umped
i) Ignore        I) Ignore all
r) Replace       R) Replace all
a) Add           l) Add Lower
b) Abort         x) Exit

At the top of the screen, our text is displayed, highlighting a potentially misspelled word. In the center, there are ten suggestions numbered from zero to nine, along with various available actions. Finally, at the bottom, there's a prompt awaiting our input.

Upon pressing the 1 key, aspell substitutes the questionable word with "jumped" and proceeds to the next incorrectly spelled word, "laxy." Choosing "lazy" as the replacement, aspell makes the change and completes its process. Upon completion, upon inspecting our file, we'll notice that the misspelled words have been rectified.

[me@linuxmachine ~]$ cat foo.txt
The quick brown fox jumped over the lazy dog.

By default, unless instructed otherwise using the command-line option --dont-backup, aspell generates a backup file by adding the extension .bak to the original filename.

Demonstrating our proficiency in sed editing, we'll reintroduce our spelling errors to reuse the file.

[me@linuxmachine ~]$ sed -i 's/lazy/laxy/; s/jumped/jimped/' foo.txt

The sed flag -i instructs sed to modify the file directly instead of displaying the edited content on the standard output. It overwrites the file with the applied changes. Additionally, multiple editing commands can be placed on the same line by using semicolons to separate them.

Moving forward, let's explore how aspell manages various text file formats. To modify our file, we'll employ a text editor like vim (those feeling adventurous might consider experimenting with sed) to incorporate HTML markup.

<html>
    <head>
        <title>Mispelled HTML file</title>
    </head>
    <body>
        <p>The quick brown fox jimped over the laxy dog.</p>
    </body>
</html>

Presently, attempting to conduct a spell check on our altered file presents a challenge if we proceed in this manner:

[me@linuxmachine ~]$ aspell check foo.txt

we’ll get this:

<html>
  <head>
     <title>Mispelled HTML file</title>
   </head>
   <body>
     <p>The quick brown fox jimped over the laxy dog.</p>
   </body>
</html> 
1) HTML         4) Hamel
2) ht ml        5) Hamil
3) ht-ml        6) hotel
i) Ignore       I) Ignore all
r) Replace      R) Replace all
a) Add          l) Add Lower
b) Abort        x) Exit
?

The content within the HTML tags will be perceived as misspelled by aspell. However, this hurdle can be surmounted by incorporating the -H (HTML) checking-mode option, illustrated as follows:

[me@linuxmachine ~]$ aspell -H check foo.txt

which will result in this:

<html>
 <head>
   <title>Mispelled HTML file</title>
 </head>
 <body>
   <p>The quick brown fox jimped over the laxy dog.</p>
 </body>
</html>
1) Mi spelled       6) Misapplied
2) Mi-spelled       7) Miscalled
3) Misspelled       8) Respelled
4) Dispelled        9) Misspell
5) Spelled          0) Misled
i) Ignore           I) Ignore all
r) Replace          R) Replace all
a) Add              l) Add Lower
b) Abort            x) Exit

?

Only the non-markup sections of the file undergo scrutiny; the HTML is disregarded. In this particular mode, the content within HTML tags remains unchecked for spelling errors. Nonetheless, the contents within ALT tags, which benefit from inspection, are indeed checked within this mode.

Note

As a default setting, aspell excludes URLs and email addresses from its text checks, but this can be altered using command line options. Furthermore, it's feasible to define which markup tags undergo scrutiny or are bypassed. For further information, refer to the aspell manual page for specifics.

Summary

In this chapter, we've explored a handful of command line tools that manipulate text. The upcoming chapter will introduce several additional tools. While their day-to-day application might not immediately appear evident or essential, we've aimed to demonstrate semi-practical instances of their utilization. Subsequent chapters will unveil how these tools compose a foundational toolkit capable of addressing various real-world challenges. This significance becomes notably evident when delving into shell scripting, where these tools showcase their true value.

PreviousRegular Expressions NextWriting Shell Scripts

Last updated 2 years ago

hashtagApplications Of Text

hashtagDocuments

hashtagWeb Pages

hashtagEmail

hashtagPrinter Output

hashtagProgram Source Code

hashtagRevisiting Some Old Friends

hashtagcat

hashtagMS-DOS Text Vs. Unix Text

hashtagsort

hashtaguniq

hashtagTip

hashtagSlicing And Dicing

hashtagcut

hashtagExpanding Tabs

hashtagpaste

hashtagjoin

hashtagComparing Text

hashtagcomm

hashtagdiff

hashtagpatch

hashtagEditing On The Fly

hashtagtr

hashtagROT13: The Not-So-Secret Decoder Ring

hashtagsed

hashtagNote

hashtagPeople Who Like sed Also Like...

hashtagaspell

hashtagNote

hashtagSummary

Applications Of Text

Documents

Web Pages

Email

Printer Output

Program Source Code

Revisiting Some Old Friends

`cat`

MS-DOS Text Vs. Unix Text

`sort`

`uniq`

Tip

Slicing And Dicing

`cut`

Expanding Tabs

`paste`

`join`

Comparing Text

`comm`

`diff`

`patch`

Editing On The Fly

`tr`

ROT13: The Not-So-Secret Decoder Ring

`sed`

Note

People Who Like sed Also Like...

`aspell`

Note

Summary