Archiving And Backup
Chapter 18
One key responsibility of a computer system administrator involves ensuring the security of the system's data. An essential method for achieving this is by regularly creating backups of the system's files. Even if you're not overseeing the system, creating duplicates and transferring large sets of files between different locations and devices can be highly beneficial.
In this chapter, we'll explore various commonly used programs designed for managing file collections. These include:
Compression tools:
gzip: Compresses or expands filesbzip2: A file compressor using block sorting techniques
Archiving utilities:
tar: Primarily a tape archiving utilityzip: Packages and compresses files
File synchronization software:
rsync: Facilitates remote file and directory synchronization
Compressing Files
Throughout the evolution of computing, there has been an ongoing endeavor to maximize data storage efficiency, whether that's within memory, storage devices, or network bandwidth. Many modern conveniences, such as portable music players, high-definition TV, and broadband Internet, owe their existence to effective data compression methods.
Data compression involves eliminating redundancy from data. Let's imagine a scenario: suppose we have an entirely black picture file sized at 100 pixels by 100 pixels. Based on data storage (assuming 24 bits or 3 bytes per pixel), the image would occupy 30,000 bytes:
100 * 100 * 3 = 30,000
An image entirely composed of a single color contains redundant data. By employing clever encoding, we could simply describe having a block of 10,000 black pixels. Instead of storing 30,000 zeros (black pixels are often represented as zeros in image files), we compress the data into the number 10,000, followed by a zero to signify our data. This compression technique is known as run-length encoding and stands as one of the most fundamental compression methods. Today's techniques are significantly more sophisticated, but the primary objective remains constant—eliminate redundant data.
Compression algorithms, the mathematical strategies employed in compression, generally fall into two categories: lossless and lossy. Lossless compression maintains all the original data. Consequently, when a file is restored from a compressed version, it's an exact replica of the original, uncompressed file. Conversely, lossy compression eliminates data during the compression process to enable greater compression. When a lossy file is restored, it doesn't perfectly match the original; instead, it's a close approximation. Examples of lossy compression include JPEG (for images) and MP3 (for music). In this discussion, we'll exclusively focus on lossless compression, as most computer data cannot withstand any data loss.
gzip
gzipThe gzip utility compresses single or multiple files. Upon execution, it substitutes the original file with its compressed counterpart. To revert compressed files back to their initial, uncompressed state, one would utilize the corresponding gunzip program. Here's an illustration:
In this demonstration, we generate a text file called foo.txt using a directory listing. Subsequently, by employing gzip, we replace the initial file with a compressed iteration labeled foo.txt.gz. Upon inspecting the directory listing for foo.*, we observe the transition from the original file to its compressed version, which is approximately one-fifth the size of the original. Additionally, both files retain identical permissions and timestamps.
Following this, we execute the gunzip utility to decompress the file. Upon completion, we note the restoration of the original file, retaining its permissions and timestamp as before, while the compressed version has been replaced.
gzip has many options. Here are a few:
-c
Write output to standard output and keep original files. May also be specified with --stdout and --to-stdout.
-d
Decompress. This causes gzip to act like gunzip. May also be specified with --decompress or --uncompress.
-f
Force compression even if a compressed version of the original file already exists. May also be specified with --force.
-h
Display usage information. May also be specified with --help.
-l
List compression statistics for each file compressed. May also be specified with --list.
-r
If one or more arguments on the command line are directories, recursively compress files contained within them. May also be specified with --recursive.
-t
Test the integrity of a compressed file. May also be specified with --test.
-v
Display verbose messages while compressing. May also be specified with --verbose.
-number
Set amount of compression. number is an integer in the range of 1 (fastest, least compression) to 9 (slowest, most compression). The values 1 and 9 may also be expressed as --fast and --best, respectively. The default value is 6.
Going back to our earlier example:
In this instance, we substituted the file foo.txt with its compressed variant, renamed as foo.txt.gz. Subsequently, we verified the integrity of the compressed file by employing the -t and -v options. Lastly, we decompressed the file, reverting it to its initial state.
Additionally, gzip presents intriguing possibilities through standard input and output manipulation.
This command generates a compressed representation of a directory listing.
The gunzip tool, responsible for decompressing gzip files, operates under the assumption that filenames conclude with the .gz extension. Thus, specifying the extension isn't mandatory, provided the given name doesn't clash with an existing uncompressed file:
If our aim were solely to inspect the contents of a compressed text file, we could achieve this by:
Alternatively, within the gzip package, there exists a program called zcat, which mirrors the functionality of gunzip when used with the -c option. This utility allows for operations similar to the cat command but on gzip compressed files:
bzip2
bzip2Julian Seward's bzip2 program, akin to gzip, employs a distinct compression algorithm that attains superior compression rates while sacrificing compression speed. Functionally, it operates similarly to gzip and files compressed using bzip2 are identified with the .bz2 extension:
Similarly to gzip, bzip2 operates with analogous functionality. All the discussed options for gzip, except for -r, are supported in bzip2. It's worth noting that the compression level option (-number) carries a slightly different implication for bzip2.
Moreover, bzip2 offers bunzip2 and bzcat utilities dedicated to decompressing files. Additionally, the bzip2 package includes the bzip2recover program, designed to attempt the recovery of damaged .bz2 files.
Don’t Be Compressive Compulsive
At times, I notice individuals attempting to compress a file that has already undergone compression using an efficient algorithm, like so:
I'd advise against this action. It's likely to be a futile effort, wasting both time and space! When you apply compression to a file that's already compressed, the result could be a larger file.
The reason behind this lies in the fact that all compression methods involve some additional information added to the file to describe the compression process. If you try to compress a file that contains minimal or no redundant information, the compression won't yield any savings to counterbalance the extra data added as overhead.
Archiving Files
A frequently employed file management process, often coupled with compression, is archiving. Archiving involves consolidating numerous files into a single, larger file. This method is frequently employed during system backups and when transferring older data from a system to long-term storage.
tar
tarWithin the realm of Unix-like software, the tar utility stands as the classic tool for file archiving. Its name, originating from "tape archive," signifies its origins in creating backup tapes. While initially intended for this purpose, it adeptly serves various storage devices.
Frequently encountered filenames include extensions like .tar or .tgz, denoting a "plain" tar archive and a gzipped archive, respectively. A tar archive can encompass a collection of individual files, one or multiple directory structures, or a combination of both. The command structure operates as follows:
tar mode[options] pathname...
Here, mode represents one of the operating modes (a partial list is depicted; refer to the tar man page for a comprehensive list):
c
Create an archive from a list of files and/or directories.
x
Extract an archive.
r
Append specified pathnames to the end of an archive.
t
List the contents of an archive.
The way tar expresses options might seem a bit peculiar, thus it's best explained through examples. Let's start by reconstructing our playground from the preceding chapter:
Next, let’s create a tar archive of the entire playground:
Executing this command produces a tar archive named playground.tar encompassing the complete playground directory structure. Notably, the mode and the f option, responsible for specifying the tar archive's name, can be combined and do not necessitate a preceding dash. However, it's crucial to note that the mode should always be specified first, preceding any other option.
To display the archive's contents, the following can be executed:
For a more detailed listing, we can add the v (verbose) option:
Next, we'll extract the playground to a different location. This involves establishing a fresh directory called foo, navigating into it, and then extracting the contents of the tar archive:
Upon inspecting the contents of ~/foo/playground, it becomes evident that the archive has been successfully extracted, faithfully replicating the original files. However, there's a caveat: unless operating as the superuser, restored files and directories adopt the ownership of the user executing the restoration, not the original owner.
Furthermore, an intriguing behavior of tar pertains to its treatment of pathnames within archives. By default, tar adopts a relative approach to pathnames rather than an absolute one. This means that when creating the archive, tar simply eliminates any leading slash from the pathname. To illustrate, let's recreate our archive, this time specifying an absolute pathname:
Keep in mind that upon pressing the enter key, ~/playground will resolve to /home/me/playground, resulting in an absolute pathname for our demonstration. Subsequently, we'll proceed with extracting the archive as previously done and observe the outcome:
Upon extracting our second archive, it recreated the directory home/me/playground relative to our current working directory, ~/foo, instead of being relative to the root directory as it would have been with an absolute pathname. While this might seem unconventional, it's actually quite practical as it grants the flexibility to extract archives to any desired location rather than being limited to their original placements. To better understand the process, repeating the exercise with the addition of the verbose option (v) will offer a clearer depiction of the operations at hand.
Consider a hypothetical yet practical scenario where tar comes into play. Suppose we aim to transfer the home directory and its contents from one system to another using a spacious USB hard drive. On our contemporary Linux system, this drive typically mounts itself in the /media directory, and let's envision it being recognized with the volume name "BigDisk" upon attachment. To create the tar archive for this purpose, follow these steps:
Once the tar file is composed, we detach the drive and connect it to the second computer. Once more, it is mounted at /media/BigDisk. To extract the archive, the following steps are taken:
The crucial point to note is the necessity to navigate to the / directory initially. This ensures that the extraction occurs relative to the root directory, considering that all pathnames within the archive are specified relatively.
During the extraction process, it's feasible to confine what is extracted from the archive. For instance, to extract a single file from an archive, the process could be executed as follows:
Appending the trailing pathname to the command instructs tar to exclusively restore the designated file. It's possible to specify multiple pathnames. It's crucial to input the full, precise relative pathname as it exists in the archive. Normally, wildcards aren't supported when specifying pathnames. However, the GNU version of tar, commonly prevalent in Linux distributions, accommodates them using the --wildcards option. Here's an illustration utilizing our earlier playground.tar file:
Executing this command will exclusively extract files that correspond to the specified pathname, encompassing the wildcard dir-*.
tar is frequently combined with find to generate archives. In this instance, we'll employ find to generate a collection of files intended for inclusion in an archive:
In this instance, find is employed to locate all files within the playground directory named file-A. By utilizing the -exec action, tar is invoked in the append mode (r) to incorporate the matched files into the archive playground.tar.
Employing tar alongside find proves to be an effective method for creating incremental backups of a directory tree or an entire system. This approach allows for matching files newer than a timestamp file, enabling the creation of an archive containing only files more recent than the last one, assuming the timestamp file is updated immediately following each archive creation.
Moreover, tar is capable of leveraging both standard input and output. Here's an extensive example:
In this illustration, find was employed to generate a list of matching files, subsequently piped into tar. When specifying the filename -, it denotes standard input or output as required. (By the way, this convention of using - to represent standard input/output is also employed by several other programs.)
The --files-from option, also represented as -T, prompts tar to read its list of pathnames from a file rather than the command line. Finally, the archive produced by tar is directed into gzip to create the compressed archive playground.tgz. The .tgz extension is the customary suffix given to gzip-compressed tar files, although the .tar.gz extension is also occasionally used.
While we externally used the gzip program to create our compressed archive, contemporary versions of GNU tar directly support gzip and bzip2 compression via the z and j options, respectively. Refining our earlier example, we can simplify it as follows:
If our intent was to produce a bzip2 compressed archive instead, the process would involve:
By switching the compression option from z to j (and modifying the output file's extension to .tbz, denoting a bzip2 compressed file), we activated bzip2 compression.
Another fascinating application of standard input and output with the tar command is transferring files between systems via a network. Suppose we have two Unix-like systems equipped with tar and ssh. In such a setup, transferring a directory from a remote system (designated as remote-sys in this example) to our local system could be achieved as follows:
In this scenario, we successfully copied a directory named Documents from the remote system, remote-sys, to a directory situated within remote-stuff on the local system. How did we accomplish this? Initially, we initiated the tar program on the remote system using ssh. As a quick refresher, ssh facilitates the execution of a program remotely on a networked computer while displaying the results on the local system—any standard output generated on the remote system is transmitted to the local system for observation.
We utilized this functionality by instructing tar to generate an archive (the c mode) and direct it to the standard output rather than a file (utilizing the f option with the dash argument). Consequently, the archive traversed through the encrypted tunnel provided by ssh to reach the local system. On the local system, we employed tar to unpack an archive (the x mode) retrieved from standard input (once more, using the f option with the dash argument).
zip
zipThe zip utility serves as both a compression tool and an archiver. It employs a file format commonly recognized by Windows users, as it handles .zip files for reading and writing. However, within Linux, gzip remains the prevalent compression program, closely followed by bzip2.
At its most fundamental level, zip is used in the following manner:
zip options zipfile file...
For instance, to create a zip archive of our playground, the command would be:
Unless the -r option for recursion is included, solely the playground directory (excluding its contents) gets stored. Even though the addition of the .zip extension occurs automatically, for clarity, we'll explicitly include the file extension.
Throughout the creation of the zip archive, zip typically showcases a sequence of messages similar to this:
These notifications indicate the status of each file as it's added to the archive. zip employs two storage methods while adding files: it either "stores" a file without compression, as demonstrated here, or it "deflates" the file, facilitating compression. The numerical value showcased after the storage method signifies the level of compression attained. Given that our playground solely consists of empty files, no compression occurs within its contents.
Unpacking the contents of a zip file is simple when utilizing the unzip program:
A distinguishing aspect of zip (unlike tar) is that when specifying an existing archive, it undergoes an update rather than a complete replacement. This signifies the preservation of the existing archive while incorporating new files and replacing matching ones.
Selective listing and extraction of files from a zip archive can be accomplished by specifying them to unzip:
Utilizing the -l option prompts unzip to solely display the contents of the archive without performing extraction. In the absence of specified file(s), unzip will list all files contained in the archive. For a more detailed listing, the -v option can be included to augment the verbosity level. It's important to note that in instances where archive extraction clashes with an existing file, the user is prompted prior to file replacement.
Similar to tar, zip can employ standard input and output, albeit with somewhat limited utility. It's feasible to convey a list of filenames to zip using the -@ option via a pipe:
In this instance, find is employed to generate a list of files that match the test -name file-A, subsequently piping this list into zip. This action results in the creation of the archive file-A.zip containing the selected files.
While zip does support writing its output to standard output, its utility is limited because very few programs can utilize this output. Unfortunately, the unzip program lacks compatibility with standard input, preventing the seamless use of zip and unzip together for network file copying similar to tar.
However, zip is capable of accepting standard input, allowing it to compress the output generated by other programs:
In this particular example, we direct the output of ls into zip. Similar to tar, zip interprets the trailing dash as a command to "use standard input for the input file."
The unzip program facilitates sending its output to standard output by employing the -p (pipe) option:
We've covered some fundamental functionalities of zip and unzip. Both possess numerous options that enhance their versatility, although some are tailored to other systems, making them platform-specific. The man pages for both zip and unzip offer valuable insights and useful examples.
However, the primary purpose of these programs lies in file exchange with Windows systems rather than serving as the primary compression and archiving tools on Linux. For compression and archiving tasks on Linux, tar and gzip are considerably favored.
Synchronizing Files And Directories
A widely adopted approach for maintaining system backups involves synchronizing one or more directories with another directory, either on the local system (typically a removable storage device) or a remote system. For instance, one might possess a local copy of a website in development, periodically syncing it with the "live" version hosted on a remote web server.
In the realm of Unix-like systems, rsync stands as the preferred tool for this purpose. This utility excels in synchronizing both local and remote directories through the rsync remote-update protocol. This protocol enables rsync to rapidly identify differences between two directories and execute the minimum necessary copying to align them. Consequently, rsync proves exceptionally swift and efficient compared to alternative copying programs.
To invoke rsync:
rsync options source destination
Here, the source and destination can be:
A local file or directory
A remote file or directory specified as [user@]host:path
A remote
rsyncserver identified with a URI of rsync://[user@]host[:port]/path
It's essential to note that either the source or the destination must be a local file, as remote-to-remote copying isn't supported.
Let's experiment with rsync on some local files. To begin, let's clear out our foo directory:
Following that, we'll initiate the synchronization of the playground directory with a corresponding copy located in foo:
We've incorporated both the -a option (for archiving—enabling recursion and file attribute preservation) and the -v option (for verbose output) to mirror the playground directory within foo. As the command executes, a list of copied files and directories will be displayed. Upon completion, a summary message will be presented, resembling this:
demonstrating the extent of the copying conducted. If we were to execute the command once more, we would observe a distinct outcome:
Observe the absence of file listings. This is because rsync recognized that there were no discrepancies between ~/playground and ~/foo/playground, hence it did not necessitate any copying. However, if we were to modify a file within the playground and re-run rsync:
Upon examination, rsync identified the alteration and exclusively copied the updated file.
For a hands-on illustration, envision the hypothetical scenario of the external hard drive utilized previously with tar. Upon connecting the drive to our system, once more mounted at /media/BigDisk, a beneficial system backup can be conducted. Initially, by crafting a directory named /backup on the external drive, and subsequently employing rsync to replicate the crucial components from our system onto the external drive:
In this instance, we replicated the /etc, /home, and /usr/local directories from our system onto the hypothetical storage device. The --delete option was included to eliminate files on the backup device that no longer existed on the source device (this remains insignificant during the initial backup but proves beneficial for subsequent copies). Repeating the process of attaching the external drive and executing this rsync command could serve as a pragmatic (though not optimal) method of maintaining a backup for a compact system. An alias creation and addition to our .bashrc file could enhance this feature:
Simply connect the external drive and execute the backup command to initiate the process.
Using rsync Over A Network
rsync Over A NetworkThe incredible versatility of rsync extends to copying files across a network. After all, the "r" in rsync stands for "remote." Remote copying can be accomplished in two primary manners. The initial approach involves another system equipped with rsync and a remote shell program like ssh. Consider a scenario where a separate system within our local network possesses ample hard drive capacity. Suppose we intend to execute our backup procedure using this remote system instead of relying on an external drive. Assuming the remote system already has a designated directory, /backup, ready to receive our files, the process would resemble this:
Two modifications were incorporated into our command to facilitate the network copy. Initially, the --rsh=ssh option was introduced, directing rsync to utilize the ssh program as its remote shell. This method facilitated secure data transfer from the local system to the remote host through an ssh encrypted tunnel. Subsequently, we designated the remote host by prefixing its name (in this scenario, the remote host is named remote-sys) to the destination pathname.
The second method leveraging rsync for file synchronization over a network involves employing an rsync server. rsync can be configured to operate as a daemon, capable of listening to incoming synchronization requests. This configuration is commonly established to enable mirroring of a remote system. For instance, Red Hat Software maintains a substantial repository of software packages under development for its Fedora distribution. During the testing phase of the distribution release cycle, it's advantageous for software testers to mirror this collection. Due to frequent changes in the repository (often occurring more than once a day), maintaining a local mirror through periodic synchronization proves preferable over bulk copying. Georgia Tech houses one of these repositories, which we could mirror using our local rsync copy in conjunction with their rsync server in this manner:
In this illustration, we utilize the URI of the remote rsync server, structured with a protocol designation (rsync://), succeeded by the remote host-name (rsync.gtlib.gatech.edu), and concluded with the pathname pointing to the repository.
Summary
We've explored the prevalent compression and archiving programs employed in Linux and other Unix-like operating systems. The preferred approach for archiving files in Unix-like systems involves the tar/gzip combination, while zip/unzip is commonly used for seamless compatibility with Windows systems. Lastly, we delved into the versatile rsync program (a personal favorite), renowned for its efficiency in synchronizing files and directories across different systems.
Last updated