Searching For Files

Chapter 17

As we've explored our Linux environment, it's evident that:

Files abound in a typical Linux system! This raises the query: "How do we locate them?" While we understand that the Linux file system follows well-established conventions passed down through generations of Unix-like systems, the sheer volume of files can pose a significant challenge.

This chapter delves into two tools essential for locating files within a system. These tools include:

  • locate – Find files by name

  • find – Search for files in a directory hierarchy

Additionally, we'll explore a command frequently employed alongside file-search commands to manage the generated list of files:

  • xargs – Build and execute command lines from standard input

Moreover, we'll present a couple of commands to aid us in our investigations:

  • touch – Change file times

  • stat – Display file or file system status

locate – Find Files The Easy Way

The locate tool swiftly searches through pathnames in a database and displays all names that match a specified substring. For instance, if we aim to locate programs starting with "zip", presuming these programs are within directories ending with bin/, we can use locate in this manner to find our files:

[me@linuxmachine ~]$ locate bin/zip

The locate tool scans its database of pathnames and displays any that include the string "bin/zip":

/usr/bin/zip
/usr/bin/zipcloak
/usr/bin/zipgrep
/usr/bin/zipinfo
/usr/bin/zipnote
/usr/bin/zipsplit

When the search criteria become more complex, you can combine locate with other tools like grep to create more intricate and engaging searches:

The locate program has a longstanding history, and multiple variants are prevalent. Among the most frequently encountered in modern Linux distributions are slocate and mlocate, typically accessed via a symbolic link named locate. These versions share similar option sets, with some offering features like regular expression matching (covered in a later chapter) and wildcard support. Consult the locate's manual page to identify the installed version and its specific capabilities.

Where Does The locate Database Come From?

You might observe that on certain distributions, locate might not function immediately after system installation. However, attempting it the following day typically resolves the issue. This is because the locate database is generated by another program called updatedb. Normally, updatedb runs periodically as a cron job, a task executed at regular intervals by the cron daemon. On most systems with locate, updatedb runs once a day. As the database isn't continuously updated, very recent files won't appear in locate searches. To address this, you can manually run the updatedb program as the superuser by executing updatedb at the prompt.

find – Find Files The Hard Way

Although the locate program locates a file solely by its name, the find program explores a specified directory (along with its subdirectories) for files using various attributes. We'll focus extensively on find because of its numerous compelling features that will reappear frequently when we delve into programming concepts in subsequent chapters.

At its most basic, find is provided with one or more directory names to search. For instance, to generate a list of files in our home directory:

For most active user accounts, this command will generate an extensive list. As the list is directed to standard output, we can channel it into other programs. Let's employ wc to tally the number of files:

Impressive work! The brilliance of find lies in its ability to pinpoint files that meet precise criteria. It achieves this through the (somewhat peculiar) utilization of options, tests, and actions. Let's begin by exploring the tests.

Tests

Suppose we aim to gather a list of directories from our search. To achieve this, we can include the following test:

By incorporating the test -type d, we narrowed down the search to directories. Conversely, we could have restricted the search to regular files using this test:

Here are the common file type tests supported by find:

File Type
Description

b

Block special device file

c

Character special device file

d

Directory

f

Regular file

l

Symbolic link

Additional tests allow us to search by file size and filename criteria. For instance, let's search for all regular files matching the wildcard pattern "*.JPG" and exceeding one megabyte in size:

Here, we incorporate the -name test, followed by the wildcard pattern, enclosed in quotes to prevent shell pathname expansion. Subsequently, we include the -size test, followed by the +1M string. The plus sign at the beginning signifies our search for files larger than the specified size. A preceding minus sign would alter the meaning to search for files smaller than the specified size, while no sign means an exact match. The trailing M denotes the unit of measurement, representing megabytes. Other characters that may denote units include:

Character
Unit

b

512-byte blocks. This is the default if no unit is specified.

c

Bytes

w

2-byte words

k

Kilobytes (units of 1024 byte

M

Megabytes (units of 1048576 bytes)

G

Gigabytes (units of 1073741824 bytes)

find offers a wide array of distinct tests. Here's an overview of the commonly used ones. It's important to note that for numeric arguments, the same "+” and “-" notation mentioned earlier can be employed:

Test
Description

-cmin n

Identify files or directories whose content or attributes were modified precisely n minutes ago. Use -n for less than n minutes ago and +n for more than n minutes ago.

-cnewer file

Identify files or directories whose contents or attributes were modified more recently than a specific file.

-ctime n

Identify files or directories whose contents or attributes were modified n*24 hours ago.

-empty

Match empty files and directories.

-group name

Identify files or directories associated with a specific group. The 'group' can be represented as either a group name or a numeric group ID.

-iname pattern

Like the -name test but case insensitive.

-inum n

Locate files with inode number n. This is useful for identifying all hard links linked to a specific inode.

-mmin n

Identify files or directories whose contents were last modified n minutes ago.

-mtime n

Identify files or directories whose contents were last modified n*24 hours ago.

-name pattern

Identify files and directories that match the specified wildcard pattern.

-newer file

Identify files and directories whose contents were modified more recently than the specified file. This proves highly useful when scripting file backup processes. For instance, when performing backups, updating a file like a log, and utilizing find to identify files altered since the last update.

-nouser

Identify files and directories that don't belong to a valid user. This can help locate files linked to deleted accounts or detect suspicious activity by potential attackers.

-nogroup

Identify files and directories not associated with a valid group.

-perm mode

Identify files or directories with permissions set to the specified mode. The mode can be expressed in either octal or symbolic notation.

-samefile name

Like the -inum test, this matches files sharing the identical inode number as the specified file name.

-size n

Match files of size n.

-type c

Match files of type c.

-user name

Identify files or directories associated with a specific user name. The user can be represented by either a username or a numeric user ID.

This is not a complete list. The find man page has all the details.

Operators

Despite the array of tests offered by find, there might still be a need for a more precise way to define the logical connections between these tests. For instance, consider the scenario where we aim to ascertain whether all files and subdirectories in a directory possess secure permissions. In this case, we'd search for files with permissions other than 0600 and directories with permissions other than 0700. Luckily, find offers a solution by enabling the combination of tests through logical operators, allowing for the creation of more intricate logical relationships. To execute the aforementioned test, we could employ this method:

Whoa, that might seem a bit puzzling at first glance. But once you familiarize yourself with these operators, they're not as complex as they appear. Here's the rundown:

Operator
Description

-and

Match when the tests on both sides of the operator are true. It can be abbreviated to -a. Note that when no operator is explicitly present, -and is implied by default.

-or

Match if a test on either side of the operator is true. It can be shortened to -o.

-not

Match if the test subsequent to the operator is false. It can be abbreviated using an exclamation point (!).

( )

Binds tests and operators together to create larger expressions, regulating the precedence of logical evaluations. By default, find evaluates from left to right. At times, it's essential to override this default order to achieve the desired outcome. Even when not explicitly required, including the grouping characters can enhance the command's readability. Note that because parentheses have special significance to the shell, they must be quoted when used on the command line to be passed as arguments to find. Typically, the backslash character is employed to escape them.

Armed with this list of operators, let's break down our find command. At the highest level, our tests are organized into two groups, divided by an -or operator:

( expression 1 ) -or ( expression 2 )

This structure makes sense because we're seeking files with a specific set of permissions and directories with a different set. If we're looking for both files and directories, why use -or instead of -and? Well, as find traverses through files and directories, each one is assessed to check if it matches the specified tests. We want to determine if it's either a file with inadequate permissions or a directory with inadequate permissions—it can't simultaneously be both. So, if we expand the grouped expressions, it looks like this:

( file with inadequate perms ) -or ( directory with inadequate perms )

Now, how do we define "bad permissions"? In reality, we test for "not good permissions" since we're aware of what qualifies as "good permissions." For files, we define good permissions as 0600 and for directories as 0700. The expression that checks files for "not good" permissions is:

-type f -and -not -perms 0600

And for directories:

-type d -and -not -perms 0700

As mentioned in the operator table above, the -and operator is implicitly understood and can be safely omitted. So, when assembling everything, we arrive at our final command:

find ~ ( -type f -not -perms 0600 ) -or ( -type d -not -perms 0700 )

However, since parentheses hold special meaning for the shell, we must escape them to prevent misinterpretation. Preceding each parenthesis with a backslash character accomplishes this task.

Understanding logical operators also involves grasping how two expressions separated by a logical operator function:

expr1 -operator expr2

In all cases, expr1 is always executed. However, the operator dictates whether expr2 is executed. Here's how it operates:

Results of expr1
Operator
expr2 is...

True

-and

Always performed

False

-and

Never performed

True

-or

Never performed

False

-or

Always performed

Why does this occur? It's all about enhancing efficiency. Consider -and, for instance. If the expression expr1 -and expr2 relies on expr1 being false, there's no need to execute expr2. Similarly, with the expression expr1 -or expr2, if expr1 evaluates to true, there's no necessity to proceed with expr2, as we already know that expr1 -or expr2 is true.

So, it's a speed optimization. But why is this significant? Because we can leverage this behavior to regulate how actions are executed, as we'll soon discover.

Predefined Actions

Time to roll up our sleeves! While having a list of results from our find command is handy, the real goal is to take action on these items. Luckily, find enables actions to be executed based on the search outcomes. There's a collection of preset actions and various methods to implement user-defined actions. Let's begin by exploring a few of the predefined actions:

Action
Description

-delete

Delete the currently matching file.

-ls

Perform the equivalent of ls -dils on the matching file. Output is sent to standard output.

-print

Output the full pathname of the matching file to standard output. This is the default action if no other action is specified.

-quit

Quit once a match has been made.

Just like the tests, there are numerous other actions available. Refer to the find manual page for complete details.

In our very first example, we did this:

which generated a list encompassing every file and subdirectory within our home directory. It generated a list because the -print action is implied if no other action is specified. Hence, our command could also be formulated as:

find offers the capability to remove files based on specific criteria. For instance, to delete files with the file extension .BAK (commonly used for backup files), we can employ this command:

In this instance, the command searches through every file in the user's home directory and its subdirectories for filenames ending in .BAK. Upon discovery, these files are removed.

Warning

Before proceeding further, let's revisit how logical operators influence actions. Take a glance at the following command:

As observed, this command seeks every regular file (-type f) with names ending in .BAK (-name '*.BAK') and prints the relative pathname of each matched file to standard output (-print). However, the command's behavior is shaped by the logical connections between the tests and actions. Keep in mind the default implied -and relationship between each test and action. To enhance clarity in the logical relationships, we could also express the command in this manner:

Now that our command is fully articulated, let’s explore how the logical operators impact its execution:

Test/Action
Is Performed Only If...

-print

-type f and -name '*.BAK' are true

-name ‘*.BAK’

-type f is true

-type f

Is always performed, since it is the first test/action in an -and relationship.

Given that the logical relationship between tests and actions dictates their execution, the order of these tests and actions holds significance. For instance, if we were to rearrange the order so that the -print action precedes the tests, the command's behavior would notably change:

This modified command will print every file (as the -print action always evaluates to true) and then proceed to test for file type and the specified file extension.

User-Defined Actions

Apart from the preset actions, we have the option to execute custom commands using the -exec action. Its usage is as follows:

-exec command {} ;

In this structure, command represents the command's name, {} symbolically represents the current pathname, and the semicolon is a necessary delimiter denoting the command's end.

Here's an example demonstrating how -exec can replicate the functionality of the earlier discussed -delete action:

Once more, as the brace and semicolon characters hold significance for the shell, they need to be quoted or escaped.

It's also plausible to execute a user-defined action interactively. By substituting -exec with -ok, the user is prompted before each specified command is executed:

In this instance, we seek files beginning with the string "foo" and trigger the command ls -l each time a match is located. Employing the -ok action prompts the user before the execution of the ls command.

Improving Efficiency

When the -exec action is employed, it initiates a new instance of the specified command for each matching file found. However, there are instances where consolidating all search results into a single instance of the command is preferable. For instance, instead of executing commands individually like this:

ls -l file1 ls -l file2

we might opt for this approach:

ls -l file1 file2

ensuring the command is executed only once, not multiple times. There are two methods to achieve this: the traditional method utilizing the external command xargs and an alternative method using a new feature within find itself. Let's delve into the alternative method first.

By substituting the trailing semicolon character with a plus sign, we activate find's capability to merge the search results into an argument list, executing the desired command just once. Returning to our example:

will execute ls each time a matching file is found. By changing the command to:

we get the same results, but the system only has to execute the ls command once.

xargs

The xargs command serves an intriguing purpose. It takes input from standard input and transforms it into an argument list for a designated command. In our scenario, its usage would resemble this:

Here, we witness the find command's output being directed into xargs, which subsequently assembles an argument list for the ls command before executing it.

Note

Dealing With Funny Filenames

Unix-like systems permit spaces (and even newlines!) within filenames. However, this can pose challenges for programs like xargs, which compile argument lists for other programs. An embedded space is construed as a delimiter, causing the resulting command to interpret each space-separated word as an individual argument. To tackle this issue, find and xargs offer the optional utilization of a null character as an argument separator. In ASCII, a null character is represented by the number zero (in contrast to, for instance, the space character, represented by the number 32 in ASCII). The find command offers the -print0 action, generating null-separated output. Simultaneously, the xargs command includes the --null option, accommodating null-separated input. Consider this example:

find ~ -iname '*.jpg' -print0 | xargs --null ls -l

Employing this technique ensures correct handling of all files, including those with embedded spaces in their names.

A Return To The Playground

Let's apply find to a (nearly) practical scenario. We'll set up an environment to experiment with what we've learned.

To start, we'll construct a playground abundant with numerous subdirectories and files:

Witness the command line's prowess! These two lines swiftly birthed a playground directory containing 100 subdirectories, each housing 26 blank files. Try achieving that using a GUI!

Our magical method combined a familiar command (mkdir), an exotic shell expansion (braces), and a new command, touch. By pairing mkdir with the -p option (prompting mkdir to create parent directories of specified paths) alongside brace expansion, we effortlessly formed 100 subdirectories.

Typically, the touch command manages file access, change, and modification times. Yet, if a nonexistent file is referenced as an argument, touch crafts an empty file.

In our playground, we spawned 100 instances of a file named file-A. Now, let's locate them:

Unlike ls, find does not yield results in a sorted order. Its sequence is dictated by the storage device's layout. To validate the existence of all 100 instances of the file, we can execute the following:

Now, let's explore locating files according to their modification times. This proves useful when establishing backups or organizing files chronologically. Initially, we'll generate a reference file for comparing modification times:

This command generates an empty file titled timestamp and updates its modification time to the current moment. We can confirm this using another useful command, stat, which serves as an advanced version of ls. stat unveils comprehensive details about a file and its attributes that the system comprehends:

When we touch the file once more and subsequently inspect it using stat, we'll observe that the file's times have been revised:

Next, let’s use find to update some of our playground files:

This command updates all files in the playground labeled as file-B. Following this, we'll utilize find to pinpoint the altered files by comparing them against the reference file timestamp:

The outcomes encompass all 100 occurrences of file-B. As we executed a touch on all files within the playground named file-B subsequent to updating timestamp, they are now considered "newer" than timestamp and hence identifiable using the -newer test.

Now, let's revisit the earlier assessment for bad permissions and apply it to the playground:

This command showcases a listing of all 100 directories and 2600 files within the playground (inclusive of timestamp and playground itself, totaling 2702 items) as none of them adheres to our criteria for "good permissions." Leveraging our understanding of operators and actions, we can supplement this command with actions to implement updated permissions across the files and directories within our playground:

In everyday practice, it might be more convenient to execute two commands—one for the directories and another for the files—rather than employing this comprehensive compound command. However, it's beneficial to acknowledge that this method is possible. The key takeaway here is comprehending how operators and actions harmonize to execute practical tasks.

Options

Lastly, there are the options. These options serve to manage the scope of a find search. They can be incorporated alongside other tests and actions when crafting find expressions. Below is a compilation of the frequently utilized ones:

Option
Description

-depth

Instruct find to handle a directory's files prior to the directory itself. This option is automatically activated when the -delete action is employed.

-maxdepth levels

Specify the maximum depth that find will traverse within a directory tree while executing tests and actions.

-mindepth levels

Establish the minimum depth that find will explore within a directory tree before implementing tests and actions.

-mount

Instruct find to refrain from traversing directories mounted on separate file systems.

-noleaf

Instruct find to avoid optimizing its search presuming it's exploring a Unix-like file system. This adjustment is necessary when scanning DOS/Windows file systems and CD-ROMs.

Summary

It's evident that locate offers simplicity compared to the complexity of find. Each has its own advantages. Take the opportunity to delve into the diverse capabilities of find. Regular use can significantly enhance your comprehension of Linux file system operations.

Last updated