The next part of our processing pipeline is selection.
Here, a single command, grep, is the king.
Grep stands for Global, throughout a file, Regular Expression Print.
A regular expression is a recipe for matching text strings.
At some point in time,
the grep sequence was often executed as a command
in Unix's line-oriented editor,
and you can still do that as a vim editor's colon command.
Users found the editor command to be so useful,
that its code was used to create grep as a stand-alone command.
The grep command is so essential that when many years ago,
I found myself programming in the Cobol programming language
on a computer and operating system so obscure,
that they aren't listed even on Wikipedia,
I wrote the bare-bones version of grep just to be able to do my job.
Let's start by looking at regular expressions.
As always,
remember to try out on your own, the regular expressions
and the selection commands we'll be examining.
When we use regular expressions with grep,
by default, they match any part of the line.
However, with special characters, we can have them anchor themselves
at the beginning or at the end of the line.
Let's start with a simple example.
I run "grep baba" on the words dictionary
to search for lines containing the regular expression "baba".
This is a character sequence with the letter "b" followed by "a",
followed again by "b" and "a".
As expected, the output includes all dictionary words
with this character sequence.
Now,
if we're only interested in lines starting with a particular sequence,
we precede it with the caret special character,
which looks like an up arrow.
Searching now in the dictionary for all words starting with "baba",
we only get the words that begin with this sequence.
Another special character, the dollar sign,
can be specified after a character sequence,
to match line endings.
In our case,
there are just two words ending with "baba".
The dot character matches any character.
In this way, running "grep a.a.a.a" on words,
outputs all words containing the character "a",
followed by any character,
followed by "a" again,
followed by any character and so on.
This is the result we get.
The special regular expression characters can also be combined.
For instance,
let's suppose we want to search for words starting with "t",
ending with "y" and having any character in between.
I run "grep '^t.y$' words"
and these are the identified dictionary words.
As another example,
I want to count how many four-letter words exist in the dictionary.
We can use dots as placeholders for all four characters.
So I search for words
having any four characters from their beginning to the end.
Piping the output to word count,
we see that there are around 5000 four-letter words in the dictionary.
The star special character modifies its preceding regular expression
to match its pattern, zero or more times.
Consider as an example, finding words in the dictionary
that start and end with "k"
and have a "d" somewhere in between.
I represent this with the regular expression "k",
followed by any character any number of times, including zero,
this is what the star character stands for, followed by "d",
followed again by any character any number of times,
and ending with "k".
We see that the dictionary contains two such words:
"kedlock" and "kodak".
Interestingly,
George Eastman
created the name of the legacy instant photography company, Kodak,
by looking for sequences that started and ended with the character "k".
If he had access to grep in 1888, he could have run this command.
Let's see another fun example.
We want to find all words in the dictionary,
whose characters follow the alphabetic sequence.
This means that each of their characters
must precede in the alphabet the following one.
In this case, we can use a regular expression
that contains all characters ordered in alphabetic sequence,
with each one of them occurring zero or more times.
Counting these words,
we see that there are around 700 words following this pattern.
We can specify a set of characters to match,
by putting them into square brackets.
As an example, I want to find in the dictionary,
any words that contain a digit.
To do that, I run grep '[0-9], which represents any digit,
specifying the words dictionary.
No output appears, so apparently there are no such words with digits.
Now, lets find words that include a hard consonant between "k"s, like Kodak.
I specify a regular expression that starts with "k",
continues with any character,
then includes one of the specified hard consonants,
continues with another arbitrary character,
and ends with a "k".
Two such words exist in the dictionary:
"kapok" and "kodak".
In order to count the proper nouns of the dictionary,
these are words that begin with a capital letter,
I type a regular expression that includes an up arrow
and then an uppercase character from A to Z.
There are about 25.000 proper nouns in words.
Furthermore,
to identify non-alphabetic characters
I specify an up arrow within the square brackets,
in order to complement the set operation.
Whenever the set in square brackets starts with an up arrow,
this means
"any character except for the set specified in the square brackets".
In our case, there are two such names, both containing a dash.
Within square brackets,
we can also specify the name of a character class name
enclosed in colons and another pair of square brackets.
This corresponds to the set of all characters belonging to that class.
Standard character classes
include alphanumeric or alphabetic characters,
blank characters, such as space and tab,
digits,
space characters,
and uppercase characters.
For instance, the class ":space:" represents the space characters:
the tab, the newline, the vertical tab, the space and so on.
Let's see an example with a character class.
Consider the case where we want to find on a Windows PC
all top level directories whose name contains a space character.
I first run find on /cygdrive/c,
specifying a maxdepth of one, to not descend further,
and looking only for directories.
I then pipe the output to grep,
in order to specify directory names with spaces.
The output includes the various Program Files
and the System Volume Information directory.
To make all those special characters
we have seen up to now, lose their special meaning
we can precede them with a backslash.
Note that the backslash must be escaped or quoted
when used in the shell's command line context.
For instance, let's look into the /etc directory
for file and directory names that contain a dot.
As you can see,
I use a find command and pipe the output to grep
specifying a dot preceded by a backslash,
in order to deprive the dot of its special meaning.
In this way,
instead of matching any character with dot,
I match actual dot characters appearing in the file and directory names.
This concludes our unit on selecting data with regular expressions and grep.
Stay with us!