Tuesday, April 29, 2014

Unix: Counting chickens or anything else

http://www.itworld.com/operating-systems/415401/unix-counting-chickens-or-anything-else

Unix tools make it easy to find strings in files, but what if you want to find specific whole words, more complex text patterns, or every instance of a word?
 
By  
Basic Unix commands make it easy to determine whether files contain particular strings. Where would we be without commands like grep? But sometimes when using grep, you can get answers that under- or overreport the presence of what you are looking for. Take a very simple grep command for an example.
 
$ grep word mybigfile | wc -l
98

Commands like this tell you how many lines contain the word you are looking for, but not necessarily how many times that word appears in the file. After all, the word "word" might appear twice or more times in a single line and yet will only be counted once. Plus, if the word could be part of longer words (like "word" is a part of the word "password" and the word "sword"), you might even get some false positives. So you can't depend on the result to give you an accurate count or even if the word you are looking for appears at all unless, of course, if the word you are looking just isn't going to be part of another word -- like, maybe, chicken.

Trick #1: grep with -w

If you want to be sure that you count only the lines containing "word", you can add the -w option with your grep command. This option tells grep to only look for "word" when it's a word on its own, not when it is part of another word.
$ grep -w word mybigfile | wc -l
54

Trick #2: looping through every word

To be sure that you count every instance of the word you are looking for, you might elect to use some technique that examines every word in a file independently. The easiest way to do this is to use a bash for command. After all, any time you use a for command, such as for letter in a b c d e, the command loops once for every argument provided. And, if you use a command such as for letter in `cat mybigfile`,
it will loop through every word (i.e., every piece of text on every line) in the file.
$count=0
$ for word in `cat mybigfile`
> do
>   if [ $word == "word" ]; then
>      count=`expr $count + 1`
>   fi
> done
$ echo $count
71
If you need to do this kind of thing often -- that is, look for particular words in arbitrary files, then you might want to commit the operation to a script so that you don't have to type the looping and if commands time and time again. Here's an example script that will prompt you for the word you are looking for and the file you want to look through if you don't choose to provide them on the command line.
#!/bin/bash

if [ $# -le 2 ]; then
    echo -n "look for> "
    read lookfor
    echo -n "file> "
    read filename
else
    lookfor=$1
    filename=$2
fi

for w in `cat $filename`
do
  if [ $w == "$lookfor" ]; then
    count=`expr $count + 1`
  fi
done

echo $count

Trick #3: Looking for patterns

More interesting than looking for some specific word is the challenge of looking for various patterns in particular files.
 
Maybe you need to answer questions like "Does this file contain anything that looks like phone numbers, social security numbers, or IP addresses?". And maybe you need to grab a list of what phone numbers, social security numbers, or IP addresses might be contained in the file -- or just verify that none are included.
When looking for patterns, I like to rely on the powers of Perl. Some patterns are relatively easy to construct. Others take a lot more effort. Take, for example, the patterns below. The first represents a social security number -- 3 digits, followed by a hyphen, followed by 2 digits, followed by a hyphen, followed by 4 digits. That's easy. The last represents an IP address (IPv4) with 1 to 3 digits in each of four positions, separated by dots. Phone numbers, on the the other hand, can take a number of different forms. For example, you might need the preceding 1. You might separate the portions of the number with hyphens or dots. The middle expression tries to capture all the possible permutations, but even this doesn't cover the possible expressions of international numbers.
[0-9]{3}-[0-9]{2}-[0-9]{4}
1?\W*([2-9][0-8][0-9])\W*([2-9][0-9]{2})\W*([0-9]{4})
\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
International numbers might work with [+]?([0-9]*[\.\s\-\(\)]|[0-9]+){3,24} though I'm not sure than I can vouch for all the possible expressions of these numbers as I'm not one who ever makes international calls.
The Perl script below looks for IP addresses in whatever file is provided as an argument. By using the "while pattern exists" logic in line 12, it captures multiple IP addresses on a single line if they exist. Each identified IP address is then removed from the line so that the next can be captured in the subsequent pass through the loop. When all addresses have been identified, we move to the next line from the text file.
#!/usr/bin/perl -w

if ( $#ARGV >= 0 ) {
    open FILE,"<$ARGV[0]" or die;
} else {
    print "ERROR: file required\n";
    exit 1;
}

my %IP=();

while (  ) {
    while ( /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/ ) {
        ($ipaddr)=/(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/;
        if ( exists $IP{$ipaddr} ) {
            $IP{$ipaddr}++;
        } else {
            $IP{$ipaddr}=1;
        }
        s/$ipaddr//;  # remove captured phone number from line
    }
}

# display list of captured IP addresses and # of occurrences
foreach my $key ( keys %IP )
{
    print "$key $IP{$key}\n";
}
This script stuffs all the identified IP addresses into a hash and counts how many times each appears.

So, it tells you not just what IP addresses show up in the file, but how many times each appears.
Notice how it uses the exists test to determine whether an IP address has been seen and captured earlier before it decides to create a new hash entry or increment the count for an existing one.

Wrap-Up

Identifying text of interest from arbitrary files in generally an easy task as long as you can distinguish what you are looking for and not miss counting instances when more than one appears on the same line.

No comments:

Post a Comment