Tuesday, April 22, 2014

Learn regular expressions to more effectively search through code and the shell

http://www.linuxuser.co.uk/tutorials/regular-expressions-guide

We’re always searching for something – the file where we wrote that recipe (Python or baking); the comment in 100,000 lines of code that points to an unfinished module; the log entry about an iffy connection. Regular expressions (abbreviated as regexps hereafter, but you’ll also see regex and re) are a codified method of searching which, to the unenlightened, suggests line noise. Yet, despite a history that stretches back to Ken Thompson’s 1968 QED editor, they’re still a powerful tool today, thanks to grep – ‘global regular expression print’. Using grep exposes only the limited Basic Regular Expressions (BRE); grep -E (or egrep) gives Extended Regular Expressions (ERE). For other languages, most adopt PCRE (Perl Compatible Regular Expressions), developed in 1997, by Philip Hazel, and understood by many languages, though not always implemented in exactly the same way. We’ll use grep -P when we need to access these. Emacs has its own regexp style but, like grep, has a -P option to use Perl-compatible regexps.
This introduction is mostly aimed at searching from the shell, but you should easily be able to adapt it to standalone Perl scripts, and other languages which use PCRE.
Even the simplest regexp can make you more productive at the command line
Even the simplest regexp can make you more productive at the command line

Resources

Your favourite editor
Perl 5.10 (or later)

Step-by-step

Step 01 Word up!
You’re probably used to searching a text file for occurrences of a word with grep – in that case, the word is the regular expression. More complicated regexps are simply concise ways for searching for parts of words, or character strings, in particular positions.
Step 02 Reserved character
Some characters mean special things in regexp pattern matching: . * [ ] ^ $ \ in Basic Regular Expressions. The ‘.’ matches any character, so using it above doesn’t just find the full stop unless grep’s -F option is used to make the string entirely literal.
Step 03 Atlantic crossing
Extended Regular Expressions add ? | { } ( ) to the metacharacters. grep -E or egrep lets you use them, as above, where ‘standardise|standardize’ can match British or American (and ‘Oxford’) spellings of ‘standardise’.
Step 04 Colourful?
‘|’ gives a choice between the two characters in the parentheses – standardi(s|z)e – saving unnecessary typing. Another way to find both British and American spellings is ‘?’ to indicate one or zero of the preceding element, such as the u in colour.
Step 05 Mmmmm, cooooool
The other quantifiers are + for at least one of the preceding regexps (‘_+’ finds lines with at least one underscore) and * for zero or more (coo*l matches col, cool, coooooooool, but not cl, useful for different spellings of mmmmmmmmm or zzzzzzzzzz).
Step 06 No number
Feeling confident? Good, time for more goodies. [0-9] is short for [0123456789] and matches any element in the square brackets. The ^ inside the brackets is a negation, here matching any non-number but the other ^? …
Step 07 Start to finish
The ^ matches the expression at the beginning of the line; a $ matches the end. Now you can sort your document.text from text.doc and find lines beginning with # or ending in a punctuation mark other than a period.
Step 08 A to Z Guide
The range in [] can be anything from the ASCII character set, so [ \t\r\n\v\f] indicates the whitespace characters (tab, newline et al). [^bd]oom$ matches all words ending in ‘oom’, occurring at the end of the line, except boom and doom.
Step 09 POSIX classes
The POSIX classes for character ranges save a lot of the [A-Za-z0-9], but perhaps most useful is the non-POSIX addition of [:word:] which matches [A-Za-z0-9_], the addition of underscore helping to match identifiers in many programming languages.
Step 10 ASCII style
Where character classes aren’t implemented, knowledge of ASCII’s underpinnings can save you time: so [ -~] is all printable ASCII characters (character codes 32-127) and its inverse [^ -~] is all non-printable ASCII characters.
Step 11 Beyond grep
Find and Locate both work well with regexps. In The Linux Command Line (reviewed in LUD 111), William Shotts gave the great example of find . -regex ‘.*[^-_./0-9a-zA-Z].*’ to find filenames with embedded spaces and other nasties.
Step 12 Nice one Cyril
Speaking of non-standard characters, while [:alpha:] depends on your locale settings, and may only find ASCII text, you can still search for characters of other alphabets – from accented French and Welsh letters to the Greek or Russian alphabet.
Step 13 Ranging repeat
While {4} would match the preceding element if it occurred four times, putting in two numbers gives a range. So, [0-9]{1,3} in the above screenshot finds one-, two- or three- digit numbers – a quick find for dotted quads, although it won’t filter out 256-999.
Step 14 Bye bye, IPv4
FOSDEM was all IPv6 this year, so let’s not waste any more time on IPv4 validation, as the future may actually be here. As can be seen in this glimpse of IPv6 validators, despite some Perl ‘line noise’, it boils down to checking appropriate amounts of hex.
Step 15 Validation
By now regexps should be looking a lot less like line noise, so it’s time to put together a longer one, just building from some of the simpler parts. A common programming task, particularly with web forms, is validating input is in the correct format – such as dates.
In this case we’re looking at validating dates, eg for date-of-birth (future dates could then be filtered using current date). Note that (0[1-9]|[12][0-9]|3[01]) checks numbers 01-31, but won’t prevent 31st February.
Step 16 Back to basics
Now we have the basics, and can string them together, don’t neglect the grep basics – here we’re looking at how many attempts at unauthorised access were made by SSH in a given period. An unnecessary pipe replaced with grep -c.
Step 17 Why vi?
Whatever your position in the venerable and affectionate vi/Emacs war, there will be times and servers where vi is your only tool, so grab yourself a cheat-sheet. Vi and vim mostly follow BRE. Here we see one of the \< \> word boundaries.
Step 18 Boundary guard
As well as ^ and $ for line ends, word boundaries can be matched in regexps with \b – enabling matches on, say, ‘hat’ without matching ‘chatter’. The escape character, \, is used to add a number of extra elements, such as \d for numerical digit.
Step 19 Literally meta
Speaking of boundaries, placing \Q \E around a regexp will treat everything within as literals rather than metacharacters – meaning you can just quote a part of the regexp, unlike grep -F where everything becomes a literal.
Step 20 Lazy = good
Time to think about good practice. * is a greedy operator, expanding something like <.*> by grabbing the last closing tag and anything between, including further tags. <.*?> is non- greedy (lazy), taking the first closing tag.
Step 21 Perl -pie
Aside from grep, Perl remains the most comfortable fit with regexps, as is far more powerful than the former. With perl -pie on the command line, you can perform anything from simple substitutions on one or more files, to…
Step 22 Perl one-liner
…counting the empty lines in a text file (this from Krumin’s Perl One-Liners, see next month’s book reviews). /^$/ matches an empty line; note Perl’s use of // to delimit a regexp; ,, could also be used if / is one of the literals used.
Step 23 A regexp too far
Now you know the basics, you can build slightly more complicated regexps – but, as Jeff Atwood said: “Regular expressions are like a particularly spicy hot sauce – to be used in moderation and with restraint, only when appropriate.”
Step 24 Tagged offender
Finally, know the limitations of regexps. Don’t use on HTML, as they don’t parse complex languages well. Here the legendary StackOverflow reply by Bob Ince to a query on their use with HTML expresses the passion this question engenders.

No comments:

Post a Comment