Michael de Silva's Blog

Software Engineer. Rubyist and Roboticist.

Michael de Silva's Blog

Software Engineer. Rubyist and Roboticist.

HOWTO Grok Regular Expressions

I'm not ashamed to say that I've been more a consumer of publicly-crafted regexes over the years. All the texts on the subject have been cryptic enough, that it's something I never actually bothered looking into seriously enough, and boy-oh-boy I wish I had done this sooner!

Today, I took that step and am already on my way to enlightenment thanks to

  1. #regex on irc.freenode.net and ShiningThrough for his help
  2. http://regex101.com/ which is the best online regex tool I've come across to date (thanks ShiningThrough!)
  3. http://regexone.com/ interactive lessons on groking regexes

I've compiled my solutions as I make my way through these lessons as a Gist for reference and it would be great if you could share your solutions as well. Here's are a couple examples

#### Lesson 9 - http://regex101.com/r/yZ4qG2

Objective:

    match   1.   abc
    match   2.    abc
    match   3.                   abc
    skip    4.abc

Solutions:

    ([\d\.]+\s+[a-c]+)          # matches any whitespace
    ([\d\.]+[" "|\t]+[a-c]+)    # specifically matches spaces and tab-based whitespace

#### Lession 11 - http://regex101.com/r/hZ7kE1

Objective:

    match   file_a_record_file.pdf
    match   file_yesterday.pdf
    skip    testfile_fake.pdf.tmp    

Solutions:

    ([a-z+\_?]+)\.pdf$
    ([a-z+\_?]+)(?=\.pdf$)      # using positive lookahead

There's a whole lot more to regexes and I'm really liking what I've learned thus far, especially the power of lookarounds. Also remember regexp results will vary depending on the engine in question, i.e. whether it's Perl, PCRE, or even Vim's take on regexes, as well as its support in various languages.

Fellow rubyists should also take note that \A and \z — the start and end of string anchors — are to be used in Ruby-based regexes due to it differing from other languages in that it automatically uses "multiline mode" (which enables the aforementioned behaviour of having ^ and $ match per line) for regular expressions

You should at least have a read through Exploring Ruby’s Regular Expression Algorithm for details of its fascinating internals.

comments powered by Disqus