>> Regular Expression

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski

Regular Expressions are difficult to write and maintain. Instead of harping about the problems, I want to explore what Emacs offers to make writing them easier. In particular, I want to tackle the rx macro, the regular s-expression or lispy regular expression.

(require 's)  ;; All we need is =s-matches-p=
(require 'rx)

;; Creating a regexp that will match -> <File> [<Line>:<Column] <Suggestion>
(setq this-file-name "blog.org")

(s-matches-p
 (rx bol
     (eval this-file-name)
     space
     "[" (group (one-or-more digit)) ":" (group (one-or-more digit)) "]"
     space
     (group (zero-or-more anything))
     eol)
 "blog.org [17:16] Emacs Lisp, not emacs lisp")

;; Produced regexp, I do not want to write or maintain this by hand
"^blog\\.org[[:space:]]\\[\\([[:digit:]]+\\):\\([[:digit:]]+\\)][[:space:]]\\(\\(?:.\\|
\\)*\\)$"

Although it is less concise, the example above illustrates the selling point of writing regular expressions at a higher level: it is more understandable, comfortable to write and easier to maintain. Rather, the "lispyness" of the expressions is more appropriate in the style and heart of Emacs, working with symbolic expressions.

The builtin rx macro has no obvious manual but it has a symbol documentation found via describe-function. For a powerful idea, it doesn't have strong examples in the wiki or web to promote it. Hackers before users. To be fair, reading the documentation is enough but examples or recipes would hasten comprehension. This is what this article explores, thus the following sections are exploring some syntax or construct. (Aside, the problems I use are found in Regular Expression Cookbook and if you found them intriguing, support the author and buy the book.)

>> Strings And Quoting

STRING
     matches string STRING literally.

CHAR
     matches character CHAR literally.

(eval FORM)     evaluate FORM and insert result.  If result is a string,
     ‘regexp-quote’ it.

PROBLEM: What (regular) expression matches this string: The punctuation characters in the ASCII table are: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}

;; Escape the double quote here
(setq input "The punctuation characters in the ASCII table are: !\"#$%&'()*+,-./:;<=>?@[\]^_`{|}")

(s-matches-p (rx "The punctuation characters in the ASCII table are: !\"#$%&'()*+,-./:;<=>?@[\]^_`{|}")
             input) ;; Direct use of strings

(not (s-matches-p input input)) ;; Does not work because of quoting
(s-matches-p (regexp-quote input) input)

(s-matches-p (rx (eval input)) input) ;; More rx

This problem is merely quoting or escaping syntax characters, that is if you know what those syntax characters are. The function regexp-quote, which escapes those characters, is simple enough. This is done by default by rx when a string is passed in for simplicity. Finally, string variables can be used through the eval syntax, which is like inserting values with backquoting.

>> Variables And Ranges

(any SET ...)(in SET ...)(char SET ...)     matches any character in SET ....  SET may be a character or string.
     Ranges of characters can be specified as ‘A-Z’ in strings.
     Ranges may also be specified as conses like ‘(?A . ?Z)’.

     SET may also be the name of a character class: ‘digit’,
     ‘control’, ‘hex-digit’, ‘blank’, ‘graph’, ‘print’, ‘alnum’,
     ‘alpha’, ‘ascii’, ‘nonascii’, ‘lower’, ‘punct’, ‘space’, ‘upper’,
     ‘word’, or one of their synonyms.

PROBLEM: Create one regular expression to match all common misspellings of calendar, so you can find this word in a document without having to trust the author’s spelling ability. Allow an a or e to be used in each of the vowel positions.

(s-matches-p (rx "c"
                 (any "a" "e")
                 "l"
                 (any "a" "e")
                 "nd"
                 (any "a" "e")
                 "r")
             "celander")

(setq misspelling-pattern `(any "a" "e"))

(s-matches-p (rx "c"
                 (eval misspelling-pattern)
                 "l"
                 (eval misspelling-pattern)
                 "nd"
                 (eval misspelling-pattern)
                 "r")
             "calendar")

"c[ae]l[ae]nd[ae]r" ;; Generated pattern

Aside from demonstrating a simple range construct, the use of sub-patterns through the familiar eval allows us to treat these expressions more modularly, which helps us move away from a monolithic concatenated string.

PROBLEM: Create a regular expression to match a single hexadecimal character.

(s-matches-p (rx (any "a-f" "A-F" "0-9"))
             "A")
(s-matches-p (rx (in "a-f" "A-F" "0-9"))
             "A") ;; Equivalently

"[0-9A-Fa-f]" ;; Generated pattern


(s-matches-p (rx (char hex-digit))
             "d") ;; More rx
(s-matches-p (rx hex-digit)
             "d") ;; Equivalently

"[[:xdigit:]]" ;; Generated pattern

Lastly, the range syntax allows the familiar dashes to add character range. Rather, the abstraction of special character ranges like [:upper:] or [:xdigit:] is nice to know. Other useful constructs such as word-start, line-end, and punctuation exist that is worthy to be explored.

>> Alternatives And Depth

(or SEXP1 SEXP2 ...)(| SEXP1 SEXP2 ...)     matches anything that matches SEXP1 or SEXP2, etc.  If all
     args are strings, use ‘regexp-opt’ to optimize the resulting
     regular expression.

(zero-or-one SEXP ...)(optional SEXP ...)(opt SEXP ...)     matches zero or one occurrences of A.

(and SEXP1 SEXP2 ...)(: SEXP1 SEXP2 ...)(seq SEXP1 SEXP2 ...)(sequence SEXP1 SEXP2 ...)     matches what SEXP1 matches, followed by what SEXP2 matches, etc.

(repeat N SEXP)(= N SEXP ...)     matches N occurrences.

PROBLEM: Create a regular expression that when applied repeatedly to the text Mary, Jane, and Sue went to Mary's house will match Mary, Jane, Sue and then Mary again.

(s-match-strings-all
 (rx (or "Mary" "Jane" "Sue"))
 "Mary, Jane, and Sue went to Mary's house")

;; Output
'(("Mary") ("Jane") ("Sue") ("Mary"))

;; Generated pattern
"\\(?:Jane\\|Mary\\|Sue\\)"

This simple problem is a demonstration of using the alternation construct, which is related to ranges and classes. Nothing fancy but the possibility of making it nuanced exist.

PROBLEM: Create a regular expression matching 0 to 255.

(setq range-expression ;; Expression and pattern separated for reuse
      `(or "0"
           (sequence "1" (optional digit (optional digit)))
           (sequence "2" (optional
                          (or
                           (sequence (any "0-4") (optional digit))
                           (sequence "5" (optional (any "0-5")))
                           (any "6-9"))))
           (sequence (any "3-9") (optional digit))))

(setq range-pattern (rx bol (eval range-expression) eol))

;; A test for the regular expression
(require 'cl)
(cl-every (lambda (number)
            (s-matches-p range-pattern (number-to-string number)))
          (number-sequence 0 255))

;; Generated pattern
"0\\|1\\(?:[[:digit:]][[:digit:]]?\\)?\\|2\\(?:[0-4][[:digit:]]?\\|5[0-5]?\\|[6-9][[:digit:]]?\\)?\\|[3-9][[:digit:]]?"

;; To use this IP Addresses
(setq ip4-pattern (rx bol
                      (repeat 3 (sequence (eval range-expression) "."))
                      (eval range-expression)
                      eol))

(s-matches-p range-pattern "30")
(s-matches-p ip4-pattern "300")
;; Testing for permutation might take too long, one is good enough
(s-matches-p ip4-pattern "61.12.234.30")

;; Generated pattern
"\\(?:\\(?:0\\|1\\(?:[[:digit:]][[:digit:]]?\\)?\\|2\\(?:[0-4][[:digit:]]?\\|5[0-5]?\\|[6-9][[:digit:]]?\\)?\\|[3-9][[:digit:]]?\\)\\.\\)\\{3\\}\\(?:0\\|1\\(?:[[:digit:]][[:digit:]]?\\)?\\|2\\(?:[0-4][[:digit:]]?\\|5[0-5]?\\|[6-9][[:digit:]]?\\)?\\|[3-9][[:digit:]]?\\)"

The idea of this expression is matching the first digit, then considering the branches. Even if I don't explain in depth, the syntax should be helpful; but three new constructs deserve some words. First, the optional or opt syntax is the equivalent of the zero-or-one construct. Second, the sequence or seq syntax is primarily an expression wrapper, where a list not an atom is required. Third, repeat syntax is the same as the repetition construct of a prior pattern. Regardless of the new syntax, the problem is just flexing the syntax.

Also, remember to write tests for regular expressions. I made three mistakes on my first draft, thus test before publishing. Strangely, regular expressions are like functions that can be property checked.

Before I forget, the eval construct requires that the variables exist in the interpreter; meaning, they have to be globally set via setq before being used. That is why two setters in the snippet set up the expression and pattern separately and respectively. I suggest setting the expression or pattern via defconst or defvar as refactoring. It is unfortunate that let will not work with eval , but it isn't a huge cost.

>> Groups And Backreferencs

(submatch SEXP1 SEXP2 ...)(group SEXP1 SEXP2 ...)     like ‘and’, but makes the match accessible with ‘match-end’,
     ‘match-beginning’, and ‘match-string’.

(submatch-n N SEXP1 SEXP2 ...)(group-n N SEXP1 SEXP2 ...)     like ‘group’, but make it an explicitly-numbered group with
     group number N.

PROBLEM: Create a regular expression that matches any date in yyyy-mm-dd format and separately captures the year, month, and day. As extra challenge, make the groups named.

(setq date-pattern
   (rx (group-n 3 (repeat 4 digit))
       "-"
       (group-n 2 (repeat 2 digit))
       "-"
       (group-n 1 (repeat 2 digit))))

(s-match-strings-all date-pattern
                     (format-time-string "%F"))

;; Output and pattern, notice it is day, month and year or reverse order
"\\(?3:[[:digit:]]\\{4\\}\\)-\\(?2:[[:digit:]]\\{2\\}\\)-\\(?1:[[:digit:]]\\{2\\}\\)"
'(("2017-03-30" "30" "03" "2017"))

Capturing groups are fundamental; however, this is where the syntax needs works. Named groups aren't possible here, instead we are limited to numbered groups. Closely, this is not a limitation of the macro but the specific Emacs Lisp regex syntax; a more domain specific version can be tuned. This example just shows not every feature is translated.

The group-n or group syntax is obvious in intention. The first argument represent the group number and the rest are the actual expression. Nothing fancy.

PROBLEM: Create a regular expression that matches "magical" dates in yyyy-mm-dd format. A date is magical if the year minus the century, the month, and the day of the month are all the same number. For example, 2008-08-08 is a magical date.

(setq magical-pattern
   (rx
    (repeat 2 digit)
    (group-n 1 (repeat 2 digit))
    "-"
    (backref 1)
    "-"
    (backref 1)))

(s-matches-p magical-pattern
             "2008-08-08")

;; Generated pattern
"[[:digit:]]\\{2\\}\\(?1:[[:digit:]]\\{2\\}\\)-\\1-\\1"

This just shows backreferences are available. The backref syntax is just invoking the group with the numeric argument. Again, nothing complicated.

>> re-builder

To conclude this exploration, a UI exist for testing and experimenting on regular expressions: re-builder. Execute the command, re-builder or regexp-builder, on a buffer containing the text, then execute reb-change-syntax and select rx. The following screencast can be illuminating.

 

This UI can handle raw expression but we are interested in how this ties to rx. To elaborate, every time the expression is updated, it highlights any possible matches. Although it is not as dynamic or programmatic, it is handy as a quick experiment and check.

>> Conclusion

This macro is not a replacement for learning regular expressions since there are nuances that a DSL can cover such language specific syntax like PCRE; rather, productivity is the key. As for me, I do not want to write raw regular expression, I would prefer an abstraction to make it easier on the eyes and hands.

Finally, I did not discuss all constructs but only the interesting features that draw me in, and perhaps enchant you as well. Read The Function Documentation.

If this can be done for regular expressions, can it be applied for SQL? An idea still waiting to be written.