This article is for teaching you how to use Lua's pattern matching language. The pattern matching language (or
patterns for short) provides advanced tools for searching and replacing recurring patterns in strings. These tools can be used for writing text data parsers, custom formatters and many other things that would take hundreds of lines of code.
A lot of the theory in this article is either copied or rewritten from the lua reference manual. You can see the manual section on patterns here.
An average pattern looks like this:
That specific pattern could be used for finding variable names (such as "hi_there", "h0w_are_you" etc.). What each character in the pattern does will be explained later in this article.
These functions can be used together with patterns:
I will try to use all of these functions and explain how each of them work in detail.
There are a bunch of special characters that either escape other characters, or modify the pattern in some way.
These characters are:
They can also be used in the pattern as normal characters by prefixing them with a "%" character, so "%%" becomes "%", "%[" becomes "[", etc.
Character classes represent a set of characters. They can be either predefined sets or custom sets that can consist of the same predefined sets, ranges or any single characters.
Available character classes (custom and predefined):
||represents all letters (from a to z upper and lower case)|
||represents all control characters (special characters "\t", "\n", etc.)|
||represents all digits (from 0 to 9)|
||represents all lowercase letters (any letter that is lower case)|
||represents all punctuation characters (".", ",", etc.)|
||represents all space characters (a normal space, tab, etc.)|
||represents all uppercase letters (any letter that is upper case)|
||represents all alphanumeric characters (all letters and numbers)|
||represents all hexadecimal digits (digits 0-9, letters a-f, and letters A-F)|
||represents the character with representation 0 (the null character "\0")|
||(where 'x' is any non-alphanumeric character) represents itself|
||represents all characters in 'set' as a union. You can see this used in the previous section. '[%w_]' will match any letter, digit and an underscore|
||represents the opposite of the union 'set', so '[^%w_]' matches everything that is not a letter, digit or underscore|
- An upper case version of a predefined character set will represent the opposite of that set, so
%Awill match anything that is not a letter,
- The starting and ending points of a range are separated with a hyphen "-", so
0-5will match a digit from zero to five,
a-cwill match a, b or c.
Characters in a string match a pattern in the following ways:
- a single class will match a single character,
- a single class followed by "+" will match one or more repetitions of characters and will match the longest sequence,
- a single class followed by "-" will match zero or more repetitions of characters and will match the shortest sequence,
- a single class followed by "*" will match zero or more repetitions of characters and will match the longest sequence,
- a single class followed by "?" will match one or zero characters,
- %n (where
nis a digit between 1 and 9) will match the
nth capture (see next section),
xywill match strings that start with
xand end with
y, "%b()" will match a string that starts with "(" and ends with ")".
Patterns can be anchored like so:
- starting the pattern with "^" will match a string at the beginning,
- ending the pattern with "$" will match a string at the end,
- not anchoring the pattern will match a string at any position.
These two characters only have a meaning if positioned as stated above. At any other position, these characters have no meaning and represent themselves.
Patterns can also contain sub-patterns enclosed in "()". Captures are used in functions like string.match and string.gsub to return or substitute a specific match from the pattern. Examples on how to use these can be found below.
Now I'm going to show you how to actually use all that stuff above. The examples below explain how to use the four functions listed above.
str is the string to search,
pattern is the pattern string to find,
start is the start index and
plain is a boolean indicating whether to use a pattern search or just plain text search. The function returns the start and end indices (not start index and length) of the matching substring. If the pattern has captures, they will be returned after the indices. If a match couldn't be found, the function returns nil.
The following code will find the first word in the string.
You probably think that this could be done with string.Explode and a few loops, but look, we did it in three lines.
The following code will check if a string is safe to be used as a file name, by comparing it with a set of restricted characters.
nil if no match is found. This means we can use boolean logic to print "unsafe" if a match is made, and "safe" otherwise.
str is the string to search,
pattern is the pattern to find and
start is the start position. If a there is a match, the function return the captures from the pattern, if there are no captures, it will return the whole match. If a match couldn't be found, the function will return
The following code will parse a simple keyvalue line.
The following code will check if the string ends with a .lua extension.
str is the string to search and
pattern is the string to search for. The function returns an iterator function (special functions used by loops) that goes through every match in the string and returns the pattern's captures, if there are any, or the whole match if there are no captures. The function will not return nil in the case where a match couldn't be found, but an 'empty' iterator function that will not start a loop.
The following code goes through every word in the string.
Any pattern that you use in string.match can also be used in gmatch, but instead of finding only the first match, it will find every match in the string.
The following code uses the keyvalue parsing pattern but can now read a list of keyvalues.
The interesting thing is that the string can have any characters as separators between keyvalue pairs.
str is the string to search in,
pattern is the pattern to search for and
repl is the value to replace with. The function returns
str where all occurrences of pattern have been replaced with the value given by
repl and, as the second argument, the total number of matches.
repl can be the following things:
- a string - in which case all occurrence of pattern are replaced with this string, the "%n" item is also supported with a special case of "%0" representing the whole match,
- a function - in which case the passed function gets called with the match/captures as its argument(s) each time a match occurs, and the match is replaced with the value returned by the function,
- a table - in which case the value indexed with the first capture (or the match if there are no captures) is returned.
If the function or table returns
false, the match gets ignored and nothing gets replaced.
The following code formats a keyvalue pair as an xml node.
The following example creates a function that works like the .NET formatting feature.
The article is finally over! I hope you learned something new from all of this. Lua's patterns are very powerful when used right. When making an addon that heavily relies on strings, patterns will most likely come in handy. You can find some new examples in either the Lua manual or PIL.