Revision Difference
Patterns#527627
<cat>Dev.Lua</cat>
<title>Patterns</title>
## What's this article for?
This article is for teaching you how to use Lua's pattern matching language. The pattern matching language (or `patterns` for short) provides advanced tools for searching and replacing recurring patterns in strings. These tools can be used for writing text data parsers, custom formatters and many other things that would take hundreds of lines of code.
A lot of the theory in this article is either copied or rewritten from the lua reference manual. You can see the manual section on patterns [here](http://www.lua.org/manual/5.2/manual.html#6.4.1).
## Getting started
An average pattern looks like this:
```
[%w_]+
```
That specific pattern could be used for finding variable names (such as "hi_there", "h0w_are_you" etc.). What each character in the pattern does will be explained later in this article.
These functions can be used together with patterns:
* <page>string.find</page>
* <page>string.match</page>
* <page>string.gmatch</page>
* <page>string.gsub</page>
I will try to use all of these functions and explain how each of them work in detail.
## Special characters
There are a bunch of special characters that either escape other characters, or modify the pattern in some way.
These characters are:
```
^ $ ( ) % . [ ] * + - ?
```
They can also be used in the pattern as normal characters by prefixing them with a "%" character, so "%%" becomes "%", "%[" becomes "[", etc.
## Character classes
Character classes represent a set of characters. They can be either predefined sets or custom sets that can consist of the same predefined sets, ranges or any single characters.
Available character classes (custom and predefined):
| Class | Description |
| ----- | ------------ |
| ```%a``` | represents all letters (from a to z upper and lower case) |
| ```%c``` | represents all control characters (special characters "\t", "\n", etc.) |
| ```%d``` | represents all digits (from 0 to 9) |
| ```%g``` | represents all printable characters expect space |
| ```%g``` | represents all printable characters except space (same as '[\x21-\x7E]') |
| ```%l``` | represents all lowercase letters (any letter that is lower case) |
| ```%p``` | represents all punctuation characters (".", ",", etc.) |
| ```%s``` | represents all space characters (a normal space, tab, etc.) |
| ```%u``` | represents all uppercase letters (any letter that is upper case) |
| ```%w``` | represents all alphanumeric characters (all letters and numbers) |
| ```%x``` | represents all hexadecimal digits (digits 0-9, letters a-f, and letters A-F) |
| ```%z``` | represents the character with representation 0 (the null character "\0") |
| ```%x``` | (where 'x' is any non-alphanumeric character) represents itself |
| ```%char``` | escapes a character if it doesn't represent any class (e.g. '%[(%d)%]' matches a digit inside of brackets) |
| ```[set]``` | represents all characters in 'set' as a union. You can see this used in the previous section. '[%w_]' will match any letter, digit and an underscore |
| ```[^set] ``` | represents the opposite of the union 'set', so '[^%w_]' matches everything that is not a letter, digit or underscore |
* An upper case version of a predefined character set will represent the **opposite** of that set, so `%A` will match anything that is **not** a letter,
* The starting and ending points of a range are separated with a hyphen "-", so `0-5` will match a digit from zero to five, `a-c` will match a, b or c.
### Repetition and anchoring
Characters in a string match a pattern in the following ways:
* a single class will match a single character,
* a single class followed by "+" will match one or more repetitions of characters and will match the longest sequence,
* a single class followed by "-" will match zero or more repetitions of characters and will match the shortest sequence,
* a single class followed by "*" will match zero or more repetitions of characters and will match the longest sequence,
* a single class followed by "?" will match one or zero characters,
* %n (where `n` is a digit between 1 and 9) will match the `n`th capture (see next section),
* %b`xy` will match strings that start with `x` and end with `y`, "%b()" will match a string that starts with "(" and ends with ")".
Patterns can be anchored like so:
* starting the pattern with "^" will match a string at the beginning,
* ending the pattern with "$" will match a string at the end,
* not anchoring the pattern will match a string at any position.
These two characters only have a meaning if positioned as stated above. At any other position, these characters have no meaning and represent themselves.
## Captures
Patterns can also contain sub-patterns enclosed in "()". Captures are used in functions like string.match and string.gsub to return or substitute a specific match from the pattern. Examples on how to use these can be found below.
## Usage
Now I'm going to show you how to actually use all that stuff above. The examples below explain how to use the four functions listed above.
### string.find
```
string.find( string str, string pattern [, number start [, boolean plain ]] )
```
`str` is the string to search, `pattern` is the pattern string to find, `start` is the start index and `plain` is a boolean indicating whether to use a pattern search or just plain text search. The function returns the start and end indices (not start index and length) of the matching substring. If the pattern has captures, they will be returned after the indices. If a match couldn't be found, the function returns nil.
The following code will find the first word in the string.
```
local str = "1. Don't spam!"
local pattern = "([%a']+)" -- will match a substring that has one or more letter or apostrophes (')
local start, endpos, word = string.find( str, pattern )
print( start, endpos, word )
```
Output:
<br/><br/>
```
4 8 Don't
```
You probably think that this could be done with <page>string.Explode</page> and a few loops, but look, we did it in three lines.
The following code will check if a string is safe to be used as a file name, by comparing it with a set of restricted characters.
```
local str = "cry|*to"
local pattern = '[\\/:%*%?"<>|]' -- a set of all restricted characters
local start = string.find( str, pattern )
print( "String is "..( ( start ~= nil ) and "unsafe" or "safe" ) )
```
Output:
<br/><br/>
```
String is unsafe
```
<page>string.find</page> returns `nil` if no match is found. This means we can use boolean logic to print "unsafe" if a match is made, and "safe" otherwise.
### string.match
```
string.match( string str, string pattern [, number start] )
```
`str` is the string to search, `pattern` is the pattern to find and `start` is the start position. If a there is a match, the function return the captures from the pattern, if there are no captures, it will return the whole match. If a match couldn't be found, the function will return `nil`.
The following code will parse a simple keyvalue line.
```
local str = "key= value"
--The following will match "variable name, 0 or more spaces, equals sign, 0 or more spaces, variable name":
local pattern = "([%w_]+)%s*=%s*([%w_]+)"
local k, v = string.match( str, pattern )
print( k, v )
```
Output:
<br/><br/>
```
key value
```
The following code will check if the string ends with a .lua extension.
```
local str = "teel.lua"
local pattern = ".+%.lua$" -- anything until a dot and "lua" at the end of the string
local match = string.match( str, pattern )
print( "String ends with "..( ( match ) and ".lua" or "something else" ) )
```
Output:
<br/><br/>
```
String ends with .lua
```
### string.gmatch
```
string.gmatch( string str, string pattern )
```
`str` is the string to search and `pattern` is the string to search for. The function returns an iterator function (special functions used by loops) that goes through every match in the string and returns the pattern's captures, if there are any, or the whole match if there are no captures. The function will not return nil in the case where a match couldn't be found, but an 'empty' iterator function that will not start a loop.
The following code goes through every word in the string.
```
local str = "This is PATTERNS"
local pattern = "%w+" -- will match any word
for word in string.gmatch( str, pattern ) do
print( word )
end
```
Output:
<br/><br/>
```
This
is
PATTERNS
```
Any pattern that you use in string.match can also be used in gmatch, but instead of finding only the first match, it will find every match in the string.
The following code uses the keyvalue parsing pattern but can now read a list of keyvalues.
```
local str = "key = value key2 = value2"
local pattern = "([%w_]+)%s*=%s*([%w_]+)" -- same pattern as above
local tbl = { }
for k, v in string.gmatch( str, pattern ) do
tbl[ k ] = v
end
PrintTable( tbl )
```
Output:
<br/><br/>
```
key = value
key2 = value2
```
The interesting thing is that the string can have any characters as separators between keyvalue pairs.
### string.gsub
```
string.gsub( string str, string pattern, string/table/function repl )
```
`str` is the string to search in, `pattern` is the pattern to search for and `repl` is the value to replace with. The function returns `str` where all occurrences of pattern have been replaced with the value given by `repl` and, as the second argument, the total number of matches.
`repl` can be the following things:
* **a string** - in which case all occurrence of pattern are replaced with this string, the "%n" item is also supported with a special case of "%0" representing the whole match,
* **a function** - in which case the passed function gets called with the match/captures as its argument(s) each time a match occurs, and the match is replaced with the value returned by the function,
* **a table** - in which case the value indexed with the first capture (or the match if there are no captures) is returned.
If the function or table returns `nil` or `false`, the match gets ignored and nothing gets replaced.
The following code formats a keyvalue pair as an xml node.
```
local str = "key = value"
local pattern = "([%w_]+)%s*=%s*([%w_]+)"
local replacement = "<%1>%2</%1>"
local output = string.gsub( str, pattern, replacement )
print( output )
```
Output:
<br/><br/>
```
<key>value</key>
```
The following example creates a function that works like the .NET formatting feature.
```
function string.format2( fmt, ... )
// 'arg' is the ... combined in a table
return fmt:gsub( "{(%d+)}", function( i ) return arg[ tostring( i ) + 1 ] end )
end
local str = "This is {0}, oh {1}.."
local repl1 = "PATTERNS"
local repl2 = "YEAH"
local output = string.format2( str, repl1, repl2 )
print( output )
```
Output:
<br/><br/>
```
This is PATTERNS, oh YEAH..
```
## Conclusion
The article is finally over! I hope you learned something new from all of this. Lua's patterns are very powerful when used right. When making an addon that heavily relies on strings, patterns will most likely come in handy. You can find some new examples in either the [Lua manual](http://www.lua.org/manual/5.2/) or [PIL](http://www.lua.org/pil/).
Good day!
## See also
* [Official 5.3 Patterns Tutorial (PiL)](http://www.lua.org/manual/5.3/manual.html#6.4.1)