Hello. In this video we are going to speak about regex tries.
When we have a list of words to match, we usually concatenate them using a pipe symbol.
A pipe symbol is an alternation operator in regex.
However, when we build dynamic regular expressions using a list of words that is supplied by some user and we do not know beforehand what these words look like, we need to be very careful. Why? You know that in an NFA regex alternations are processed from left to right and the first alternative "wins" and the others are not processed.
However, an alternation group is only efficient when each alternative cannot match at the same position inside the string.
If one or more alternatives can match at the same position inside the string, backtracking can come into play and ruin the whole performance. The more there are alternatives that can match at the same position inside the string, the less efficient is the pattern.
Now, lets have a look at a concrete example.
You can see an input string like this: "Afoos,foo,food, fool-foolish, foods". So, the initial naive attempt to match these words would be to construct a regular expression based on this list of words that will look like this.
And you can see a test at regex101.com that we match all these strings, not as whole words inside the string, and it takes about 29 steps.
However, you can see that the result is not expected because in the word "food", we would like to match "food" and not "foo".
Thats why we might want to try Solution 2.
Here, the words in the list are sorted by length in the descending order and put into the regular expression like this.
Now lets have a test and we see that now we matched "food", "fool", "foolish" fully.
However, we see that we also matched "foo" in "Afoos" and this is not expected.
We need word boundaries.
Good, Solution #3.
Words are sorted by length in descending order inside the list and we added word boundaries here.
So in Python we can build this regular expression using this code and the regular expression looks like this now. Good.
You can see there are correct matches.
"Afoos" does not match, but it still takes 92 steps and you can see that each alternative here starts with the same prefix, "foo", you can see here as well.
So this is where regex tries come into play.
Have a look at this example here and the test.
You can see that it took 67 steps for this regular expression to find all matches in the string. You can see that "foo" is only tried once.
Then we can only try "l" or "d", after "d" there is an optional "s" and after "l" there is an optional "ish" substring.
So what is a trie? A trie is a kind of a tree data structure used for locating specific keys from within a set.
So these keys are in our case, of course, strings and there are links between nodes in this trie that are defined not by the entire key but by individual characters, or in our case, subpatterns.
So, this is a trie here for keys "A", "to", "te", "ted", "ten", "i", "in" and "inn".
A regex trie is a regular expression that is built in a tree- like way, so you can have a look here.
How can we build such patterns? Usually, there are such specific libraries that help you create regex tries.
For example, in Python, you can use the trieregex library.
You can see it here.
All you need to do is to install this library using "pip install trieregex", and then you can build your patterns using this library in a very simple and short way.
However, you can also use other ways to build regex tries and one of these ways is described in this StackOverflow answer, in a "Speed up millions of regex replacements in Python 3" post, which is really very-very nice and highly customizable.
Now lets continue with our testing.
I have found some arbitrary text on the Internet and it will be our sample input text. In order to make our test more close to real life situations, scenarios, I decided to extract all words from this text.
Thats why Im using this part of code here with "re.findall", and you can see all the words extracted here.
So there are a lot of words.
Some of them start with the same letters, some of them dont.
So this is really close to the situations you might come across in your real life.
So we build two patterns.
One of them is not trie-based, and the other one is trie-based.
We compile them and you can see the patterns here.
So trie regexps look very cryptic, but in fact, thats why they are much more efficient in the end. In this block, I benchmarked the two regular expressions using the Python "re" library.
You can see that I ran each of them 10,000 times and it took the first pattern longer than the second pattern, which means that the trie pattern is more efficient than the regular expression that is not trie-based.
Trie-based regular expressions are very efficient to match very-very long lists of words and even such things as emojis.
For example, you can see this post at StackOverflow where all current emojis can be matched and extracted using a huge pattern, but this pattern is still quite efficient when you use it in your code.
You can learn more about this construct if you follow the links in this FURTHER LINKS section, and if you liked my video, please click "Like" and subscribe to my channel if you havent done it yet.
Thank you for watching and happy regexing.