Match regex caches compiled regular expressions?

I was wondering about the internal behavior of Match regex. Does it parse and compile the regular expression for each call, or is there an internal mechanism to cache previously seen regular expressions?

I’m curious because for a large number of iterations (e.g. 60,000+), 4D 17.1 compiled is about 5 times slower than Postgres for the same operations. If there is no caching, that might explain the large difference.

Can’t say about the cache. But I compare Match regex with Position, using a simple pattern (a simple text separator, for example), interpreted/compiled, the result is quite unexpected for me, I think it may interest you:
• interpreted, as expected regex is a bit slower than Position, ratio ~1,23
• compiled, regex is now really slower, ratio is ~12, about 10 times more!
Strange, no?

code used:
https://forums.4d.com/4DBB_Main/x_User/4467/files/30106993.7z

Hi Arnaud,

That is interesting. My guess is that there is not much difference interpreted because the time is dominated by executing the interpreted code. The compiled code really shows how much faster Position compared to Match regex.

I came up with a test that convinces me that 4D is caching something for Match regex, so I’ll give up on the idea of making a feature request. I used a UUID as a regular expression and compared the time of using the same UUID versus different ones for 100,000 iterations. In compiled (preemptive mode) using the same regex was up to 21 times faster.

<code 4D>
C_LONGINT($i;$ms;$count)
C_TEXT($data;$regex)
C_BOOLEAN($match)

$data:=Generate UUID
$regex:=Generate UUID
$count:=100000

ARRAY TEXT($aRegex;$count)
For ($i;1;$count)
$aRegex{$i}:=$regex
End for

$ms:=Milliseconds
For ($i;1;$count)
$match:=Match regex($aRegex{$i};$data;1)
End for
$ms:=Milliseconds-$ms
ALERT(String($ms))

ARRAY TEXT($aRegex;$count)
For ($i;1;$count)
$aRegex{$i}:=Generate UUID
End for

$ms:=Milliseconds
For ($i;1;$count)
$match:=Match regex($aRegex{$i};$data;1)
End for
$ms:=Milliseconds-$ms
ALERT(String($ms))

</code 4D>

I’m not sure if your test code definitively proves that the pattern is cached.

local variables and “For/End for” loops are so much faster in compiled mode,
I think it is impossible to single out the gain from “Match regex” itself.

also, UUID as a regex pattern and string to match is way too simple to measure speed.
a more demanding test would be needed to collect meaningful stats.

EXECUTE FORMULA has been enhanced to cache formula

c.f. Number of formulas in cache for https://doc.4d.com/4Dv17/4D/17.1/SET-DATABASE-PARAMETER.301-4179137.en.htmlSET DATABASE PARAMETER>

so it is not a reasonable request to ask for caching regex in like manner.

Hi Miyako,

Please explain in more detail my misunderstanding. The loops are
exactly identical. The only difference is that in one case the array
of regular expressions (100,000 elements) is a single UUID value and
in the other loop all UUID are unique values. There is a huge
difference in performance. Why? The lengths in all cases should be
exactly the same.

: Keisuke MIYAKO

so it is not a reasonable request to ask for caching regex in like
manner.

Parsing a complicated regex to generate a finite state automata (FSA) can be an expensive operation. If the regex is going to be used for a significant number of operations, doing the compile step one time can be a significant performance gain. Other environments I use have an API to support this use case.

: John DESOI

Please explain in more detail my misunderstanding. The loops are
exactly identical.

I am not saying that there is a misunderstanding, I am just not sure.

for example, if patterns are indeed cached,
would we not get a faster result if we repeated the 2nd loop (with the same array of patterns) ?

also, if Match regex implemented the compiled regex feature,

http://icu-project.org/apiref/icu4c/classicu_1_1RegexPattern.html

should we not get the advantage in interpreted mode as well?

: Keisuke MIYAKO

for example, if patterns are indeed cached,
would we not get a faster result if we repeated the 2nd loop (with
the same array of patterns) ?

Only if we know how big the cache is. If it is small and 100,000 values are used, there will be no difference. I repeated the loop with all distinct values and it takes about the same amount of time.

: Keisuke MIYAKO

should we not get the advantage in interpreted mode as well?

In the code I posted, the single value regex is about 4 times faster. As I mentioned previously in this thread, comparing interpreted is more difficult because interpreted execution can dominate the overall time of the loop iteration.