Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Documentation of REGEXP support in DBD::SQLite?

by ibm1620 (Hermit)
on Nov 13, 2024 at 23:54 UTC ( [id://11162694]=perlquestion: print w/replies, xml ) Need Help??

ibm1620 has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed monks,

I discovered that, without installing anything other than DBD::SQLite (version 1.76), I was able to run a SELECT using a REGEXP in the WHERE clause. All of the searching I'd turned up had suggested that built-in REGEXP support in SQLite was either non-existent or limited to the CLI, and that I'd have to install one of several available regex libraries.

I tested against a table of 11M movie titles, in which the title column was indexed. While REGEXP worked, it was considerably slower than LIKE in the following queries:

SELECT * FROM Titles WHERE title LIKE '%batman%'; -- 1.3 sec SELECT * FROM Titles WHERE title REGEXP '(?i)\bbatman\b'; -- 5.9 sec SELECT * FROM Titles WHERE title REGEXP '(?i)\b{wb}batman\b{wb}'; -- 1 +1.0 sec !!
I can't find any documentation for this! Could anyone provide some links? And, why is REGEXP so much slower?

Replies are listed 'Best First'.
Re: Documentation of REGEXP support in DBD::SQLite?
by LanX (Saint) on Nov 14, 2024 at 00:25 UTC
    See REGEXP-function of DBD::SQLite, emphasis added.

    • REGEXP function

      SQLite includes syntactic support for an infix operator REGEXP , but without any implementation.

      The DBD::SQLite driver automatically registers an implementation that performs standard perl regular expression matching , using current locale.

      So for example you can search for words starting with an A with a query like

      SELECT * from table WHERE column REGEXP '\bA\w+'

      If you want case-insensitive searching, use perl regex flags, like this :

      SELECT * from table WHERE column REGEXP '(?i:\bA\w+)'

      The default REGEXP implementation can be overridden through the create_function API described above.

      Note that regexp matching will not use SQLite indices, but will iterate over all rows, so it could be quite costly in terms of performance.

    Update

    Short version: Perl regexes are hooked into sqlite using a callback mechanism. But filtering row-wise. That's much slower than a native implementation.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

      Thank you for finding that for me!

      Perl regexes are hooked into sqlite using a callback mechanism. But filtering row-wise. That's much slower than a native implementation.

      I'm not quite clear on what you mean by "native implementation" - would linking in libpcre2 (or another) as a custom function be expected to perform better? Or would I be better off using a different DBMS with built-in REGEXP support such as mySQL?

      (It does appear that both mySQL and SQLite REGEXP can't use the index and have to search the data itself.)

        I don't know if they perform better. You have to test it.

        Native means for me without call overhead and efficiently compiled (subset of) regexes.

        Regarding indexes, I can only imagine a small subset of regexes capable to profit from them, unless a lot of special case optimisation was implemented¹.

        It should be quite complicated to achieve this with a pluggable extension...

        But again I don't know ... This you should better ask at a DBM-board.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

        ¹) I'm not even sure a substring search with LIKE %substr% can take advantage from the index, the differences in your benchmark are not in magnitudes, this could be easily explained with "call overhead and efficiently compiled code".

        A real "index search" should be dramatically faster than just factor 4.

        And, as a side note, your regexes were much more complicated than a substr search. Apples and oranges...

        mysql or something with builtin regexps would surely be at least somewhat faster. Btw, leading wildcards like LIKE '%batman%' don't use indices either, however, even with SQLite you can probably speed things up by first doing a rough pass LIKE %batman% then refining that with a regexp either as part of the statement like LIKE '%batman%' AND REGEXP '(?i)\bbatman\b' or just parsing the output of LIKE '%batman%' yourself.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11162694]
Approved by LanX
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (6)
As of 2025-07-18 10:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.