Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^2: Documentation of REGEXP support in DBD::SQLite?

by ibm1620 (Hermit)
on Nov 14, 2024 at 20:16 UTC ( [id://11162703]=note: print w/replies, xml ) Need Help??


in reply to Re: Documentation of REGEXP support in DBD::SQLite?
in thread Documentation of REGEXP support in DBD::SQLite?

Thank you for finding that for me!

Perl regexes are hooked into sqlite using a callback mechanism. But filtering row-wise. That's much slower than a native implementation.

I'm not quite clear on what you mean by "native implementation" - would linking in libpcre2 (or another) as a custom function be expected to perform better? Or would I be better off using a different DBMS with built-in REGEXP support such as mySQL?

(It does appear that both mySQL and SQLite REGEXP can't use the index and have to search the data itself.)

  • Comment on Re^2: Documentation of REGEXP support in DBD::SQLite?

Replies are listed 'Best First'.
Re^3: Documentation of REGEXP support in DBD::SQLite?
by LanX (Saint) on Nov 14, 2024 at 20:59 UTC
    I don't know if they perform better. You have to test it.

    Native means for me without call overhead and efficiently compiled (subset of) regexes.

    Regarding indexes, I can only imagine a small subset of regexes capable to profit from them, unless a lot of special case optimisation was implemented¹.

    It should be quite complicated to achieve this with a pluggable extension...

    But again I don't know ... This you should better ask at a DBM-board.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

    ¹) I'm not even sure a substring search with LIKE %substr% can take advantage from the index, the differences in your benchmark are not in magnitudes, this could be easily explained with "call overhead and efficiently compiled code".

    A real "index search" should be dramatically faster than just factor 4.

    And, as a side note, your regexes were much more complicated than a substr search. Apples and oranges...

      Regular expressions can be implemented in databases with quite a bit of success. I use them for some stuff in PostgreSQL and it is blazingly fast. It shouldn't matter for complicated the regular expressions are, the simple fact that the database software only has to read a (relatively) small index file from disk instead of everything in the table (full table scan) should still speed up the search. And many databases also keep frequently used indexes in RAM.

      Generally, i think of SQLite more of a desktop-type database that is integrated into a single application with a relatively small amount of data. For serious data crunching, i always choose a serious standalone database like PostgreSQL (hey, it's the same price as SQLite).

      PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
      Also check out my sisters artwork and my weekly webcomics
        > It shouldn't matter for complicated the regular expressions are,

        Sorry that's bordering at nonsense hubris.

        To take advantage of the index you need to identify character sequences in the regex.° (And erix even showed that it only works for 3+ characters in PG.)

        Now, while it's possible to identify those sequences at the compile phase - at least in Perl using re - one need's to know which ones are

        • mandatory
        • or optional
        • or belonging to more complicated AND/OR clauses.

        And regexes are Turing complete.

        It might be possible with sufficient work involved, but I doubt it's already done.

        You are more than welcome to prove me wrong. Even with a test/benchmark in PG.

        Claiming that the complexity doesn't matter is really quite a bold statement.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

        °) if they even appear in the regex!

Re^3: Documentation of REGEXP support in DBD::SQLite?
by Danny (Chaplain) on Nov 14, 2024 at 22:32 UTC
    mysql or something with builtin regexps would surely be at least somewhat faster. Btw, leading wildcards like LIKE '%batman%' don't use indices either, however, even with SQLite you can probably speed things up by first doing a rough pass LIKE %batman% then refining that with a regexp either as part of the statement like LIKE '%batman%' AND REGEXP '(?i)\bbatman\b' or just parsing the output of LIKE '%batman%' yourself.

      SCNR - have to show off postgres' regex indexing on a quickly generated 10M row table.

      -- table size: 1117 MB -- regex index size: 1727 MB (big index = disadvantage) where txt like '%batman%'; --> 'like' means: search without regex; +uses the trgm index txt + ---------------------------------------------------------------------- +----------- zdfrcj batman igihbky bpwz hzzfmq plyazufcawovvjrlwow xkllfiovgktraw + bmnuyjelj Time: 9.911 ms where txt ~ '\ybatman\y'; --> '~' means: regex search; \y: word b +oundary txt + ---------------------------------------------------------------------- +----------- zdfrcj batman igihbky bpwz hzzfmq plyazufcawovvjrlwow xkllfiovgktraw + bmnuyjelj Time: 13.179 ms

      (performance of these regex-indexes (via trigram matching from the pg_trgm extension) is good but the size can be reason not to use them.)

        Thanks for this, I had something like trigrams in mind.

        But for full automatic regex support one would need to analyse the regex and identify the substrings which are in all execution paths ...

        Quite a task.

        Though... Perl does some optimisation when compiling its regexes and it's possible to introspect the result. It might be possible to build on this.

        On a side note: true, the index adds 150% to the datas size. But the speed gain is dramatic enough to justify it in many use cases.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

        Question: how are substrings of size 1 or 2 handled?

        Are they just ignored?

        Or are trigrams-indeces also indexed, such that "ab" is efficiently found in "cab" and "abs" ?

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11162703]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2025-07-14 02:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.