in reply to Re: In search of an efficient query abstractor
in thread In search of an efficient query abstractor

Hmmm. Good catch, although I thought I had a test case to ensure that was handled correctly. I'll check that :) You seem to know more about regexes than I do. Why would you guess this particular regex is slow? Line-by-line profiling proves that it is consuming the vast majority of the time. It consumes almost 300 CPU seconds, and the next most expensive line in this code consumes 89 seconds, on an 8GB file.

The next-most-worst offender is

$query =~ s/(?<=\w_)\d+(_\d+)?\b/$1 ? "N_N" : "N"/eg;

followed by

$query =~ s/\s{2,}/ /g;

Replies are listed 'Best First'.
Re^3: In search of an efficient query abstractor
by ikegami (Pope) on Dec 07, 2008 at 20:46 UTC

    Are you doing the whole 8GB file at once? If your string starts with "a b c d", $query =~ s/\s{2,}/ /g; needs to copy 32GB of text. Just for the first 10 characters.

      No, it's done one entry at a time. Each entry is a header with some commented lines, followed by a query. There are special cases, but it generally looks like

      # Time: 071015 21:43:52 # User@Host: root[root] @ localhost [] # Query_time: 2 Lock_time: 0 Rows_sent: 1 Rows_examined: 0 use test; select sleep(2) from n;
Re^3: In search of an efficient query abstractor
by Corion (Pope) on Dec 07, 2008 at 20:50 UTC

    At least for the last case, Perl has an optimized version:

    $query =~ tr[ ][]s;

    and it should be faster or at least as fast as the s/// version. Another version to try would be s/\s+/ /g - there is no need to use the counting variant of {2,}, and skipping might be slower than just writing the output "replacement".

      $query =~ tr[ \n\t\r\f][ ]s; turns out to be a lot faster than any s/// variant. That change moves this line from #4 badness to #28 badness. Still having trouble with the floats, though.