Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Weighted frequency of characters in an array of strings

by BrowserUk (Patriarch)
on Jun 07, 2016 at 19:53 UTC ( [id://1165108]=note: print w/replies, xml ) Need Help??


in reply to Weighted frequency of characters in an array of strings

Either of these should be 7 to 8 times faster than your original. They're much the same on my machine, with the latter winning on later perls:

my %freq; my $inc = 1 / @data; for my $d ( @data ) { # $freq{ substr $d, $_, 1 }[ $_ ] += $inc for 0 .. length( $d )-1; my $p = 0; $freq{ chop $d }[ $p++ ] += $inc while length $d; }

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.

Replies are listed 'Best First'.
Re^2: Weighted frequency of characters in an array of strings
by K_Edw (Beadle) on Jun 08, 2016 at 08:37 UTC

    Thanks! This is significantly faster (and also easier to read in my opinion) when ran on a real dataset:

    Execution Time (Original): 293 seconds

    Execution Time (New): 53 seconds

    5.52x speed improvement.

    Would you be able to explain why this approach is much faster (or how the old one was so slow)?

      Your OP code is slow:

      my %freq; for my $i ( 1 .. @data ) { $freq{ $1 }[ $-[0] ] += 1/@data while $data[ $i-1 ] =~ /(.)/g; }

      For several reasons (in roughly scale of penalty):

      1. It starts the regex engine for every line.
        1. Just starting the regex engine, even with a simple pattern, is expensive.
        2. I'm not sure whether that simple pattern get re-parsed each time; or if the parsing is cached, but its still parsing.
        3. It uses capture brackets, which means that not only does $1 have to be populated, but also @+, @-, $&, $`, $' and possibly %+ ...
      2. It performs the same invariant calculation (1/@data) for every line.

        Moving that out of the loop and adding the invariant is simple and effective.

      3. It does a completely unnecessary calculation ($i-1) for every line.

        Changing the loop parameters to for my $i ( 0 .. $#data ) avoids that repeated subtraction.

      4. It indexes into @data for every line.

        Why generate a list of integers, and then use those integers to index into the array, when you can access the array elements directly?

      Most of those can be avoided like this:

      my $inc = 1 / @data; ## do the i +nvariant once: for ( @data ) { ## $_ alias +es each line in turn avoids both the $i-1 and indexing the array $freq{ $1 }[ $-[0] ] += $inc while /(.)/g; ## the rege +x engine operates directly on the $_ alias. }

      But that still incurs the penalty of the regex engine to get the chars 1 at a time. A naive alternative is ... for split //; but that still involves the regex engine, albeit in a faster mode.

      The regex can be avoided completely using ... for unpack '(a1)*', $_;, but that template still requires parsing, albeit a simpler language; but the big penalty here is the construction of a long list of single chars, each in its own scalar that each require over 100 bytes of memory. That is a costly process.

      Contrast that with the substr version $freq{ substr $data, $_, 1 }[ $_ ] += $inc for 0 .. length( $d )-1;. That numerical iterator costs only ~20 bytes regardless of the length of the string.

      My other version trades the (fairly expensive) call to substr and the iterator, for a very cheap operation chop and an increment. And a little obscurity.

      It can however be made to be the fastest of the lot with a couple of simple changes:

      for( @data ) { ## use $_ to alias the +lines my $p = -1; ## Allow pre-increment +which is more efficient $freq{ chop() }[ ++$p ] += $inc while $_; ## implicit arg for cho +p; avoid call to length. }

      That might see you over the 6X mark. Or not.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.

        The chop version is wrong, it's trimming chars from the end, but counting them from zero. This fixes that:

        for( @data ) { ## use $_ to alias the +lines my $p = length(); ## Allow pre-decrement +and count backwards $freq{ chop() }[ --$p ] += $inc while $p; ## implicit arg for cho +p; avoid call to length. }

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.
      In addition to substr being faster than a regex, your code is computing:
      1/@data
      200,000 times. Computing this value only once before looping over the data and storing the result into a variable is very likely to save you some significant time (but benchmarking and/or profiling would be needed to say how much time). Storing the result of a computation to avoid re-computing it many times is called caching or memoizing (see for example https://en.wikipedia.org/wiki/Memoization) and is a very common optimization technique.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1165108]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (7)
As of 2024-03-28 18:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found