Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Re: string occurences

by lhoward (Vicar)
on Jun 12, 2001 at 21:13 UTC ( #87875=note: print w/replies, xml ) Need Help??

in reply to string occurences

Persoally, I'd use a hash in-memory. a 400MB logfile with only part of the data being urls, with a fair number of duplicates will probably not use more than 200MB of RAM in a hash... but (as you suspected) if you don't have much ram to spare you'll need to use some sort of disk-based storage (tied-hash to a DBM file, etc...). Make the URL the hash key, and the hash value the # of occurrences. That way you'll only have to read the logfile once.
open F,"<squid_logfile" or die "$!"; my %counts; #tie the file to a DB hash or something similar if memory is a concern while(<F>){ my $url=.... #extract url from a line of data and put it in $url $counts{$url}=0 if !defined $counts{$url}; $counts{$url}++; } close F; #do something with %counts to produce your report.

added later

Since I always run w/ warnings and strict on I can't get away w/ the "an undefined hash value is treated as 0 numericly" trick.

Also, because of the way I am using the hash, the "defined" check is good enough, because there will not be a hash entry that is undef.

Replies are listed 'Best First'.
Re: Re: string occurences
by MeowChow (Vicar) on Jun 12, 2001 at 21:20 UTC
    $counts{$url}=0 if !defined $counts{$url};
    That line is unnecessary. Autoincrement on an undefined hash key autovivifies the key and sets the value to 1.
                   s aamecha.s a..a\u$&owag.print
      The problem is that the program will then increment that default hash value of 1.

      Explicitly setting the hash value to 0 for each first-time occurence of a URL avoids the problem of all counts being too high by 1.

        The problem is that the program will then increment that default hash value of 1.

        I don't see how that is a problem. Right now, this code:

        $counts{$url}=0 if !defined $counts{$url}; $counts{$url}++;
        Makes a new hash instance 0, than increments it; also incrementing each repeated instance. If you just take out that first line, the results are the same.

        On a side note, exists probably would have been better, instead of defined. But mabey that's just me : )

        The 15 year old, freshman programmer,
        Stephen Rawls

        If the hash value is undefined, then is incremented, the new value will be 1, which means the url has appeared once so far, which is corect. There is no need to explicitly set each first-time occurence to 0, the counts will be the same either way, and MeowChow is correct.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://87875]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2019-08-22 06:51 GMT
Find Nodes?
    Voting Booth?

    No recent polls found