Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

(Guildenstern) Re: Re: Taming a memory hog

by Guildenstern (Deacon)
on Nov 10, 2003 at 20:20 UTC ( #305955=note: print w/replies, xml ) Need Help??

in reply to Re: Taming a memory hog
in thread Taming a memory hog

I actually did have Tie::File in the mix at one point. I'm not sure if I was using it wrong, but my run times increased by several times. (I had to cancel the 100,000 run after waiting 2 hours - about 20 times longer than normal.) Maybe when I get some time I'll have to reinvestigate.

Negaterd character class uber alles!
  • Comment on (Guildenstern) Re: Re: Taming a memory hog

Replies are listed 'Best First'.
Re: (Guildenstern) Re: Re: Taming a memory hog
by Roger (Parson) on Nov 11, 2003 at 02:59 UTC
      100,000 run after waiting 2 hours - about 20 times longer than normal

    Ok, does that mean your normal speed for processing records is 120 min /20 = 6 min for 100,000 records?

    It sounds to me like there might still be room for improvement if you want to impress your client once more. I haven't seen your data, but I am doing daily processing of 20,000,000 records within 10 mins.

    Anyway, I am building my records with a split, your record processing might be complex, and I am just too fussy. :-)

      This is an interesting problem of dealing with large datasets. currently I am trying to work with files 20,000,000 lines long and am trying to sort them. Do you have any suggestions about sorting? there seems to be a lot of info out there on large datasets, but I haven't seen much on sorting, especially on datasets too large to hold in memory. Thanks
        Hi codingchemist, that depends on how large and complex your data set is.

        If you are dealing with straight forward flat files (like csv type of files), the unix command-line sort is usually the fastest. The syntax is simple too, eg., sort -t "\t" -k 3 input.txt > output.txt will sort the tab delimited input file based on 3rd key.

        Otherwise you could use a different sorting algorithm, like heap sort.

        The heap sort is the slowest of the O(n log n) sorting algorithms, but unlike the merge and quick sorts it doesn't require massive recursion or multiple arrays to work. This makes it the most attractive option for very large data sets of millions of items. There is a description and code example here at the WKU-Linux user group.

      So, I tried to make my record creation more efficient. Currently, there's three levels of nested foreach, plus some extra logic for special locations in the record. I realized that all of this could be rewritten using a single foreach containing a map. Chopped the lines of code by over half to create each record for output to the file.

      Then I ran it. As guessed above, 6 minutes is a normal run for 100,000 records. With my nifty new changes, creating 100,000 records takes almost a full minute longer. I don't know if using a map within a foreach is a good idea, but something sure seems to slow it down.

      On a side note, creating every piece of data in a record includes a call to rand, which is probably a large factor in why generating records takes so long.

      Negated character class uber alles!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://305955]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2020-10-20 00:43 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (208 votes). Check out past polls.