Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
When someone is dealing with a large data set you need to trade off programmer efficiency against performance. While we in the Perl world are used to valuing programmer efficiency more, this stops being true with large datasets.

Consider 5 GB of data broken up into 50 byte lines. So there are 100 million lines of data. Suppose we want to store that and retrieve it into DBM::Deep. For the sake of argument let's say that each store or retrieve takes one seek to disk. So that's 200 million seeks to disk.

How long does 200 million seeks to disk take? Well suppose that your disk spins at 6000 rpm. (This is typical.) That means it spins 100 times per second. A disk seek will therefore take between 0 and 0.01 seconds, or 0.005 seconds on average. 200 million seeks therefore takes a million seconds. Which is 11.57 days, or a week and a half.

Now how long does sorting that data take? Well let's assume an absurdly slow disk - 10 MB/s. (Real sorting algorithms keep a few passes in RAM and so will need fewer passes.) Suppose we code up a merge-sort and need 30 passes to disk. Each pass needs to read and wrote 5 GB. We therefore have 300 GB of throughput at 10 MB/s which will take 30,000 seconds, or a bit over 8 hours. (If your machine really takes this long to sort this much data, you should upgrade to a machine from this millennium.)

The moral? Hard drives are not like RAM. DBM::Deep and friends are efficient for programmers, but not for performance. If you have existing complex code that needs to scale, consider using them. But it is worth some programmer effort to stay away from them.


In reply to Re^2: Memory Efficient Alternatives to Hash of Array by tilly
in thread Memory Efficient Alternatives to Hash of Array by neversaint

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2024-03-19 09:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found