Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^2: Google like scalability using perl?

by dpavlin (Friar)
on Oct 09, 2009 at 01:12 UTC ( #800136=note: print w/ replies, xml ) Need Help??


in reply to Re: Google like scalability using perl?
in thread Google like scalability using perl?

My goal is to process ~130000 bibliographic records (with tree-like structure) with interactive editing of perl code which is applied to each record with interactive response (under three seconds for edit/run/see result cycle).

Right now, execution of code on each database record isn't a bottleneck (code is simple, and even if it isn't I would just boot a few more machines and distribute smaller shards of data to each of them).

However, when nodes start to produce more than ~20000 results merge on master becomes a problem. This is where merge to central Redis from nodes comes into play, removing need for merge step all-together at expense of network traffic. My fingers are itching to try that out, but this will have to wait at least tomorrow morning...


2share!2flame...


Comment on Re^2: Google like scalability using perl?
Re^3: Google like scalability using perl?
by BrowserUk (Pope) on Oct 11, 2009 at 07:51 UTC

    You say that Disks are slow. But network IO is slower!

    Your solution is to keep all the data in memory by distributing it across multiple machines, but for 130,000 bibliographies, a (these days quite modest) 2GB machine could handle 16KB per record. Even with structural overheads that is a fairly expansive bibliography. By moving to even the most basic 64-bit OS on commodity hardware, that can limit can be at least quadrupled.

    Whilst you gain through paralellism of the multiple boxes, you loose through networking (IO and coordination). You say "This is where merge to central Redis from nodes comes into play, removing need for merge step all-together at expense of network traffic.", but one way or another, the results from multiple machines has to end up being gathered back together. And that means it has to transition the network and be written to disk.

    These days, even commodity boxes come with 2 or 4 cores, with 6 or 8 becoming the norm at the next inventory upgrade cycle. If you applied the same 'remote processes' solution you are using with networked boxes within a fully-RAMed, multi-core box, the 'networking' overhead becomes little more than serialisation through shared memory buffers (plus some protocol overhead that you'd have anyway).

    If you use threads instead of processes, each core can process over a segment of shared memory, avoiding all the networking overhead entirely. And by returning references to the shared memory rather than data, to the merging thread, all network IO is avoided completely.

    And for scalability; with 64-bit OSs capable of handling at least 128GB of virtual memory, you'd have plenty of room for growth. Whilst random access to virtual memory greater than physical memory will 'thrash the cache', there are still huge gains to be had by avoiding the networking protocal overhead.

    Please don't take this as a critisism of your sharing your current solution. That's all good. I just can't help think ing that for problems of this scale, and those an order of magnitude or two larger, it is a solution with a quite limited lifespan.

    For a while now I've been looking at a solution of using virtualised memory (memory mapped files) to provide generalised, thread-shared, random read-write access to large datasets (up to ~128GB), and your problem is the first real-world task that has come up here. If you could share your dataset, or at least a reasonably detailed schema sufficient to allow the creation of a realistic testset to be generated, it would be extremely interesting (to me) to see how the two approaches to the problem compare. In terms of both complexity and performance.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      All your observations are totally correct, and I would suggest to everyone to re-read your reply before deciding on architecture for system like this one.

      I had different end-goal: make map over perl structure scale across commodity low-end desktop machines. To be honest I was hoping to be able to do it on single machine, in-memory but it didn't work out for me.

      Dataset isn't huge: 540 Mb with 121887 records with about 4.5K per record.

      dpavlin@klin:~$ ls -alh /data/isi/full.txt -rw-r--r-- 1 dpavlin dpavlin 546M 2009-10-05 14:15 /data/isi/full.txt <code> Structure is quote simple with repetable field names: <code> PT J AU RABIN, DL KALIMO, E MABRY, JH TI WORLD-HEALTH-ORGANIZATION INTERNATIONAL COLLABORATIVE STUDY OF MEDICAL-CARE UTILIZATION - SUMMARY OF METHODOLOGICAL STUDIES AND PRELIMINARY FINDINGS SO SOCIAL SCIENCE & MEDICINE LA English DT Article C1 GEORGETOWN UNIV,SCH MED,DEPT COMMUNITY MED & INT HLTH,WASHINGTON,DC + 20007. SOCIAL INSUR INST,RES INST SOCIAL SECUR,HELSINKI,FINLAND. UNIV VERMONT,COLL MED,DEPT COMMUNITY MED,BURLINGTON,VT. CR 1970, WHO INT COLLABORATIV 1972, MILBANK MEMORIAL F 2, V50 BICE TW, 1969, CROSS NATIONAL MEASU BICE TW, 1971, SOC SCI MED, V5, P283 BICE TW, 1972, MILBANK MEM FUND Q, V50, P57 KALIMO E, 1970, BRIT J PREVENTIVE SO, V24, P229 KALIMO E, 1972, 1 WHO ICS MCU OCC RE KALIMO E, 1972, MED CARE, V10, P95 KOHN R, IN PRESS LOGAN RFL, 1972, MILBANK MEMORIAL F 2, V50, P45 RABIN DL, 1972, MILBANK MEMORIAL F 2, V50, P19 SCHACH E, 1972, MILBANK MEM FD Q, V50, P65 VUKMANOVIC C, 1972, MILBANK MEMORIAL F 2, V50, P5 WHITE KL, 1967, NEW ENGL J MED, V277, P516 WHITE KL, 1969, INT COMPARISONS MEDI WHITE KL, 1972, MILBANK MEMORIAL F 2, V50, P31 NR 16 TC 2 PU PERGAMON-ELSEVIER SCIENCE LTD PI OXFORD PA THE BOULEVARD, LANGFORD LANE, KIDLINGTON, OXFORD, ENGLAND OX5 1GB SN 0277-9536 J9 SOC SCI MED JI Soc. Sci. Med. PY 1974 VL 8 IS 5 BP 255 EP 262 PG 8 SC Public, Environmental & Occupational Health; Social Sciences, Biome +dical GA T5180 UT ISI:A1974T518000003 ER
      My basic goal was to write something like this for query:
      # $rec - input record # $out - generated data foreach ( @{ $rec->{C1} } ) { my $country = $1 if m{,\s?([^,]+)\.$}; $country =~ s{^.+USA$}{USA}; $country =~ s{^\w\w\s\d{5}$}{USA}; $country =~ s{^\w\w$}{USA}; $out->{'C1_country+'}->{ uc $country }++; }
      And I wanted to run this query as fast as possible on all my data (which affected my decision not to use disk for processing), so simpliest possible solution I could come up with is to load it all into perl hash.

      This worked well for most fields, until I discovered that I have 4.5 milion entries for CR running following code:

      $out->{'CR+'}->{ $_ }++ foreach @{ $rec->{CR} };
      Perl hash structures are nice, and code is clean and concise, but memory usage for results pushed me to swap (confirming that my anti-disk bias is correct). Having said that, I have used swap very effectively as first line of on-disk offload before, but with a result set which has random access pattern it doesn't mix well (disks are tipically SATA drives, so fast in linear reads, but slow for almost anything else.

      To work around this problem, I implemented conversion from long fields names in CR (err... longer than 4 bytes for int ignoring perl decorations) into simple integer which enabled me to put 4.5 million CR records onto single machine.

      I decided to use MD5 hash for each key, keep mapping of md5 to integer in-memory and replace (memory hungry) key with integer while preserving value. This way, I pushed full key names to disk in form of int -> full_name mapping.

      On disk storage had two iterations, in first one I used BerkeleyDB. It saved so much memory that whole database file could fit in /dev/shm :-)

      But, storing fields on disk did come with huge performance penalty to my query time. It took longer than 3 seconds to complete query and I wasn't prepared to admit defeat. To speed it up, it was logical to shard it across machines. Even better, I could control worst possible query time by changing size of shard for each node to adjust for different systems.

      To implement communication between nodes, I first implemented fancy protocol, but then decided to just ship Storable object directly to socket. Well, not quite directly, since I'm using ssh with compression to speed up network transfer. Since it's mostly bulk (send data to node or receive results) it improves performance on my 100Mb/s network for about 30%.

      Whole idea is to have fast throw-away calculations on data which comes from semi-formatted text files (e.g. Apache logs) so your note about limited lifespan is so very right. This was design decision. With this model, I can start on single machine until I fill up memory (or query becomes too slow) and then spread it across other machines until I have whole dataset available or scale out for query speed.


      2share!2flame...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://800136]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (14)
As of 2014-07-25 19:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (174 votes), past polls