Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Threads From Hell #2: How To Search A Very Huge File [SOLVED]

by karlgoethebier (Monsignor)
on May 23, 2015 at 20:04 UTC ( #1127540=perlquestion: print w/replies, xml ) Need Help??
karlgoethebier has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

for learning purposes i started to think about how to parse search a very huge file using the multithreading capabilities of Perl.

As i like trivial examples, i started out with something trivial and created some huge file at first:

karls-mac-mini:monks karl$ ls -hl very_huge.file -rw-r--r-- 1 karl karl 2,0G 23 Mai 19:38 very_huge.file karls-mac-mini:monks karl$ tail very_huge.file Lorem ipsum kizuaheli Lorem ipsum kizuaheli Lorem ipsum kizuaheli Lorem ipsum kizuaheli Lorem ipsum kizuaheli Lorem ipsum kizuaheli Lorem ipsum kizuaheli Lorem ipsum kizuaheli Lorem ipsum kizuaheli nose cuke karl karls-mac-mini:monks karl$ wc -l very_huge.file 100000001 very_huge.file

By RTFM i figured out this using MCE::Grep:

#!/usr/bin/env perl use strict; use warnings; use MCE::Grep; use Data::Dump; use Time::HiRes qw (time); MCE::Grep::init( { max_workers => 4 } ); my $start = time; open( my $fh, '<', 'very_huge.file' ); my @result = mce_grep { /karl/ } $fh; close $fh; printf "Took %.3f seconds\n", time - $start; dd \@result; __END__ karls-mac-mini:monks karl$ ./huge.pl Took 29.690 seconds ["nose cuke karl\n"]

Good old grep performs very much better easily:

karls-mac-mini:monks karl$ time grep karl very_huge.file nose cuke karl real 0m2.563s user 0m2.176s sys 0m0.309s

I don't know if this trivial exercise is peinlich parallel, but i'm wondering how to:

  • do this "by hand" (without using MCE::Grep)
  • ...and improve the performance

Thank you very much for any hint and best regards,

Update:

Edit: Striked out nonsense.

Ouch! Perhaps more RTFM would have helped:

PID Prozessname Benutzer % CPU Physikal. Speic Virt. S +peicher 1065 perl karl 12,7 10,3 MB 2, +33 GB 1068 perl karl 83,7 3,9 MB 2, +33 GB 1069 perl karl 84,6 3,9 MB 2, +33 GB 1070 perl karl 83,5 3,9 MB 2, +33 GB 1071 perl karl 84,0 3,9 MB 2, +33 GB

Edit 2: Renamed the thread

Update 3: Many thanks to marioroy and BrowserUk for their patience and their contributions to this interesting thread.

Karl

«The Crux of the Biscuit is the Apostrophe»

Replies are listed 'Best First'.
Re: Threads From Hell #2: How To Parse A Very Huge File
by BrowserUk (Pope) on May 23, 2015 at 21:38 UTC
    i started to think about how to parse a very huge file using the multithreading capabilities of Perl....and improve the performance

    You won't.

    Your processing is entirely limited by how fast you can read the file from disk. And there is no way to make that faster using multitasking!

    Whether real threads; or green threads; or processes; the limitation is the speed of the drive, not the single thread or process doing the reading.

    As your figures from the cute but pointless MCE::Grep show, using multiple cores to issue the reads, simply means that it takes longer than doing the same thing with a single thread/process. Nearly 10 times longer.

    Here are a few numbers:

    wc -l does nothing but read lines and count them. Ie. the minimum processing so it reflects pure IO speed:

    [21:22:49.82] C:\test>wc -l big.csv 167772159 big.csv [21:24:39.11] C:\test>dir big.csv 07/02/2015 13:27 10,737,418,241 big.csv

    1:49.29 for 10,737,418,241 = 98247033/s. Let's (inaccurately) call that 98MB/s

    Now let's see how long (worse case:not found) it takes to search 98MB for a 4-char string:

    $s = 'kar'; $s x= 32749011;; print length $s;; 98247033 $t = time; $s =~ m[karl] and ++$n for 1 .. 1; printf "%.9f\n", time() +- $t;; 0.192370176 $t = time; $s =~ m[karl] and ++$n for 1 .. 10; printf "%.9f\n", ( time +() - $t ) / 10;; 0.1929563999 $t = time; $s =~ m[karl] and ++$n for 1 .. 100; printf "%.9f\n", ( tim +e() - $t ) / 100;; 0.192800162

    So, the IO rate is ~98MB/s, and the time taken to search 98MB is 0.2s. If you could spread the searching over 4 cores (without overhead) you could reduce the latter to 0.05s.

    But you cannot reduce the IO time, so your best possible outcome would be 1.05s/98MB rather than 1.2s/98MB.

    And if you incur any overhead at all, that 0.15 seconds saving just melts away. Incur a lot of overhead -- as MCE::Grep does -- and your overall time grows by a factor of 10.

    There are 2 possibilities to speed your problem up (by a meaningful amount):

    • Make the IO faster:

      Possibilities include using an SSD; or spreading the file across multiple devices (per SAN/NAS etc.).

      This is the same file as above, but from an SSD rather than disk:

      [21:54:00.65] S:\>wc -l big.csv 167772159 big.csv [21:55:19.75] S:\>

      So, 1:19.10 for 10,737,418,241 = 135744857/s call that 35% faster. A vast improvement over the 0.15s/98MB that came from multi-threading; but hardly anything to write home about.

      Now the limitation is my SATA 2 interface card. A SATA 3 card would buy a little more, maybe 50% overall throughput, but again not a huge amount.

      The next possibility is to use PCIe attached SSDs, which I can't demonstrate, but I've used to affect another doubling of throughput. Ie. circa 300MB/s. But it comes with all kinds of problems. First, you've got to get the data on to it, which means copying it from somewhere, usually a disk, which unless you are reusing the same data for many runs completely throws away any gains.

    • If you were actually parsing -- rather than just searching -- and the parsing was very complex -- think huge nested XML -- then you might be able to cut your overall processing time by overlapping the parsing with IO waits.

      That is, if you could arrange to issue the read of the next record, before parsing the current one, then you can utilise the IO wait time to good effect.

      With your example, the best that would achieve, is folding the 0.2s search time into the 1s processing time, thus saving a maximum of 0.2s/98MB overall.

      But the transfer of the data between the thread that reads it, and the thread that processes it, would have to be completely costless; and -- unfortunately and unnecessarily -- that is not the case for shared memory under threads::shared.

    The bottom line is that for your example problem of searching a single huge file for a simple string, threading (or any other form of multi-processing) simply has nothing to offer in terms of performance gain.

    Change any one of the parameter of your task -- multiple data sources; more complex data processing requirements; the need to make multiple passes over the data -- and there's a chance that multi-threading can provide some gains.

    But for the single source, single pass, simple search application that you've described, there are no gains to be had from multi-tasking, regardless of whether you try kernel threads; green threads, or processes; and regardless of what language you use; or what platform you run it on.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Threads From Hell #2: How To Parse A Very Huge File
by marioroy (Priest) on May 24, 2015 at 03:48 UTC

    The assumptions and testing are valid. However, MCE is quite fast at this. MCE has bin/mce_grep (a parallel wrapper for the grep binary) and examples/egrep.pl (100% Perl code). Both run faster than the grep command.

    $ time grep karl very_huge.file nose cuke karl real 0m2.127s user 0m1.845s sys 0m0.283s $ time ./MCE-1.608/bin/mce_grep karl very_huge.file nose cuke karl real 0m1.061s user 0m2.176s sys 0m1.616s $ time ./MCE-1.608/examples/egrep.pl karl very_huge.file nose cuke karl real 0m0.690s user 0m2.165s sys 0m0.362s

    The MCE::Grep has an alternative mode by appending the "_f" suffix and reading the file directly. That runs in 8.5 seconds. The overhead is from calling the code block once per each line. Thus, use egrep.pl residing inside the examples directory.

    # open( my $fh, '<', 'very_huge.file' ); # my @result = mce_grep { /karl/ } $fh; # close $fh; my @result = mce_grep_f { /karl/ } 'very_huge.file';

    The following code snippet parses the 2 GiB file in 1 second.

    use MCE::Loop; use Time::HiRes qw( time ); MCE::Loop::init( { max_workers => 4, use_slurpio => 1 } ); my $start = time; my $pattern = 'karl'; my @result = mce_loop_f { my ($mce, $slurp_ref, $chunk_id) = @_; ## Quickly determine if a match is found. ## Basically, only process slurped chunk if true. if ($$slurp_ref =~ /$pattern/im) { my @matches; open my $MEM_FH, '<', $slurp_ref; binmode $MEM_FH, ':raw'; while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); } close $MEM_FH; MCE->gather(@matches); } } 'very_huge.file'; print join('', @result); printf "Took %.3f seconds\n", time - $start;

      And then there's:

      $ time grep karl very_huge.file nose cuke karl real: 2.127s == user: 1.845s + sys: 0.283s ## CLOSE ENOUGH! $ time ./MCE-1.608/bin/mce_grep karl very_huge.file nose cuke karl real: 1.061s != user: 2.176s + sys: 1.616s ## NOT EVEN CLOSE to: 3. +792s $ time ./MCE-1.608/examples/egrep.pl karl very_huge.file nose cuke karl real: 0.690s != user: 2.165s + sys: 0.362s ## NOR IS THIS CLOSE to: + 2.527s

      Looks dodgy to me.

      And then the claim that:

      The following code snippet parses the 2 GiB file in 1 second.

      Let's examine that. The code: slurps the entire 2GiB file into memory and then starts 4 workers that each get a reference to (1/4 of???) the slurped data and then:

      1. Run a regex against their chunk of the file to see if it contains the sought string;
      2. And if it does, open their referenced chunk of the file as a memory file, and then re-run the regex on a line-by-line basis in order to get the whole lines that contain the sought string;
      3. And then they return those lines to the parent process (via a pipe??) for it to print out.

      So the regex is run against 1.25 times as much data as is contained in the file, and takes "less than one second"; which is less than the 2.127 seconds the real grep takes, despite that it only processes the file's data once?

      Have you heard the song: Funny numbers?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        Chunking is applied following a bank-teller model. Thus, MCE does not read the entire file into memory.

Re: Threads From Hell #2: How To Parse A Very Huge File
by marioroy (Priest) on May 24, 2015 at 04:09 UTC

    MCE also includes wc.pl. I thought why not compare the wc binary with wc.pl.

    ## wc -l very_huge.file $ time wc -l very_huge.file 100000001 very_huge.file real 0m0.781s user 0m0.521s sys 0m0.260s $ time ./MCE-1.608/examples/wc.pl -l very_huge.file 100000001 very_huge.file real 0m0.612s user 0m0.633s sys 0m1.522s ## wc very_huge.file $ time wc very_huge.file 100000001 300000003 2200000015 very_huge.file real 0m21.621s user 0m21.341s sys 0m0.289s $ time ./MCE-1.608/examples/wc.pl very_huge.file 100000001 300000003 2200000015 very_huge.file real 0m6.210s user 0m23.639s sys 0m1.034s ## LANG=C wc very_huge.file $ time LANG=C wc very_huge.file 100000001 300000003 2200000015 very_huge.file real 0m6.860s user 0m6.519s sys 0m0.344s $ time LANG=C ./MCE-1.608/examples/wc.pl very_huge.file 100000001 300000003 2200000015 very_huge.file real 0m1.973s user 0m6.832s sys 0m0.904s

      Okay. From your post:

      $ time wc -l very_huge.file 100000001 very_huge.file real 0m0.781s user 0m0.521s sys 0m0.260s

      So, user code time: 0.521 + system code time: 0.260 = real time: 0.781. Makes sense!

      But then:

      $ time ./MCE-1.608/examples/wc.pl -l very_huge.file 100000001 very_huge.file real 0m0.612s user 0m0.633s sys 0m1.522s

      User code time: 0.633 + system code time: 1.522 = real time: 2.155s != 0.612s doesn't!

      The other examples seem to indicate -- although the "timing information" is equally futzed -- that if you force the real wc to run more slowly, then your fake wc runs less, more slowly?

      Is the lesson of your post that if you force wc to count characters -- which you can determine instantly from ls/dir -- and words -- which nobody is ever interested in; whilst also forcing it to use an 'unnatural' collating sequence that is equally of no interest, then your MCE::Grep can do what nobody wants done, less slowly than the system (which?) provided binary does, what nobody wants done?

      Great advert.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        "...User code time... + system code time... = real time..."

        Yes:

        From the docs:

        "The time utility executes and times the specified utility. After the utility finishes, time writes to the standard error stream, (in seconds): the total time elapsed, the time used to execute the utility process and the time consumed by system overhead."

        Some observations:

        karls-mac-mini:monks karl$ ls -hl very_huge10GB.file -rw-r--r-- 1 karl karl 10G 25 Mai 00:53 very_huge10GB.file karls-mac-mini:monks karl$ time grep karl very_huge10GB.file nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl real 2m42.126s user 0m20.437s sys 0m5.645s

         

        karls-mac-mini:monks karl$ ./mce_loop.pl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl Took 150.555 seconds

         

        #!/usr/bin/env perl use Time::HiRes qw( time ); use feature qw(say); my $start = time; say qx (grep karl very_huge10GB.file); printf "Took %.3f seconds\n", time - $start; __END__ karls-mac-mini:monks karl$ ./wrap.pl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl Took 157.265 seconds

        For the grep example 60+60+42=162 which is 2m42s. But user+sys (20+5) is 0m25s. What do i miss?

        Perhaps it's too late tonight. Or too early in the morning?

        Best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        MCE::Grep is not the tool for this. Calling a code block (once per line) is the reason for the overhead. The testing was done on a CentOS VM. However, the MCE::Loop example showed a way that allows Perl to run faster.

        Surely, one is not likely to force a character count. I was mainly comparing the wc binary against the Perl script.

Re: Threads From Hell #2: How To Parse A Very Huge File
by BrowserUk (Pope) on May 23, 2015 at 23:22 UTC
    Update: Ouch! Perhaps more RTFM would have helped:

    Could you explain what that means please?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      "...explain what that means..."

      I'll try it: A disastrous way to express my surprise to see five processes with one thread each instead of one process with five threads?

      Please don't worry to much about this nonsense, i'll strike it out.

      Best regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1127540]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2018-08-19 22:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Asked to put a square peg in a round hole, I would:









    Results (188 votes). Check out past polls.

    Notices?