Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte)
on Apr 16, 2016 at 13:35 UTC ( [id://1160637]=perlquestion: print w/replies, xml ) Need Help??

John FENDER has asked for the wisdom of the Perl Monks concerning the following question:

Hi, i'm currently making some basic tests on parsing huges files for security works, searching in them a basic regex. As the files could be more than 10 gb, i can't load it fully in memory, so i have to read it line by line. My standard test is to count the number of line and search 123456$ regexp. I need to do both : count the number of line in the file and make a search and count hte number of result. Here is my code
open (FH, '<', "../Tests/10-million-combos.txt"); $counter=0; $counter2=0; while (<FH>) { if (/123456$/) {++$counter2;} ++$counter; } print "Num. Line : $counter - Occ : $counter2\n"; close FH;
It's simple but for a simple file of 2 Gb it takes 12,6 min !!! I suspect i did something wrong as Perl is a fast language, but i not good enought to know what. Please help ! Thanks.

Replies are listed 'Best First'.
Re: How to optimize a regex on a large file read line by line ?
by AnomalousMonk (Archbishop) on Apr 16, 2016 at 15:19 UTC
    ... for a simple file of 2 Gb it takes 12,6 min ...

    Wait... Over 12 minutes to process a 2 GB file in the simple way you've shown?!? I put together a 10,000,000 line file of 200 characters per line, with the last six characters '000000' .. '999999', and processing with your code took just over 20 seconds on my laptop (update: although some later runs took just over 40 seconds). (Generating the file only took about 40 seconds!)

    If I understand your 12 minute claim correctly, I have a sneaking suspicion that you're not showing us the code you're actually running. It's important to show real code and not "It's just like as if it was this code..."

    Update: If, however, the time is actually on the order of 12 seconds, I honestly don't think you're going to do a great deal better; such a time would seem pretty good to me.


    Give a man a fish:  <%-{-{-{-<

      I'm currently hiding nothing :).

      I've the latest ActiveState PERL installed on my machine (ActivePerl-5.22.1.2201-MSWin32-x64-299574).

      I've uploaded on my FTP both file i used for my tests. I'm running a Windows 10 home edition (it's my personnal laptop, as i'm at home these days), with a Quad Core 3.1/16 Gb.

      To give you an idea, a grep + wc command give me a result of 10 s, java or c#, 30s, c++, 48s, php 7, 50s, ruby 85s, python, 346s, powershell 682s, VBS, 1031s, Free Pascal,72,58s, VB.NET,100,63s...

      Maybe something related to the perl distribution you think ? I will try with another distribution.

        How do you grep line by line?

        I suppose grep does the same  like I suggested before, reading large chunks into memory and trying to match multiple lines at once.

        Another option is to fork into four child's each processing a quarter to use the full power of your machine.

        And btw using lexical variables declared with my should help a little too.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        But do you confirm that the processing time with Perl for the OPed code is in excess of 12 minutes? That's what would be shocking to me.

        Someone else would have to advise about differences between distributions (I'm running Strawberry 5.14.4.1 for my tests (update: on Windows 7)), but I would be flabbergasted by such a performance difference.


        Give a man a fish:  <%-{-{-{-<

        By the way, here is the full 2 Gb dict i'm using for tests :

        http://mab.to/tbT8VsPDm

        Please give me your execution times with the same code, your plateform, it's interesting.

Re: How to optimize a regex on a large file read line by line ?
by LanX (Saint) on Apr 16, 2016 at 14:32 UTC
Re: How to optimize a regex on a large file read line by line ?
by graff (Chancellor) on Apr 16, 2016 at 16:29 UTC
    Not that this would make a big difference in terms of run-time, but you don't have to keep your own counter for the number of lines in the file. The predefined global variable $. does that for you (cf. the perlvar man page):
    print "Num. Line : $. - Occ : $counter2\n";
    A few other observations...

    I fetched the "10-million-combos.txt.zip" file you cited in one of the replies above, and noticed that it contains just the one text file. In terms of benchmarking, you might find that a command-line operation like this:

    unzip -p 10-million-combos.txt.zip | perlscript
    is likely to be faster than having the perl script read an uncompressed version of the file from disk, because piping output from "unzip -p" involves fetching just 23 MB from disk, as opposed to 112 MB to read the uncompressed version. (Disk access time is always a factor for stuff like this.)

    Spoiler alert: your file "10-million-combos.txt" does not contain any lines that match /123456$/. UPDATE: actually, there would be 2 matches on a windows system, and I find those two on my machine if I search for /123456\r\n$/.

    I was going to suggest using the gnu/*n*x "grep" command-line utility to get a performance baseline, assuming that this would be the fastest possible way to do your regex search-and-count, but then I tried it out on your actual data and got a surprise (running on a macbook pro, osx 10.10.5, 2.2GHz intel core i7, 4GB ram):

    $ unzip -p 10-million-combos.txt.zip | time grep -c 123456$ 0 3.30 real 3.25 user 0.01 sys $ unzip -p 10-million-combos.txt.zip | time grep -c 123456$ 0 3.23 real 3.22 user 0.01 sys $ unzip -p 10-million-combos.txt.zip | time grep -c 123456$ 0 3.18 real 3.17 user 0.01 sys $ unzip -p 10-million-combos.txt.zip | time perl -ne 'BEGIN{$n=0} $n++ + if /123456$/; END{print "$. lines, $n matches\n"}' 9835513 lines, 0 matches 1.96 real 1.89 user 0.02 sys $ unzip -p 10-million-combos.txt.zip | time perl -ne 'BEGIN{$n=0} $n++ + if /123456$/; END{print "$. lines, $n matches\n"}' 9835513 lines, 0 matches 1.96 real 1.93 user 0.02 sys $ unzip -p 10-million-combos.txt.zip | time perl -ne 'BEGIN{$n=0} $n++ + if /123456$/; END{print "$. lines, $n matches\n"}' 9835513 lines, 0 matches 1.93 real 1.90 user 0.02 sys
    I ran each command three times in rapid succession, to check for timing differences due to system cache behavior and other unrelated variables. Perl is consistently faster by about 33% (and can report total line count along with match count, which the grep utility cannot do).

    (If I remove the "$" from the regex, looking for 123456 anywhere on any line, I find three matches, and the run times are just a few percent longer overall.)

      "The predefined global variable $. does that for you"

      Wasn't aware of this trick, thanks !

      "Spoiler alert: your file "10-million-combos.txt" does not contain any lines that match /123456$/."

      Hahem, sound like i've done something wrong while zipping the file. Now the 19x mb file containing 10 millions password are updated in the right way. You will find 10000000 lines in it, and 61466 with the regex 123456$.

      "unzip -p 10-million-combos.txt.zip | perlscript"

      Currently i'm working on txt file only. But it's interesting. I've done your test like that :

      echo 1:%time% unzip -p 10-million-combos.zip | grep 123456$ | wc -l echo 2:%time% grep 123456$ 10-million-combos.txt | wc -l echo 3:%time% pause

      Result :

      1:19:16:46,11 61466 2:19:16:48,43 61466 3:19:16:49,00

      0,58 in plaintext, 2,27 in zip file piped.

      More now with your command line

      zip piped : 3,89 unzip -p "C:\Users\admin\Desktop\10-million-combos.zip" | perl -ne "BE +GIN{$n=0} $n++ if /123456$/; END{print $n}" plain text : 5,16 type "C:\Users\admin\Desktop\10-million-combos.txt" | perl -ne "BEGIN{ +$n=0} $n++ if /123456$/; END{print $n}") perl direct : 2,29 perl "demo.pl"

      =Fastest on my side stay the direct access to the plain text file either using grep or perl. Amazing to see the perl unzip goes faster than the plain text access with an inline command... The shell is strange sometimes...

      "I was going to suggest using the gnu/*n*x "grep" command-line utility to get a performance baseline"

      Im' using the one you can find in the unix utils, i suppose it's the GNU one ported on windows. --version give me : grep (GNU grep) 2.4.2.

      Now grep vs perl
      echo %time%& grep 123456$ C:\Users\admin\Desktop\10-million-combos.txt + | wc -l& echo %time% echo %time%& type "C:\Users\admin\Desktop\10-million-combos.txt" | per +l -ne "BEGIN{$n=0} $n++ if /123456$/; END{print $n}"& echo.&echo %tim +e% echo %time%& perl demo.pl& echo %time%

      Give me :

      19:43:28,91/61466/19:43:29,51 for grep (0,6) 19:45:29,51/61466/19:45:34,71 for perl (5,2) 19:46:13,27/61466/19:46:15,47 for perl (direct) (2,2)
        Thanks for showing your comparison of the unzip pipeline vs. reading uncompressed text. I had said that the former would be faster (because of less reading from disk), but without actually testing it. (I think I must have encountered at least a couple situations in the past where some process finished more quickly if I read compressed data from disk, rather than uncompressed, but I don't know what may have been different in those cases.)

        Having now tested it for this situation (multiple times in quick succession to check for consistency), the difference in timing was negligible or slightly favoring reading the uncompressed file, so it seems my initial idea about the role of disk access was wrong: either it really doesn't make any difference, or else whatever difference it makes is washed out by the added overhead of the extra unzip process and/or the pipeline itself.

        (The perl one-liner was still faster than the compiled "grep" utility on my machine, but YMMV - different machines will have different versions / compilations of both Perl and grep.)

      I think the matter come from huge file. How many times took on your computer the same request on the 1,9 Gb dictionnary ?

      http://mab.to/tbT8VsPDm

Re: How to optimize a regex on a large file read line by line ?
by Athanasius (Archbishop) on Apr 16, 2016 at 13:49 UTC

    Hello John FENDER, and welcome to the Monastery!

    Since you don’t print a result until the loop has finished, it appears that you expect the regex to match only once. In that case, you can cut the time substantially1 by exiting the loop as soon as a match is found:

    while (FH) { ++$counter; if (/1234556$) { ++$counter2; last; } }

    See perlsyn#Loop-Control.

    1By half, on the average, if the matching line appears in a random location within the file.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Hello Athanasius ! Thanks for your answer : i don't want to leave my loop until i know how many users with the password 123456$ i have in the file. Cheers.

        Ah yes, I see. In that case, you’re going to have to read through the whole file, and I doubt there’s much you can do to speed up the loop.

        BTW, when I saw the regex /123456$/, I assumed you wanted to match 123456 at the end of a line — that’s what the $ anchor means in a regex. If you want to match a literal $, you need to escape it: m{123456\$} or:

        use strict; use warnings; use autodie; ... my $password = '123456'; open(FH, '<', "../Tests/10-million-combos.txt"); $counter = 0; $counter2 = 0; while (<FH>) { ++$counter; ++$counter2 if /^Q$password/; } print "Num. Line : $counter - Occ : $counter2\n"; close FH;

        See quotemeta.

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: How to optimize a regex on a large file read line by line ?
by marioroy (Prior) on Apr 17, 2016 at 09:23 UTC

    Hello John FENDER,

    The following is a parallel demonstration using MCE::Flow and MCE::Shared.

    use strict; use warnings; use MCE::Flow; use MCE::Shared; open my $fh, "unzip -p 10-million-combos.zip |" or die "$!"; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow { chunk_size => '1m', max_workers => 8, use_slurpio => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my ( $numlines, $occurances ) = ( 0, 0 ); while ( $$chunk_ref =~ /([^\n]+\n)/mg ) { $numlines++; $occurances++ if ( $1 =~ /123456\r/ ); } $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, $fh; close $fh; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

    The following construction reads the plain text file directly if already unzipped.

    use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '1m', max_workers => 8, use_slurpio => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my ( $numlines, $occurances ) = ( 0, 0 ); while ( $$chunk_ref =~ /([^\n]+\n)/mg ) { $numlines++; $occurances++ if ( $1 =~ /123456\r/ ); } $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "10-million-combos.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

      Update: Shorten code

      Hello again,

      Slurping requires double regular expressions. One for breaking into actual lines and the other for the query. Below, workers receive an array reference containing some number of lines and run slightly faster, possibly due to one regex.

      use strict; use warnings; use MCE::Flow; use MCE::Shared; open my $fh, "10-million-combos.zip |" or die "$!"; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow { chunk_size => '1m', max_workers => 8, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = @{ $chunk_ref }; my $occurances = 0; for ( @{ $chunk_ref } ) { $occurances++ if /123456\r/; } $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, $fh; close $fh; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

      And finally, the construction for reading the plain text file directly.

      use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '1m', max_workers => 8, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = @{ $chunk_ref }; my $occurances = 0; for ( @{ $chunk_ref } ) { $occurances++ if /123456\r/; } $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "10-million-combos.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";
        Hello marioroy,

        It's impressiv ! And i've test successfully your code. Eventually the Strawberry dist works the best on my system.

        "It's works even well with the mixed file (multiple kind of EOF) ! Benchmarking both the 3 methods, i've found 32,53 and 33,07 for the two codes provided kindfully. My current code (works only on a cr+lf or lf file) have done 33,76".

        It's impressive to see the 8 CPU cores up to 100% at the same time with your demo ! But the results only slightly differs with my code which doesn't impact all the core at all like that. Strange !

        Very happy anyway, i'm now close to the best performance i could got on my laptop with perl !

        . Grep : 10,71 . Java : 25,95 . C# : 30,05 . Perl : 32,53 . C++ : 41,3 . PHP : 52,31 . Free Pascal : 76,46 . Delphi 7 : 78,14 . VB.NET : 100,15 . Python : 315,13 . PowerShell : 681,93 . VBS : 1031,63 . Ruby : Failed to parse the file correctly.
Re: How to optimize a regex on a large file read line by line ?
by RichardK (Parson) on Apr 16, 2016 at 14:51 UTC

    How long are the lines in your file? and how many lines is it reading in total? Maybe reading it a line at a time is not the best approach for your data set.

      How long ? Well, it's could vary regarding the extract you can make and the data you would analyze. Some logs are huges, more than 2Gbs... For starting 10000000 lines for passwords log 185866729 lines for the dictionnary file The entry are not very long, nothing more than 8 or 16 chars i would say.

        There's no point trying to optimize your code if you're not sure what your data looks like. However index will be faster than a regex if you're only looking for a fixed string.

        As other people have recommended, profile your code and find out where the time is going.

Re: How to optimize a regex on a large file read line by line ?
by Anonymous Monk on Apr 16, 2016 at 14:34 UTC
    Could you show a small, representative sample of the input, anonymized if necessary?
      As i'm working for evaluation purpose on public data at first, it's not a matter. You could find here both a 10 millons file with users password, and an extract of a 2 Gb dictionary, cut at 100 Mb. http://john.fender.free.fr/Dev/PerlMonk/QueryOnPerl/
Re: How to optimize a regex on a large file read line by line ?
by Anonymous Monk on Apr 18, 2016 at 01:26 UTC

    I'd appreciate it if someone would take this "big buffer" approach and adopt it to the test case and get timings for it. I'm stuck on this small tablet so I can't test it myself.

    http://ideone.com/LzaQI0

    I don't even know how to paste it into this post, sorry

      Update: Changed the chunk_size option from '1m' to '24m'. The time drops down to 3.2 seconds via MCE with FS cache purged ( sudo purge ) before running on a Macbook Pro laptop. Previously, this was taking 6.2 seconds for chunk_size => '1m'. The time is ~ 1 second if the file resides in FS cache.

      Update: Added the 'm' modifier to the regex operation.

      Update: Ensuring the file does not live in FS cache, the time is 7.8 seconds running serially and 6.2 seconds running on many cores for the ~ 2 GB plain text file. Once in FS cache, the time is 5.4 seconds serially and 0.9 seconds via MCE.

      Update: The unzipping of the file met that the file resided in FS cache afterwards. One doesn't normally flush FS memory typically. But, I met to do so before running. I have already removed the zip and plain text files and did not run again. IO is fast when processing a file directly. The reason is that workers do not involved the manager process when reading.

      Anonymous Monk, the following is a parallel demonstration of the online code. Yes, reading line by line is not necessary. Thus performance increases by 5x from the serial version. This is also faster than the previous parallel demonstrations by many factors.

      The parallel example below parses the ~ 2 GB plain text file in 0.9 seconds. The online serial demonstration completes in 5.2 seconds. My laptop has 4 real cores and 4 hyper-threads. Seeing nearly 6x is really good and did not expect that.

      use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '24m', max_workers => 8, use_slurpio => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = $$chunk_ref =~ tr/\n//; my $occurances = () = $$chunk_ref =~ /123456\r?$/mg; $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "Dictionary2GB.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

        How do you handle a chunk that ends in the middle of the pattern? I did it by completing the partial line (see code line with comment "finish partial line").

        Thanks for the timings. If possible, would you please also get a time for the grep+wc on your machine so we can tell how both these solutions compare to it.

      The code is incomplete cause a match could span two chunks.

      You need to seek back the longest possible match (here 8) before reading the next chunk.

      Actually the correct number is something like min ( p ,m )

      With p = chunksize - pos

      and m = length of longest possible match

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        The match is only in one line, that's the purpose of the line

        $_ .= <$fh> // '';

        It completes a partial line.

Re: How to optimize a regex on a large file read line by line ?
by marioroy (Prior) on Apr 21, 2016 at 15:26 UTC

    Update: The time is 2.2 seconds using the same demonstration below on a Mac running the upcoming MCE 1.706 release. Running with four workers also completes in 2.2 seconds. Basically, have reached the the underlying hardware limitation.

    Today, I looked at MCE to compare against the 2 GB plain text file residing in FS cache and not. Increasing the chunk_size value is beneficial, especially when the file does not exists in OS level FS cache.

    With an update to the code, simply by increasing the chunk_size value from '1m' to '24m', the total time now takes 3.2 seconds to complete.

    use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '24m', max_workers => 8, use_slurpio => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = $$chunk_ref =~ tr/\n//; my $occurances = () = $$chunk_ref =~ /123456\r?$/mg; $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "Dictionary2GB.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

    One day, I will try another technique inside MCE to see if IO performance can be improved upon.

    Resolved.

      How fast is grep+wc on your machine?

        Grep and egrep run slow on the Mac and do not know why.

        wc -l: 2.162 seconds grep -c: 45.316 seconds
      I already wanted to remark that since the FS is the bottleneck, I'm not sure if paralleling helps (there's only one FS)

      When comparing with grep/wc please also compare the one worker case, cause grep shouldn't be paralleling (AFAIK)

      BTW: While we never saw the bash script, I suppose we have to call wc twice to get the total number of numlines too (which makes comparing even more complicated, cause the second wc would need reading the file again)

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        Update: Added serial code. Am happy that IO in MCE is not too far behind. One day, will try another technique. IO aside, any CPU intensive operations such as regex do benefit from running with multiple workers.

        Yes, IO will only go as fast as the underlying IO capabilities. MCE does sequential IO, meaning only one worker reads at any given time. The regex operation benefits from having multiple workers. Eventually, IO becomes the bottleneck.

        1 worker: 9.437 secs. 2 workers: 4.480 secs. 3 workers: 3.248 secs. 4 workers: 3.236 secs. 8 workers: 3.240 secs.

        Below, removed counting and regex from the equation and running with 1 worker. It completes as fast as IO allows in 3.256 seconds.

        mce_flow_f { chunk_size => '24m', max_workers => 1, use_slurpio => 1, }, sub { }, 'Dictionary2GB.txt';

        The following serial code, reader only and without MCE, takes 2.864 seconds to read directly from the PCIe-based SSD drive, not from FS cache.

        use strict; use warnings; my $size = 24 * 1024 * 1024; open my $fh, '<', 'Dictionary2GB.txt' or die "$!"; while ( read( $fh, my $b, $size ) ) { $b .= <$fh>; } close $fh;

        Update: Am providing updated results due to background processes running previously. I rebooted my laptop and realized that things were running faster. That met having to re-run all the tests. Included are results for the upcoming MCE 1.706 release with faster IO ( applies to use_slurpio => 1 ). Previously, was unable to run below 3.0 seconds on the Mac with MCE 1.705. The run time is 2.2 seconds with MCE 1.706, which is close to the underlying hardware limit. MCE 1.706 will be released soon.

        I ran the same tests from a Linux VM via Parallels Desktop with the 2 GB plain text file residing on a virtual disk inside Fedora 22. Unlike on OS X, the binary grep command runs much faster under Linux.

        ## FS cache purged inside Linux and on Mac OS X before running. wc -l : 1.732 secs. from virtual disk grep -c : 1.912 secs. from virtual disk total : 3.644 secs. wc -l : 1.732 secs. from virtual disk grep -c : 0.884 secs. from FS cache total : 2.616 secs. Perl script : 3.910 secs. non-MCE using 1 core MCE 1.705 MCE 1.706 with MCE : 4.357 secs. 4.015 secs. using 1 core with MCE : 3.228 secs. 2.979 secs. using 2 cores with MCE : 2.884 secs. 2.624 secs. using 3 cores with MCE : 2.908 secs. 2.501 secs. using 4 cores ## Dictionary2GB.txt residing inside FS cache on Linux. wc -l : 1.035 secs. grep -c : 0.866 secs. total : 1.901 secs. Perl script : 2.314 secs. non-MCE using 1 core MCE 1.705 MCE 1.706 with MCE : 2.344 secs. 2.337 secs. using 1 core with MCE : 1.349 secs. 1.345 secs. using 2 cores with MCE : 0.961 secs. 0.932 secs. using 3 cores with MCE : 0.820 secs. 0.775 secs. using 4 cores

        On Linux, it takes at least 3 workers to run as fast as wc and grep combined with grep reading from FS cache.

        Below, the serial code and MCE code respectively.

        use strict; use warnings; my $size = 24 * 1024 * 1024; my ( $numlines, $occurances ) = ( 0, 0 ); open my $fh, '<', '/home/mario/Dictionary2GB.txt' or die "$!"; while ( read( $fh, my $b, $size ) ) { $b .= <$fh> unless ( eof $fh ); $numlines += $b =~ tr/\n//; $occurances += () = $b =~ /123456\r?$/mg; } close $fh; print "Num lines : $numlines\n"; print "Occurances: $occurances\n";

        Using MCE for running on multiple cores.

        use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '24m', max_workers => 4, use_slurpio => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = $$chunk_ref =~ tr/\n//; my $occurances = () = $$chunk_ref =~ /123456\r?$/mg; $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "/home/mario/Dictionary2GB.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

        Kind regards, Mario.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1160637]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2024-06-17 09:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.