Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Optimise file line by line parsing, substitute SPLIT

by go9kata (Novice)
on Jun 03, 2013 at 12:35 UTC ( #1036737=perlquestion: print w/replies, xml ) Need Help??
go9kata has asked for the wisdom of the Perl Monks concerning the following question:

Big text file, more than 2Mil lines, need to be parsed. The file has 11 fields each might have variable size and it is TAB separated

<field01><TAB><field02><TAB>...<field11><NEWLINE>

After writing a small code to read the file line by line, I noticed that using SPLIT function on each line cause huge time delay. The split function is a very good tool but in this case I feel like it is not the proper thing to use. How can I optimise TAB delimited file reading?

Here is an example code snippets and some time measures

#!/usr/bin/perl use warnings; use strict; my $file = "/Path/to/file/file_X.txt"; my $line_count = 0; open(my $fh, '<', $file) or die("Can not open file $file to read!\n"); while( !eof($fh) ) { # define line my $qry_line = <$fh>; $line_count++; # remove new line charc chomp($qry_line); } close($fh); print "\n# of lines: " . $line_count . "\n";

With the above code I measure the minimum time I can have by reading the file line by line, and it is:

# of lines: 2620024

real 0m1.379s

user 0m1.162s

sys 0m0.189s

Now if I add a SPLIT function to parse each field in a variable I get a substantial delay. The following line is added after the chomp function.

# split line to fields my ($a,$b,$c,$d,$e,$f,$g,$h,$i,$j,$k) = split("\t", $qry_line, 11);

After running the above code with the SPLIT function on the same file I got:

# of lines: 2620024

real 0m9.501s

user 0m9.260s

sys 0m0.239s

Now by parsing the line with SPLIT I add more than 8sec of execution time, which is quite a lot. Do you guys have any suggestions to have this as much optimised as possible? The above code is suppose to run over 200+ such files, so those 8sec that SPLIT adds kind of make a difference. Thank you in advance for any suggestions!

Similar discussion: http://arstechnica.com/civis/viewtopic.php?f=20&t=378424

Replies are listed 'Best First'.
Re: Optimise file line by line parsing, substitute SPLIT
by Tux (Abbot) on Jun 03, 2013 at 14:13 UTC

    If your data is easy to split on TAB's, perl's split is exceptionally fast. Building on BrowserUk's example - which is a good example - look at this test:

    $ perl -E'say join"\t",qw(1 aap 2 noot 3 mies 4 wim 5 zus 6) for 1..2_ +000_000' > test.tsv $ wc -l test.tsv 2000000 test.tsv $ cat test.pl use 5.016; use warnings; use Benchmark qw( cmpthese ); use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ sep_char => "\t", auto_diag => 1 }); cmpthese (3, { none => sub { open my $fh, "<", "test.tsv"; while (<$fh>) { } }, splt => sub { open my $fh, "<", "test.tsv"; while (<$fh>) { my @f = split m/\t/ => $_, -1; } }, tcsv => sub { open my $fh, "<", "test.tsv"; while (my $f = $csv->getline ($fh)) { # use $f->[1] instead of $f[1]; } }, tcsvbc => sub { open my $fh, "<", "test.tsv"; my @f = ("") x 11; $csv->bind_columns (\(@f)); while ($csv->getline ($fh)) { } }, }); $ perl test.pl s/iter tcsv tcsvbc splt none tcsv 5.26 -- -32% -36% -96% tcsvbc 3.57 48% -- -5% -93% splt 3.38 56% 6% -- -93% none 0.233 2156% 1429% 1347% -- $

    where vsespb is missing the point, is that when determining if a record is to be split or not, one still has to look at the record. That is exaclty what BrowserUk was pointing at. Looking at the first character of a record is a cheap operation, not needing split at all, but if you want to decide processing on the content of the 11th field, you have to split before you can decide.

    All of the above will start changing when the fields can be quoted and have (valid) TAB characters inside the field, or if fields are allowed to have embedded newlines or when the fields are (much) longer. Compare to e.g.:

    $ perl -E'@x=("x" x 100) x 11;say join"\t",@x for 1..2_000_000' > test +.tsv $ perl test.pl s/iter tcsv tcsvbc splt none tcsv 23.3 -- -32% -74% -92% tcsvbc 15.8 47% -- -61% -89% splt 6.13 280% 158% -- -71% none 1.78 1206% 788% 244% --

    Enjoy, Have FUN! H.Merijn

      ... and maybe simply looking for the string-of-interest anywhere in the original-string might be enough of a filter.   Decide whether the string exists anywhere, before you spent the time needed to decide if it’s the 11th field, instead of identifying the 11th field before you consider what it contains.   If split is deemed to be pricey, split only what you (guess) you (might) need to.

Re: Optimise file line by line parsing, substitute SPLIT
by BrowserUk (Pope) on Jun 03, 2013 at 13:29 UTC

    Pick up a paper back book. Bend it in half in one hand and use your thumb to flick through the pages. Time how long it took.

    Now, using the same book, read the first word on every page. Did it take you longer?

    Reading a file and doing nothing with the lines is the metaphorical equivalent of flicking through the pages.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      That's true, but not always.

      Sometimes you don't need process even 1% of the data.

      You just read it, split, and drop 99.9% of lines where field1 <> 'abcd' (that's where SQL can help)

      Or you read webserver logs into memory hash (grouped by IP) and then you do scoring of new site visitors in real time (and you need access only records related to particular IP) (and SQL will be slower)

      Or maybe you read list of files from text file, and read filelisting from disk, and then compare it in memory (no use for SQL)

      Or general case - you read data from text files (1M lines) and skip all records which are not already in another memory hash (there are 10K lines)

        When you post code that does any one of those things you cite, more quickly than you can read the file and do nothing, I'll stump up for a nice polyurethane "Code Magician of the Year" award and send it to you.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        In this kind of application, try to filter first e.g. next unless /^abcd/ and only split if you need the fields separated.

Re: Optimise file line by line parsing, substitute SPLIT
by BrowserUk (Pope) on Jun 03, 2013 at 17:11 UTC

    This uses threads to achieve a near perfect 75% reduction on overall runtime using 4 threads:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; use threads; use threads::shared; my @buffers; sub thread { my $tid = threads->tid; print "[$tid] starting"; my $bufRef = shift; open my $fh, '<', $bufRef; while( <$fh> ) { ## next if ... ## add pre-filter criteria here is applicable my @f = split "\t", $_; ## do something with the data } print "[$tid] ending"; } our $N //= 199; our $T //= 4; my $start = time; my @threads; for( 0 .. $N-1 ) { ## glob '*.tsv or whatever my $bufRef = \$buffers[ $_ % $T ]; { local( @ARGV, $/ ) = 'numbers.tsv'; $$bufRef = <>; } push @threads, async( \&thread, $bufRef ); while( @threads >= $T ) { sleep 1 until threads->list( threads::joinable ); for( reverse 0 .. $#threads ) { next unless $threads[ $_ ]->is_joinable; $threads[ $_ ]->join; splice @threads, $_, 1; } } } $_->join for @threads; printf "Took %f seconds\n", time() - $start; __END__ C:\test>1036737-3 -T=1 -N=8 [1] starting [1] ending [2] starting [2] ending [3] starting [3] ending [4] starting [4] ending [5] starting [5] ending [6] starting [6] ending [7] starting [7] ending [8] starting [8] ending Took 160.641885 seconds C:\test>1036737-3 -T=4 -N=8 [1] starting [2] starting [3] starting [4] starting [2] ending [1] ending [5] starting [4] ending [3] ending [6] starting [7] starting [8] starting [5] ending [6] ending [7] ending [8] ending Took 39.710515 seconds

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      ... thus confirming that, indeed, this is an I/O-bound process that benefits quite substantially from parallelism at (at least...) a factor of x4.   As would be expected.

      And, meaning to take absolutely no thunder away from this most-excellent example, I would like to mention here that I’ve seen quite a few shop-written scripts written (at various past-life engagements) which routinely used two command-line parameters:   -s start_percent -e end_percent.   For exactly this purpose.

      Each of the scripts which supported these two parameters started by determining the size of the file, and from this, the byte-position represented by each of these two percentage-numbers.   They then did a random-access seek to the starting position, then (if not zero%) read one line of text (presumed to be a fragment) and threw it away.   Then, they processed lines until, after processing a particular line, they noticed that the file-position was now beyond the specified ending position.   (Or end-of-file, whichever came first.)   Then, they ended normally.   In this way, each utility could be told to process a segment of the file.

      The bash-scripts (or what have you) that executed these programs launched a number of parallel copies (with non-overlapping percentages as appropriate), as shell jobs, then did a wait for all of them to finish.   It’s exactly the concept that BrowserUK demonstrates here, but implemented (for better or for worse) in the design of the programs and of the shell-scripts that invoked them.

      Some of these programs weren’t Perl, and they always used operating system calls to advise the OS that they were going to do “sequential-only” reads of those files, thereby requesting big read-ahead buffers.   They also explicitly said that the files were read-only and that they were shared, thereby encouraging shared use of common buffers.   None of the programs were threads or processes, yet all of them were thus prepared to be child processes ... of the shell.

      Great post.   Thanks.

        ... thus confirming that, indeed, this is an I/O-bound process that benefits quite substantially from parallelism at (at least...) a factor of x4. As would be expected.

        Even with the code and evidence right there for your inspection, you draw completely the wrong conclusion.

        {blah} In this way, each utility could be told to process a segment of the file.

        Prove it by posting the code. You won't (because you can't). Talk is cheap (and in your case invariably wrong!).

        What makes my code work is that all the IO is sequential. And that requires shared state.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        ... thus confirming that, indeed, this is an I/O-bound process that benefits quite substantially from parallelism at (at least...) a factor of x4. As would be expected.

        No. Throwing memory at a IO speeds up processing. Throwing threads at memory, speeds splitting.

Re: Optimise file line by line parsing, substitute SPLIT
by ramlight (Friar) on Jun 03, 2013 at 13:28 UTC

    Before we can know whether the extra 8 seconds to use split is of any consequence, we need to know how much other processing you are doing on each line. If the time to process one file is on the order of a minute or so, then extra 8 seconds/file becomes noticeable. If it takes 10 minutes/file, then an extra 8 seconds doesn't make much difference.

    If you don't yet know the answer to this question, then it is too soon to make this kind of optimization.

Re: Optimise file line by line parsing, substitute SPLIT
by hbm (Hermit) on Jun 03, 2013 at 14:40 UTC

    For the little it's worth here, I'd take advantage of $. and $_. Minimize what you are doing millions of times; and then yes, run a handful in parallel.

    open(my $fh, '<', $file) or die("Can not open file $file to read!\n"); while(<$fh>) { chomp; my ($a,$b,$c,$d,$e,$f,$g,$h,$i,$j,$k) = split("\t", $_, 11); # do something real } print "\n# of lines: $.\n"; close($fh);
Re: Optimise file line by line parsing, substitute SPLIT
by vsespb (Chaplain) on Jun 03, 2013 at 12:44 UTC

    I had same problem. And I did not find any solution. One thing I can tell, if you later validate your records with regexps, it might be better replace split with regexp and combine with validation, because split is not much faster than regexps.

    (example of my format/code)

    Also, maybe unpack() sometimes faster than split?

Re: Optimise file line by line parsing, substitute SPLIT
by nbtrap (Sexton) on Jun 06, 2013 at 02:04 UTC
    If memory is not lacking, you can slurp the entire file and then process the lines, as in:
    #!/usr/bin/perl use warnings; use strict; my $file = "/Path/to/file/file_X.txt"; open(my $fh, '<', $file) or die("Can not open file $file to read!\n"); my @lines = <$fh>; close($fh); for (@lines) { chomp; ## process the rest here, including calls to split } print "\n# of lines: " . scalar(@lines) . "\n";
    This has a good chance to speed things up, presumably because perl will make fewer calls to read(2).
Re: Optimise file line by line parsing, substitute SPLIT
by sundialsvc4 (Abbot) on Jun 03, 2013 at 13:26 UTC

    If you were to launch multiple copies of the program in parallel with one another (e.g. just use trailing-& on the Unix shell command-line to start with ...), how much time is added to the overall process if, say, 5 files were processed at the same time?   10?

    My guess is that you can get away with that, and if so, the 8-second difference in time becomes a lot less critical to the completion of the business task.   (Using split really is more clear ...)   There will be some sort of “sweet spot” where the completion-time is about the same ... then a rather abrupt shift (a knee-shaped bend in the performance curve ... a thrash-point) after which completion-times will fall through the floor. But you can fool-around with this idea, using just the command line job facilities of Unix/Linux.   I”d love to know if I guessed right.

Re: Optimise file line by line parsing, substitute SPLIT
by Anonymous Monk on Jun 03, 2013 at 12:51 UTC

    Now by parsing the line with SPLIT I add more than 8sec of execution time, which is quite a lot.

    No it isn't

Re: Optimise file line by line parsing, substitute SPLIT
by Anonymous Monk on Jun 03, 2013 at 12:52 UTC

    Do you guys have any suggestions to have this as much optimised as possible? The above code is suppose to run over 200+ such files, so those 8sec that SPLIT adds kind of make a difference.

    No they don't

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1036737]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2018-12-12 16:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How many stories does it take before you've heard them all?







    Results (60 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!