Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Process large text data in array

by hankcoder (Scribe)
on Mar 10, 2015 at 14:31 UTC ( #1119509=perlquestion: print w/replies, xml ) Need Help??

hankcoder has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to find better solutions or at least if possible to reduce the speed of processing large text data which were read from file and put in array. I will try to put everything as simple as possible and post here the related codes.

Current test is on local machine Windows XP Pro running ActivePerl. Live system will be on Unix/Linux environment.

The text file format are line by line, not fixed length. Current test contain 300k of lines about 38MB. Line format sample are:

id_1=value|id_2=value|id_3=value|.....

I tested using 2 ways of retrieving file content, both works very fast, in just about 4sec.

#-------------------------------------------------------# #-------------------------------------------------------# sub get_filecontent { my @temp_data; open (TEMPFILE,"$_[0]"); @temp_data = <TEMPFILE>; close (TEMPFILE); return( @temp_data ); } #-------------------------------------------------------# #-------------------------------------------------------# sub get_fileRef { my ($fname, $ref_dat) = @_; open (TEMPFILE,"$fname"); @{$ref_dat} = <TEMPFILE>; close (TEMPFILE); }

However, after read all data into memory, I need to process it from beginning to end once to get wanted data line based on given criteria, and finding total matches. Added this process, the total time it takes is about 37sec.

I'm not sure if this speed is normal, but if can reduce it, that is really great.

The codes use to process the array are here:

#-- filter my (@new_dat) = (); foreach my $line (@loaded_data) #-- loop thru all data { chomp($line); my (%trec) = &line2rec($line); if ($trec{'active'}) { push(@new_dat, $line); } } (@loaded_data) = (@new_dat); #-- overwrite (@new_dat) = ();

sub routine codes for converting line2rec

#---------------------------------------------------# # LINE2REC #---------------------------------------------------# # convert a line into a record by separator | sub line2rec { my ($line) = @_; my (@arr) = split( /\|/, "$line" ); my (%trec) = &hash_array(@arr); return (%trec); } #---------------------------------------------------# #---------------------------------------------------# sub hash_array { my (@arr) = @_; my ($line, $name, $value, $len_name); my (@parts) = (); my (%hashed) = (); foreach $line (@arr) { chomp($line); if ($line =~ /=/) { (@parts) = (); (@parts) = split( /\=/, $line ); #-- just incase got more than +one = separator $name = "$parts[0]"; #-- use first element as name $len_name = length($name)+1; $value = substr( "$line", $len_name, length("$line")-$len_name +); #-- !! cannot use join, if last char is separator then will dis +appear after split $hashed{$name} = $value; } } return (%hashed); }

If I remarks out sub &line2rec($line); the speed reduced to 10sec. So I guess this sub codes can be further improved.

Any suggestions are much appreciated. Thanks.

Replies are listed 'Best First'.
Re: Process large text data in array
by BrowserUk (Pope) on Mar 10, 2015 at 14:56 UTC
    However, after read all data into memory, I need to process it from beginning to end once to get wanted data line based on given criteria,

    Why read it all in -- ie. read every line, allocate space for every line, extend the array to accommodate every line -- if you only need to process the array once?

    In other words, why not?:

    while( <TEMPFILE> ) { processLine( $_ ) }

    Also, you are throwing away performance with how you are passing data back from your subroutines. Eg:

    return (%hashed);

    That builds a hash in hash_array(), the return statement converts it to a list on the stack; then back at the call site:

    my (%trec) = &hash_array(@arr);

    You convert that list from the stack back into another hash. Then, you immediately return that hash to the caller of line2rec(), converting it to another list on the stack:

    my (%trec) = &hash_array(@arr); return (%trec); }

    And then back at that call site, you convert that list back into yet another hash:

    my (%trec) = &line2rec($line);

    And all of that in order to test if the line contains the string 'active':

    if ($trec{'active'})

    The whole process can be reduced to (something like; the regex will probably need tweaking to select the appropriate field):

    my @data; while( <TEMPFILE> ) { /active/ and push @data, $_; }

    It'll be *much* faster.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      BrowserUk Thanks for pointing it out. The process not only checking just for "active" value, there are more checking, it is only sample. I built the codes into sub so it is easier for me to refer and debug in future.

      I'm more prefer to use separate sub calling to get the file content instead of using

      while( <TEMPFILE> ) { processLine( $_ ) }

      in every part of codes I'm going to retrieve the file content. I'm taking your notes, will do more test for every codes of it. Thanks.

        Swapping back and forth (and back and forth again) between the hash and a list is still inefficient.

        Use a hash reference instead, so it won't have to make multiple copies of your hash contents.

        I share the opinion that it is quite unnecessary to read a 38MB disk-file into virtual memory in order to process it.   In particular, if when that file becomes, say, “10 times larger than it now is,” your current approach might begin to fail.   It’s just as easy to pass a file-handle around, and to let that be “your input,” as it is to handle a beefy array.   Also consider, if necessary, defining a sub (perhaps, a reference to an anonymous function ...) that can be used to filter each of the lines as they are read:   the while loop simply goes to the next line when this function, say, returns False.

        We know that BrowserUK, in his daily $work, deals with enormous datasets in very high-performance situations.   If he says what he just did about your situation, then, frankly, I would take it as a very well-informed directive to “do it that way.”   :-)

Re: Process large text data in array
by Corion (Pope) on Mar 10, 2015 at 14:38 UTC

    I don't know if this makes things faster, but whenever I'm using split to split up data, I find it's usually easier to match what I want to keep:

    foreach my $line (@arr) { if( $line =~ m/^([^=]+)=(.*)/ ) { my( $name, $value )= ($1,$2); $hashed{ $name }= $value; } else { warn "Unhandled line '$line'"; }; };

    This approach also eliminates your substr gymnastics.

      Corion, you codes do speed up the process but only able to reduce just few seconds, making it total about 34sec to complete.

Re: Process large text data in array
by hdb (Monsignor) on Mar 10, 2015 at 14:42 UTC

    Turning a line of the format id_1=value|id_2=value|id_3=value|.....
    into a hash can be vastly simplified:

    my $line = "id_1=value|id_2=value|id_3=value"; my %hash = split /[|=]/, $line;

      This will result in weird behaviour if the string contains more than one equal sign (=) per column:

      foo=bar=baz|bar=bambam

        That is correct! If this case can happen and one insists on splitting on =, then the third parameter of split might be useful:

        @parts = split /=/, $line, 2;

        will return at most two parts, split on the first (if any) equal sign.

        Just to share with you all, before I store any values into my formatted data line, I do HTML::Entities::encode_numeric to make sure those unsafe characters encoded.

        id_1=[encoded value]|.....

      hdb your codes are excellent!! The speed reduced to only 21sec to complete. My previous sub codes were rather old and previous data format may contain more than 1 delimiter characters. But all my current data format will have "safe characters" encoding before storing. So I guess it is safe to use your code for my purpose use.

      If it is not too trouble, maybe could you help me improve the reversal of line2rec? Or that is the simplest and faster it can goes?

      #---------------------------------------------------# # REC2LINE #---------------------------------------------------# sub rec2line { my (%trec) = @_; my ($newline) = ""; my ($line); foreach $line (keys %trec) { if ($newline ne "") { $newline .= "|"; } $newline .= "$line=$trec{$line}"; } # end foreach return ("$newline"); } # end sub

      Thanks again.

        That is what join is for:

        $newline = join "|", map { "$_=$trec{$_}" } keys %trec;
Re: Process large text data in array
by Laurent_R (Canon) on Mar 10, 2015 at 18:55 UTC
    at least if possible to reduce the speed of processing large text data
    Reducing the speed of your processing of large data is very easy (and does not need any of the counterproductive advice given to you so far by other monks): just add calls to the sleep function. For example (untested code example, because I do not have your data):
    my (@new_dat) = (); foreach my $line (@loaded_data) #-- loop thru all data { chomp($line); my (%trec) = &line2rec($line); sleep 1; if ($trec{'active'}) { push(@new_dat, $line); } }
    Serious bench making would be needed, but this is likely to reduce the speed by a factor of about 10,000. If this not enough of an improvement, just change to a larger value the parameter passed to the sleep builtin.

    Je suis Charlie.
Re: Process large text data in array
by hankcoder (Scribe) on Mar 10, 2015 at 15:01 UTC

    Sorry, post into separate reply. I re-post reply here so others can easily view.

    hdb your codes are excellent!! The speed reduced to only 21sec to complete. My previous sub codes were rather old and previous data format may contain more than 1 delimiter characters. But all my current data format will have "safe characters" encoding before storing. So I guess it is safe to use your code for my purpose use.

    If it is not too trouble, maybe could you help me improve the reversal of line2rec? Or that is the simplest and faster it can goes?

    #---------------------------------------------------# # REC2LINE #---------------------------------------------------# sub rec2line { my (%trec) = @_; my ($newline) = ""; my ($line); foreach $line (keys %trec) { if ($newline ne "") { $newline .= "|"; } $newline .= "$line=$trec{$line}"; } # end foreach return ("$newline"); } # end sub

    Thanks again.

Re: Process large text data in array
by hankcoder (Scribe) on Mar 11, 2015 at 12:10 UTC

    ** Fastest approach tested so far **

    As suggested by BrowserUk, I have done a test using the file reading method as suggested. The results absolutely encouraging. From previous reading + processing = 21sec+-. It reduced to just 15sec or less with added up more data from 300k to 400k lines of data.

    my (@dat) = (); open (DATF, "<$file_name"); while( <DATF> ) { my ($line) = $_; chomp($line); my (%trec) = &line2rec($line); # just do some filtering here if ($trec{'active'}) { } # just testing to move every data line into array push (@dat, $line); } close(DATF);

      Try this out:

      my (@dat) = (); my @filters; push @filters, sub { /active/ ? 1 : undef }; push @filters, sub { /anotherfilter/ ? 1 : undef }; open my $DATF, '<', $file_name; while( chomp(my $line = <$DATF>) ) { foreach my $filter (@filters) { my $newline = $filter->($line) or next; push (@dat, $line); last; } } close($DATF);

      An alternative is this:

      use threads; use Thread::Queue; use constant MAXTHREADS => 2; my $workQueue = Thread::Queue->new(); my $outQueue = Thread::Queue->new(); my @threads = map { threads->new( \&worker ) } 1..MAXTHREADS; open my $DATF, '<', $file_name; while ( <$DATF> ) { $workQueue->enqueue($_); } close $DATF; $workQueue->end(); $_->join for @threads; $outQueue->end(); my @dat; while (my $line = $outQueue->dequeue()) { push @dat, $line; } sub worker { my @filters; push @filters, sub { /active/ ? 1 : undef }; push @filters, sub { /anotherfilter/ ? 1 : undef }; while ( chomp(my $line = $workQueue->dequeue()) ) { foreach my $filter (@filters) { my $newline = $filter->($line) or next; $outQueue->enqueue($line); last; } } }

      The benefit to multithreading is you can dial your performance up and down depending on how many resources are available to you. This currently requires you to read the entire file into memory first, however pushing the read process into a separate thread resolves that issue and pushing the outqueue processing into a separate thread also assists in reducing memory footprint (assuming you're doing something like writing the data into a filtered output file)

        The benefit to multithreading is you can dial your performance up and down depending on how many resources are available to you.

        Sorry, but have you actually run and times that code?

        Because it will, unfortunately, run anything from 5 to 50 times slower than the single threaded version on any build of Perl, or OS, I am familiar with.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Process large text data in array
by hankcoder (Scribe) on Mar 11, 2015 at 13:17 UTC

    My major concern now is what will happen when user interrupted the process in the while..loop when file still open?

    open (DATF, "<$file_name"); while( <DATF> ) { #-- do whatever here #-- user may interrupt before finishing the while...loop } close(DATF);

    Should I be worry of this? Currently I'm only using this while...loop method for input only (read). As for writing data, likely data gonna corrupt.

    Any suggestions on this? Thanks.

      My major concern now is what will happen when user interrupted the process in the while..loop when file still open?

      The same thing as would happen if he interrupted the program while you were filling the array in your OP code.

      That is: the file will be closed and the program will exit without producing any output. As you are only reading the file, no data will be harmed.

      Of more concern is what happens if you are producing output from within the while loop. Then, if the user interupts, the output file can contain only partial data.

      To address the latter concern -- and prevent any worries about the former -- install an interrupt handler near the top of your program (or in a BEGIN block):

      $SIG{ INT } = sub{}; ... while( ... ) { ... }

      That will prevent the user interrupting with ^C. You can do a similar thing for most</> other signals that the user might use to interrupt.

      Serach for "%SIG" in perlvar.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        Oh wow, this is something New for me to look into.

        What if the interrupt is line disconnected/closing browser/stop page loading? My program are mainly run thru www web browser, there is no command line execution in concern here. Does the interrupt handler you suggested able to capture this or they are all same?

        In my own theory, if possible to capture such interrupt with custom INT handle function, then I should be able to do some cleanup in the function. Eg.

        sub INT_handler { # check for any unfinished jobs # close all files exit(0); } $SIG{'INT'} = 'INT_handler';

        The code above are my own untested modification theory from google search.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1119509]
Front-paged by GotToBTru
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2020-01-27 22:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?