Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Processing while reading in input

by onlyIDleft (Scribe)
on Sep 19, 2018 at 23:51 UTC ( #1222680=perlquestion: print w/replies, xml ) Need Help??

onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:

I have a tab-separated 2 column input file with clustering information where column 1 contains the ID of the cluster representative, while column 2 contains the ID of the cluster member

Put differently, if there are 1000 elements clustered into 50 clusters, my input file will have 1000 lines with the ID of the cluster member in column 2, and the ID of the cluster representative in column 1

Therefore, the 1st line for each cluster will necessarily contain 2 identical columns, i.e. cluster representative and cluster member are identical

If there is more than one member of a cluster, then in the next row(s), column 1 still contains the same cluster representative ID, but column 2 will contain ID of a different cluster member

Please see example below:

Osat_a Osat_a # just one cluster member Atha_b Atha_b # >1 cluster member, this & next line = 2 members Atha_b Mtru_c Fves_d Fves_d # this & next 2 lines = 3 cluster members Fves_d Osat_e Fves_d Atha_f Atha_g Atha_g # just 1 cluster member Osat_h Osat_h Osat_h Atha_i Mtru_j Mtru_j # just 1 cluster member

The input file is very large ~20GB, which is much more than my machine RAM. I suppose one way is to process such a large input is to break the input file into pieces that can be held in RAM, right? The other way, I hoping to get help here for, is to process the input straight-away while reading it in from the file handle, without writing to some large hash or array that crashes my machine! Usually I save the input to hash or array, so processing while reading in lines would be new to me, hence this request for help

The output I need to generate from this input should be as follows:

Osat_a Atha_b, Mtru_c Fves_d, Osat_e, Atha_f Atha_g Osat_h, Atha_i Mtru_j

Thanks, in advance, for your algorithm advice

Replies are listed 'Best First'.
Re: Processing while reading in input
by tybalt89 (Parson) on Sep 20, 2018 at 00:34 UTC
    #!/usr/bin/perl # https://perlmonks.org/?node_id=1222680 use strict; use warnings; while( <DATA> ) { my ($cluster, $member) = split; print $cluster eq $member ? "\n" x ($. > 1) : ', ', $member; } print "\n"; __DATA__ Osat_a Osat_a # just one cluster member Atha_b Atha_b # >1 cluster member, this & next line = 2 members Atha_b Mtru_c Fves_d Fves_d # this & next 2 lines = 3 cluster members Fves_d Osat_e Fves_d Atha_f Atha_g Atha_g # just 1 cluster member Osat_h Osat_h Osat_h Atha_i Mtru_j Mtru_j # just 1 cluster member

      Thank you, tybalt89. Your code worked in terms of generating the expected output.

      Could you please explain in detail the following 2 lines from your code? Thank you!

      my ($cluster, $member) = split; print $cluster eq $member ? "\n" x ($. > 1) : ', ', $member;

        Let me presume to answer for tybalt89.

        my ($cluster, $member) = split;

        This depends on default behavior of split, and is equivalent to
            my ($cluster, $member) = split ' ', $_;
        The  ' ' split pattern is a special case explained in split docs.

        print $cluster eq $member ? "\n" x ($. > 1) : ', ', $member;

        This is a bit more tricksy. From the inside out:

        • ($. > 1)    $. is input line counter (update: see perlvar).  ($. > 1) evaluates to either '' (empty string) or 1 and will be 1 for every input line after the first.
        • "\n" x ($. > 1)   Repeats a newline zero times for the first line of input (empty string silently promoted to 0 in this special case), once for every subsequent input line.
        • $cluster eq $member ? Newline_or_Nada : ', '   Ternary expression. If  $cluster eq $member true, output newline for every input line after the first (see previous item); if false, output  ', ' string.
        • print Ternary_Expression, $member;   print result of ternary expresssion (see previous item), then  $member string.
        And that's all there is to it (I think).

        Update: Minor wording changes.


        Give a man a fish:  <%-{-{-{-<

Re: Processing while reading in input
by AnomalousMonk (Chancellor) on Sep 20, 2018 at 00:29 UTC

    In your example input, all the clusters occur contiguously, i.e., all Osat_a members (just the one), then all the Atha_b members, all Fves_d members, etc. Is this the case in your real data, or might you have data like, e.g.,

    Osat_a Osat_a # just one cluster member Atha_b Atha_b # >1 cluster member, this & next line = 2 members Fves_d Fves_d # this & next 2 lines = 3 cluster members Osat_h Osat_h Atha_b Mtru_c Fves_d Osat_e Atha_g Atha_g # just 1 cluster member Fves_d Atha_f Osat_h Atha_i ... ...
    where cluster members are promiscuously mingled?

    If the former case (all cluster members contiguous) is true, processing of very large files is easy: just buffer all cluster members until you detect the transition from one cluster member to another, then write out all buffered cluster members. This could scale to millions of cluster members.

    In the latter case, something like LanX's suggestion seems the way to go.


    Give a man a fish:  <%-{-{-{-<

      The input is ordered contiguously. You are correct in your observation

Re: Processing while reading in input
by AnomalousMonk (Chancellor) on Sep 20, 2018 at 06:16 UTC

    tybalt89's solution processes an input file line-by-line and so has the advantage that it will scale to an input file of any size (well, as long as your HD will hold both the input and output files :).

    It seems to me to have the disadvantage of... terseness, shall we say? Let me offer an alternative that is line-by-line and that also:

    • Makes a gesture in the direction of input validation. (I strongly believe that time spent on data validation is well spent.)
    • Uses easily adapted regexes to validate input data.
    • Makes a gesture toward ignoring input that is not of interest.
    • Is modular and therefore highly adaptable.
    • Incorporates a testing framework for development. (This could be further elaborated by moving the code into its own .pm module and writing a .t file for testing.)
    • While being considerably more verbose, is, I would argue, much more maintainable.
    So, FWIW:
    Script: Output:
    c:\@Work\Perl\monks\onlyIDleft>perl process_cluster_info_3.pl ok 1 - test output 1..1 ok 2 - no warnings 1..2


    Give a man a fish:  <%-{-{-{-<

Re: Processing while reading in input
by LanX (Archbishop) on Sep 20, 2018 at 00:10 UTC
    Update: The following reply deals with unsorted input. I didn't expect such a trivial case ...


    Your main problem is generating the output file, I'd suggest generating one temporary file per cluster and merging them at the end *. Like this you just need to append to the temporary files and keep track of the clusters.

    Now processing the input, you could use a sliding window to process big chunks (like 100MB), but I don't think it'll make a big difference from reading line by line with readline (since main limitation is hard-disk speed here and Perl and OS are already reading in big chunks behind the scene).

    But I would certainly group the write operations, like processing n = 1 million lines before writing out. Collect the entries in a hash of arrays push @{$hash{$cluster}}, $entry and append them to the temporary cluster files ( open has an append mode with '>>' ). Then empty the hash to avoid memory problems and process the next n lines.

    NB: In case the entries have to be unique within a cluster (you haven't been precise about that) you'd need a hash of hashes and a more complicated approach.

    HTH!

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

    *) I'm not sure about the most efficient way OS-wise to merge large files, but google or the monastery should know. I'm critical about this obsession of you bio-guys of creating huge files. I'd rather have data separated into several smaller files and zipped them together.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1222680]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2019-07-20 19:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If you were the first to set foot on the Moon, what would be your epigram?






    Results (5 votes). Check out past polls.

    Notices?