Beefy Boxes and Bandwidth Generously Provided by pair Networks kudra
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Reading concurrently two files with different number of lines

by frogsausage (Sexton)
on Apr 10, 2013 at 15:24 UTC ( #1027997=perlquestion: print w/ replies, xml ) Need Help??
frogsausage has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I've been reading you quite a lot for answers about so many questions and always found what I wanted, until now.

I am trying to read two (very) long files in order to compare them in a smart way: checking if some elements (such as value=3.14 vs value="3.14") are swapped on the same line.

Also:

- there are a lot of lines that I will be willing to discard as soon as I read them. Therefore, I am trying not to store these in memory as each file can go way beyond 100 000 lines each.

- I might append one or more following line (starting with a +) to the previous line starting with a letter if: this first line doesn't match with the one in the other file, if one of the following isn't matching.

Lines can be such as:

ABC a b c value=3.14 + value2="2.04"

or

ABC a b c value=3.14 + value2="2.04" + value3=text

Right now, I am reading them in this very simple way:

while (defined(my $lineA = <FILE_A>) && defined(my $line_b = <FILE_B>) +) { ... compare_line(lineA=$lineA, lineB=$lineB); ... }

When running small test cases, it works really great (swap comparison etc.) However, I have some glitches and I guess that the longest file doesn't have its line read when the end of the shorter file is reached. These glitches are that one of the line starting with a + is the start of a new line in my result print (while it should always be appended after my first line).

I tried changing && to || but it got all messed up. I am thinking of dealing the remaining part of the longest file after the end of the shortest one is reached, however it doesn't sound really clean.

Looking forward reading your thoughts and suggestions!

-F

P.S: running Perl 5.8.8

Comment on Reading concurrently two files with different number of lines
Select or Download Code
Re: Reading concurrently two files with different number of lines
by Anonymous Monk on Apr 10, 2013 at 15:41 UTC

    It might be simplest to loop while (!eof(A) or !eof(B)), and then read a line of each (if available) inside the loop body.

      Thanks for the answer.

      After looking at the perldoc, eof(FH) is only returning if EOF is reached. In that case, it seems that using defined() will do as good as either $lineA or $lineB will be set to undef (and therefore be tested in the loop).

      Is that correct?

Re: Reading concurrently two files with different number of lines
by bioinformatics (Friar) on Apr 10, 2013 at 16:09 UTC
    In the above code, you are limited to the length of the shortest file. Instead, you can read each line separately into a hash with the key being the line number. You can do a single pass comparison for the same key in each hash.

    Example:
    # read in file one and place into a hash my $line_number = 1; while ( <$infile1> ) { chomp; $hash1{$line_number} = $_; $line_number++; } # read in file 2 and place in a hash $line_number = 1; while (<$infile2>) { chomp; $hash2{$line_number) = $_; $line_number++; } # determine the file with the largest number of keys my @keys1 = keys %hash1; my @keys2 = keys %hash2; my $NoK1 = @keys1; my $NoK2 @keys2; if ($NoK1 > $NoK2) { for $key (sort {$a <=> $b} keys %hash1) { if (!defined $hash2{$key}) { next; } elsif ($hash1{$key} eq $hash2{$key}) { #do nothing? next; } else { print "Line number $key in both files is different"; } } # the rest of the code you can figure out from here ...
    Hope that helps!

    Bioinformatics
      Why a hash? If there are no "gaps", you can use an array.
      لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Thanks for the help! I thought about reading the two files separately but the only thing is I can't discard all line-to-line matching files.

      However I will definitely put everything in a hash as it is what I need later on for further comparisons.

      The only thing I am worried about is memory consumption and slowdown. How much is 100 000 lines in a hash structure?

      Thanks for the code example. As for the rest of the code, it already exists, just need to get my datas in a better way to be crunched!
Re: Reading concurrently two files with different number of lines
by bioinformatics (Friar) on Apr 10, 2013 at 16:23 UTC
    If you don't want to read in the file, then you can use  or in your original code above. You'll need an if statement to catch when either of the two $lineX values is undefined, and make the comparison otherwise. The code can increase in complexity as you decide whether you want to see if things are offset in a file, rather than if the same lines have the same values.

    Bioinformatics
Re: Reading concurrently two files with different number of lines
by talexb (Canon) on Apr 10, 2013 at 17:46 UTC

    If a line starting with a '+' is a continuation, would it make sense to add that to the previous line, and then do the comparisons once all of the continuations have been dealt with?

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      Actually it is what I am doing. For each line in each file, I am comparing if they are equal or not. If yes they are stored in a temporary string, if no, they are pushed into two arrays.

      On the next line, if previous line was different and line is starting with a +, they are appended to the last elements in arrays. If the previous line wasn't different, if the line is still not different, it is joined to the previous temporary string, otherwise it is pushed into the arrays.

      In the end, if the line and all following lines are equal, the temporary string content is discarded.

      However, the more I think about it, the more it seems I could miss a line or mishandle it. Therefore I think I will move into dealing each file first then comparing them.

      Thanks for the head up on this solution. It seems it is what you all are leaning towards to: reading first, comparing later.

Re: Reading concurrently two files with different number of lines
by sundialsvc4 (Monsignor) on Apr 10, 2013 at 18:00 UTC

    It might be useful for you to look at the concept of Finite-State Machine (FSM) algorithms as a source of ideas for generalized solutions to these problems.   The essential notion is that, in the simple-minded case of two files, File-A and File-B, (and setting aside for the moment the “continuation-line problem” for just a moment), you can at any point in time break the problem down into “states”

    1. Initial state:   neither file has been read yet.
    2. You are now at the end-of-file of both files. (Final state.)
    3. You are at the end-of-file of File-A but not File-B.
    4. You are at the end-of-file of File-B but not File-A.
    5. You have current-records from both files at the same time and ... (additional states are defined based on some comparison between the two records)

    The FSM starts in its initial-state and runs until it reaches its final-state.   In each iteration, first it reacts to whatever state it is in, then it determines what new-state it is now in.

    This general line of reasoning can be applied to your problem, as soon as you deal with the slight “twist” ... that you must “peek ahead” one record to determine if you actually have read the complete intended string.   Fortunately, in the case of random-access disk files, that is trivially done:   simply note your current-position in the file, attempt to read the next line, then, if you succeed, see if it is a continuation.   If so, append it and repeat.   If not, reposition the file to the previously-saved position before returning what you have.   All of this logic, which actually is irrelevant to the FSM, can be packaged into a short subroutine.

    Always bear in mind that, whatever problem you are now facing, it almost certainly is not “new.”   FORTRAN had to deal with the strictures of the 80-column punched card in the 1950’s ... and it did so in the same way:   with the “continuation line.”   Although Perl did not appear until 1987, the tricks of this trade have been with us for a much longer time.   (Gaaack!   Suddenly I feel old!!)

      Thank you for the link to Finite-State-Machine. I have made myself a couple of scenarios I could use to resolve my case, each having of course pros and cons.

      In term of following code complexity, it would be much simple to read first and append then compute later.

      As for Fortan, the file format I am dealing with was output by some Fortran code when it was created. However, even though the file can now go further 80 characters per line, it seems the + was there with that limit in mind and has been kept for legacy purposes.

      I am now going to implement that, taking in account all your feedbacks and I will definitely come back to let you know about how it goes! Hopefully it will be for something positive :)

      P.S: you are not old, just some are younger than you, and most probably lots are older than you! However I am of the first kind, never read Fortran.

Re: Reading concurrently two files with different number of lines
by hdb (Parson) on Apr 11, 2013 at 09:17 UTC

    Depending on your files, would it help to first run them through diff, which is very powerful, and then look at the output of that rather than comparing the tow files themselves?

      I don't think this would work as diff compares line by line. Meaning that if I have:
      ABC a b + value=3.14 + value2="name"
      and
      ABC a b + value="other" + value2="name"

      Then I would have either 1 or 2 lines (depending if returning matched or unmatched lines), which is not what I want. I need to have the full 'line' (starting with a letter plus all the following starting by a +) if either of these are different.

      Unless diff understands and deals with continuation lines, which in case I don't know that.

        diff would not understand the continuation lines, but it doesn't necessarily have to. You can break your problem into two parts, and attack them individually:

        1. Normalize your inputs
        2. Compare your inputs

        The first you would do with Perl. Write a script that just handles the line continuations and normalizes your input into a "clean" format.

        You could then use diff to do the actual comparison between the two files. If diff doesn't meet your requirements for some reason, your comparison code would still be cleaner and simpler for not having to deal with the line continuations.

        Christopher Cashell
Re: Reading concurrently two files with different number of lines
by sundialsvc4 (Monsignor) on Apr 11, 2013 at 13:42 UTC

    Certainly one idea that pops into my head is to write a short filter-script which concatenates the continuation-lines into one contiguous string. It reads line-by-line, noting whether the current line is a continuation, if so appends it to the stashed line, otherwise outputs the stashed line (if any) and stashes the current record.

    If you apply this filter to both files in turn, you have now reduced the complexity of the problem considerably because continuation-lines are no longer a concern in the filter-output files.   Now, maybe you can apply tools such as diff to them, and so on.   I’m really warming-up to that idea, as I describe it.

      Actually, I constructed each full line for each file, storing them into an array. At the same time I am pushing them into an array, I am parsing them and storing them into a hash, adding a key containing my line number in the array I just pushed my line into. At the same time, discarded all unwanted lines that don't need to be matched later on.

      Then, it is really easy to compare. First comparing string to string (just like diff) using the line number stored in the hash to know where my full strings are, then I am either discarding the line or going through each key/value pair (created while parsing the file) and decide if they match or not (either discard - deleted from the hash on the fly - or keep). Actually, I could have compared everything using the key/values but, oh well, it is working great though.

      Everything is working really great and I hookup my line reader/concatenation from today with my parsing subroutines from before (with a little adaptation to my new fromat). In the end, my program is twice as short (structures are way less complex) and much more maintainable. And more importantly, running fine.

      Now it is time for QA on it, just to make sure it works in some weird cases I didn't had in my test cases, and it is good to go!

      Thank you all for your suggestions and examples, it really helped!

      -F

      P.S: is the hand-written format used to push users to create a Perl script to automatically add the right tags? :p

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1027997]
Approved by Corion
Front-paged by MidLifeXis
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (13)
As of 2014-04-16 10:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (423 votes), past polls