Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Comparing strings from different files

by Jalcock501 (Sexton)
on Oct 08, 2013 at 09:41 UTC ( #1057395=perlquestion: print w/replies, xml ) Need Help??
Jalcock501 has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks.

I am trying to write a script that compares strings from different files.

99HEADER|001|001| 99INSSCH|AVP0| 99POLCOM|||||PIP735628020||||| 99INSFAC|F1_0|| 99INSFAC +|F2_N|| 99INSFAC|F3_N|| 99INSFAC|F4_0|| 99INSFAC|F5_0|| 99INSFAC|F6_I +M|| 99INSFAC|F8_0|| 99INSFAC|F9_B|| 99INSFAC|F10_0|| 99INSFAC|F11_3|| + 99INSFAC|F12_0|| 99INSFAC|F13_Y|| 99INSFAC|F14_N|| 99INSFAC|F15_5|| +99INSFAC|F16_30/11/2011|20111130| 99INSFAC|F17_31|| 99INSFAC|F18_NO|| + 99INSFAC|F19_246|| 99INSFAC|F20_B|| 99INSFAC|F21_0|| 99INSFAC|F22_5H +|| 99INSFAC|F23_P|| 99INSFAC|F24_M|| 99INSFAC|F25_1598|| 99INSFAC|F26 +_13|| 99INSFAC|F27_9|| 99INSFAC|F28_17|| 99INSFAC|F29_15|| 99INSFAC|F +30_13|| 99INSFAC|F31_12|| 99INSFAC|F32_34|| 99INSFAC|F33_17|| 99INSFA +C|F34_5|| 99INSFAC|F35_3|| 99INSFAC|F36_12|| 99INSFAC|F37_12|| 99INSF +AC|F38_19/09/2013|20130919| 99INSFAC|F39_3|| 99INSFAC|F43_B901 400|| +99INSFAC|F44_10.00|| 99INSFAC|F47_1.000094|| 99HEADER|004|001| 99INSSCH|248| 99POLCOM|3||CAP01|66|3301R7435459||||| 99INSFAC2|MSRA01_ +1||||||"LNI10708"| 99INSSCH|391| 99POLCOM|3||CAP01|66|3301R7435459||||| 99INSFAC2|MSGAL1| +|||||"W=P1|X=335|AB=0|BB=0|JF=569|HB=0|IB=0|AD=1|GC=I1|KD=0|YE=335|BF +=B|GF=0|KF=401|0|0|0|GSB=15|HSB=99.2468394|KDB=2377.37934|LDB=A|UDB=0 +|ETB=155|HTB=51|URB=7"| 99INSSCH|116| 99POLCOM|3||CAP01|66|3301R7435459||||| 99INSFAC2|MSRZ01a +||||||"00I10587"| 99INSFAC2|MSRZ01b||||||"335000 B"| 99INSSCH|216| 99POLCOM|3||CAP01|66|3301R7435459||||| 99INSFAC2|MSRZ01a +||||||"00I10587"| 99INSFAC2|MSRZ01b||||||"335000 B"| 99HEADER|006|001 99INSSCH|091| 99POLCOM|1||IIL|62|22593465033322||||| 99INSFAC2|C00156| +|||||I1P82240|CCCN0000|INNA0000|FAAA0570|YANZ1000| 99INSSCH|084| 99POLCOM|1||IIL|62|22593465033322||||| 99INSFAC2|C00050| +|||||I1001569| 99INSSCH|052| 99POLCOM|1||IIL|62|22593465033322||||| 99INSFAC2|C00124| +|||||XAAX0800|YPAX8400|ZAAZ0401|VAAA0000|WZZA0000| 99INSSCH|222| 99POLCOM|1||IIL|62|22593465033322||||| 99INSFAC2|C00243| +|||||XAAX0800|YPAX8400|ZAAZ0401|VAAA0000|WZZA0000| 99TERMIN|
And File2:
E99INSSCH|248| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSRA +01_1||||||"LNI10708"| E99HEADER|004|001| E99INSSCH|248| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSRA +01_1||||||"LNI10708"| E99HEADER|001|001| E99INSSCH|AVP0| E99POLCOM|||||PIP735628020||||| E99INSFAC|F1_0|| E99IN +SFAC|F2_N|| E99INSFAC|F3_N|| E99INSFAC|F4_0|| E99INSFAC|F5_0|| E99INS +FAC|F6_IM|| E99INSFAC|F8_0|| E99INSFAC|F9_B|| E99INSFAC|F10_0|| E99IN +SFAC|F11_3|| E99INSFAC|F12_0|| E99INSFAC|F13_Y|| E99INSFAC|F14_N|| E9 +9INSFAC|F15_5|| E99INSFAC|F16_30/11/2011|20111130| E99INSFAC|F17_31|| + E99INSFAC|F18_NO|| E99INSFAC|F19_246|| E99INSFAC|F20_B|| E99INSFAC|F +21_0|| E99INSFAC|F22_5H|| E99INSFAC|F23_P|| E99INSFAC|F24_M|| E99INSF +AC|F25_1598|| E99INSFAC|F26_13|| E99INSFAC|F27_9|| E99INSFAC|F28_17|| + E99INSFAC|F29_15|| E99INSFAC|F30_13|| E99INSFAC|F31_12|| E99INSFAC|F +32_34|| E99INSFAC|F33_17|| E99INSFAC|F34_5|| E99INSFAC|F35_3|| E99INS +FAC|F36_12|| E99INSFAC|F37_12|| E99INSFAC|F38_19/09/2013|20130919| E9 +9INSFAC|F39_3|| E99INSFAC|F43_B901 400|| E99INSFAC|F44_10.00|| E99INS +FAC|F47_1.000094|| E99HEADER|006|001 E99INSSCH|091| E99POLCOM|1||IIL|62|22593465033322||||| E99INSFAC2|C001 +56||||||I1P82240,CCCN0000,INNA0000,FAAA0570,YANZ1000| E99HEADER|006|001 E99INSSCH|091| E99POLCOM|1||IIL|62|22593465033322||||| E99INSFAC2|C001 +56||||||I1P82240,CCCN0000,INNA0000,FAAA0570,YANZ1000| E99HEADER|004|001| E99INSSCH|391| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSGA +L1||||||"W=P1|X=335|AB=0|BB=0|JF=569|HB=0|IB=0|AD=1|GC=I1|KD=0|YE=335 +|BF=B|GF=0|KF=401,0,0,0|GSB=15|HSB=99.2468394|KDB=2377.37934|LDB=A|UD +B=0|ETB=155|HTB=51|URB=7"| E99HEADER|004|001| E99INSSCH|116| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSRZ +01a||||||"00I10587"| E99INSFAC2|MSRZ01b||||||"335000 B"| E99HEADER|004|001| E99INSSCH|391| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSGA +L1||||||"W=P1|X=335|AB=0|BB=0|JF=569|HB=0|IB=0|AD=1|GC=I1|KD=0|YE=335 +|BF=B|GF=0|KF=401,0,0,0|GSB=15|HSB=99.2468394|KDB=2377.37934|LDB=A|UD +B=0|ETB=155|HTB=51|URB=7"|
However file1 contains records beginning with 99, and file2 contains records that begin with E99. I thought the easiest way would be to put all records that belong to a certain HEADER on one line and compare that line from both files.

But I have run into one tiny problem. I have no idea how to compare strings across files. Here is the code that I have so far:

#!/usr/bin/perl -w use strict; my @files = <*.in.sep>; for(@files) { s/[.]in[.]sep//g } for my $file (@files) { open (IN, "<", "$") || die ("cannot open $file"); open (OUT,"<", "$file.out.sep") || die ("cannot open search.txt"); undef $/; my $in = <IN>; my $out = <OUT>; my @in = split /\n/, $in; my @out = split /\n/, $out; my @final; for $a (@in) { my @result = grep/^\Q$a\E$/, @out; push (@final , @result); } print "Strings that don't match: \t@final"; }
The last for loop is a bit of a bodge job, as I haven't done this before. Could one of you lovely people please help.



Replies are listed 'Best First'.
Re: Comparing strings from different files
by marinersk (Curate) on Oct 08, 2013 at 15:38 UTC
    If the files are not huge (i.e., one of them will fit in memory), I would go with the approach specified by hippo. Read file #1 into hash, then compare each line in file #2 to the hash.

    If the files are too big for this, a slight modification: Read file #1 and store the seek (or tell) locations in the hash. Then compare each line in file #2 to the corresponding line in file #1, using your hash as a shortcut way to go straight to that line.

    Update: Sample of first option:

Re: Comparing strings from different files
by McA (Priest) on Oct 08, 2013 at 09:50 UTC

    You don't need a script. One look showed me: They are different...

    Of course just kidding: You have to explain when two files are equal and when they are different. Where are the records, where are the fields? Even the field names seem to be different.


      Hi McA The records are the same it's just the Record labels that are slightly different:
      FILE1: 99INSSCH|248| 99POLCOM|3||CAP01|66|3301R7435459||||| 99INSFAC2|MSRA01_ +1||||||"LNI10708"| FILE2: E99INSSCH|248| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSRA +01_1||||||"LNI10708"|
      Where it begins 99(label) or E99(label) that is a new record.

      I need to compare the records (not the labels) but I'm not sure how.

      Hope this clears things up a wee bit
        1. Open file1 and read it in one record at a time
        2. Each record should be split into the label and the rest
        3. Make the label the hash key and the rest the hash value
        4. Close file1 and open file2 and read it in one record at a time
        5. Each record should be split into the label and the rest
        6. You now have two rows to compare (the current record from file2 and the one stored in the hash keyed on the label from file1), so do that for whatever "compare" means in your task
        7. Close file2
Re: Comparing strings from different files (merge)
by tye (Sage) on Oct 08, 2013 at 19:54 UTC

    The suggestions to use a hash don't seem sound to me as it looks like you have plenty of records with duplicate "labels". But perhaps there is a unique identifier in there that you are aware of but haven't clearly told us about and a hash would work (if the files easily fit in RAM).

    I would instead sort each file and then do a classic "merge" algorithm between the two sorted files. How to sort the files will require more knowledge about the structure and content than I can deduce from just the example data you have posted.

    - tye        

Re: Comparing strings from different files
by Lennotoecom (Pilgrim) on Oct 08, 2013 at 20:24 UTC
    that is ugly but working
    while( <> ){ while(/(E99|99)(\w+)(?=\|)|\n/){ $key .= $`; $hash{$key}++; $key = '->'.$2; $_ = $'; } } foreach (sort keys %hash){ print "$_ $hash{$_}\n" if $hash{$_} >= 2; }
    run it like ./ FILE1 FILE2
    output will be yours repeating lines,
    accordingly == 1 will show you unique lines
    P.S. and turn out warnings)
    P.P.S. sorry for ugliness
      under "line" I meant any symbols between E99/99 markers
Re: Comparing strings from different files
by Lennotoecom (Pilgrim) on Oct 08, 2013 at 20:48 UTC
    If you are comparing actual lines from the files
    and not the different "nodes"
    then your task is even easier
    something like that:
    ./ FILE1 FILE2
    %hash = map{ s/E99/99/g; $_ => $hash{$_}++;} <>; foreach (sort keys %hash){ print "$_ $hash{$_}\n" if !$hash{$_}; }
    will print you yours unique lines
      Thanks for this it works great, however I need to run this on loads of files, I've had a play and can't seem to automate the command line arguments. Here's what I have but it doesn't work (probably because I'm not doing it right)
      #!/usr/bin/perl -w use strict; my @files = <*.in.sep>; my %hash; for(@files) { s/[.]in[.]sep//g } for my $file (@files) { open (my $in, "<", "$") || die ("cannot open $file"); open (my $out,"<", "$file.out.sep") || die ("cannot open search.tx +t"); %hash = map{ s/E99/99/g; $_ => $hash{$_}++;} <$in, $out>; foreach (sort keys %hash){ print "$_ $hash{$_}\n" if !$hash{$_}; } }
      Thanks Jim
        well I tested it on different files,
        but the main idea is:
        foreach (<*.in.sep>){ $name = $` if /.in.sep/; open IN, $_ or die $!; open OUT, $name.'.out.sep' or die $!; %hash = map{ s/E99/99/g; $_ => $hash{$_}++;} <IN>, <OU +T>; close IN, OUT; foreach (sort keys %hash){ print "$_ $hash{$_}\n" if !$hash{$_}; } }
        correct if there are mistakes,
        or if it might be optimized
Re: Comparing strings from different files
by Laurent_R (Abbot) on Oct 08, 2013 at 18:57 UTC

    If the file are not too big, then the hash, as per Hippo's solution.

    If they are too big, then sort them according to the key that matters to you, and read both files in parallel (but watch out, there are a number of edge cases, the algorithm can be a bit tricky).