Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Removing duplicates lines

by vihar (Acolyte)
on Sep 04, 2013 at 18:39 UTC ( #1052413=perlquestion: print w/ replies, xml ) Need Help??
vihar has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am trying to remove duplicate lines from a file if I find a specific string match at a specific location. I am supposed to be opening up a set of 8 files. These files are huge. I am reading them line by line and looking for a unique number at 10th and 11th position on each line. If this number is repeated in the next line, I am supposed to disregard that line. Later, I am connecting to a DB and performing some XML parsing on all feeds that come back from DB for each line. For example, if one of the text file is like this:
AB000000026JHAHKDFK AB000000028JHKHKHKJ AB00000003033AFSFAS AB000000030HJHKH80J AB000000030LOIKJUJ8 AB0000000324446KJHK
I am only supposed to be considering following since 30 is repeated for 2 more lines.
AB000000026JHAHKDFK AB000000028JHKHKHKJ AB00000003033AFSFAS AB0000000324446KJHK
Here is my code:
use DBI; use Time::localtime; use File::Compare; use XML::Simple; # qw(:strict); use Data::Dumper; $user = "213256"; @fileRead = glob '/export/home/$user/Tests/Match/dummy2*'; my @array1; foreach $file (@fileRead){ open(FILE, $file) or die "Can't open `$file': $!"; @lines = <FILE>; close FILE; foreach $line ( @lines ) { $str = $line; $var = substr($str, 10, 2); push(@array1, "$var"); my @unique = grep { ! $seen{$_}++ } @array1; } ...... }
I am stuck in that last foreach loop as I am not sure what I can do. Like I mentioned, I am performing other functions to these lines. But I am just having trouble with avoiding these lines with repeated unique codes. I want this code to be able to move over to next line if same unique code is found.

Thanks

Comment on Removing duplicates lines
Select or Download Code
Re: Removing duplicates lines
by mtmcc (Hermit) on Sep 04, 2013 at 19:34 UTC
    Could you try explaining your question again? if you just want to skip consecutive lines with identical 10th/11th digits, it's quite straightforward.

    Must each code be unique across all 8 files?

    Thanks.
      No the code won't be unique across all 8 files. If the code repeats, it will be in the next line or line after that in that same file. I am just trying to do the easy part here (skipping to the next line if this code is repeated).. just can't get around to it.
      Thanks
Re: Removing duplicates lines
by zork42 (Monk) on Sep 04, 2013 at 19:45 UTC
    Assuming you are only concerned with repeated numbers in consecutive lines, then something like this would do (untested):
    use strict; #### Always have these lines. use warnings; #### They will save you lots of time debugging use DBI; use Time::localtime; use File::Compare; use XML::Simple; # qw(:strict); use Data::Dumper; my $user = "213256"; my @filesToProcess = glob "/export/home/$user/Tests/Match/dummy2*"; + #### Need ""s. ''s do not do variable interpolation foreach my $file (@filesToProcess) { open(FILE, $file) or die "Can't open `$file': $!"; my $prev_var = -1_000_000; #### set this to a value that wil +l never appear as the numbers you need to check. #### (Obviously the 2-digit numbe +rs must be in the range 0 to 99) while ( my $line = <FILE> ) #### do NOT read entire files int +o memory if they are big. Much better to process them a line at a ti +me { chomp $line; my $var = substr($line, 10, 2); #### Do you mean substr($line, + 9 , 2)? substr() consders the first character to be at offset zer +o if ($var != $prev_var) #### if number is different to nu +mber in previous line, then process it { $prev_var = $var; ... process $line here .... } } close FILE; }

    ======

    Thought #1:

    In your example the lines are sorted by the 2-digit number. If that is typical, then at the most you're only going to have to process 100 lines from the huge files.

    Is that what you expect?

    ======

    Thought #2:

    Q: What are you supposed to do if you get repeated numbers, but in non-consecutive lines like below?
    AB000000026JHAHKDFK AB00000003033AFSFAS = "30" line AB000000028JHKHKHKJ AB000000030HJHKH80J = "30" line AB0000000324446KJHK AB000000030LOIKJUJ8 = "30" line


    UPDATE:

    Replaced:
    foreach my $line ( <FILE> )        #### do NOT read entire files into memory if they are big.  Much better to process them a line at a time
    with
    while ( my $line = <FILE> )        #### do NOT read entire files into memory if they are big.  Much better to process them a line at a time

    Embarassing bug that! In my defence the original code had 'foreach' and I probably just missed it.
    Had I written the code from scratch I would (I hope!) have used 'while'. I'm still an idiot though! :)

    Thanks very much to Not_a_Number for pointing this out!

      Small remark.

      foreach my $line ( <FILE> )  #### do NOT read entire files into memory if they are big.

      With foreach, you ARE reading the whole file into memory!

      Use while:

      while ( my $line = <FILE> )

        Doh!

        Thanks for pointing that out!
        I've updated my post to remove my stupidity :)
        ++
      Actually as of now, they are 2 digits. But in future it is supposed to expand. I can change my code accordingly. And responding to your second concern, it would never repeat in non-consecutive lines. Thanks for your help!
Re: Removing duplicates lines
by Laurent_R (Vicar) on Sep 04, 2013 at 21:15 UTC

    I have written a module to do this type of things (and many others, such as comparing files in a very detailed way) on arbitrary large sorted files. It has a detailed documentation and I have been using it happily at my job for the last few months, but I still am not sure on how to provide an adequate test suite for different OS's to upload it to the CPAN (basically, I do not figure out how to provide test samples adequate for different end-of-line characters across various OS's). I can provide the module to you if you wish to try it.

    Anyway, the following is an extremely simplified command-line version of the algorithm used by this module for removing duplicates from a file (whatever the definition of duplicate is for your particular problem):

    $ cat test.txt AB000000026JHAHKDFK AB000000028JHKHKHKJ AB00000003033AFSFAS AB000000030HJHKH80J AB000000030LOIKJUJ8 AB0000000324446KJHK $ $ perl -ne '$c = substr ($_, 9, 2); print if $c ne $prev_c; $prev_c=$c +;' test.txt AB000000026JHAHKDFK AB000000028JHKHKHKJ AB00000003033AFSFAS AB0000000324446KJHK
Re: Removing duplicates lines
by Eily (Hermit) on Sep 04, 2013 at 22:04 UTC

    Because this is Perl, There is more than one way to do it. So although you already have two working propositions, I'll add another version with the next keyword. Combined with loop labels and statement modifiers (postfix if, while etc...) they help make code that looks like plain old English. And it avoid adding another level to a block hierarchy with a if block.

    use strict; use warnings; my $previous = ''; LINE: while(my $line = <DATA>) { # next LINE unless length $line >= 11; ## useless if you're absolu +tely sure that you can't have a shorter line # next LINE unless $line =~ m<\S>; ## Checking if the line isn +'t blank, same as above my $var = substr($line, 9, 2); next LINE if $var eq $previous; # with eq or ne instead of == or ! += this works even for hexadecimal values, or any string of two charac +ters # At this point, lines with the same number as the previous have b +een skipped $previous = $var; print $line; } __DATA__ AB000000026JHAHKDFK AB000000028JHKHKHKJ AB00000003033AFSFAS AB000000030HJHKH80J AB000000030LOIKJUJ8 AB0000000324446KJHK
    AB000000026JHAHKDFK AB000000028JHKHKHKJ AB00000003033AFSFAS AB0000000324446KJHK

Re: Removing duplicates lines
by sundialsvc4 (Monsignor) on Sep 04, 2013 at 22:34 UTC

    Your first-attempt code tries to do the impossible:   to read the entire file into memory.   Fortunately, you don’t need to do that.

    The filter that you seek to build only ever needs to consider two lines:   “this one,” and “the previous one if any.”   Thus, you can read and process the file (or files ...) line-by-line with an algorithm that looks something like this:

    my $previous_line = undef; # INITIALLY, THERE IS NO 'PREVIOUS LINE' while ( my $current_line = <FILE> ) { next if defined($previous_line) && (substr($current_line, 10, 2) eq substr(previous_line, 10, 2) ) + # OR WHATEVER-IT-IS ... ; << "ooh, we survived!!" so, do something magical >> $previous_line = $current_line; }

    Perl’s short-circuit boolean evaluation comes in handy here... the special case of “this is the first line in the file” is marked by $previous_line = undef, and the condition as-written expressly omits that case from consideration, evaluating the substr()s only when both values are known to exist.

      Thanks everyone for your help!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1052413]
Approved by Laurent_R
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (8)
As of 2014-07-25 09:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (170 votes), past polls