Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Delete duplicate data in file

by darrengan (Sexton)
on Nov 21, 2005 at 04:25 UTC ( #510358=perlquestion: print w/ replies, xml ) Need Help??
darrengan has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks'

How do i open a file and scan the data row by row and delete duplicate row in the file?

I have multile files in a folder /temp/ and the files are 051119.temp, 051120.temp, 051121.temp

Each file will have the following row format i.e.
data_row_051126120432.data
data_row_051126120630.data
data_row_051126120630.data
data_row_051126122305.data
data_row_051126122305.data

How do i delete data_row_051126120630.data and data_row_051126122305.data as it is duplicate?

Cheers and Thanks

Comment on Delete duplicate data in file
Re: Delete duplicate data in file
by pg (Canon) on Nov 21, 2005 at 04:37 UTC

    It looks like your rows are sorted by timestamps (I will assume this for the rest of this post), so if there are duplicates, they will be next to each other (could be more than 2 rows). All what you need to do is (the algorithm):

    open the file; open a temp file for output; set $lastrow to ''; while (file not empty) { read one row; if (this row equals to the $lastrow) { do nothing; } else { write this row to the output file; set $lastrow to this row; } } close both files; copy the temp file to the original file;

    The perl code would be close to this:

    use strict; use warnings; my $lastrow = ""; while (my $line = <DATA>) { $line =~ /(.*?)\n/; $line = $1; if ($line ne $lastrow) { print $line, "\n"; $lastrow = $line; } } __DATA__ data_row_051126120432.data data_row_051126120630.data data_row_051126120630.data data_row_051126122305.data data_row_051126122305.data

    This prints:

    data_row_051126120432.data data_row_051126120630.data data_row_051126122305.data
Re: Delete duplicate data in file
by ptum (Priest) on Nov 21, 2005 at 04:51 UTC
    I don't really like having to depend on the files being sorted. One alternative way to remove duplicate data is to use a hash to temporarily hold your data. You can read in the data from the files, place it in a hash, and then (eventually) write it back out again. Since hashes depend on a unique key, you'll overwrite any prior duplicate data rows in the hash and end up with only one copy of each unique element.

      It is not whether you "depend on the files being sorted", but whether it is a fact that the file is sorted.

      If it is (for example it could be some sort of log), to hold the entire file in momery is then a waste.

Re: Delete duplicate data in file
by steveAZ98 (Monk) on Nov 21, 2005 at 05:47 UTC
    If you can depend on the datestamps then pg's solution is fine. I like the hash solution better myself for small to medium data files.

    Example:
    my %seen = (); while(<>) { print if ! $seen{$_}; $seen{$_} = 1; }

    Then process your file like this:
    ./above_code.pl < data_file_in > data_file_out

    You might want to chomp the data lines so you don't miss trailing new lines. Also if the data files are huge, then go with pg's solution, this one will create hash keys for each data line in the file.

    Steve
Re: Delete duplicate data in file
by mulander (Monk) on Nov 21, 2005 at 07:07 UTC
    I think that you can take advantage on Tie::File and tie your file with an array, then use the standard method from perldoc to remove duplicate elements from an array:
    #!/usr/bin/perl use warnings; use strict; use Tie::File; tie @file,'Tie::File','myfile' or die "Can't tie file: $!"; undef %saw; @file = grep(!$saw{$_}++, @file); untie @file;
    This is example b) from perldoc -q 'How can I remove duplicate elements from a list or array?' of course it could be less efficient than using a single hash and reading line by line, but I think that Tie::File could give you other ideas about solving your problem, just thought that this might throw some new light on your problem.

    note: I did not have the time to test the code that I posted above ( I'm late to work already :P ) so please do a backup of your file before trying this script on it.
Re: Delete duplicate data in file
by Aristotle (Chancellor) on Nov 21, 2005 at 09:21 UTC
    perl -pi -e'$_ = "" if $seen{ $_ }++' temp/*

    Makeshifts last the longest.

Re: Delete duplicate data in file
by Roy Johnson (Monsignor) on Nov 21, 2005 at 15:32 UTC
    Running it through the Unix uniq command should do what you want.

    Caution: Contents may have been coded under pressure.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://510358]
Approved by jbrugger
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2014-12-22 05:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (110 votes), past polls