Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Remove Duplicate Lines

by dcb0127 (Novice)
on Sep 18, 2002 at 18:53 UTC ( [id://198959]=perlquestion: print w/replies, xml ) Need Help??

dcb0127 has asked for the wisdom of the Perl Monks concerning the following question:

I have this database (1MB) that I know has duplicate lines. The database layout is ID,NAME,ADDRESS,EMAIL How do I sort it remove duplicates and have it print out into a new file.
#!/usr/bin/perl -w open(INF,"db.txt"); @data = <INF>; close(INF); @sd = sort(@data); ...what do I use to remove duplicates... open(OUTF,">outdb.txt"); print OUTF; close(OUTF);

Replies are listed 'Best First'.
Re: Remove Duplicate Lines
by perrin (Chancellor) on Sep 18, 2002 at 18:57 UTC
    To remove dupes you put things into a hash: my %unique = @data; But honestly, there's no reason to use Perl for this. Just sort -u file > new_file
Re: Remove Duplicate Lines
by rbc (Curate) on Sep 18, 2002 at 19:01 UTC
    If you are on a Unix system that has a sort command that
    supports the -u switch ( -u implies Unique ) you could do this
    $ sort -u db.txt > outdb.txt
    I know that Solaris's sort has -u and Red Hats cygwin does not.
    I hope that helps :)
Re: Remove Duplicate Lines
by BrowserUk (Patriarch) on Sep 18, 2002 at 21:45 UTC

    If purchance it is necessary for you to keep the unique lines of your file in the same order, then this will remove all but the first occurance of each line and leave the remaining ones in their original order.

    Just redirect the output to a new file on the command line (and uncomment the open line).

    #! perl -sw use strict; my %lines; #open DATA, $ARGV[0] or die "Couldn't open $ARGV[0]: $!\n"; while (<DATA>) { print if not $lines{$_}++; } __DATA__ this is a line this is another line yet another and yet another still this is a line more and more and even more this is a line and this and that but not the other cos its a family website:)

    Gives

    C:\test>uniq this is a line this is another line yet another and yet another still more and more and even more and this and that but not the other cos its a family website:) C:\test>

    The caveat of course is that with a large file, that hash could get mind of big, but maybe that's ok if this is what you need to do.


    Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring!
Re: Remove Duplicate Lines
by tstock (Curate) on Sep 18, 2002 at 19:01 UTC
    If this is a one time fix, you can use the unix commands 'sort' and 'uniq', like so:
    sort db.txt | uniq > newfile.db
    'sort' sorts the entries in db.txt, uniq then cuts out duplicates and the the '>' redirects the output to a new file name instead of the screen (STDOUT).

    tstock
Re: Remove Duplicate Lines
by fglock (Vicar) on Sep 18, 2002 at 19:04 UTC
    for(@sd){ push @out, $_ if (not @out) or ($out[-1] ne $_); };

    update: added (not @out) to initialize array

      @out = grep $sd[$_] ne $sd[1+$_], 0 .. $#sd; With inline sort and a perverse twist:
      my $prev; @out = grep "$prev" ne ($prev = $_), sort @sd; # not for production co +de

      Makeshifts last the longest.

Re: Remove Duplicate Lines
by Anonymous Monk on Aug 01, 2019 at 14:36 UTC
    if you want to remove duplicates from a data file and the file has a header
    #!/usr/bin/perl $ifile=$ARGV[0]; $ofile=$ARGV[1]; $header=`sed -n '1p' $ifile` ; $data=`sed '1d' $ifile | sort -u` ; open(my $fh, '>', $ofile) or die "Could not open file '$ofile' $!"; print $fh $header; print $fh $data; close $fh; exit 0

      Let's see:

      • use strict missing
      • use warnings missing
      • Missing my for $ifile, $ofile, $header, $data.
      • no check that the program is called with the correct number of arguments
      • Forking a shell (1) via qx (``) begs for trouble - see Improve pipe open?
      • ... to run sed, just to read the first line of a file
      • ... while making sed read the entire file
      • ... and ignoring all quoting issues by simply not quoting at all - see The problem of "the" default shell
      • ... and ignoring the fact that sed is not available by default on Windows and other operating systems
      • Forking another shell via qx to pipe sed output to sort -u input
      • ... again without any qouting
      • ... again assuming sed is available everywhere
      • ... assuming a POSIX sort is available everywhere. DOS/Windows sort does not understand -u and can't sort and filter out dupes
      • ... reading the entire output of sort -u into memory
      • ... just to write it out again three lines later
      • And finally, exit 0 is redundant

      This is highly inefficient and has several issues with "interesting" filenames.

      In Re: Remove Duplicate Lines, BrowserUk explains how to use perl properly.

      Another option - if running on a POSIX compatible system - is to use sort properly. Without headers, it is trivial:

      sort -u < inputfile > outputfile

      With headers, this will do:

      head -n 1 inputfile > outputfile sed '1d' inputfile | sort -u >> outputfile

      This way, head can stop processing the input file after the first line, unlike sed -n '1p'. Directly writing to the outputfile avoids all further overhead of your script.

      Alexander


      (1) yes, given a sane filename, perl may start the first sed without help of the default shell. Change the filename to something interesting and perl will start sed via the default shell.

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://198959]
Approved by VSarkiss
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (4)
As of 2024-04-18 08:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found