Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

to avoid redundacy in a file

by Anonymous Monk
on Jul 15, 2002 at 09:17 UTC ( #181718=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a file in tha there are 4 fields,in some rows already exsisting 4 fields are present again, I need to eleminate those lines. example:1.CB N CA HN 2.N CB CA HN these 2 lines are same accroding to me, I need only one line to be present database finally. 1.In my database there is no line that is repeated again with all fileds same. example lines in adatabase are: CB N CA HB1 CB N CA HB2

Replies are listed 'Best First'.
Re: to avoid redundacy in a file
by amphiplex (Monk) on Jul 15, 2002 at 09:31 UTC
    You could remove all duplicate lines with something like this:
    my %seen; while (<>) { next if $seen{$_}; print; $seen{$_}++; }

    ---- amphiplex
      To avoid redundancy in your code, you could do this:
      my %seen; while (<>) { next if ($seen{$_}++); print; }
      You can do it in one shot, so you might as well. Note that this code eliminates all duplicate lines, not just repeated ones. If you want to just ditch repeats, use this:
      my $last; while (<>) { next if ($_ eq $last); $last = $_; print; }
      Thus lines "A A A B B B A A C C" will be "A B A C" not "A B C" as in the previous bit.
        You can do it in one shot, so you might as well.
        That makes your second snippet
        my $prev; while (<>) { next if ($_ eq $prev); print $prev = $_; }
        :^)

        Wait, we can shorten that..
        my $prev; while (<>) { print $prev = $_ unless $_ eq $prev; }
        Hmm..
        my $prev; $_ ne $prev and print $prev = $_ while <>;
        Err.. sorry, got carried away for a second.. Perl is just too seductive. Sigh. :-)

        Makeshifts last the longest.

      This won't do exactly as the AM wants - some lines will be duplicate to the user, but not to Perl:
      $ more file.txt N AB TX NC AB N TX NC FOO BAR N AB TX NC $ perl test.pl file.txt N AB TX NC AB N TX NC FOO BAR

      The first two lines of the file.txt file are "the same" to the user, but not to your program. zejames' solution works to the AM's needs, as it creates an unique key for the hash, based on the AM's definition of a duplicate.

      Jason

Re: to avoid redundacy in a file
by zejames (Hermit) on Jul 15, 2002 at 09:34 UTC
    One way to do it
    # We are modifying the $/ variable, so we limit the scope # by adding some {} around the code { local $/ = ''; $^I = '.bak'; # See man perl and the -i switch for that trick @ARGV = ('data.txt'); while (<>) { # The order is not important, so we sort the fields to # obtain a unique id $sorted = join ':', sort split /\s+/; print if (! $seen{$sorted}++ ); } }

    HTH
    Update : add comments to the code
    --
    zejames
Re: to avoid redundacy in a file
by thor (Priest) on Jul 15, 2002 at 11:59 UTC
    Depending on your database setup and how you insert rows, you could also impose a unique key constraint. Failing this, you will want to sort your records and then test for equality. i.e. (warning: untested)
    my %hash while(<>){ my $key = join " ", (sort (split " ")); $hash{$key} = 1; } #now iterate over the keys of the hash, and either print them out, or +do your insert in to the database
    Mind you that this is feasible for small files, for certain values of small. If your file is large, you may want to just do the join line, write it to another file, and then let a sort -u do your bidding. That assumes that you are on Unix or one of its derivatives (unless there is sort for Windoze... :)

    thor

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://181718]
Approved by tadman
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (2)
As of 2019-12-12 08:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?