Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Removing partially duplicated lines from a file

by perldigious (Priest)
on Jul 26, 2016 at 16:23 UTC ( [id://1168575]=note: print w/replies, xml ) Need Help??


in reply to Removing partially duplicated lines from a file

Hi Sandy_Bio_Perl,

Try something like this:

#!/usr/bin/perl use warnings; use strict; open(my $in_fh, '<', 'input.txt') or die $!; open(my $out_fh, '>', 'output.txt') or die $!; my %seen_lines; while (<$in_fh>) { chomp; my @columns = split; if ($columns[1] and $columns[1] =~ /^HLA-A/) { my $HLA_Peptide = $columns[1] . $columns[2]; print $out_fh "$_\n" if (!exists $seen_lines{$HLA_Peptide}); $seen_lines{$HLA_Peptide} = 1; } else { print $out_fh "$_\n"; } } close $out_fh; close $in_fh;

I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious

Replies are listed 'Best First'.
Re^2: Removing partially duplicated lines from a file
by Sandy_Bio_Perl (Beadle) on Jul 26, 2016 at 21:08 UTC

    Thank you. This works well, but I dont understand all your code. For example, why do we need to say

    if ($columns[1] and $columns[1] =~ /^HLA-A/)

    e.g. with the same reference used twice? Also, I would like to send the output to a variable and not print to a file. I know this should seem like a minor change to your great code, but I can't seem to make it work. Could you help please? (My novice level skills are showing)

      The line of code you asked about basically says if $columns[1] is true (has any value Perl evaluates as true) and contains a string that begins with "HLA-A" then take the following actions. I included the first "does it have a true value" check because I assumed use warnings; would end up complaining for any line that didn't have an element at index 1 in $columns. I didn't actually try it without it, but I just assumed that would happen for at least the all "---" lines.

      As for the code changes you requested:

      #!/usr/bin/perl use warnings; use strict; open(my $in_fh, '<', 'input.txt') or die $!; my $output; my %seen_lines; while (<$in_fh>) { chomp; my @columns = split; if ($columns[1] and $columns[1] =~ /^HLA-A/) { my $HLA_Peptide = $columns[1] . $columns[2]; $output .= "$_\n" if (!exists $seen_lines{$HLA_Peptide}); $seen_lines{$HLA_Peptide} = 1; } else { $output .= "$_\n"; } } close $in_fh; print $output;

      EDIT: I did just try it without that first check and I was correct, it does throw warnings without it. There may be a better way to avoid that warning (it does occur to me that false values like "0" or an empty string would be evaluated as such), but I use this trick a lot in an attempt to appease use warnings; or "-w". I wonder if there is something like exists which I use a lot for hashes only meant for use to check if an array element exists?

      I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
      I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
        ... that first check ... a better way to avoid that warning ... something like exists ...

        defined is the way I would typically finesse this problem:
            if (defined($columns[1]) && $columns[1] =~ /^HLA-A/) {
                ...
                }
        In the case of your posted code, the empty string and  '0' will not, as you say, be tested against the regex, and in this particular case it will not matter because they cannot match anyway. In the general case, I think it's better to use defined because you can better avoid the "It'll never happen... Oh, it does happen..." situations that lead to those wonderful 3 AM debug sessions.


        Give a man a fish:  <%-{-{-{-<

        Thank you Perldigious, I am very very grateful

      Okay. I am commenting here just because I thought of another way to solve this problem. What if you sort the lines before you try to eliminate the duplicates? That way the same lines will fall right next to each other, and you can just skip them by comparing this line to the previous line. If the two are the same, then you can skip that because it's a duplicate. This is a good idea if you don't expect to have a lot of duplicate lines and you plan to sort the output later on. Might as well sort it now and eliminate the duplicates in one step. ;-)
      use strict; use warnings; my $ff = 'robots.txt'; my $fh; my @lines; # Read the entire file and # store lines in an array open $fh, "<", $ff or die "Sorry, can't open file - $ff\n"; { local $/; @lines = split("\n", <$fh>); } close $fh; # Get rid of duplicate lines @lines = sort(@lines); my $L; my $prev = ''; foreach $L (@lines) { print($L . "\n") if ($prev ne $L); $prev = $L; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1168575]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (2)
As of 2024-04-26 02:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found