Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^2: Removing partially duplicated lines from a file

by Sandy_Bio_Perl (Beadle)
on Jul 26, 2016 at 21:08 UTC ( [id://1168589]=note: print w/replies, xml ) Need Help??


in reply to Re: Removing partially duplicated lines from a file
in thread Removing partially duplicated lines from a file

Thank you. This works well, but I dont understand all your code. For example, why do we need to say

if ($columns[1] and $columns[1] =~ /^HLA-A/)

e.g. with the same reference used twice? Also, I would like to send the output to a variable and not print to a file. I know this should seem like a minor change to your great code, but I can't seem to make it work. Could you help please? (My novice level skills are showing)

Replies are listed 'Best First'.
Re^3: Removing partially duplicated lines from a file
by perldigious (Priest) on Jul 26, 2016 at 21:28 UTC

    The line of code you asked about basically says if $columns[1] is true (has any value Perl evaluates as true) and contains a string that begins with "HLA-A" then take the following actions. I included the first "does it have a true value" check because I assumed use warnings; would end up complaining for any line that didn't have an element at index 1 in $columns. I didn't actually try it without it, but I just assumed that would happen for at least the all "---" lines.

    As for the code changes you requested:

    #!/usr/bin/perl use warnings; use strict; open(my $in_fh, '<', 'input.txt') or die $!; my $output; my %seen_lines; while (<$in_fh>) { chomp; my @columns = split; if ($columns[1] and $columns[1] =~ /^HLA-A/) { my $HLA_Peptide = $columns[1] . $columns[2]; $output .= "$_\n" if (!exists $seen_lines{$HLA_Peptide}); $seen_lines{$HLA_Peptide} = 1; } else { $output .= "$_\n"; } } close $in_fh; print $output;

    EDIT: I did just try it without that first check and I was correct, it does throw warnings without it. There may be a better way to avoid that warning (it does occur to me that false values like "0" or an empty string would be evaluated as such), but I use this trick a lot in an attempt to appease use warnings; or "-w". I wonder if there is something like exists which I use a lot for hashes only meant for use to check if an array element exists?

    I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
    I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
      ... that first check ... a better way to avoid that warning ... something like exists ...

      defined is the way I would typically finesse this problem:
          if (defined($columns[1]) && $columns[1] =~ /^HLA-A/) {
              ...
              }
      In the case of your posted code, the empty string and  '0' will not, as you say, be tested against the regex, and in this particular case it will not matter because they cannot match anyway. In the general case, I think it's better to use defined because you can better avoid the "It'll never happen... Oh, it does happen..." situations that lead to those wonderful 3 AM debug sessions.


      Give a man a fish:  <%-{-{-{-<

        Ah, thank you very much, defined sounds like exactly the type of thing I was looking for. I think I even skimmed over the perldoc for it before (I did learn about the existence of defined from a quick mention of it in "Learning Perl") but erroneously disregarded it in this case due to the perldoc bit that says "Use of defined on aggregates (hashes and arrays) is deprecated." Which is my fault for skimming rather than actually RTFM'ing, because the perldoc is pretty clear about what it actually meant by "use... on aggregates" via the examples it gives, and it doesn't mention anything being frowned upon for using defined in a scalar context on a single array element like the defined($columns[1]) you show.

        Small follow up question. You chose to add parenthesis and switch the and to an && instead. I understand the different order of precedence between and vs. &&, but is there a reason you elected to rewrite it that way? Or is just a case of, "that's just the way I decided to write it"?

        EDIT: Sorry, my brain saw added parenthesis where there were none.

        I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
        I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious

      Thank you Perldigious, I am very very grateful

Re^3: Removing partially duplicated lines from a file
by harangzsolt33 (Chaplain) on Jul 26, 2016 at 21:44 UTC
    Okay. I am commenting here just because I thought of another way to solve this problem. What if you sort the lines before you try to eliminate the duplicates? That way the same lines will fall right next to each other, and you can just skip them by comparing this line to the previous line. If the two are the same, then you can skip that because it's a duplicate. This is a good idea if you don't expect to have a lot of duplicate lines and you plan to sort the output later on. Might as well sort it now and eliminate the duplicates in one step. ;-)
    use strict; use warnings; my $ff = 'robots.txt'; my $fh; my @lines; # Read the entire file and # store lines in an array open $fh, "<", $ff or die "Sorry, can't open file - $ff\n"; { local $/; @lines = split("\n", <$fh>); } close $fh; # Get rid of duplicate lines @lines = sort(@lines); my $L; my $prev = ''; foreach $L (@lines) { print($L . "\n") if ($prev ne $L); $prev = $L; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1168589]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-24 12:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found