Re^2: Removing partially duplicated lines from a file

Replies are listed 'Best First'.
Re^3: Removing partially duplicated lines from a file by perldigious (Priest) on Jul 26, 2016 at 21:28 UTC
The line of code you asked about basically says if `$columns[1]` is true (has any value Perl evaluates as true) and contains a string that begins with "HLA-A" then take the following actions. I included the first "does it have a true value" check because I assumed `use warnings;` would end up complaining for any line that didn't have an element at index 1 in `$columns`. I didn't actually try it without it, but I just assumed that would happen for at least the all "---" lines. As for the code changes you requested: `#!/usr/bin/perl use warnings; use strict; open(my $in_fh, '<', 'input.txt') or die $!; my $output; my %seen_lines; while (<$in_fh>) { chomp; my @columns = split; if ($columns[1] and $columns[1] =~ /^HLA-A/) { my $HLA_Peptide = $columns[1] . $columns[2]; $output .= "$_\n" if (!exists $seen_lines{$HLA_Peptide}); $seen_lines{$HLA_Peptide} = 1; } else { $output .= "$_\n"; } } close $in_fh; print $output;` [download] EDIT: I did just try it without that first check and I was correct, it does throw warnings without it. There may be a better way to avoid that warning (it does occur to me that false values like "0" or an empty string would be evaluated as such), but I use this trick a lot in an attempt to appease `use warnings;` or "-w". I wonder if there is something like `exists` which I use a lot for hashes only meant for use to check if an array element exists? I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious	[reply] [d/l] [select]
Re^4: Removing partially duplicated lines from a file by AnomalousMonk (Archbishop) on Jul 27, 2016 at 00:55 UTC
... that first check ... a better way to avoid that warning ... something like exists ... defined is the way I would typically finesse this problem: `if (defined($columns[1]) && $columns[1] =~ /^HLA-A/) {` `...` `}` In the case of your posted code, the empty string and `'0'` will not, as you say, be tested against the regex, and in this particular case it will not matter because they cannot match anyway. In the general case, I think it's better to use `defined` because you can better avoid the "It'll never happen... Oh, it does happen..." situations that lead to those wonderful 3 AM debug sessions. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^5: Removing partially duplicated lines from a file by perldigious (Priest) on Jul 27, 2016 at 13:03 UTC
Ah, thank you very much, `defined` sounds like exactly the type of thing I was looking for. I think I even skimmed over the perldoc for it before (I did learn about the existence of `defined` from a quick mention of it in "Learning Perl") but erroneously disregarded it in this case due to the perldoc bit that says "Use of defined on aggregates (hashes and arrays) is deprecated." Which is my fault for skimming rather than actually RTFM'ing, because the perldoc is pretty clear about what it actually meant by "use... on aggregates" via the examples it gives, and it doesn't mention anything being frowned upon for using `defined` in a scalar context on a single array element like the `defined($columns[1])` you show. Small follow up question. You ~~chose to add parenthesis and~~ switch the `and` to an `&&` instead. I understand the different order of precedence between `and` vs. `&&`, but is there a reason you elected to rewrite it that way? Or is just a case of, "that's just the way I decided to write it"? EDIT: Sorry, my brain saw added parenthesis where there were none. I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious	[reply] [d/l] [select]
Re^6: Removing partially duplicated lines from a file by AnomalousMonk (Archbishop) on Jul 27, 2016 at 15:36 UTC
Re^7: Removing partially duplicated lines from a file by perldigious (Priest) on Jul 27, 2016 at 15:50 UTC
Re^4: Removing partially duplicated lines from a file by Sandy_Bio_Perl (Beadle) on Jul 26, 2016 at 21:35 UTC
Thank you Perldigious, I am very very grateful	[reply]
Re^3: Removing partially duplicated lines from a file by harangzsolt33 (Chaplain) on Jul 26, 2016 at 21:44 UTC
Okay. I am commenting here just because I thought of another way to solve this problem. What if you sort the lines before you try to eliminate the duplicates? That way the same lines will fall right next to each other, and you can just skip them by comparing this line to the previous line. If the two are the same, then you can skip that because it's a duplicate. This is a good idea if you don't expect to have a lot of duplicate lines and you plan to sort the output later on. Might as well sort it now and eliminate the duplicates in one step. ;-) `use strict; use warnings; my $ff = 'robots.txt'; my $fh; my @lines; # Read the entire file and # store lines in an array open $fh, "<", $ff or die "Sorry, can't open file - $ff\n"; { local $/; @lines = split("\n", <$fh>); } close $fh; # Get rid of duplicate lines @lines = sort(@lines); my $L; my $prev = ''; foreach $L (@lines) { print($L . "\n") if ($prev ne $L); $prev = $L; }` [download]	[reply] [d/l]