remove entries with duplicate characters

davi54 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: remove entries with duplicate characters by choroba (Cardinal) on Jul 31, 2019 at 20:40 UTC
If the same GN is always consecutive, you can just remember the previous GN while processing the file line by line. If the the GN is different, remember the header and print it before printing the sequence, otherwise skip printing them. `#!/usr/bin/perl use warnings; use strict; my $previous_gn = ""; my $header; while (<>) { if (my ($gn) = /^>.* GN=([^ ]+)/) { if ($gn ne $previous_gn) { $previous_gn = $gn; $header = $_; } } else { if ($header) { print $header, $_; undef $header; } } }` [download] If the header with the same GN don't have to be consecutive, you need to remember all the GN's seen so far. A hash is the best structure to remember them: `#!/usr/bin/perl use warnings; use strict; my %seen; my $header; my $gn; while (<>) { if (/^>.* GN=([^ ]+)/) { $gn = $1; $header = exists $seen{$gn} ? undef : $_; } elsif ($header) { undef $seen{$gn}; print $header, $_; } }` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re: remove entries with duplicate characters by Laurent_R (Canon) on Jul 31, 2019 at 20:51 UTC
In general terms, the usual way to remove duplicates (which is what you're trying to do for a certain definition of duplicates) is to read sequentially your input data records and to store the values of interest (in your example the GN "TUFA" value) that you've already seen in a hash (as a hash key). Then, if the GN value has already been seen, just don't consider the entry. Else write it to output. For example, something along these lines (untested): `my %seen; while (my $line = <$IN>) { my $gn = $1 if $line =~ / GN=(\w+) /; next if exists $seen{$1}; $seen{$1} = 1; print $line; }` [download] That's just the general idea. Details may vary according to the circumstances. Otherwise, if that doesn't help, I agree with 1nickt: please show the code that you have, it will be much easier to help you.	[reply] [d/l]
Re: remove entries with duplicate characters by 1nickt (Canon) on Jul 31, 2019 at 20:28 UTC
Hi, welcome! There's lots of help here. The thing to do is show what you have, what works, what doesn't, what error messages you are getting. See SSCCE. Like, can you open the file yet? Can you read a line from it yet? And so on. The way forward always starts with a minimal test.	[reply]
Re^2: remove entries with duplicate characters by davi54 (Sexton) on Jul 31, 2019 at 21:07 UTC
Sorry, I should have mentioned this. But, actually I'm new to perl. I thought of asking you guys to have some kind of knowledge and then kind of start learning from there.	[reply]
Re^3: remove entries with duplicate characters by 1nickt (Canon) on Aug 01, 2019 at 04:48 UTC
How new? :-) If you have not yet, set aside an afternoon and go through perlintro. There's a section there for virtually each piece of what you need to do in your program. Start with some simple exercises, loops, etc. Don't try to pull off your goal in the first step. See my follow up to the code you showed here. The way forward always starts with a minimal test.	[reply]
Re: remove entries with duplicate characters by davi54 (Sexton) on Jul 31, 2019 at 21:28 UTC
Hey everyone, Following is what I have. I ran the command and it is just processing... It hasn't given me any errors or results yet. Can you please let me know what's the issue. #!/usr/bin/perl use warnings; use strict; print 'Enter protein sequence filename: '; chomp( my $prot_filename = <STDIN> ); open my $PROTFILE, '<', $prot_filename or die "Cannot open '$prot_filename' because: $!"; my $out_filename = 'duplicate_gene_entries_in_'.$prot_filename; open my $OUTFILE, '>', $out_filename or die "Cannot open '$out_filename' because: $!"; $/ = ''; # Set paragraph mode my %seen; my $header; my $count_in; my $count_out; my $gn; while (<>) { if (/^>.* GN=([^ ]+)/) { $gn = $1; $header = exists $seen{$gn} ? undef : $_; } elsif ($header) { undef $seen{$gn}; print $header, $_; } } close $OUTFILE; close $PROTFILE; printf "%d total records read from '%s'\n",$count_in,$prot_filename; printf "%d records written to '%s' after removing duplicate entries\n" +,$count_out,$out_filename; [download]	[reply] [d/l]
Re^2: remove entries with duplicate characters by Laurent_R (Canon) on Jul 31, 2019 at 21:42 UTC
This: `while (<>) {` [download] is wrong, because you assigned a file handler to your input file. So, this should be: `while (<$PROTFILE>) {` [download] Otherwise, I am not sure why you want to set to paragraph mode. There may be some other problems, but that should help you to get going.	[reply] [d/l] [select]
Re^3: remove entries with duplicate characters by davi54 (Sexton) on Jul 31, 2019 at 21:54 UTC
I want to set to paragraph mode because each of my entries is multiple lines long and is separated by a new line. So, I was reading about perl, it said I can use the paragraph mode to specify that. As I said, I'm new and still learning, let me know if it's wrong. Also, when I ran the code after correcting what you suggested, it gave me the following errors: Use of uninitialized value $count_in in printf at ../remove_duplicate_genes.pl line 34. 0 total records read from 'enzymes.fasta' Use of uninitialized value $count_out in printf at ../remove_duplicate_genes.pl line 35. 0 records written to 'duplicate_gene_entries_in_enzymes.fasta' after removing duplicate entries	[reply]
Re^4: remove entries with duplicate characters by 1nickt (Canon) on Aug 01, 2019 at 04:36 UTC
Re^4: remove entries with duplicate characters by Laurent_R (Canon) on Aug 01, 2019 at 20:46 UTC
Re^4: remove entries with duplicate characters by Anonymous Monk on Aug 02, 2019 at 07:37 UTC


Keep It Simple, Stupid
	PerlMonks