Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

remove entries with duplicate characters

by davi54 (Sexton)
on Jul 31, 2019 at 20:20 UTC ( [id://11103665]=perlquestion: print w/replies, xml ) Need Help??

davi54 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a very basic question. So, I have a list of protein sequences, where each entry has a header followed by the actual sequence. Say for example, two of the entries have the same word in the header:

>sp|O24310|EFTU_PEA Elongation factor Tu, chloroplastic OS=Pisum sativum OX=3888 GN=TUFA PE=2 SV=1

MALSSTAATTSSKLKLSNPPSLSHTFTASASASVSNSTSFR

>sp|Q43467|EFTU1_SOYBN Elongation factor Tu, chloroplastic OS=Glycine max OX=3847 GN=TUFA PE=3 SV=1

MAVSSATASSKLILLPHASSSSSLNSTPFRSSTTNTHKLTP

So, as highlighted, both these entries have GN=TUFA in their header. So, I want to write a script which can read the value of GN for all entries and removes all the following entries that have the same value for GN, either it be TUFA or any other value. And write the output to a file. Can anyone please help me?

  • Comment on remove entries with duplicate characters

Replies are listed 'Best First'.
Re: remove entries with duplicate characters
by choroba (Cardinal) on Jul 31, 2019 at 20:40 UTC
    If the same GN is always consecutive, you can just remember the previous GN while processing the file line by line. If the the GN is different, remember the header and print it before printing the sequence, otherwise skip printing them.

    #!/usr/bin/perl use warnings; use strict; my $previous_gn = ""; my $header; while (<>) { if (my ($gn) = /^>.* GN=([^ ]+)/) { if ($gn ne $previous_gn) { $previous_gn = $gn; $header = $_; } } else { if ($header) { print $header, $_; undef $header; } } }

    If the header with the same GN don't have to be consecutive, you need to remember all the GN's seen so far. A hash is the best structure to remember them:

    #!/usr/bin/perl use warnings; use strict; my %seen; my $header; my $gn; while (<>) { if (/^>.* GN=([^ ]+)/) { $gn = $1; $header = exists $seen{$gn} ? undef : $_; } elsif ($header) { undef $seen{$gn}; print $header, $_; } }

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: remove entries with duplicate characters
by Laurent_R (Canon) on Jul 31, 2019 at 20:51 UTC
    In general terms, the usual way to remove duplicates (which is what you're trying to do for a certain definition of duplicates) is to read sequentially your input data records and to store the values of interest (in your example the GN "TUFA" value) that you've already seen in a hash (as a hash key). Then, if the GN value has already been seen, just don't consider the entry. Else write it to output.

    For example, something along these lines (untested):

    my %seen; while (my $line = <$IN>) { my $gn = $1 if $line =~ / GN=(\w+) /; next if exists $seen{$1}; $seen{$1} = 1; print $line; }
    That's just the general idea. Details may vary according to the circumstances.

    Otherwise, if that doesn't help, I agree with 1nickt: please show the code that you have, it will be much easier to help you.

Re: remove entries with duplicate characters
by 1nickt (Canon) on Jul 31, 2019 at 20:28 UTC

    Hi, welcome! There's lots of help here. The thing to do is show what you have, what works, what doesn't, what error messages you are getting. See SSCCE. Like, can you open the file yet? Can you read a line from it yet? And so on.


    The way forward always starts with a minimal test.
      Sorry, I should have mentioned this. But, actually I'm new to perl. I thought of asking you guys to have some kind of knowledge and then kind of start learning from there.

        How new? :-) If you have not yet, set aside an afternoon and go through perlintro. There's a section there for virtually each piece of what you need to do in your program. Start with some simple exercises, loops, etc. Don't try to pull off your goal in the first step. See my follow up to the code you showed here.


        The way forward always starts with a minimal test.
Re: remove entries with duplicate characters
by davi54 (Sexton) on Jul 31, 2019 at 21:28 UTC
    Hey everyone, Following is what I have. I ran the command and it is just processing... It hasn't given me any errors or results yet. Can you please let me know what's the issue.

    #!/usr/bin/perl use warnings; use strict; print 'Enter protein sequence filename: '; chomp( my $prot_filename = <STDIN> ); open my $PROTFILE, '<', $prot_filename or die "Cannot open '$prot_filename' because: $!"; my $out_filename = 'duplicate_gene_entries_in_'.$prot_filename; open my $OUTFILE, '>', $out_filename or die "Cannot open '$out_filename' because: $!"; $/ = ''; # Set paragraph mode my %seen; my $header; my $count_in; my $count_out; my $gn; while (<>) { if (/^>.* GN=([^ ]+)/) { $gn = $1; $header = exists $seen{$gn} ? undef : $_; } elsif ($header) { undef $seen{$gn}; print $header, $_; } } close $OUTFILE; close $PROTFILE; printf "%d total records read from '%s'\n",$count_in,$prot_filename; printf "%d records written to '%s' after removing duplicate entries\n" +,$count_out,$out_filename;
      This:
      while (<>) {
      is wrong, because you assigned a file handler to your input file. So, this should be:
      while (<$PROTFILE>) {
      Otherwise, I am not sure why you want to set to paragraph mode.

      There may be some other problems, but that should help you to get going.

        I want to set to paragraph mode because each of my entries is multiple lines long and is separated by a new line. So, I was reading about perl, it said I can use the paragraph mode to specify that. As I said, I'm new and still learning, let me know if it's wrong. Also, when I ran the code after correcting what you suggested, it gave me the following errors:

        Use of uninitialized value $count_in in printf at ../remove_duplicate_genes.pl line 34.

        0 total records read from 'enzymes.fasta'

        Use of uninitialized value $count_out in printf at ../remove_duplicate_genes.pl line 35.

        0 records written to 'duplicate_gene_entries_in_enzymes.fasta' after removing duplicate entries

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11103665]
Approved by Corion
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2024-04-19 19:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found