Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Shorten the headers of a file and remove empty lines using perl

by intect (Initiate)
on Jun 13, 2013 at 21:07 UTC ( #1038841=perlquestion: print w/ replies, xml ) Need Help??
intect has asked for the wisdom of the Perl Monks concerning the following question:

I have a file with headers like this:

>GL13245678

ABDVEADGADE

>GL123456789

BDADSFAEADVASDFGAGWE

>GL1254367890

DGAEGAHAGADRGASGASDG

I want to shorten the header to GL + 6 digits (only keep the first 6 digits from the orginal header), like this:

>GL132456

ABDVEADGADE

>GL123456

BDADSFAEADVASDFGAGWE

>GL125436

DGAEGAHAGADRGASGASDG

At the same time, there is an empty line above each header (CL123456) and I want to remove these empty lines. Can anyone help me with this? Thanks

Comment on Shorten the headers of a file and remove empty lines using perl
Re: Shorten the headers of a file and remove empty lines using perl
by davido (Archbishop) on Jun 13, 2013 at 21:11 UTC

    We could write the whole thing for you, but that's not what PerlMonks is about. What have you done to get started, and what part are you having trouble with?

    My solution would differ depending on how large the input file is expected to be (or become). So as you work on providing more information, that's one element I would like to know.


    Dave

      Thank you for your comments. I just start learning perl and am not very used to script yet. So that is problem. I am not sure how to start this script yet. The file is very big like 500M. XF
Re: Shorten the headers of a file and remove empty lines using perl
by Preceptor (Chaplain) on Jun 13, 2013 at 21:34 UTC

    What you need to accomplish this is regular expressions. Learning regular expressions can be a bit painful, but they're really incredibly powerful. As a sample, the code below might give a start point.

    The relevant documentation is perlre

    open ( my $input_fh, "<", $input_file ); open ( my $output_fh, ">", $output_file ); foreach my $line ( <$input_fh> ) { unless ( $line =~ m/\A\s*\Z/ ) { $line =~ s/(GL\d{6}))\d+/$1/; print $output_fh $line; } } close ( $input_fh ); close ( $output_fh );

    The essence is - first you test if a line is blank. Then you use a 'search and replace pattern' to trim any pattern starting GL, followed by 6 digits, to 6 digits.

      How can I input the "input_file"? "$input_file" is a scalar but not a file. I run the following script:

      #!/usr/local/bin/perl use warnings; open ( my $input_fh, "<", $genome ); open ( my $output_fh, ">", $output_file ); foreach my $line ( <$input_fh> ) { unless ( $line =~ m/\A\s*\Z/ ) { $line =~ s/(GL\d{6})\d+/$1/; print $output_fh $line; } } close ( $input_fh ); close ( $output_fh );
      and got the following messages:

      Name "main::genome" used only once: possible typo at header.pl line 3. Name "main::output_file" used only once: possible typo at header.pl line 4. Use of uninitialized value $genome in open at header.pl line 3. Use of uninitialized value $output_file in open at header.pl line 4. readline() on closed filehandle $input_fh at header.pl line 5.

      I think my way to open the file is wrong. Can you give me some hints? Thanks XF
        foreach my $line ( <$input_fh> )

        should be

        while (my $line = <$input_fh> )

        The first form is a glob, (but I don't know well enough to explain it to you).

        The second line should work properly for reading your file.

        Update: Yes Choroba is correct, it is not a glob - my mistake,

        Ah, I didn't see your problem clearly. First, you should use strict; as well as use warnings;, which you did, in the header of your program. Then, you have to assign the name of your file to $genome.

        my $genome = 'whateverthename';

        (You must assign a name to your output file also)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1038841]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (9)
As of 2014-08-22 14:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (158 votes), past polls