Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

to delete a set of specific lines in a file

by sugar (Sexton)
on Dec 05, 2008 at 07:52 UTC ( #728215=perlquestion: print w/ replies, xml ) Need Help??
sugar has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, How to delete a set of specific lines in a file which occurs twice(the only difference is .f or .r). A sample input and output is given below. sample input:
input is: A0000.f BG_c22 A000X.f BG_c5 A000X.r BG_c5 A002B.f BG_c38 A002B.r BG_c38 A003A.r BG_c38 A0082.r BG_c12 A00AS.f BG_c52 A00B9.f BG_c45 A00B9.r BG_c45 A00DK.f BG_c5 A00F0.f BG_c22 A00F0.r BG_c22 A00F3.f BG_c14 A00FX.f BG_c7 A00FX.r BG_c7 result shud be: A0000.f BG_c22 A003A.r BG_c38 A0082.r BG_c12 A00AS.f BG_c52 A00DK.f BG_c5 A00F3.f BG_c14
The code i have written is:
#!/usr/bin/perl use strict; use warnings; open(FILE,"pe_real_sample.txt") or die "cannot open"; open(OUTFILE,">last.output") or die "cannot open"; my @arr=<FILE>; close(FILE); my $prev1=0;my $prev2=0;my @dels; foreach(@arr){ my @spl=split(".(f|r)",$_); if($prev1 eq $spl[0] && $prev2 eq $spl[2]){ push(@dels,$spl[0]); } $prev1=$spl[0];$prev2=$spl[2]; } my $del= join '|', map quotemeta, @dels; my @arr1=grep !/$del/,@arr; print OUTFILE "@arr1\n";
Well, everything is perfect as far as the file is small. the sample input which i have given works. but then, wen i feed a 10MB data(which contains 3.5 million lines), the program doesnt work, i mean it takes awfully a lot of time but never stops either with a bug or correct result. What do i do? plz help. P.S: This is also with reference to the post "deleting a specific element from an array".

Comment on to delete a set of specific lines in a file
Select or Download Code
Re: to delete a set of specific lines in a file
by dHarry (Abbot) on Dec 05, 2008 at 07:57 UTC

    That's because you slurp the entire file into memory. It's better to step through the file line-by-line. It is more scalable and gives a better performance.

Re: to delete a set of specific lines in a file
by lakshmananindia (Chaplain) on Dec 05, 2008 at 08:32 UTC
  • Read the file line-by-line. Store each line in an array
  • After getting the second line, check whether the line is already there in the array
  • If so then remove the line
  • Using grep to check, will be easier
Re: to delete a set of specific lines in a file
by moritz (Cardinal) on Dec 05, 2008 at 08:49 UTC

    First of all sugar++ for giving us sample input and reference output as well as a clear description of what your problem is.

    In addition to what the others have said: You build a fairly large regex from @dels, which might slow down things (if the number of items is in the millions).

    I'd suggest to use a hash instead of @dels:

    #!/usr/bin/perl use strict; use warnings; my $prev1=0;my $prev2=0;my @dels; open my $file, '<', 'data.txt' or die "Can't open file: $!@"; my %dels; while (<$file>) { # the \. prevents . from matching any character my @spl = split(/\.[fr]/, $_); if($prev1 eq $spl[0] && $prev2 eq $spl[1]){ $dels{$spl[0]} = 1; } $prev1=$spl[0];$prev2=$spl[1]; } # reset the file cursor, read from the beginning again seek $file, 0, 0; while (<$file>) { my @spl = split(/\.[fr]/, $_); print unless $dels{$spl[0]}; } close $file;

    This version gets rid of @arr entirely, and replaced the regex and @dels with %dels.

      thank u :) Now the program takes only 2 seconds to output the desired results :)
      Going a step further, we can do this in a single pass:
      #!/usr/bin/perl use strict; use warnings; open my $file, '<', 'data.txt' or die "Can't open file: $!@"; my @prev = (); while (my $line = <$file>) { my @spl = split(/(\.[fr])/, $line); if (@prev) { if ($prev[0] ne $spl[0] || $prev[2] ne $spl[2]){ print @prev; } else { @prev = (); next; } } @prev = @spl; } print @prev; close $file;
Re: to delete a set of specific lines in a file (GOLF)
by ccn (Vicar) on Dec 05, 2008 at 10:57 UTC
    #!/usr/bin/perl -- # filter.pl use strict; use warnings; die "Usage: perl filter.pl pe_real_sample.txt > last.output\n" unless @ARGV; my %seen; my @ary; # push into @ary every line which have not be seen yet while (my $line = <>) { $line =~ /[fr] / or die "Can't parse line: $line"; my $key = $` . $'; push(@ary, [$line, $key]) unless $seen{$key}++; } # print all lines from @ary having $seen only once print $_->[0] foreach grep {$seen{$_->[1]} < 2} @ary;

    This script can be written more shortly:

    #!/usr/bin/perl -n /[fr] /, $s{$`.$'}++ || push @a, [$_,$`.$'] }{ $s{$_->[1]}<2 && print +$_->[0] for @a

    or even:

    #!/usr/bin/perl -n /f|r/,$s{$`.$'}++,push@a,[$_,$`.$']}{$s{$$_[1]}<2&&print$$_[0]for@a

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://728215]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2015-07-06 02:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (69 votes), past polls