Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Cleaning up a text file with compact regex

by neversaint (Deacon)
on Apr 24, 2007 at 05:39 UTC ( #611652=perlquestion: print w/ replies, xml ) Need Help??
neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
Given this files ('file.txt') how can I use a compact regex to clean it so that from this:
<h3>Warning</h3><blockquote><FONT SIZE=+1 color=#aa0000><B> Circular contig NC_001224.1 leftmost feature 0 Q0050 le +ft neighbour Q0275 </B></FONT></blockquote><BR><HR SIZE=3> <h3>Warning</h3><blockquote><FONT SIZE=+1 color=#aa0000><B> Circular contig NC_001224.1 rightmost feature 18 Q0275 +right neighbour Q0045 </B></FONT></blockquote><BR><HR SIZE=3> >YJR152W|DAL5 TGATTTTGGATATTCATCAAAGGAAACCCTATTAATGGGTTTACCTACAGGTGCTGTTGA ATTGGTAGGTTGTCCACTTTTTGGTATTCTAGCAGTTTATGCAGCCAATAAGAAGATACC ATTTTGGAAATATAAGTTGAGTTGGGCTATTTTTGCAGCTGTCTTAGCATTGATTGCTAG CTGCATGTTAGGGTTTGCAACAAACTCCAAAAAAGCAAGACTGGCTGGTGCTTACCTGTG GTACATCTCGCCCGTCTCATTTATTTGCGTACTTTCCAATATCAGTGCGAATTCCTCGGG ATATAGTAAAAAATGGACTGTATCTTCAATAAACTTAGTAGCATATGCTGCAGCTAACTT GGCAGGACCACAAACCTTTATTGCTAAGCAGGCTCCTAAATATCATGGCGCTAAGGTCGC TATGGTCGTATGTTATGCTGTTATGATCGTGCTTCTATCTATACTGCTCATCGTCAATTT AAGGGAAAACAAGAGACGTGATAAGATAGCTGCCGAGAGAGGGTTCCCTGAAGAAACAGA GAATTTAGAGTTTTCTGATTTGACTGATTTTGAAAATCCAAATTTCAGATACACTTTATG >YKR039W|GAP1 CCTAGCTGAACAGAGATTTCTGCCAGAAATCTTTTCCTACGTTGACCGTAAGGGTAGACC ATTGGTGGGAATTGCTGTCACATCTGCATTCGGTCTTATTGCGTTTGTTGCCGCCTCCAA AAAGGAAGGTGAAGTTTTCAACTGGTTACTAGCCTTGTCTGGGTTGTCATCTCTATTCAC ATGGGGTGGTATCTGTATTTGTCACATTCGTTTCAGAAAGGCATTGGCCGCCCAAGGAAG AGGCTTGGATGAATTGTCTTTCAAGTCTCCTACCGGTGTTTGGGGTTCCTACTGGGGGTT ATTTATGGTTATTATTATGTTCATTGCCCAATTCTACGTTGCTGTATTCCCCGTGGGAGA TTCTCCAAGTGCGGAAGGTTTCTTCGAAGCTTATCTATCCTTCCCACTTGTTATGGTTAT GTACATCGGACACAAGATCTATAAGAGGAATTGGAAGCTTTTCATCCCAGCAGAAAAGAT GGACATTGATACGGGTAGAAGAGAAGTCGATTTAGATTTGTTGAAACAAGAAATTGCAGA AGAAAAGGCAATTATGGCCACAAAGCCAAGATGGTATAGAATCTGGAATTTCTGGTGTTA ;WARNING invalid query foo ;WARNING invalid query bar ;WARNING invalid query qux
we get simply:
>YJR152W|DAL5 TGATTTTGGATATTCATCAAAGGAAACCCTATTAATGGGTTTACCTACAGGTGCTGTTGA ATTGGTAGGTTGTCCACTTTTTGGTATTCTAGCAGTTTATGCAGCCAATAAGAAGATACC ATTTTGGAAATATAAGTTGAGTTGGGCTATTTTTGCAGCTGTCTTAGCATTGATTGCTAG CTGCATGTTAGGGTTTGCAACAAACTCCAAAAAAGCAAGACTGGCTGGTGCTTACCTGTG GTACATCTCGCCCGTCTCATTTATTTGCGTACTTTCCAATATCAGTGCGAATTCCTCGGG ATATAGTAAAAAATGGACTGTATCTTCAATAAACTTAGTAGCATATGCTGCAGCTAACTT GGCAGGACCACAAACCTTTATTGCTAAGCAGGCTCCTAAATATCATGGCGCTAAGGTCGC TATGGTCGTATGTTATGCTGTTATGATCGTGCTTCTATCTATACTGCTCATCGTCAATTT AAGGGAAAACAAGAGACGTGATAAGATAGCTGCCGAGAGAGGGTTCCCTGAAGAAACAGA GAATTTAGAGTTTTCTGATTTGACTGATTTTGAAAATCCAAATTTCAGATACACTTTATG >YKR039W|GAP1 CCTAGCTGAACAGAGATTTCTGCCAGAAATCTTTTCCTACGTTGACCGTAAGGGTAGACC ATTGGTGGGAATTGCTGTCACATCTGCATTCGGTCTTATTGCGTTTGTTGCCGCCTCCAA AAAGGAAGGTGAAGTTTTCAACTGGTTACTAGCCTTGTCTGGGTTGTCATCTCTATTCAC ATGGGGTGGTATCTGTATTTGTCACATTCGTTTCAGAAAGGCATTGGCCGCCCAAGGAAG AGGCTTGGATGAATTGTCTTTCAAGTCTCCTACCGGTGTTTGGGGTTCCTACTGGGGGTT ATTTATGGTTATTATTATGTTCATTGCCCAATTCTACGTTGCTGTATTCCCCGTGGGAGA TTCTCCAAGTGCGGAAGGTTTCTTCGAAGCTTATCTATCCTTCCCACTTGTTATGGTTAT GTACATCGGACACAAGATCTATAAGAGGAATTGGAAGCTTTTCATCCCAGCAGAAAAGAT GGACATTGATACGGGTAGAAGAGAAGTCGATTTAGATTTGTTGAAACAAGAAATTGCAGA AGAAAAGGCAATTATGGCCACAAAGCCAAGATGGTATAGAATCTGGAATTTCTGGTGTTA
In principle we want to:
  • Delete first six lines
  • Delete lines that start with ; (semicolon)


---
neversaint and everlastingly indebted.......

Comment on Cleaning up a text file with compact regex
Select or Download Code
Re: Cleaning up a text file with compact regex
by chromatic (Archbishop) on Apr 24, 2007 at 05:51 UTC
    open( my $fh, '<', 'file.txt' ) or die "Cannot read file.txt: $!\n"; # clearer idiom, anyone? scalar <$fh> for 1 .. 6; while (<$fh>) { next if /^;/; # ... process lines }
      while (<$fh>)

      You should always put local($_); before such a loop, because the global variable $_ is set to each line and is not cleaned up in the end.

      Update: http://perl.plover.com/local.html has an example of the dangers of this kind of while loop.

Re: Cleaning up a text file with compact regex
by Samy_rio (Vicar) on Apr 24, 2007 at 05:58 UTC

    Try like this,

    TIMTOWDI

    use strict; use warnings; my @content = <DATA>; @content = grep!/^\;/, @content; print @content[6..$#content]; __DATA__ <h3>Warning</h3><blockquote><FONT SIZE=+1 color=#aa0000><B> Circular contig NC_001224.1 leftmost feature 0 Q0050 le +ft neighbour Q0275 </B></FONT></blockquote><BR><HR SIZE=3> <h3>Warning</h3><blockquote><FONT SIZE=+1 color=#aa0000><B> Circular contig NC_001224.1 rightmost feature 18 Q0275 +right neighbour Q0045 </B></FONT></blockquote><BR><HR SIZE=3> >YJR152W|DAL5 TGATTTTGGATATTCATCAAAGGAAACCCTATTAATGGGTTTACCTACAGGTGCTGTTGA ATTGGTAGGTTGTCCACTTTTTGGTATTCTAGCAGTTTATGCAGCCAATAAGAAGATACC ATTTTGGAAATATAAGTTGAGTTGGGCTATTTTTGCAGCTGTCTTAGCATTGATTGCTAG CTGCATGTTAGGGTTTGCAACAAACTCCAAAAAAGCAAGACTGGCTGGTGCTTACCTGTG GTACATCTCGCCCGTCTCATTTATTTGCGTACTTTCCAATATCAGTGCGAATTCCTCGGG ATATAGTAAAAAATGGACTGTATCTTCAATAAACTTAGTAGCATATGCTGCAGCTAACTT GGCAGGACCACAAACCTTTATTGCTAAGCAGGCTCCTAAATATCATGGCGCTAAGGTCGC TATGGTCGTATGTTATGCTGTTATGATCGTGCTTCTATCTATACTGCTCATCGTCAATTT AAGGGAAAACAAGAGACGTGATAAGATAGCTGCCGAGAGAGGGTTCCCTGAAGAAACAGA GAATTTAGAGTTTTCTGATTTGACTGATTTTGAAAATCCAAATTTCAGATACACTTTATG >YKR039W|GAP1 CCTAGCTGAACAGAGATTTCTGCCAGAAATCTTTTCCTACGTTGACCGTAAGGGTAGACC ATTGGTGGGAATTGCTGTCACATCTGCATTCGGTCTTATTGCGTTTGTTGCCGCCTCCAA AAAGGAAGGTGAAGTTTTCAACTGGTTACTAGCCTTGTCTGGGTTGTCATCTCTATTCAC ATGGGGTGGTATCTGTATTTGTCACATTCGTTTCAGAAAGGCATTGGCCGCCCAAGGAAG AGGCTTGGATGAATTGTCTTTCAAGTCTCCTACCGGTGTTTGGGGTTCCTACTGGGGGTT ATTTATGGTTATTATTATGTTCATTGCCCAATTCTACGTTGCTGTATTCCCCGTGGGAGA TTCTCCAAGTGCGGAAGGTTTCTTCGAAGCTTATCTATCCTTCCCACTTGTTATGGTTAT GTACATCGGACACAAGATCTATAAGAGGAATTGGAAGCTTTTCATCCCAGCAGAAAAGAT GGACATTGATACGGGTAGAAGAGAAGTCGATTTAGATTTGTTGAAACAAGAAATTGCAGA AGAAAAGGCAATTATGGCCACAAAGCCAAGATGGTATAGAATCTGGAATTTCTGGTGTTA ;WARNING invalid query foo ;WARNING invalid query bar ;WARNING invalid query qux

    Regards,
    Velusamy R.


    eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@|6%,53!-9@2~j';

      Just a note:

      my @content = <DATA>;

      Slurping genomic data into memory may be expensive in time and resources. In this case it may not be an issue, but data sets run large in bioinformatics, so line or chunk processing is often much more feasible.

Re: Cleaning up a text file with compact regex
by Anonymous Monk on Apr 24, 2007 at 06:48 UTC
    As you asked for a regex... $line =~ /^(>.*|[ATCG]*)$/ the whole line should be captured into $1
Re: Cleaning up a text file with compact regex
by OfficeLinebacker (Chaplain) on Apr 24, 2007 at 12:32 UTC
    To get to the heart of the matter, as it were, your principles: as I understand the relative efficiencies and strengths, one might argue that awk or sed might be the 'right tool for the job.'

    Note that while I know how I would do this in Perl, there are already several good answers. I do not know, off the top of my head, how to do it in sed or awk.


    I like computer programming because it's like Legos for the mind.
Re: Cleaning up a text file with compact regex
by ikegami (Pope) on Apr 24, 2007 at 22:19 UTC
    perl -pe "next if $.<7 || /^;/" infile > outfile

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://611652]
Approved by prasadbabu
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (8)
As of 2014-08-20 05:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (105 votes), past polls