Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Applying a regex across multiple files

by nuance (Hermit)
on May 25, 2000 at 21:00 UTC ( [id://14810]=perlquestion: print w/replies, xml ) Need Help??

nuance has asked for the wisdom of the Perl Monks concerning the following question:

This shows up older than the other answers on the original Q &: A question Apply regex to entire file, not just individual lines ? that's because I've been trying to get it posted all afternoon.

Ok Q & A isn't really designed to give you the kind of help you're looking for. Since you haven't logged in I can't contact you directly, but I've moved this to Seekers of Perl Wisom.

You asked

I'm trying to extract a specific block of recurring text from a daily-updated Web page, and output the result to a local file. I'm happy with my HTML retrieval, but then applying regex's on a line-by-line basis requires waaay too much tweeking on my part. How can I substitute across multiple lines? Preferably to the entire file.

See my answer below, I'm having real problems posting this question

Replies are listed 'Best First'.
RE: Applying a regex across multiple files
by nuance (Hermit) on May 25, 2000 at 17:51 UTC
    Updated from Q & A this is a futher answer to a post in Q & A Apply regex to entire file, not just individual lines ? which was turing into a conversation better suited to this forum.

    I'm not sure exactly what problem you're still experiencing, but the following seems to do what you're asking for.

    #!/usr/bin/perl -w my $raw = 'c:/temp/delete.me'; my $out = 'c:/temp/newfile'; my $lines; { open (RAW, "<$raw") or die "Couldn't open $RAW"; local $/ = undef; $lines = <RAW>; close RAW; } $lines =~ s/^unwantedstuff//msg; $lines =~ s/moreunwantedstuff//msg; open (OUT, ">$out") or die "Couldn't open $out for writing"; print OUT $lines; close OUT;
    Note the added /g modifier, if you want to get all occurrances of a peice of text, you probably need that as well.

    I put the part that reads the file into a block so that making the $/ variable local is restricted to that section of the program. That's probably overkill if your entire script is not much bigger than this. If you start writing longer scripts, the fact that you've changed the value of this predefined global could give you problems. If you put it in a separate block as I've done here, it isolates the rest of the program from the effects of that.

    Nuance

    Baldrick, you wouldn't see a subtle plan if it painted itself purple and danced naked on top of a harpsichord, singing "Subtle plans are here again!"

RE: Applying a regex across multiple files
by lhoward (Vicar) on May 25, 2000 at 21:30 UTC
    Just slurp your entire file into a single scalar (from what you describe it soulds like you're doing line-by-line processign with a while loop now). If you're processing like this now:
    while(<>){ #examine one line }
    you can slurp the whole stream into a scalar with
    my $s=join '',<>;
    and then process with a normal regular expression. Use the s flag on your regular-expression to get it to span lines.
    if($s=~/foo\s*(\d+)\s*bar/s){ ... }
RE: Applying a regex across multiple files
by Anonymous Monk on May 26, 2000 at 08:51 UTC

    Applying a regex across multiple files

    This is the anonymous monk who posted the question. Thanks to Nuance, juahonen, and lhoward for their help.

    Sorry for the prior (partially unformatted) reply. I expected <CODE></CODE> to provide implied breaks.

    Here's what I came up with. I don't fully grok the OO stuff, but the Owl, Ram, Llama and Camel books help enough to get simple stuff done. This is the first time I've looked for outside help - y'alls responsiveness is great.

    Suggestions for improvements are welcome.

    #!/usr/bin/perl -wT # Specify Perl modules use Time::localtime; use Net::Ping; use LWP::UserAgent; # Define base scalars my $time = localtime; my $targurl = 'http://wwwsomesite/somefile.html'; my $targhost = 'www.somesite'; my $agent = 'Mozilla/4.5 [en] (X11; I; Linux 2.0.36 i486; Nav)'; my $raw = '/tmp/tempfile'; my $out = '/var/www/outfile.txt'; my $pl = 'thisfilename.pl'; my $desturl = 'http://thiswebserver.org'; # Gen log header printf " thisfile.pl : %02d-%02d-%04d %02d:%02d:%02d\n", $time->mon ++1, $time->mday, $time->year+1900, $time->hour, $time->min, $time->se +c; prin

    janitored by ybiC: Closed unbalanced <code> tag and removed html formatting markup from code.   Did *not* modify apparantly-incomplete code listing

RE: Applying a regex across multiple files
by Anonymous Monk on May 26, 2000 at 09:07 UTC

    Same anonymous monk. Third time's a charm. Apologies again for prior attempted responses

    my $raw = '/tmp/tempfile'; my $out = '/var/www/outfile.txt'; my $lines; { open (RAW, "<$raw") or die "Can't open $raw to strip unwanted stuff\n"; local $/ = undef; $lines = <RAW>; close RAW; } $lines =~ s/unwantedstuff //msg; $lines =~ s/moreunwantedstuff //msg; open (OUT, ">$out") or die "Can't open $out\n"; print OUT $lines; close OUT;

    janitored by ybiC: Removed extraneous html formatting markup from code

RE: Applying a regex across multiple files
by Anonymous Monk on May 26, 2000 at 08:30 UTC

    This is the anonymous monk who posted the question. Thanks to Nuance, juahonen, and lhoward for their help.

    Here's what I came up with. I don't fully grok the OO stuff, but the Owl, Ram, Llama and Camel books help enough to get simple stuff done. This is the first time I've looked for outside help - y'alls responsiveness is great.

    Suggestions for improvements are welcome.

    #!/usr/bin/perl -wT # Specify Perl modules use Time::localtime; use Net::Ping; use LWP::UserAgent; # Define base scalars my $time = localtime; my $targurl = 'http://wwwsomesite/somefile.html'; my $targhost = 'www.somesite'; my $agent = 'Mozilla/4.5 [en] (X11; I; Linux 2.0.36 i486; Nav)'; my $raw = '/tmp/tempfile'; my $out = '/var/www/outfile.txt'; my $pl = 'thisfilename.pl'; my $desturl = 'http://thiswebserver.org'; # Gen log header printf " thisfile.pl : %02d-%02d-%04d %02d:%02d:%02d\n", $time->mon ++1, $time->mday, $time->year+1900, $time->hour, $time->min, $time->se +c; print " thisfile.pl : target - $targurl \n"; print " thisfile.pl : agent - $agent \n"; print " thisfile.pl : patience, the next two steps may take some time\ +n"; # Ping host to make sure it's accessible $p = Net::Ping->new( "icmp" ) or die " thisfile.pl : Can't create new Net::Ping object: $!\n";% +0

    janitored by ybiC: Closed unbalanced <code> tag

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://14810]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-04-19 12:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found