Contributed by Anonymous Monk
on May 24, 2000 at 13:59 UTC
Q&A
> regular expressions
Description: I'm trying to extract a specific block of recurring text from a daily-updated Web page, and output the result to a local file. I'm happy with my HTML retrieval, but then applying regex's on a line-by-line basis requires waaay too much tweeking on my part. How can I substitute across multiple lines? Preferably to the entire file. Answer: Apply regex to entire file, not just individual lines ? contributed by nuance You can read the entire file into a scalar variable like this
{
open(FILE, "$filename") or die "Cant open $filename\n";
local $/ = undef;
$lines = <FILE>;
close(FILE);
}
Then you can just use your normal regular expression, but you'll probably want to use at least one of the following modifiers (from the perlre manpage):
m
Treat string as multiple lines. That is, change ``^'' and ``$'' from matching at only the very start or end of the string to the start or end of any line anywhere within the string,
s
Treat string as single line. That is, change ``.'' to match any character whatsoever, even a newline, which it normally would not match. The /s and /m modifiers both override the $* setting. That is, no matter what $* contains, /s without /m will force ``^'' to match only at the beginning of the string and ``$'' to match only at the end (or just before a newline at the end) of the string. Together, as /ms,
they let the ``.'' match any character whatsoever, while yet allowing ``^'' and ``$'' to match, respectively, just after and just before newlines within the string. | Answer: Apply regex to entire file, not just individual lines ? contributed by juahonen After you've opened and read the file (or web page) into an array, join all lines with join().
open(FILE, "$filename");
@lines = <FILE>;
close(FILE);
$content = join('', @lines);
After this, $content will be single-line and it is easy to do regexp with your existing functions. | Answer: Apply regex to entire file, not just individual lines ? contributed by vxp You might not want to have your WHOLE file in one variable. Depending on the size of the file, it could eat a LOT of your memory. From my own experience, it is usually enough for me to do $/ = '\n\n' and then the linebreak is 2 new lines, not one.
I was parsing a bounce file when I was doing this, which was about 300megs in size, daily.
thats a LONG 300mb line.
$/ = '\n\n'; took care of it. i ended up with having.. smaller big lines, and was able to do what I wanted to do without consuming a lot of RAM.
| Answer: Apply regex to entire file, not just individual lines ? contributed by dsb The key is two get the whole file into one scalar( the first 'while' loop). Then the 'g' modifier ( the condition in the second 'while' loop ) will keep the place of the last match found and continue from there until there are no matches found.
open( FH, "filename" ) || die "couldn't open\n";
while ( <FH> ) {
$data .= $_;
}
while ( $data =~ m/PATTERN/g ) {
# executed code
# executed code...etc.
}
-kel
|
Please (register and) log in if you wish to add an answer
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
|
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
|
|