Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Apply regex to entire file, not just individual lines ?

( #14521=categorized question: print w/ replies, xml ) Need Help??
Contributed by Anonymous Monk on May 24, 2000 at 13:59 UTC
Q&A  > regular expressions


Description:

I'm trying to extract a specific block of recurring text from a daily-updated Web page, and output the result to a local file. I'm happy with my HTML retrieval, but then applying regex's on a line-by-line basis requires waaay too much tweeking on my part. How can I substitute across multiple lines? Preferably to the entire file.

Answer: Apply regex to entire file, not just individual lines ?
contributed by nuance

You can read the entire file into a scalar variable like this

{ open(FILE, "$filename") or die "Cant open $filename\n"; local $/ = undef; $lines = <FILE>; close(FILE); }
Then you can just use your normal regular expression, but you'll probably want to use at least one of the following modifiers (from the perlre manpage):

m

Treat string as multiple lines. That is, change ``^'' and ``$'' from matching at only the very start or end of the string to the start or end of any line anywhere within the string,

s

Treat string as single line. That is, change ``.'' to match any character whatsoever, even a newline, which it normally would not match. The /s and /m modifiers both override the $* setting. That is, no matter what $* contains, /s without /m will force ``^'' to match only at the beginning of the string and ``$'' to match only at the end (or just before a newline at the end) of the string. Together, as /ms, they let the ``.'' match any character whatsoever, while yet allowing ``^'' and ``$'' to match, respectively, just after and just before newlines within the string.

Answer: Apply regex to entire file, not just individual lines ?
contributed by juahonen

After you've opened and read the file (or web page) into an array, join all lines with join().

open(FILE, "$filename");
@lines = <FILE>;
close(FILE);

$content = join('', @lines);
After this, $content will be single-line and it is easy to do regexp with your existing functions.
Answer: Apply regex to entire file, not just individual lines ?
contributed by vxp

You might not want to have your WHOLE file in one variable. Depending on the size of the file, it could eat a LOT of your memory. From my own experience, it is usually enough for me to do $/ = '\n\n' and then the linebreak is 2 new lines, not one. I was parsing a bounce file when I was doing this, which was about 300megs in size, daily. thats a LONG 300mb line. $/ = '\n\n'; took care of it. i ended up with having.. smaller big lines, and was able to do what I wanted to do without consuming a lot of RAM.

Answer: Apply regex to entire file, not just individual lines ?
contributed by dsb

The key is two get the whole file into one scalar( the first 'while' loop). Then the 'g' modifier ( the condition in the second 'while' loop ) will keep the place of the last match found and continue from there until there are no matches found.

open( FH, "filename" ) || die "couldn't open\n"; while ( <FH> ) { $data .= $_; } while ( $data =~ m/PATTERN/g ) { # executed code # executed code...etc. }
-kel

Please (register and) log in if you wish to add an answer



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others avoiding work at the Monastery: (4)
    As of 2014-07-12 14:03 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      When choosing user names for websites, I prefer to use:








      Results (240 votes), past polls