Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Apply regex to entire file, not just individual lines ?

( #14521=categorized question: print w/ replies, xml ) Need Help??
Contributed by Anonymous Monk on May 24, 2000 at 13:59 UTC
Q&A  > regular expressions


I'm trying to extract a specific block of recurring text from a daily-updated Web page, and output the result to a local file. I'm happy with my HTML retrieval, but then applying regex's on a line-by-line basis requires waaay too much tweeking on my part. How can I substitute across multiple lines? Preferably to the entire file.

Answer: Apply regex to entire file, not just individual lines ?
contributed by nuance

You can read the entire file into a scalar variable like this

{ open(FILE, "$filename") or die "Cant open $filename\n"; local $/ = undef; $lines = <FILE>; close(FILE); }
Then you can just use your normal regular expression, but you'll probably want to use at least one of the following modifiers (from the perlre manpage):


Treat string as multiple lines. That is, change ``^'' and ``$'' from matching at only the very start or end of the string to the start or end of any line anywhere within the string,


Treat string as single line. That is, change ``.'' to match any character whatsoever, even a newline, which it normally would not match. The /s and /m modifiers both override the $* setting. That is, no matter what $* contains, /s without /m will force ``^'' to match only at the beginning of the string and ``$'' to match only at the end (or just before a newline at the end) of the string. Together, as /ms, they let the ``.'' match any character whatsoever, while yet allowing ``^'' and ``$'' to match, respectively, just after and just before newlines within the string.

Answer: Apply regex to entire file, not just individual lines ?
contributed by juahonen

After you've opened and read the file (or web page) into an array, join all lines with join().

open(FILE, "$filename");
@lines = <FILE>;

$content = join('', @lines);
After this, $content will be single-line and it is easy to do regexp with your existing functions.
Answer: Apply regex to entire file, not just individual lines ?
contributed by vxp

You might not want to have your WHOLE file in one variable. Depending on the size of the file, it could eat a LOT of your memory. From my own experience, it is usually enough for me to do $/ = '\n\n' and then the linebreak is 2 new lines, not one. I was parsing a bounce file when I was doing this, which was about 300megs in size, daily. thats a LONG 300mb line. $/ = '\n\n'; took care of it. i ended up with having.. smaller big lines, and was able to do what I wanted to do without consuming a lot of RAM.

Answer: Apply regex to entire file, not just individual lines ?
contributed by dsb

The key is two get the whole file into one scalar( the first 'while' loop). Then the 'g' modifier ( the condition in the second 'while' loop ) will keep the place of the last match found and continue from there until there are no matches found.

open( FH, "filename" ) || die "couldn't open\n"; while ( <FH> ) { $data .= $_; } while ( $data =~ m/PATTERN/g ) { # executed code # executed code...etc. }

Please (register and) log in if you wish to add an answer

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chilling in the Monastery: (4)
    As of 2015-11-29 04:34 GMT
    Find Nodes?
      Voting Booth?

      What would be the most significant thing to happen if a rope (or wire) tied the Earth and the Moon together?

      Results (746 votes), past polls