Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Apply regex to entire file, not just individual lines ?

( #14521=categorized question: print w/ replies, xml ) Need Help??
Contributed by Anonymous Monk on May 24, 2000 at 13:59 UTC
Q&A  > regular expressions


Description:

I'm trying to extract a specific block of recurring text from a daily-updated Web page, and output the result to a local file. I'm happy with my HTML retrieval, but then applying regex's on a line-by-line basis requires waaay too much tweeking on my part. How can I substitute across multiple lines? Preferably to the entire file.

Answer: Apply regex to entire file, not just individual lines ?
contributed by nuance

You can read the entire file into a scalar variable like this

{ open(FILE, "$filename") or die "Cant open $filename\n"; local $/ = undef; $lines = <FILE>; close(FILE); }
Then you can just use your normal regular expression, but you'll probably want to use at least one of the following modifiers (from the perlre manpage):

m

Treat string as multiple lines. That is, change ``^'' and ``$'' from matching at only the very start or end of the string to the start or end of any line anywhere within the string,

s

Treat string as single line. That is, change ``.'' to match any character whatsoever, even a newline, which it normally would not match. The /s and /m modifiers both override the $* setting. That is, no matter what $* contains, /s without /m will force ``^'' to match only at the beginning of the string and ``$'' to match only at the end (or just before a newline at the end) of the string. Together, as /ms, they let the ``.'' match any character whatsoever, while yet allowing ``^'' and ``$'' to match, respectively, just after and just before newlines within the string.

Answer: Apply regex to entire file, not just individual lines ?
contributed by juahonen

After you've opened and read the file (or web page) into an array, join all lines with join().

open(FILE, "$filename");
@lines = <FILE>;
close(FILE);

$content = join('', @lines);
After this, $content will be single-line and it is easy to do regexp with your existing functions.
Answer: Apply regex to entire file, not just individual lines ?
contributed by vxp

You might not want to have your WHOLE file in one variable. Depending on the size of the file, it could eat a LOT of your memory. From my own experience, it is usually enough for me to do $/ = '\n\n' and then the linebreak is 2 new lines, not one. I was parsing a bounce file when I was doing this, which was about 300megs in size, daily. thats a LONG 300mb line. $/ = '\n\n'; took care of it. i ended up with having.. smaller big lines, and was able to do what I wanted to do without consuming a lot of RAM.

Answer: Apply regex to entire file, not just individual lines ?
contributed by dsb

The key is two get the whole file into one scalar( the first 'while' loop). Then the 'g' modifier ( the condition in the second 'while' loop ) will keep the place of the last match found and continue from there until there are no matches found.

open( FH, "filename" ) || die "couldn't open\n"; while ( <FH> ) { $data .= $_; } while ( $data =~ m/PATTERN/g ) { # executed code # executed code...etc. }
-kel

Please (register and) log in if you wish to add an answer



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others lurking in the Monastery: (17)
    As of 2015-07-02 17:27 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (44 votes), past polls