http://www.perlmonks.org?node_id=737169

jonnyfolk has asked for the wisdom of the Perl Monks concerning the following question:

The following code is ignoring the text I would like to extract from html:
<div class="detailline_address"> 15480 W 66TH AVE Whiten, CO 86663 </div>
and the regex is:
if ($r =~ m#<div class="detailline_address">\s*?(.*?)\s*?</div>#) { p( +"Address: $1"); }
where subroutine p simply formats the print command.

Could someone give me a few pointers please?

Replies are listed 'Best First'.
Re: regex to extract text
by CountZero (Bishop) on Jan 18, 2009 at 17:31 UTC
    m/<div class="detailline_address">\s*(.*)</div>/s
    will do the trick.

    "dot" does not match newline unless you add the s-option, meaning you treat the data to be matched as a single string and \n looses its special status.

    But of course, one should not deal with HTML through a regexp but using a parser (such as HTML::Parser)

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Thanks for the fix and the advice - I'll be straight onto the parser!!
        Note that CountZero's solution (based on your initial attempt, just adding the necessary "s" modifier) is doing a greedy match with '(.*)' -- this means that if there are two or more instances of '</div>' following the address section, the match will extend to the farthest one.

        Using '(.*?)' instead, to specify a non-greedy match, will do what you really want, though as pointed out already, you probably should be getting acquainted with proper HTML parsing. It takes a bit of learning to catch on, but in the long run a parsing module will lead you to quicker and better solutions than what can be done with regex matching.

      A small point of style, but

      m{(?s)<div class="detailline_address">\s*(.*?)</div>}
      avoids endweight problems by pushing the modifier up front. There's also the case for always using (?msx) at the beginning of your regexes unless there's a damned good reason not to. In this case, the damned good reason not to is: "You're attempting to parse XML with a regular expression! Are you mad?"

Re: regex to extract text
by AnomalousMonk (Archbishop) on Jan 18, 2009 at 18:27 UTC