regex to extract text

jonnyfolk has asked for the wisdom of the Perl Monks concerning the following question:

The following code is ignoring the text I would like to extract from html:

                     <div class="detailline_address">
                             
                                15480 W 66TH AVE
                             
                             Whiten, CO 86663
                     </div>
[download]

and the regex is:

if ($r =~ m#<div class="detailline_address">\s*?(.*?)\s*?</div>#) { p(
+"Address: $1"); }
[download]

where subroutine p simply formats the print command.

Could someone give me a few pointers please?

Comment on regex to extract text Select or Download Code

Replies are listed 'Best First'.
Re: regex to extract text by CountZero (Bishop) on Jan 18, 2009 at 17:31 UTC
`m/<div class="detailline_address">\s(.)</div>/s` [download] will do the trick. "dot" does not match newline unless you add the s-option, meaning you treat the data to be matched as a single string and `\n` looses its special status. But of course, one should not deal with HTML through a regexp but using a parser (such as HTML::Parser) CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^2: regex to extract text by jonnyfolk (Vicar) on Jan 18, 2009 at 21:22 UTC
Thanks for the fix and the advice - I'll be straight onto the parser!!	[reply]
Re^3: regex to extract text by graff (Chancellor) on Jan 19, 2009 at 07:46 UTC
Note that CountZero's solution (based on your initial attempt, just adding the necessary "s" modifier) is doing a greedy match with '(.)' -- this means that if there are two or more instances of '`</div>`' following the address section, the match will extend to the farthest one. Using '(.?)' instead, to specify a non-greedy match, will do what you really want, though as pointed out already, you probably should be getting acquainted with proper HTML parsing. It takes a bit of learning to catch on, but in the long run a parsing module will lead you to quicker and better solutions than what can be done with regex matching.	[reply] [d/l]
Re^2: regex to extract text by pdcawley (Hermit) on Jan 19, 2009 at 17:02 UTC
A small point of style, but `m{(?s)<div class="detailline_address">\s(.?)</div>}` [download] avoids endweight problems by pushing the modifier up front. There's also the case for always using `(?msx)` at the beginning of your regexes unless there's a damned good reason not to. In this case, the damned good reason not to is: "You're attempting to parse XML with a regular expression! Are you mad?"	[reply] [d/l] [select]
Re: regex to extract text by AnomalousMonk (Archbishop) on Jan 18, 2009 at 18:27 UTC
See perlre section Modifiers and in particular the s modifier.	[reply]

Back to Seekers of Perl Wisdom