Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason

by jcb (Parson)
on Jan 13, 2021 at 00:42 UTC ( [id://11126824]=note: print w/replies, xml ) Need Help??


in reply to Re: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason
in thread I match a pattern in regex, yet I don't get the group I wanted to extract for some reason

Note that parsing HTML/XML using regexen is generally a really bad idea.

The reason that it often works (for some definition of "works") is that few dynamic sites actually build and serialize a DOM tree, instead simply inserting details into (textual) templates. Regexen can match the parts of the output that come from the template, thereby selecting the insertions and extracting the desired information.

The resulting parsers tend to be somewhat fragile, as any change to the template can invalidate the "islands" on which that the regex-based scraper relies, but can be suitable for tools that are needed quickly and for the short-term, or where inconveniences adapting the tool when the site changes are acceptable. The upside is that regex-based parsers are relatively easily written from inspecting the HTML page source without requiring knowledge of DOM structure and handling, giving them a lower "barrier of entry" for programmers unfamiliar with SGML/XML/DOM concepts.

  • Comment on Re^2: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11126824]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-19 04:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found