Re^2: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason

Note that parsing HTML/XML using regexen is generally a really bad idea.

The reason that it often works (for some definition of "works") is that few dynamic sites actually build and serialize a DOM tree, instead simply inserting details into (textual) templates. Regexen can match the parts of the output that come from the template, thereby selecting the insertions and extracting the desired information.

The resulting parsers tend to be somewhat fragile, as any change to the template can invalidate the "islands" on which that the regex-based scraper relies, but can be suitable for tools that are needed quickly and for the short-term, or where inconveniences adapting the tool when the site changes are acceptable. The upside is that regex-based parsers are relatively easily written from inspecting the HTML page source without requiring knowledge of DOM structure and handling, giving them a lower "barrier of entry" for programmers unfamiliar with SGML/XML/DOM concepts.

Comment on Re^2: I match a pattern in regex, yet I don't get the group I wanted to extract for some reason


Perl: the Markov chain saw
	PerlMonks