comment on

Ironically the distinction that you draw is the same one that I use to argue against using regular expressions for parsing problems.

Regular expressions are designed as a tool for locating specific patterns in a sea of stuff. (Well until Perl 6 that is...) Parsing is the task of taking structured information and analyzing the structure. This is a very different task, and regular expressions (as they currently are) are simply not designed to do it. Parsing is a lot more work, but for structured text is going to give much more robust solutions. For instance you avoid different kinds of data being mistaken for each other.

The problem is that people are used to using regular expressions for text manipulation, and then set out to solve what is really a parsing probem with regular expressions. Then fail (and may or may not realize this). This happens so routinely that the knee-jerk response is that virtually anything which can be done with parsing should be, rather than regular expressions. And indeed this is good advice to give to someone who doesn't understand the parsing wheels - if only to avoid the problem of all problems looking like nails for the one hammer (regexps) that you have.

However the two kinds of problems are different and do overlap. Where they do overlap, it isn't necessarily obvious which is more practical. It isn't even necessarily obvious from the problem specification - sometimes you need to make a guess about how the code will evolve to know that...

In reply to Re: Re: Scraping HTML: orthodoxy and reality by tilly
in thread Scraping HTML: orthodoxy and reality by grinder

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks