http://www.perlmonks.org?node_id=1023593


in reply to Re: How to grab a portion of file with regex
in thread How to grab a portion of file with regex

Instead as swkronenfeld pointed out its better to use the CPAN module HTML::Parser

Not by much, HTML::Parser is very low-level, use a DOM parser supporting xpaths

  • Comment on Re^2: How to grab a portion of file with regex

Replies are listed 'Best First'.
Re^3: How to grab a portion of file with regex
by 7stud (Deacon) on Mar 15, 2013 at 03:37 UTC
    And for html files that are 9,000 GB's in size?
      Always limits to everything. I must remind you that I am not the one wanting to parse HTML. I am simply trying to offer guidance. I understand that HTML parsing is a hot topic. However, as a solution to the question asked HTML::Parser works fine.

      And for html files that are 9,000 GB's in size?

      Nevermind that that 9k-GB html-files don't exit, you can still use XML::Twig, naturally

Re^3: How to grab a portion of file with regex
by kielstirling (Scribe) on Mar 15, 2013 at 02:46 UTC
    Well instead of trolling why not supply a working example to help ??

    Its always the Anonymous Monk lacking courage to put a name to a comment

      Well instead of trolling why not supply a working example to help ?? Its always the Anonymous Monk lacking courage to put a name to a comment

      How is it trolling to point out the shortcomings of a "solution"? Maybe you should look up the definition of troll

      What courage is required to point out a simple fact about HTML::Parser? Are you under the impression that HTML::Parser is a high level parser?

      Your "solution" doesn't fetch the portion of page from class = lastUnit to class = line margin10 -- its incomplete -- it is lots easier/shorter/simpler to use  m{\Q$start\E(.+?)\Q$end\E}i instead of that HTML::Parser low-levelness

      Have you seen Re: How to grab a portion of file with regex (don't)? Its not unlike a minimum of three different tutorials/walkthroughs/step-by-step-instructions on extracting/xpathing the dom , some even compare/contrast with HTML::Parser


        You make some valid points. The example in the question didn't seem to need the content of the div.
        I do agree that working with the DOM is a much better way to parse HTML.