http://www.perlmonks.org?node_id=186786

perl_virgin has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am fetching a HTML page into a scalar variable. The page has a list of links to various items with unique id. I am interested in retrieving all such unique id # from the scalar variable. Any suggestions ? Thanks

Originally posted as a Categorized Question.

  • Comment on How do I remove a specific keyword from a HTML page

Replies are listed 'Best First'.
Re: How do I remove a specific keyword from a HTML page
by kvale (Monsignor) on Aug 01, 2002 at 16:24 UTC
    When trying to extract useful bits from HTML, a good general strategy is to use HTML::Parser to decompose HTML into its constituent elements and extract the parts you want with event handlers.

    -Mark

Re: How do I remove a specific keyword from a HTML page
by Anonymous Monk on Aug 14, 2004 at 21:18 UTC
    Regarding the original question, one useful tool you may look into is the lynx used in conjunction with its -dump option. Combined with Perl regexps, there's not much you can't do with the actual information and links on the page.
Re: How do I remove a specific keyword from a HTML page
by Foggy Bottoms (Monk) on Jul 10, 2003 at 15:38 UTC
         Hi kvale, you said that a good general strategy is to use HTML::Parser to decompose HTML into its constituent elements and extract the parts you want with event handlers..
         Even though this seems like a good way to handle HTML and retrieving data, I'm not convinced it's quite sufficient or efficient at all : I've been wanting to extract useful information from a webpage. What I infer by useful information is actually when you're on a newspaper website reading an article, to be able to retrieve the article only. In order to do that you need to find the beginning and the ending of the article's body. However, within the article itself there can be several HTML tags. I'm afraid your method would simply split the article apart turning it into nonsense.
         I haven't found any better way than to have a look at the HTML code itself and finding out whether special tags are used. Newspaper webmasters may sometimes use hidden HTML tags (<!-- article start-->) but then I need to come up with templates depending on which newspaper's website I'm currently analyzing.
         Have you any other idea ? I'd greatly appreciate your comments on this.