Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Answer: How do I remove a specific keyword from a HTML page

( #273019=categorized answer: print w/ replies, xml ) Need Help??

Q&A > regular expressions > How do I remove a specific keyword from a HTML page - Answer contributed by Foggy Bottoms

     Hi kvale, you said that a good general strategy is to use HTML::Parser to decompose HTML into its constituent elements and extract the parts you want with event handlers..
     Even though this seems like a good way to handle HTML and retrieving data, I'm not convinced it's quite sufficient or efficient at all : I've been wanting to extract useful information from a webpage. What I infer by useful information is actually when you're on a newspaper website reading an article, to be able to retrieve the article only. In order to do that you need to find the beginning and the ending of the article's body. However, within the article itself there can be several HTML tags. I'm afraid your method would simply split the article apart turning it into nonsense.
     I haven't found any better way than to have a look at the HTML code itself and finding out whether special tags are used. Newspaper webmasters may sometimes use hidden HTML tags (<!-- article start-->) but then I need to come up with templates depending on which newspaper's website I'm currently analyzing.
     Have you any other idea ? I'd greatly appreciate your comments on this.

  • Comment on Answer: How do I remove a specific keyword from a HTML page
Log In?

What's my password?
Create A New User
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (6)
As of 2016-08-29 01:05 GMT
Find Nodes?
    Voting Booth?
    The best thing I ever won in a lottery was:

    Results (397 votes). Check out past polls.