Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Answer: How do I remove a specific keyword from a HTML page

( #273019=categorized answer: print w/ replies, xml ) Need Help??

Q&A > regular expressions > How do I remove a specific keyword from a HTML page contributed by Foggy Bottoms

     Hi kvale, you said that a good general strategy is to use HTML::Parser to decompose HTML into its constituent elements and extract the parts you want with event handlers..
     Even though this seems like a good way to handle HTML and retrieving data, I'm not convinced it's quite sufficient or efficient at all : I've been wanting to extract useful information from a webpage. What I infer by useful information is actually when you're on a newspaper website reading an article, to be able to retrieve the article only. In order to do that you need to find the beginning and the ending of the article's body. However, within the article itself there can be several HTML tags. I'm afraid your method would simply split the article apart turning it into nonsense.
     I haven't found any better way than to have a look at the HTML code itself and finding out whether special tags are used. Newspaper webmasters may sometimes use hidden HTML tags (<!-- article start-->) but then I need to come up with templates depending on which newspaper's website I'm currently analyzing.
     Have you any other idea ? I'd greatly appreciate your comments on this.

Comment on Answer: How do I remove a specific keyword from a HTML page
Log In?
Username:
Password:

What's my password?
Create A New User
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (7)
As of 2014-10-22 04:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (112 votes), past polls