Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: Match text from txt to html

by Anonymous Monk
on Sep 05, 2019 at 01:00 UTC ( #11105643=note: print w/replies, xml ) Need Help??


in reply to Re: Match text from txt to html
in thread Match text from txt to html

It looks like you want a parser. HTML::Parser will do almost exactly what you want, although you may need to handle the extra HTML tags in the input.

No. HTML::Parser low level, it doesn't give you tree. A html document is a tree ( Document Object Model).

You can use XML::Twig or HTML::TreeBuilder::XPath, XML::LibXML ...

Or as marto shows Mojo::DOM

Replies are listed 'Best First'.
Re^3: Match text from txt to html
by jcb (Chaplain) on Sep 05, 2019 at 01:40 UTC

    Just as a text file is both a set of lines and a stream of bytes, an HTML document is both a tree and a stream of elements. HTML::Parser extracts the latter, which is equivalent to walking the DOM tree in some order. The advantage of using HTML::Parser for an application like this is the same as the advantage of processing a text file line-by-line without reading the whole file into memory.

    While it is unlikely that an HTML document would not fit into memory on a client, our questioner could be building something that runs on a server, with an instance of the program for each concurrent client connection which can quickly become very large in aggregate if many clients are active. In this case, building the entire tree in memory is unnecessary because the transformation to be applied is very simple: find and mark ocurrances of certain text in a finite sliding window. If this is running on a server, building the DOM tree in memory is both wasteful and foolish, creating an opportunity for easy DoS attacks.

    Put simply, if you do not actually need the DOM tree, do not waste time and memory building it!

      Ever used XML::Twig or XML::LibXML? Ever heard of them? They both give you all the DOM goodness in steaming mode, perlmonks is full of examples

        HTML is not XML, and you cannot parse HTML with an XML parser.

        You might also want to make an account, so you can edit your posts and fix your typos, like "steaming mode" in the post above.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11105643]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2019-10-24 01:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?