Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Munging Rendered HTML While Preserving Formatting

by ViceRaid (Chaplain)
on Jun 28, 2004 at 16:55 UTC ( [id://370248]=note: print w/replies, xml ) Need Help??


in reply to Munging Rendered HTML While Preserving Formatting

HTML::TokeParser::Simple makes this task very simple:

use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(*DATA); while ( my $token = $p->get_token() ) { if ( $token->is_text() ) { $token->[1] =~ s/2004/2006/; } print $token->as_is; } __DATA__ <html> <head> </head> <body> <h1 id="2004">Euro 2004 : The English were robbed</h1> <p>We <strong>will</strong> have revenge in the 2006 World Cup!</p> <!-- Last edited in 2004 --> </body> </html>

The 2004 occurring as an HTML attribute and the 2004 in the comment remain unchanged.

Cheers
ViceRaid

Used H:T:Simple's nice ->is_text() method

Replies are listed 'Best First'.
Re^2: Munging Rendered HTML While Preserving Formatting
by Limbic~Region (Chancellor) on Jun 28, 2004 at 17:44 UTC
    ViceRaid,
    HTML::TokeParser::Simple makes this task very simple:

    Not really, but I guess it is my fault for not being clear. If you look, this is the same module that I had mentioned that doesn't meet all the requirements.

    Try changing hello to goodbye in the rendered HTML below:
    <html> <head></head> <body>h<i>e</i>ll<b>o</b></body> </html>
    As idsfa points out, this isn't an easy problem given that the replacement text may not be as long as the original text. This makes the problem even more interesting to me - I don't do HTML data munging if at all humanly possible - what are other people doing?

    Cheers - L~R

      I had to do something a bit like this. I worked at a typesetting company where the typsetters used xml-like tagging. It wasn't xml, because it had no requirements to be balanced, well formed or anything. They had a long book, marked up like that.

      I was give a bunch of text documents that matched the 'text' (sans tags) of the book. The text documents had index tags put in like this: <index1235>this is indexed</index1235>. These could overlap, be nested, etc. To make matters worse, the rtf documents were based an outdated copy of the book text - many corrections and addition had been made to it. My task was to try and insert the index tags into the correct place in the xml-like text.

      So I read in the book file, stripped out every tag, space and punctuation character (these were 'corrected' more often than regular text), and stored it aside with a note of its position. Then I read through the index file and tried to match strings (100 chars) starting from each index tag against the book text, and, if found, added the index tag into the tag list the match position. Then I put he book file back together again, starting from the back so as not to mess up the character positions.

      But with html that can be parsed its easier. Some tags are stylistic, and some "semantic" (ok, sort of). While these strings could be considered equivalent:

      <i>Apple</i> Juice <b>Apple </b>Juice

      its unlikely that this would be:

      <h1>Apple</h1> <h1>Juice</h1>

      So I think I'd only do substitutions within one "semantic" tag.

      If the strings are variable length, you've got to talk with the client about what to do about formatting tags. I think it real world situations its not likely to be a problem. You'd probaby get a spec like s/bug/issue/g, and you'd only want to match whole words. Or you'd get a paragraph to replace: s/I have no comment/I refer you to <a href="s@e.com">my solicitor</a>/g. In that case, you may want to match "I have <b>no comment</b>", but you would still use the replacement string intact.

      qq

      There are problems if the replacement text is longer, shorter, or the same size. If the text is longer, where do you put the extra? If the text is shorter, where do you remove the characters? If the text is the same length, do you break it in the same way?

      This is really only a problem when doing replacements with sentences instead of words. It is pretty unlikely that a word will be split in non-pathological cases. It can be argued that a tag is equivalent to a word break. The problem is actually pretty similar to doing munging across line breaks.

      The only sane is to do replacement on individual text blocks. It might be possible to do replacements on multiple words, either by using something like XSLT that works on the tree. The other way to do would write regexp that match whitespace and elements as word separators. For XML, this would not be too hard. The other hard part is maintaining the tags when doing the substitution.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://370248]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-04-19 04:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found