Re: Munging Rendered HTML While Preserving Formatting

Replies are listed 'Best First'.
Re^2: Munging Rendered HTML While Preserving Formatting by Limbic~Region (Chancellor) on Jun 28, 2004 at 17:44 UTC
ViceRaid, HTML::TokeParser::Simple makes this task very simple: Not really, but I guess it is my fault for not being clear. If you look, this is the same module that I had mentioned that doesn't meet all the requirements. Try changing hello to goodbye in the rendered HTML below: `<html> <head></head> <body>h<i>e</i>ll<b>o</b></body> </html>` [download] As idsfa points out, this isn't an easy problem given that the replacement text may not be as long as the original text. This makes the problem even more interesting to me - I don't do HTML data munging if at all humanly possible - what are other people doing? Cheers - L~R	[reply] [d/l]
Re^3: Munging Rendered HTML While Preserving Formatting by qq (Hermit) on Jun 28, 2004 at 21:08 UTC
I had to do something a bit like this. I worked at a typesetting company where the typsetters used xml-like tagging. It wasn't xml, because it had no requirements to be balanced, well formed or anything. They had a long book, marked up like that. I was give a bunch of text documents that matched the 'text' (sans tags) of the book. The text documents had index tags put in like this: <index1235>this is indexed</index1235>. These could overlap, be nested, etc. To make matters worse, the rtf documents were based an outdated copy of the book text - many corrections and addition had been made to it. My task was to try and insert the index tags into the correct place in the xml-like text. So I read in the book file, stripped out every tag, space and punctuation character (these were 'corrected' more often than regular text), and stored it aside with a note of its position. Then I read through the index file and tried to match strings (100 chars) starting from each index tag against the book text, and, if found, added the index tag into the tag list the match position. Then I put he book file back together again, starting from the back so as not to mess up the character positions. But with html that can be parsed its easier. Some tags are stylistic, and some "semantic" (ok, sort of). While these strings could be considered equivalent: `<i>Apple</i> Juice <b>Apple </b>Juice` [download] its unlikely that this would be: `<h1>Apple</h1> <h1>Juice</h1>` [download] So I think I'd only do substitutions within one "semantic" tag. If the strings are variable length, you've got to talk with the client about what to do about formatting tags. I think it real world situations its not likely to be a problem. You'd probaby get a spec like `s/bug/issue/g`, and you'd only want to match whole words. Or you'd get a paragraph to replace: `s/I have no comment/I refer you to <a href="s@e.com">my solicitor</a>/g`. In that case, you may want to match `"I have <b>no comment</b>"`, but you would still use the replacement string intact. qq	[reply] [d/l] [select]
Re^3: Munging Rendered HTML While Preserving Formatting by iburrell (Chaplain) on Jun 28, 2004 at 19:54 UTC
There are problems if the replacement text is longer, shorter, or the same size. If the text is longer, where do you put the extra? If the text is shorter, where do you remove the characters? If the text is the same length, do you break it in the same way? This is really only a problem when doing replacements with sentences instead of words. It is pretty unlikely that a word will be split in non-pathological cases. It can be argued that a tag is equivalent to a word break. The problem is actually pretty similar to doing munging across line breaks. The only sane is to do replacement on individual text blocks. It might be possible to do replacements on multiple words, either by using something like XSLT that works on the tree. The other way to do would write regexp that match whitespace and elements as word separators. For XML, this would not be too hard. The other hard part is maintaining the tags when doing the substitution.	[reply]


Just another Perl shrine
	PerlMonks