Perl Monk, Perl Meditation | |
PerlMonks |
Re^3: Munging Rendered HTML While Preserving Formattingby qq (Hermit) |
on Jun 28, 2004 at 21:08 UTC ( [id://370336]=note: print w/replies, xml ) | Need Help?? |
I had to do something a bit like this. I worked at a typesetting company where the typsetters used xml-like tagging. It wasn't xml, because it had no requirements to be balanced, well formed or anything. They had a long book, marked up like that. I was give a bunch of text documents that matched the 'text' (sans tags) of the book. The text documents had index tags put in like this: <index1235>this is indexed</index1235>. These could overlap, be nested, etc. To make matters worse, the rtf documents were based an outdated copy of the book text - many corrections and addition had been made to it. My task was to try and insert the index tags into the correct place in the xml-like text. So I read in the book file, stripped out every tag, space and punctuation character (these were 'corrected' more often than regular text), and stored it aside with a note of its position. Then I read through the index file and tried to match strings (100 chars) starting from each index tag against the book text, and, if found, added the index tag into the tag list the match position. Then I put he book file back together again, starting from the back so as not to mess up the character positions. But with html that can be parsed its easier. Some tags are stylistic, and some "semantic" (ok, sort of). While these strings could be considered equivalent:
its unlikely that this would be:
So I think I'd only do substitutions within one "semantic" tag. If the strings are variable length, you've got to talk with the client about what to do about formatting tags. I think it real world situations its not likely to be a problem. You'd probaby get a spec like s/bug/issue/g, and you'd only want to match whole words. Or you'd get a paragraph to replace: s/I have no comment/I refer you to <a href="s@e.com">my solicitor</a>/g. In that case, you may want to match "I have <b>no comment</b>", but you would still use the replacement string intact.
In Section
Seekers of Perl Wisdom
|
|