Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^3: Munging Rendered HTML While Preserving Formatting

by qq (Hermit)
on Jun 28, 2004 at 21:08 UTC ( [id://370336]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Munging Rendered HTML While Preserving Formatting
in thread Munging Rendered HTML While Preserving Formatting

I had to do something a bit like this. I worked at a typesetting company where the typsetters used xml-like tagging. It wasn't xml, because it had no requirements to be balanced, well formed or anything. They had a long book, marked up like that.

I was give a bunch of text documents that matched the 'text' (sans tags) of the book. The text documents had index tags put in like this: <index1235>this is indexed</index1235>. These could overlap, be nested, etc. To make matters worse, the rtf documents were based an outdated copy of the book text - many corrections and addition had been made to it. My task was to try and insert the index tags into the correct place in the xml-like text.

So I read in the book file, stripped out every tag, space and punctuation character (these were 'corrected' more often than regular text), and stored it aside with a note of its position. Then I read through the index file and tried to match strings (100 chars) starting from each index tag against the book text, and, if found, added the index tag into the tag list the match position. Then I put he book file back together again, starting from the back so as not to mess up the character positions.

But with html that can be parsed its easier. Some tags are stylistic, and some "semantic" (ok, sort of). While these strings could be considered equivalent:

<i>Apple</i> Juice <b>Apple </b>Juice

its unlikely that this would be:

<h1>Apple</h1> <h1>Juice</h1>

So I think I'd only do substitutions within one "semantic" tag.

If the strings are variable length, you've got to talk with the client about what to do about formatting tags. I think it real world situations its not likely to be a problem. You'd probaby get a spec like s/bug/issue/g, and you'd only want to match whole words. Or you'd get a paragraph to replace: s/I have no comment/I refer you to <a href="s@e.com">my solicitor</a>/g. In that case, you may want to match "I have <b>no comment</b>", but you would still use the replacement string intact.

qq

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://370336]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-03-19 09:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found