good chemistry is complicated, and a little bit messy -LW |
|
PerlMonks |
Re: Re: A nice text processing questionby moseley (Acolyte) |
on Jan 05, 2002 at 20:05 UTC ( [id://136544]=note: print w/replies, xml ) | Need Help?? |
Well, I'm starting to believe that HTML::Parser is the way to go. I was trying to avoid it for size reasons (running under mod_perl), and since I'll have many of these to do I'd like the fastest way possible.
Looking at your code I'm not sure you understood. I'm not trying to remove the tags. Rather imagine a long string of text that may or may not have some words (or group of words) bolded or marked up in some way. Now, what I'm then doing is splitting it up into chunks, which may end up splitting a tagged words. So one chunk may have the opening tag, where another tag may have the closing tag. Or it might get split in the middle of two tags, so that a given chunk might have the *closing* tag from the previous chunk, and the *opening* tag that's not closed until the next chunk. In other words: Starting text: <tag>This is a -- bunch</tag> of words <tag>where maybe -- some have</tag> tags. Splitting on the double dash:
<tag>This is a Which should then be corrected to:
<tag>This is a</tag> I might check on the lwp list, too, since I'll probably move to HTML::Parser.
In Section
Seekers of Perl Wisdom
|
|