In HTML , I Want to process only Data and Not tags

sanPerl has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I need to change all the data in some HTML file. Now since the data is between tags. I capture data like this. (Please assume that $htmlbuffer contains containts of HTML file.)

$htmlbuffer =~ s{>(.*?)<}
{
    my ($mydata) = ($1);
$mydata =~ s/abcd/efgh/gs;
$mydata =~ s/yyy/zzz/gs;
“>$mydata<”
}exgs;
[download]

Now here what happens is; the buffer $htmlbuffer gets called many times, just because I want to process data between tags and don’t want to disturb tags. I face same problem while processing XML data also.
This is slowing the program. I am sure someone expert will suggest me a better way.

Regards,
Sandeep

Comment on In HTML , I Want to process only Data and Not tags Download Code

Replies are listed 'Best First'.
Re: In HTML , I Want to process only Data and Not tags by GrandFather (Saint) on Jul 25, 2006 at 20:39 UTC
No. No way. Never ever (well not often) try to use simple rexen for parsing markup - life is too short to spend the months you would likley need to get the bugs out when others have already done it for you. For HTML see HTML::TreeBuilder. For XHTML or XML see XML::Twig. See some of the answers to how to eliminate all html tags in a given string ??, and in particular the sample code shown in Re: how to eliminate all html tags in a given string ?? for some sample code and other related suggestions. DWIM is Perl's answer to Gödel	[reply]
Re^2: In HTML , I Want to process only Data and Not tags by revdiablo (Prior) on Jul 25, 2006 at 21:33 UTC
I second the vote for HTML::TreeBuilder, but I also would like to recommend XML::TreeBuilder. It uses the same handy API, which just makes my life so much simpler. There are most likely cases where other modules -- such as XML::Twig -- make more sense, but I don't know of them off the top of my head.	[reply]
Re: In HTML , I Want to process only Data and Not tags by lorn (Monk) on Jul 25, 2006 at 20:25 UTC
if (i understanded what you sayed){ you need to see this page: http://www.stonehenge.com/merlyn/LinuxMag/col49.html } Lorn -http://lornlab.org -slackwarezine.com.br Code tags added by GrandFather	[reply]
Re^2: In HTML , I Want to process only Data and Not tags by duckyd (Hermit) on Jul 25, 2006 at 22:05 UTC
I think you mean to suggest something like: `s/>[^>]+</.../` [download] But in general it's not a good idea to try to roll your own HTML or XML parsing solution when there are plenty of good ones out there.	[reply] [d/l]
Re: In HTML , I Want to process only Data and Not tags by n00dles (Novice) on Jul 25, 2006 at 23:57 UTC
This regexp works for me fine. `/^<.*>/` [download] n00dles lynx.neocyber.info	[reply] [d/l]
Re^2: In HTML , I Want to process only Data and Not tags by GrandFather (Saint) on Jul 26, 2006 at 00:22 UTC
I'm sure it does. But what does it work for? As shown it is a match that doesn't capture anything and will match a < at the start of a line, followed by anything at all for as much as it can manage, until it finds a >. For example, all the following match: `'<>' '<tag>' "< line of quoted text in an email using '<' instead of the more usual + '>'" '<tag>the stuff OP wanted to retreive</tag>'` [download] note that what is matched isn't even what OP wants to retreive. OP was after element data - the bit between a start tag and a end tag. BTW, the regex matches the whole last sample line, not just the start tag as you might have expected: `.*` is greedy. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]


"be consistent"
	PerlMonks