Text::Balanced woes..

u914 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(jcwren) Re: Text::Balanced woes.. by jcwren (Prior) on May 27, 2002 at 02:01 UTC
It appears that Text::Balanced does not cope with leading non-white space characters that are not balanced tag pairs. The example below works as advertised. However, put ANY leading character or word in front of the opening <B>, and it ceases working. This doesn't seem terribly useful, unless you know you're parsing complete HTML. Note that in a list context, a valid parsing returns 6 items. See the docs for which element is which. `#!/usr/bin/perl -w use Text::Balanced qw (extract_tagged); use strict; my $text = " <B>for</B> some trailing text"; my @a = extract_tagged ($text); print scalar (@a), "\n"; print "$_\n" for (@a); exit 0;` [download] --Chris e-mail jcwren	[reply] [d/l]
Re: (jcwren) Re: Text::Balanced woes.. by u914 (Pilgrim) on May 27, 2002 at 03:05 UTC
I see... thank you very much, Chris.... i never would have though to try a test excluding everything before the tag... You're right that Text::Balanced isn't very useful like this, i can't imagine that the author meant it to be this way. You can see what i'm trying to do (well, actually i'll be parsing a href links out in an effort to combat chatroom spambots), is there another method you'd suggest? It was looking at the docs that convinced me that Text::Balanced was the right thing for me... i'd be using the 5th (#4) element.. i haven't found another lib that'll supply just the stripped URL inside the tag yet... and while Perl seems super-cool for text handling (i'm a duffer), i'd rather not rewrite the wheel.. in any case, thanks very much for your reply!	[reply]
(jcwren) Re: Text::Balanced woes.. by jcwren (Prior) on May 27, 2002 at 03:12 UTC
There are several packages based on HTML::Parser, such as HTML::LinkExtor, that shouldn't require you to invent too many wheels. I would take a look at that. I would avoid at all costs attempting to use a regular expression to attempt to extract links. That's just a path to problems. --Chris e-mail jcwren	[reply]
Re: Text::Balanced woes.. by TheDamian (Vicar) on May 27, 2002 at 04:09 UTC
<sigh> Everybody expects the `extract_*` subroutines to act like: `$text =~ /extractor/ # i.e. match anywhere in the string` [download] even though they're clearly documented (and fully intended) to act like: `$text =~ /\G extractor/gc # i.e. match at current pos in string` [download] If you want to match heterogeneous input, what you really want is the `extract_multiple` subroutine. Like this: `use Text::Balanced ':ALL'; my $text = "this is a test <B>for</B> tags! \n"; my @data = extract_multiple( $text, [ \&extract_tagged ]); use Data::Dumper 'Dumper'; print Dumper [ @data ];` [download] Notes: Your original example had `<B>for</b>`, which `extract_tagged`'s case-sensitive default tag pattern wouldn't recognize anyway. The other monks are correct, you'd be much better off with one of the many HTML:: modules on the CPAN ( HTML::TreeBuilder is my personal favorite).	[reply] [d/l] [select]
(jcwren) Re: Re: Text::Balanced woes.. by jcwren (Prior) on May 27, 2002 at 04:41 UTC
Damian, I don't consider myself a world-class Perl programmer by any means, but I do believe I'm capable of reading the documentation. That being said, I don't see ANYWHERE that it's "obvious" that it matches from the current string position. Read the documentation, trying to ignore the fact that you wrote it. Tell me where you see that it mentions that, or even reasonably implies that. And keep in mind that someone like myself or 914 may be reading it. jeffa, who is someone I consider an experienced Perl programmer, mentioned in a /msg that he wouldn't have thought of deleting the leading words to see if it would pass. And I only got that idea from mucking around for 1/2 an hour, then running the extgen.pl test case with $DEBUG set. I'm not critizing the documentation, because there is a lot of good stuff there, but I do think it could be better indicated that they match at the current position. One final detail. The documentation mentions that it matches valid HTML/XML pairs. Well, HTML allows upper and lower case tags, and as such, <B>/</b> should match. Under XML, where tags are required to be lower case (if I remember correctly), then <B>/</B> should fail anyway, as it's not valid XML. Update: chromatic pointed me to this text in the Description section The various "extract_..." subroutines may be used to extract a delimited string (possibly after skipping a specified prefix string). The search for the string always begins at the current "pos" location of the string's variable (or at index zero, if no "pos" position is defined). However, my interpetation of that is that not that matching will only occur at the start of the string, but rather, there is no implicit offset to the search for a matching tag. It also doesn't indicate that white space will be ignored, although that's not of terrible importance. --Chris e-mail jcwren	[reply]
Re: (jcwren) Re: Re: Text::Balanced woes.. by TheDamian (Vicar) on May 27, 2002 at 07:43 UTC
Well, I think your interpretation is...err...imaginative, but you're without doubt a very smart person so the docs mustn't be clear enough. I'll make sure the next version leaves no room for misinterpretation: The various "extract_..." subroutines may be used to extract a delimited substring, possibly after skipping a specified prefix string. By default, that prefix is optional whitespace, but you can change it to whatever you wish (see below). The substring to be extracted must appear at the current "pos" location of the string's variable (or at index zero, if no "pos" position is defined). In other words, the "extract_..." subroutines don't extract the first occurance of a substring anywhere in a string (like an unanchored regex would). Rather, they extract an occurance of the substring appearing immediately at the current matching position in the string (like a "\G"-anchored regex would). [download]	[reply] [d/l]
Re: Re: (jcwren) Re: Re: Text::Balanced woes.. by demerphq (Chancellor) on Jun 03, 2002 at 20:03 UTC
Re: Re: Re: Text::Balanced woes.. by runrig (Abbot) on May 27, 2002 at 05:32 UTC
XML is required to be case-sensitive, but not necessarily lower-case; you are right about the HTML though... :-)	[reply]
Re: Re: Re: Re: Text::Balanced woes.. by TheDamian (Vicar) on May 27, 2002 at 07:23 UTC
Re: Re: Text::Balanced woes.. by u914 (Pilgrim) on May 27, 2002 at 14:55 UTC
Having looked at the HTML:: modules, i really do think this is exactly what i want/need it to do... It works marvelously, except that when using extract_multiple with extract_tagged as the subroutine, there seems no (obvious:) way to access the 5th (#4) element of the array returned by extract_tagged.... Or is it that by calling it within extract_multiple it isn't in list context? But if that's the case, then it must be in scalar context, what happens to the remainder string? i guess the crux of my question is: "When using extract_multiple, how does one access the other members of the returned array, as it seems that item 0 is the only available?" i've got a some working code, but am reluctant to post the code here (it is an anti-spambot tool, after all)but i'd be happy to share it via email. update i've worked it out with a for loop (i know, control structures are for whimps! guilty as charged!).. `# find all the URLs from the page contents, rejecting any from bianca @data = extract_multiple( $response->content, [ sub {extract_tagged($_[0], '<a href="http://', '</a>', undef, {reject => ['bianca.com']} ) } ], undef, 1); # loop thru and strip the URL to it's bare address, this is # what's needed to insert into the database for (my $i=0; $i<=$#data; $i++) { my @temp = extract_tagged($data[$i], '<a href="http://', '">', und +ef, undef); $data[$i] = $temp[4]; }` [download] Thanks again for everyone's help and comments!	[reply] [d/l]
Re: Text::Balanced woes.. by Smylers (Pilgrim) on May 28, 2002 at 10:29 UTC
That loop can be simplified: Don't bother doing the counting yourself when Perl will do it for you. You don't actually need the temporary array — you can grab a single element from a list. This is untested, since I don't have sample data handy, but I reckon does the same as your loop and is a little simpler: `foreach my $datum (@data) { $datum = (extract_tagged($datum, '<a href="http://', '">'))[4]; }` [download] Smylers	[reply] [d/l]
Re: Text::Balanced woes.. by Smylers (Pilgrim) on May 28, 2002 at 10:37 UTC
Having looked at the HTML:: modules, i really do think this Text::Balanced is exactly what i want/need it to do... I realize that your code is only a snippet, but it does look like it is possible to concoct valid HTML hyperlinks that don't get caught by it: upper-case letters: `<a HREF="...">` single quotes: `<a href='...'>` other attributes: `<a class="main" href="...">` Whether these matter depends on your application and your users. But if they do you probably are better using a module explitly for parsing HTML rather than trying to think of all the possible valid variations. Smylers	[reply] [d/l] [select]
Re: Re: Text::Balanced woes.. by u914 (Pilgrim) on Jun 12, 2002 at 05:45 UTC
Re: Re: Text::Balanced woes.. by u914 (Pilgrim) on May 27, 2002 at 05:27 UTC
OK, thanks everyone! i was not reading the documentation carefully enough, but i think i can force it to do my bidding now... though some of the other options mentioned might be easier. i'll check them out, thanks!	[reply]
Re: Text::Balanced woes.. by Zaxo (Archbishop) on May 27, 2002 at 03:16 UTC
Here's something that sort of does what you want: `#!/usr/bin/perl -w use strict; use Text::Balanced ('extract_tagged'); my $text = "this is a test <b>for</b> tags! \n"; my ($leading,$extracted, $remainder); $leading = substr $text, 0, index($text,'<'), ''; ($extracted, $remainder) = extract_tagged($text); printf "leading: %s$/", $leading; printf "extracted: %s$/", $extracted; printf "remainder: %s$/", $remainder; exit 0;` [download] That can be iterated over both $extracted and $remainder to parse html. HTML::Parser is probably a better bet, particularly (given your reply to jcwren) with its friend HTML::LinkExtor. Update: Here's code that does what you were trying more exactly. Text::Balanced honors pos: `#!/usr/bin/perl -w use strict; use Text::Balanced ('extract_tagged'); my $text = "this is a test <B>for</B> tags! \n"; pos($text) = index $text, '<'; my $extracted = extract_tagged($text); printf "extracted: %s$/", $extracted; printf "remainder: %s$/", $text; exit 0;` [download] Note scalar context for &extract_tagged. After Compline, Zaxo	[reply] [d/l] [select]
Re: Text::Balanced woes.. by Albannach (Monsignor) on May 27, 2002 at 03:08 UTC
I'm not a Text::Balanced guru, but I believe all you need to do is specify a prefix to be skipped, otherwise searching starts at the last `pos($text)` and won't find your tag: `my ($extracted, $remainder) = extract_tagged($text, '<B>', undef, '.*?(?=<B>)');` This should have it ignore everything up to the first `<B>` and then grab what you are after. You may also find it useful to print `$@` which is set if something goes wrong. -- I'd like to be able to assign to an luser	[reply] [d/l] [select]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks

Text::Balanced woes..

update