in reply to Text::Balanced woes..
<sigh>
Everybody expects the extract_* subroutines to act like:
$text =~ /extractor/ # i.e. match anywhere in the string
even though they're clearly documented (and fully intended) to act like:
$text =~ /\G extractor/gc # i.e. match at current pos in string
If you want to match heterogeneous input, what you really want is the extract_multiple subroutine. Like this:
use Text::Balanced ':ALL';
my $text = "this is a test <B>for</B> tags! \n";
my @data = extract_multiple( $text, [ \&extract_tagged ]);
use Data::Dumper 'Dumper';
print Dumper [ @data ];
Notes:
-
Your original example had <B>for</b>, which extract_tagged's case-sensitive default tag pattern wouldn't recognize anyway.
-
The other monks are correct, you'd be much better off with one of the many HTML:: modules on the CPAN (
HTML::TreeBuilder is my personal favorite).
(jcwren) Re: Re: Text::Balanced woes..
by jcwren (Prior) on May 27, 2002 at 04:41 UTC
|
Damian, I don't consider myself a world-class Perl programmer by any means, but I do believe I'm capable of reading the documentation.
That being said, I don't see ANYWHERE that it's "obvious" that it matches from the current string position. Read the documentation, trying to ignore the fact that you wrote it. Tell me where you see that it mentions that, or even reasonably implies that. And keep in mind that someone like myself or 914 may be reading it.
jeffa, who is someone I consider an experienced Perl programmer, mentioned in a /msg that he wouldn't have thought of deleting the leading words to see if it would pass. And I only got that idea from mucking around for 1/2 an hour, then running the extgen.pl test case with $DEBUG set.
I'm not critizing the documentation, because there is a lot of good stuff there, but I do think it could be better indicated that they match at the current position.
One final detail. The documentation mentions that it matches valid HTML/XML pairs. Well, HTML allows upper and lower case tags, and as such, <B>/</b> should match. Under XML, where tags are required to be lower case (if I remember correctly), then <B>/</B> should fail anyway, as it's not valid XML.
Update: chromatic pointed me to this text in the Description section
The various "extract_..." subroutines may be used to
extract a delimited string (possibly after skipping a
specified prefix string). The search for the string
always begins at the current "pos" location of the
string's variable (or at index zero, if no "pos" position
is defined).
However, my interpetation of that is that not that matching will only occur at the start of the string, but rather, there is no implicit offset to the search for a matching tag. It also doesn't indicate that white space will be ignored, although that's not of terrible importance.
--Chris
e-mail jcwren | [reply] |
|
Well, I think your interpretation is...err...imaginative, but you're without doubt a very smart person so the docs mustn't be clear enough. I'll make sure the next version leaves no room for misinterpretation:
The various "extract_..." subroutines may be used to
extract a delimited substring, possibly after skipping a
specified prefix string. By default, that prefix is
optional whitespace, but you can change it to whatever
you wish (see below).
The substring to be extracted must appear at the
current "pos" location of the string's variable
(or at index zero, if no "pos" position is defined).
In other words, the "extract_..." subroutines *don't*
extract the first occurance of a substring anywhere
in a string (like an unanchored regex would). Rather,
they extract an occurance of the substring appearing
immediately at the current matching position in the
string (like a "\G"-anchored regex would).
| [reply] [d/l] |
|
Not to kick a man when hes down ;-) but I think the problem is that your documentation tends to be very tutorial oriented (im thinking P::RD and Text::Balanced) which is excellent if you are working through them from begin to end. But the tutorial style can get in the way when all you want is a quick and dirty. For instance in Text::Balanced you have the the general conventions followed by a page or more for each sub. This is compounded by pod2html which doesnt index =item blocks. (I patched it to add an index of them at the end, which I find quite helpful.)
Incidentally, this seems to be a failing of many of the better module designers, DBI has IMO similar problems.
Oh and please dont take this as a negative criticism, its just that a terse, factual reference oriented doc/section can also be very helpful. Adding such a section (as you have already said you will) would be appreciated very much.
And im well aware that if all you provided was such a reference text, that you'd be innundated with relatively foolish questions...
Yves / DeMerphq
---
Writing a good benchmark isnt as easy as it might look.
| [reply] |
|
XML is required to be case-sensitive, but not necessarily lower-case; you are right about the HTML though... :-)
| [reply] |
|
Well, that's a fair cop, Guv, since I do suggest it matches HMTL tags at one point in the docs. It will be fixed in the next release (though whether I fix it in favour of XML or HTML remains to be seen! ;-).
| [reply] |
Re: Re: Text::Balanced woes..
by u914 (Pilgrim) on May 27, 2002 at 14:55 UTC
|
Having looked at the HTML:: modules, i really do think this is exactly what i want/need it to do...
It works marvelously, except that when using extract_multiple with extract_tagged as the subroutine, there seems no (obvious:) way to access the 5th (#4) element of the array returned by extract_tagged....
Or is it that by calling it within extract_multiple it isn't in list context? But if that's the case, then it must be in scalar context, what happens to the remainder string?
i guess the crux of my question is: "When using extract_multiple, how does one access the other members of the returned array, as it seems that item 0 is the only available?"
i've got a some working code, but am reluctant to post the code here (it is an anti-spambot tool, after all)but i'd be happy to share it via email.
update
i've worked it out with a for loop (i know, control structures are for whimps! guilty as charged!)..
# find all the URLs from the page contents, rejecting any from bianca
@data = extract_multiple( $response->content,
[ sub {extract_tagged($_[0],
'<a href="http://', '</a>',
undef,
{reject => ['bianca.com']} ) } ],
undef, 1);
# loop thru and strip the URL to it's bare address, this is
# what's needed to insert into the database
for (my $i=0; $i<=$#data; $i++) {
my @temp = extract_tagged($data[$i], '<a href="http://', '">', und
+ef, undef);
$data[$i] = $temp[4];
}
Thanks again for everyone's help and comments!
| [reply] [d/l] |
|
That loop can be simplified:
- Don't bother doing the counting yourself when Perl will do it for you.
- You don't actually need the temporary array — you can grab a single element from a list.
This is untested, since I don't have sample data handy, but I reckon does the same as your loop and is a little simpler:
foreach my $datum (@data) {
$datum = (extract_tagged($datum, '<a href="http://', '">'))[4];
}
Smylers | [reply] [d/l] |
|
Having looked at the HTML:: modules, i really do think this Text::Balanced is exactly what i want/need it to do...
I realize that your code is only a snippet, but it does look like it is possible to concoct valid HTML hyperlinks that don't get caught by it:
- upper-case letters: <a HREF="...">
- single quotes: <a href='...'>
- other attributes: <a class="main" href="...">
Whether these matter depends on your application and your users. But if they do you probably are better using a module explitly for parsing HTML rather than trying to think of all the possible valid variations.
Smylers
| [reply] [d/l] [select] |
|
| [reply] |
Re: Re: Text::Balanced woes..
by u914 (Pilgrim) on May 27, 2002 at 05:27 UTC
|
OK, thanks everyone!
i was not reading the documentation carefully enough, but i think i can force it to do my bidding now...
though some of the other options mentioned might be easier.
i'll check them out, thanks! | [reply] |
|
|