Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Text::Balanced woes..

by TheDamian (Priest)
on May 27, 2002 at 04:09 UTC ( #169480=note: print w/ replies, xml ) Need Help??


in reply to Text::Balanced woes..

<sigh>

Everybody expects the extract_* subroutines to act like:

$text =~ /extractor/ # i.e. match anywhere in the string
even though they're clearly documented (and fully intended) to act like:
$text =~ /\G extractor/gc # i.e. match at current pos in string
If you want to match heterogeneous input, what you really want is the extract_multiple subroutine. Like this:
use Text::Balanced ':ALL'; my $text = "this is a test <B>for</B> tags! \n"; my @data = extract_multiple( $text, [ \&extract_tagged ]); use Data::Dumper 'Dumper'; print Dumper [ @data ];
Notes:
  1. Your original example had <B>for</b>, which extract_tagged's case-sensitive default tag pattern wouldn't recognize anyway.

  2. The other monks are correct, you'd be much better off with one of the many HTML:: modules on the CPAN ( HTML::TreeBuilder is my personal favorite).


Comment on Re: Text::Balanced woes..
Select or Download Code
(jcwren) Re: Re: Text::Balanced woes..
by jcwren (Prior) on May 27, 2002 at 04:41 UTC

    Damian, I don't consider myself a world-class Perl programmer by any means, but I do believe I'm capable of reading the documentation.

    That being said, I don't see ANYWHERE that it's "obvious" that it matches from the current string position. Read the documentation, trying to ignore the fact that you wrote it. Tell me where you see that it mentions that, or even reasonably implies that. And keep in mind that someone like myself or 914 may be reading it.

    jeffa, who is someone I consider an experienced Perl programmer, mentioned in a /msg that he wouldn't have thought of deleting the leading words to see if it would pass. And I only got that idea from mucking around for 1/2 an hour, then running the extgen.pl test case with $DEBUG set.

    I'm not critizing the documentation, because there is a lot of good stuff there, but I do think it could be better indicated that they match at the current position.

    One final detail. The documentation mentions that it matches valid HTML/XML pairs. Well, HTML allows upper and lower case tags, and as such, <B>/</b> should match. Under XML, where tags are required to be lower case (if I remember correctly), then <B>/</B> should fail anyway, as it's not valid XML.

    Update: chromatic pointed me to this text in the Description section

           The various "extract_..." subroutines may be used to
           extract a delimited string (possibly after skipping a
           specified prefix string).  The search for the string
           always begins at the current "pos" location of the
           string's variable (or at index zero, if no "pos" position
           is defined).
    

    However, my interpetation of that is that not that matching will only occur at the start of the string, but rather, there is no implicit offset to the search for a matching tag. It also doesn't indicate that white space will be ignored, although that's not of terrible importance.

    --Chris

    e-mail jcwren

      XML is required to be case-sensitive, but not necessarily lower-case; you are right about the HTML though... :-)
        Well, that's a fair cop, Guv, since I do suggest it matches HMTL tags at one point in the docs. It will be fixed in the next release (though whether I fix it in favour of XML or HTML remains to be seen! ;-).
      Well, I think your interpretation is...err...imaginative, but you're without doubt a very smart person so the docs mustn't be clear enough. I'll make sure the next version leaves no room for misinterpretation:
      The various "extract_..." subroutines may be used to extract a delimited substring, possibly after skipping a specified prefix string. By default, that prefix is optional whitespace, but you can change it to whatever you wish (see below). The substring to be extracted must appear at the current "pos" location of the string's variable (or at index zero, if no "pos" position is defined). In other words, the "extract_..." subroutines *don't* extract the first occurance of a substring anywhere in a string (like an unanchored regex would). Rather, they extract an occurance of the substring appearing immediately at the current matching position in the string (like a "\G"-anchored regex would).
        Not to kick a man when hes down ;-) but I think the problem is that your documentation tends to be very tutorial oriented (im thinking P::RD and Text::Balanced) which is excellent if you are working through them from begin to end. But the tutorial style can get in the way when all you want is a quick and dirty. For instance in Text::Balanced you have the the general conventions followed by a page or more for each sub. This is compounded by pod2html which doesnt index =item blocks. (I patched it to add an index of them at the end, which I find quite helpful.)

        Incidentally, this seems to be a failing of many of the better module designers, DBI has IMO similar problems.

        Oh and please dont take this as a negative criticism, its just that a terse, factual reference oriented doc/section can also be very helpful. Adding such a section (as you have already said you will) would be appreciated very much.

        And im well aware that if all you provided was such a reference text, that you'd be innundated with relatively foolish questions...

        Yves / DeMerphq
        ---
        Writing a good benchmark isnt as easy as it might look.

Re: Re: Text::Balanced woes..
by 914 (Pilgrim) on May 27, 2002 at 05:27 UTC
    OK, thanks everyone!

    i was not reading the documentation carefully enough, but i think i can force it to do my bidding now...
    though some of the other options mentioned might be easier.

    i'll check them out, thanks!

Re: Re: Text::Balanced woes..
by 914 (Pilgrim) on May 27, 2002 at 14:55 UTC
    Having looked at the HTML:: modules, i really do think this is exactly what i want/need it to do...

    It works marvelously, except that when using extract_multiple with extract_tagged as the subroutine, there seems no (obvious:) way to access the 5th (#4) element of the array returned by extract_tagged....

    Or is it that by calling it within extract_multiple it isn't in list context? But if that's the case, then it must be in scalar context, what happens to the remainder string?

    i guess the crux of my question is: "When using extract_multiple, how does one access the other members of the returned array, as it seems that item 0 is the only available?"

    i've got a some working code, but am reluctant to post the code here (it is an anti-spambot tool, after all)but i'd be happy to share it via email.

    update

    i've worked it out with a for loop (i know, control structures are for whimps! guilty as charged!)..
    # find all the URLs from the page contents, rejecting any from bianca @data = extract_multiple( $response->content, [ sub {extract_tagged($_[0], '<a href="http://', '</a>', undef, {reject => ['bianca.com']} ) } ], undef, 1); # loop thru and strip the URL to it's bare address, this is # what's needed to insert into the database for (my $i=0; $i<=$#data; $i++) { my @temp = extract_tagged($data[$i], '<a href="http://', '">', und +ef, undef); $data[$i] = $temp[4]; }
    Thanks again for everyone's help and comments!

      That loop can be simplified:

      1. Don't bother doing the counting yourself when Perl will do it for you.
      2. You don't actually need the temporary array — you can grab a single element from a list.

      This is untested, since I don't have sample data handy, but I reckon does the same as your loop and is a little simpler:

      foreach my $datum (@data) { $datum = (extract_tagged($datum, '<a href="http://', '">'))[4]; }

      Smylers

      Having looked at the HTML:: modules, i really do think this Text::Balanced is exactly what i want/need it to do...

      I realize that your code is only a snippet, but it does look like it is possible to concoct valid HTML hyperlinks that don't get caught by it:

      • upper-case letters: <a HREF="...">
      • single quotes: <a href='...'>
      • other attributes: <a class="main" href="...">

      Whether these matter depends on your application and your users. But if they do you probably are better using a module explitly for parsing HTML rather than trying to think of all the possible valid variations.

      Smylers

        Actually, the html pages that it works on are generated by another script, so they're pretty consistent.

        i like the loop simplification, though... i often forget about foreach, and the C-ishness is a fiendish habit to break!

        Thanks for the tip!

        :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://169480]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2014-10-01 06:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (389 votes), past polls