Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Text::Balanced woes..

by 914 (Pilgrim)
on May 26, 2002 at 23:42 UTC ( #169447=perlquestion: print w/ replies, xml ) Need Help??
914 has asked for the wisdom of the Perl Monks concerning the following question:

Hello there..

i've not been able to find any answers to this question here or elsewhere on the web, so....

i've recently installed Text:Balanced (ver 1.89, from CPAN) on two different linux machines, and following the instructions to my utmost, created a rather simplistic test script.

Though 'make test' reported no errors, my script wwon't work. That leads me to believe that the problem is mine... hence this post.

Can (will?) someone please take a look at this script (pasted below) and tell me why it does not do what i expect? (i expected it to find the text wrapped in the bold tags..)

thanks ever so much!

#!/usr/bin/perl -w use Text::Balanced qw(extract_tagged); use strict; my $text = "this is a test <B>for</b> tags! \n"; my ($extracted, $remainder); ($extracted, $remainder) = extract_tagged($text); print "extracted: "; print $extracted; print "\n"; print "remainder: "; print $remainder; print "\n"; exit 0;

Comment on Text::Balanced woes..
Download Code
(jcwren) Re: Text::Balanced woes..
by jcwren (Prior) on May 27, 2002 at 02:01 UTC

    It appears that Text::Balanced does not cope with leading non-white space characters that are not balanced tag pairs.

    The example below works as advertised. However, put ANY leading character or word in front of the opening <B>, and it ceases working. This doesn't seem terribly useful, unless you know you're parsing complete HTML.

    Note that in a list context, a valid parsing returns 6 items. See the docs for which element is which.

    #!/usr/bin/perl -w use Text::Balanced qw (extract_tagged); use strict; my $text = " <B>for</B> some trailing text"; my @a = extract_tagged ($text); print scalar (@a), "\n"; print "$_\n" for (@a); exit 0;

    --Chris

    e-mail jcwren

      I see...
      thank you very much, Chris.... i never would have though to try a test excluding everything before the tag...

      You're right that Text::Balanced isn't very useful like this, i can't imagine that the author meant it to be this way.

      You can see what i'm trying to do (well, actually i'll be parsing a href links out in an effort to combat chatroom spambots), is there another method you'd suggest?

      It was looking at the docs that convinced me that Text::Balanced was the right thing for me... i'd be using the 5th (#4) element.. i haven't found another lib that'll supply just the stripped URL inside the tag yet... and while Perl seems super-cool for text handling (i'm a duffer), i'd rather not rewrite the wheel..

      in any case, thanks very much for your reply!

        There are several packages based on HTML::Parser, such as HTML::LinkExtor, that shouldn't require you to invent too many wheels. I would take a look at that.

        I would avoid at all costs attempting to use a regular expression to attempt to extract links. That's just a path to problems.

        --Chris

        e-mail jcwren

Re: Text::Balanced woes..
by Albannach (Prior) on May 27, 2002 at 03:08 UTC
    I'm not a Text::Balanced guru, but I believe all you need to do is specify a prefix to be skipped, otherwise searching starts at the last pos($text) and won't find your tag: my ($extracted, $remainder) = extract_tagged($text, '<B>', undef, '.*?(?=<B>)');
    This should have it ignore everything up to the first <B> and then grab what you are after. You may also find it useful to print $@ which is set if something goes wrong.

    --
    I'd like to be able to assign to an luser

Re: Text::Balanced woes..
by Zaxo (Archbishop) on May 27, 2002 at 03:16 UTC

    Here's something that sort of does what you want:

    #!/usr/bin/perl -w use strict; use Text::Balanced ('extract_tagged'); my $text = "this is a test <b>for</b> tags! \n"; my ($leading,$extracted, $remainder); $leading = substr $text, 0, index($text,'<'), ''; ($extracted, $remainder) = extract_tagged($text); printf "leading: %s$/", $leading; printf "extracted: %s$/", $extracted; printf "remainder: %s$/", $remainder; exit 0;
    That can be iterated over both $extracted and $remainder to parse html. HTML::Parser is probably a better bet, particularly (given your reply to jcwren) with its friend HTML::LinkExtor.

    Update: Here's code that does what you were trying more exactly. Text::Balanced honors pos:

    #!/usr/bin/perl -w use strict; use Text::Balanced ('extract_tagged'); my $text = "this is a test <B>for</B> tags! \n"; pos($text) = index $text, '<'; my $extracted = extract_tagged($text); printf "extracted: %s$/", $extracted; printf "remainder: %s$/", $text; exit 0;
    Note scalar context for &extract_tagged.

    After Compline,
    Zaxo

Re: Text::Balanced woes..
by TheDamian (Priest) on May 27, 2002 at 04:09 UTC
    <sigh>

    Everybody expects the extract_* subroutines to act like:

    $text =~ /extractor/ # i.e. match anywhere in the string
    even though they're clearly documented (and fully intended) to act like:
    $text =~ /\G extractor/gc # i.e. match at current pos in string
    If you want to match heterogeneous input, what you really want is the extract_multiple subroutine. Like this:
    use Text::Balanced ':ALL'; my $text = "this is a test <B>for</B> tags! \n"; my @data = extract_multiple( $text, [ \&extract_tagged ]); use Data::Dumper 'Dumper'; print Dumper [ @data ];
    Notes:
    1. Your original example had <B>for</b>, which extract_tagged's case-sensitive default tag pattern wouldn't recognize anyway.

    2. The other monks are correct, you'd be much better off with one of the many HTML:: modules on the CPAN ( HTML::TreeBuilder is my personal favorite).

      Damian, I don't consider myself a world-class Perl programmer by any means, but I do believe I'm capable of reading the documentation.

      That being said, I don't see ANYWHERE that it's "obvious" that it matches from the current string position. Read the documentation, trying to ignore the fact that you wrote it. Tell me where you see that it mentions that, or even reasonably implies that. And keep in mind that someone like myself or 914 may be reading it.

      jeffa, who is someone I consider an experienced Perl programmer, mentioned in a /msg that he wouldn't have thought of deleting the leading words to see if it would pass. And I only got that idea from mucking around for 1/2 an hour, then running the extgen.pl test case with $DEBUG set.

      I'm not critizing the documentation, because there is a lot of good stuff there, but I do think it could be better indicated that they match at the current position.

      One final detail. The documentation mentions that it matches valid HTML/XML pairs. Well, HTML allows upper and lower case tags, and as such, <B>/</b> should match. Under XML, where tags are required to be lower case (if I remember correctly), then <B>/</B> should fail anyway, as it's not valid XML.

      Update: chromatic pointed me to this text in the Description section

             The various "extract_..." subroutines may be used to
             extract a delimited string (possibly after skipping a
             specified prefix string).  The search for the string
             always begins at the current "pos" location of the
             string's variable (or at index zero, if no "pos" position
             is defined).
      

      However, my interpetation of that is that not that matching will only occur at the start of the string, but rather, there is no implicit offset to the search for a matching tag. It also doesn't indicate that white space will be ignored, although that's not of terrible importance.

      --Chris

      e-mail jcwren

        XML is required to be case-sensitive, but not necessarily lower-case; you are right about the HTML though... :-)
        Well, I think your interpretation is...err...imaginative, but you're without doubt a very smart person so the docs mustn't be clear enough. I'll make sure the next version leaves no room for misinterpretation:
        The various "extract_..." subroutines may be used to extract a delimited substring, possibly after skipping a specified prefix string. By default, that prefix is optional whitespace, but you can change it to whatever you wish (see below). The substring to be extracted must appear at the current "pos" location of the string's variable (or at index zero, if no "pos" position is defined). In other words, the "extract_..." subroutines *don't* extract the first occurance of a substring anywhere in a string (like an unanchored regex would). Rather, they extract an occurance of the substring appearing immediately at the current matching position in the string (like a "\G"-anchored regex would).
      OK, thanks everyone!

      i was not reading the documentation carefully enough, but i think i can force it to do my bidding now...
      though some of the other options mentioned might be easier.

      i'll check them out, thanks!

      Having looked at the HTML:: modules, i really do think this is exactly what i want/need it to do...

      It works marvelously, except that when using extract_multiple with extract_tagged as the subroutine, there seems no (obvious:) way to access the 5th (#4) element of the array returned by extract_tagged....

      Or is it that by calling it within extract_multiple it isn't in list context? But if that's the case, then it must be in scalar context, what happens to the remainder string?

      i guess the crux of my question is: "When using extract_multiple, how does one access the other members of the returned array, as it seems that item 0 is the only available?"

      i've got a some working code, but am reluctant to post the code here (it is an anti-spambot tool, after all)but i'd be happy to share it via email.

      update

      i've worked it out with a for loop (i know, control structures are for whimps! guilty as charged!)..
      # find all the URLs from the page contents, rejecting any from bianca @data = extract_multiple( $response->content, [ sub {extract_tagged($_[0], '<a href="http://', '</a>', undef, {reject => ['bianca.com']} ) } ], undef, 1); # loop thru and strip the URL to it's bare address, this is # what's needed to insert into the database for (my $i=0; $i<=$#data; $i++) { my @temp = extract_tagged($data[$i], '<a href="http://', '">', und +ef, undef); $data[$i] = $temp[4]; }
      Thanks again for everyone's help and comments!

        That loop can be simplified:

        1. Don't bother doing the counting yourself when Perl will do it for you.
        2. You don't actually need the temporary array — you can grab a single element from a list.

        This is untested, since I don't have sample data handy, but I reckon does the same as your loop and is a little simpler:

        foreach my $datum (@data) { $datum = (extract_tagged($datum, '<a href="http://', '">'))[4]; }

        Smylers

        Having looked at the HTML:: modules, i really do think this Text::Balanced is exactly what i want/need it to do...

        I realize that your code is only a snippet, but it does look like it is possible to concoct valid HTML hyperlinks that don't get caught by it:

        • upper-case letters: <a HREF="...">
        • single quotes: <a href='...'>
        • other attributes: <a class="main" href="...">

        Whether these matter depends on your application and your users. But if they do you probably are better using a module explitly for parsing HTML rather than trying to think of all the possible valid variations.

        Smylers

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://169447]
Approved by broquaint
Front-paged by jcwren
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (11)
As of 2014-07-10 13:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (211 votes), past polls