Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

(jcwren) Re: Text::Balanced woes..

by jcwren (Prior)
on May 27, 2002 at 02:01 UTC ( #169457=note: print w/ replies, xml ) Need Help??


in reply to Text::Balanced woes..

It appears that Text::Balanced does not cope with leading non-white space characters that are not balanced tag pairs.

The example below works as advertised. However, put ANY leading character or word in front of the opening <B>, and it ceases working. This doesn't seem terribly useful, unless you know you're parsing complete HTML.

Note that in a list context, a valid parsing returns 6 items. See the docs for which element is which.

#!/usr/bin/perl -w use Text::Balanced qw (extract_tagged); use strict; my $text = " <B>for</B> some trailing text"; my @a = extract_tagged ($text); print scalar (@a), "\n"; print "$_\n" for (@a); exit 0;

--Chris

e-mail jcwren


Comment on (jcwren) Re: Text::Balanced woes..
Download Code
Re: (jcwren) Re: Text::Balanced woes..
by 914 (Pilgrim) on May 27, 2002 at 03:05 UTC
    I see...
    thank you very much, Chris.... i never would have though to try a test excluding everything before the tag...

    You're right that Text::Balanced isn't very useful like this, i can't imagine that the author meant it to be this way.

    You can see what i'm trying to do (well, actually i'll be parsing a href links out in an effort to combat chatroom spambots), is there another method you'd suggest?

    It was looking at the docs that convinced me that Text::Balanced was the right thing for me... i'd be using the 5th (#4) element.. i haven't found another lib that'll supply just the stripped URL inside the tag yet... and while Perl seems super-cool for text handling (i'm a duffer), i'd rather not rewrite the wheel..

    in any case, thanks very much for your reply!

      There are several packages based on HTML::Parser, such as HTML::LinkExtor, that shouldn't require you to invent too many wheels. I would take a look at that.

      I would avoid at all costs attempting to use a regular expression to attempt to extract links. That's just a path to problems.

      --Chris

      e-mail jcwren

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://169457]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (12)
As of 2015-07-06 21:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (83 votes), past polls