regexp solutions

programmer.perl has asked for the wisdom of the Perl Monks concerning the following question:

Hi, for many hours I'm trying to extract needed text from a html format (by LWP::UserAgent). I need those text which are not in html tags: '< >' brackets. I wrote many codes, it seems my regexp not so good :-), I'll be happy if we can solve it togather, tnx...

example of html text: <h1>Volume Leaders</h1></div><div class="yfitabs"><div id="yfitt" class="yfitt"><ul><li class="on"><a href="/actives?e=us"><em>US</em></a></li><li><a href="/actives?e=o"><em>NASDAQ</em></a></li><li><a href="/actives?e=aq"><em>AMEX</em></a></li><li><a href="/actives?e=nq"><em class="last">NYSE</em></a></li></ul></div><div id="yfitp" class="yfitabsc"><table cellpadding="0" cellspacing="0"><thead><th class="first">Symbol</th><th class="second">Name</th><th>Last Trade</th><th>Change</th><th>Volume</th><th class="last">Related Info</th></thead><tbody><tr><td class="first"><b><a href="/q?s=BAC">BAC</a></b></td><td class="second name">Bank of America Corporation Com</td><td class="last_trade"><b><span id="yfs_l10_bac">8.06</span></b> <nobr><span id="yfs_t10_bac">Jul 3</span></nobr></td><td><span id="yfs_c10_bac"><img width="10" height="14" style="margin-right:-2px;" border="0" src="http://l.yimg.com/a/i/us/fi/03rd/up_g.gif" alt="Up"> <b style="color:#008800;">0.01</b></span> <span id="yfs_p20_bac"><b style="color:#008800;"> (0.12%)</b></span></td><td><span id="yfs_v00_bac">57,655,357</span></td><td class="last"><a href="/q/bc?s=BAC">Chart</a>, <a href="/q/pr?s=BAC">Profile</a>, <a href="/q?s=BAC">More</a></td></tr><tr><td class="first"><b><a href="/q?s=SIRI">SIRI</a></b></td><td class="second name">Sirius XM Radio Inc.</td><td class="last_trade"><b><span id="yfs_l10_siri">2.04</span></b> <nobr><span id="yfs_t10_siri">Jul 3</span></nobr></td><td><span id="yfs_c10_siri"><img width="10" height="14" style="margin-right:-2px;" border="0" src="http://l.yimg.com/a/i/us/fi/03rd/up_g.gif" alt="Up"> <b style="color:#008800;">0.06</b></span> <span id="yfs_p20_siri"><b style="color:#008800;"> (2.77%)</b></span></td><td><span id="yfs_v00_siri">53,607,894</span></td><td class="last"><a href="/q/bc?s=SIRI">Chart</a>, <a href="/q/pr?s=SIRI">Profile</a>, <a href="/q?s=SIRI">More</a></td></tr><tr><td class="first"><b><a href="/q?s=F">F</a></b></td><td class="second name">Ford Motor Company Common Stock</td><td class="last_trade"><b> EXAMPLE OF CODE:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request::Common qw(GET);

my $ua = LWP::UserAgent->new;

# Define user agent type
$ua->agent('MyApp/0.1 ');

# Request object
my $req = GET 'http://finance.yahoo.com/actives?e=us';

# Make the request
my $res = $ua->request($req);

#my @con = $res->content;
$res->content =~ /(<div class="yfitbg tbtabs">)(<h1>Volume Leaders.*)(
+Get a)/s;
my $cont = $2;
print $cont, "\n";


exit 0;
[download]

Comment on regexp solutions Select or Download Code

Replies are listed 'Best First'.
Re: regexp solutions by Corion (Patriarch) on Jul 04, 2012 at 08:57 UTC
Just use the `->text` method of WWW::Mechanize: `use WWW::Mechanize; my $mech = WWW::Mechanize->new(); my $url = 'http://example.com'; $mech->get( $url ); print $mech->text;` [download]	[reply] [d/l] [select]
Re: regexp solutions by zentara (Archbishop) on Jul 04, 2012 at 11:47 UTC
You can also use HTML::TokeParser::Simple. I'll leave the regex of the text up to you. :-) P.S. HTML::Strip may also be useful to you, see Stripping HTML tags efficiently #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use HTTP::Request::Common qw(GET); use HTML::TokeParser::Simple; my $ua = LWP::UserAgent->new; # Define user agent type $ua->agent('MyApp/0.1 '); # Request object my $req = GET 'http://finance.yahoo.com/actives?e=us'; # Make the request my $res = $ua->request($req); my $con = $res->content; #print "$con\n"; my $p = HTML::TokeParser::Simple->new( \$con ); while ( my $token = $p->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next unless $token->is_text; print $token->as_is, "\n"; } exit 0; [download] I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply] [d/l]
Re^2: regexp solutions by programmer.perl (Beadle) on Jul 04, 2012 at 14:27 UTC
Thank you, zentara it is working it took a bit time for me to download and install modules (I'm installing modules at first time :) using the software center ubuntu 12... Remained work is about designing and making readable lines... Here, link was for US, also I have to collect data from these sublinks: NASDAQ: http://finance.yahoo.com/actives?e=o AMEX: http://finance.yahoo.com/actives?e=aq NYSE: http://finance.yahoo.com/actives?e=nq Is it possible to make copy-paste this script (what you wrote for me) under the script and change the url to another url (one of above three links)? In one script I'm planning take data from four urls, is it possible?	[reply]
Re^3: regexp solutions by zentara (Archbishop) on Jul 04, 2012 at 14:51 UTC
Is it possible to make copy-paste this script (what you wrote for me) under the script and change the url to another url (one of above three links)? In one script I'm planning take data from four urls, is it possible? Sure, it should be as simple as putting it all in a loop. Just put your urls into single quoted strings, and separate with a comma, as shown below. #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use HTTP::Request::Common qw(GET); use HTML::TokeParser::Simple; my $ua = LWP::UserAgent->new; # Define user agent type $ua->agent('MyApp/0.1 '); my @requests = ( 'http://finance.yahoo.com/actives?e=us', 'http://finance.yahoo.com/actives?e=o AMEX', 'http://finance.yahoo.com/actives?e=aq', 'http://finance.yahoo.com/actives?e=nq', ); # loop thru them foreach my $requested ( @requests ) { print "STARTING $requested ###########################\n\n\n\n\n"; # Request object my $req = GET $requested; # Make the request my $res = $ua->request($req); my $con = $res->content; #print "$con\n"; my $p = HTML::TokeParser::Simple->new( \$con ); while ( my $token = $p->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next unless $token->is_text; print $token->as_is, "\n"; } print "ENDING $requested ###########################\n\n\n\n\n\n"; } # end of loop exit 0; [download] I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply] [d/l]
Re^4: regexp solutions by programmer.perl (Beadle) on Jul 05, 2012 at 07:27 UTC
Re^4: regexp solutions by programmer.perl (Beadle) on Jul 05, 2012 at 16:59 UTC
Re^5: regexp solutions by zentara (Archbishop) on Jul 05, 2012 at 17:19 UTC
Re: regexp solutions by Anonymous Monk on Jul 04, 2012 at 09:12 UTC
Hi, for many hours I'm trying to extract needed text from a html format (by LWP::UserAgent). I need those text which are not in html tags: '< >' brackets. I wrote many codes, it seems my regexp not so good :-), I'll be happy if we can solve it togather, tnx... Why not post your code?	[reply]
Re^2: regexp solutions by programmer.perl (Beadle) on Jul 04, 2012 at 09:24 UTC
I added the code	[reply]

Back to Seekers of Perl Wisdom