Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Parsing HTML tags with regex

by kye (Initiate)
on Oct 03, 2002 at 01:35 UTC ( #202414=perlquestion: print w/ replies, xml ) Need Help??
kye has asked for the wisdom of the Perl Monks concerning the following question:

Hello, i would like help on this problem i'm currently having.

I want to parse a page using regexs. I want to get the whole Pure-Dream Table row off http://pvpgnservers.ath.cx/ Then parse it so that a variable will end up with

"217.172.178.113 Pure-Dream Europe 0d 00:40 DreamDiver PvPGN BnetD Mod 1.1.6 Linux 42 9"

Then i can split the variable and Output it in the shell This is a stats monitor script that goes to that page and grab the # of users on the Server Pure-Dream. I think i can do the splitting and assigning the result into an array. but all i'm asking is how to parse that page so that the result is 217.172.178.113 Pure-Dream Europe 0d 00:40 DreamDiver PvPGN BnetD Mod 1.1.6 Linux 42 9".
OH, and i can't use the HTML::X modules, so please don't ask me why not just use modules. I wana do this with regex so i can learn too. Thanks!

curretly i have:
#!/usr/bin/perl use LWP::Simple; $html = get("http://pvpgnservers.ath.cx");
THANKS in advance directed here by: STrRedWolf

Comment on Parsing HTML tags with regex
Select or Download Code
Re: Parsing HTML tags with regex
by samurai (Monk) on Oct 03, 2002 at 01:45 UTC
    Parsing HTML code (correctly) with hand-crafted regexes is not a feat to be undertaken lightly. It has been known to cause chronic headaches in hobbyists and professionals alike. And then, you have to worry about parsing erroneous HTML code...

    There's a very good reason why people reccomend you use an HTML::* module. But I suppose you should go ahead. You'll learn more than just a bit about regexes, you'll learn why CPAN is so important to the community.

    --
    perl: code of the samurai

      ...and once you're tired of it, check out HTML::TableExtract, which was practically written with your exact problem in mind :-)

      HTH,

      Tim

      Update: Okay, I'm lazy, I didn't post the code to actually do the job (mainly because I think it really is that trivial). But the wonderful blakem submitted this node to another thread describing what I was thinking. So go upvote him instead :-)

Re: Parsing HTML tags with regex
by BrowserUk (Pope) on Oct 03, 2002 at 04:00 UTC

    Re-instated as requested.

    #! perl -sw use strict; use LWP::Simple; my $html = get("http://pvpgnservers.ath.cx"); my @stuff = $html =~ m! <tr>\s+ <td><font\ssize=1><a\shref="bnetd://217.172.178.113/">([^<]+?)</a></fo +nt></td>\s+ <td><a\starget="_blank"\shref="http://www.pure-dream.com"><font\ssize= +1>([^<]+?)</font></a></td>\s+ <td><font\ssize=1>([^<]+?)</font></td>\s+ <td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+ <td><font\ssize=1><a\shref="mailto:webmaster\@pure-dream.com">([^<]+?) +</a></font></td>\s+ <td><font\ssize=1>([^<]+)</font></td>\s+ <td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+ <td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+ </tr>\s+ <tr> !sx; print "@stuff\n"; __DATA__ C:\test>202414 217.172.178.113 Pure-Dream Europe 0d 00:40 DreamDiver PvPGN&nbsp;BnetD + Mod 1.1.6 Linux 42 9
Re: Parsing HTML tags with regex
by davorg (Chancellor) on Oct 03, 2002 at 13:23 UTC

    I know you don't want to use HTML::foo (tho' you never explain why) but in the interests of having at least one "best practices" answer listed here an HTML::TreeBuilder solution is given below:

    #!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TreeBuilder; my $page = get ('http://pvpgnservers.ath.cx/') or die; my $tree = HTML::TreeBuilder->new; $tree->parse($page); my @trs = $tree->find_by_tag_name('tr'); my @stuff; foreach my $row (@trs) { if ($row->as_text =~ /^217\.172\.178\.113/) { @stuff = map { ref $_ ? $_->as_text : $_ } $row->content_list; last; } } print "@stuff\n";
    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      See also this other node in the other thread; I take a slightly different approach to the problem, but also using HTML::TreeBuilder. TIMTOWTDI, indeed.

      perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'

Re: Parsing HTML tags with regex
by McD (Chaplain) on Oct 03, 2002 at 13:30 UTC
    This article is a good place to start. It describes how to do what you want, and some of the risks of doing it that way.

    Peace,
    -McD

Re: Parsing HTML tags with regex
by PodMaster (Abbot) on Oct 03, 2002 at 14:14 UTC
    I like HTML::TokeParser a lot, but I LOOOOVE HTML::TokeParser::Simple, so here is an example (cause the others toted such memory hogs as HTML::TreeBuilder, and HTML::Parser doesn't fit for this task)
    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; # friendlier tokens use LWP::Simple; my $html = get("http://pvpgnservers.ath.cx"); =head1 MY Test HTML The "TH" is the 1st trimmeg, so we gotta "seek" to it. Next is a check, to make sure there is a link to index_address.html And if that passes, it means the html ain't changed significantly, so LOOOOOOOOOOOOOOOP while we got TR's { eat a TD and get_trimmed_text 8 times in a row } my $html = q{<tr> <th bgcolor="#808080"><a href="index_adress.html"> +<font size=2>Address</font></a></th> <th bgcolor="#808080"><a href="index_description.h +tml"><font size=2>Description/URL</font></a></th> <th bgcolor="#808080"><a href="index_location.html +"><font size=2>Location</font></a></th> <th bgcolor="#808080"><a href="index_uptime.html"> +<font size=2>Uptime</font></a></th> <th bgcolor="#808080"><a href="index_contact.html" +><font size=2>Contact</font></a></th> <th bgcolor="#808080"><a href="index_software.html +"><font size=2>Software</font></a></th> <th bgcolor="#808080"><a href="index_users.html">< +font size=2>Users</font></a></th> <th bgcolor="#808080"><a href="index_games.html">< +font size=2>Games</font></a></th> </tr> <tr> <td><font size=1><a href="bnetd://211.62.58.113/"> +211.62.58.113</a></font></td> <td><a target="_blank" href="unknown"><font size=1 +>unknown</font></a></td> <td><font size=1>unknown</font></td> <td align=right><font size=1>0d 03:26</font></td> <td><font size=1><a href="mailto:unknown">a PvPGN +user</a></font></td> <td><font size=1>PvPGN&nbsp;BnetD Mod 1.1.6 Linux< +/font></td> <td align=right><font size=1>1158</font></td> <td align=right><font size=1>320</font></td> </tr> }; =cut my $p = new HTML::TokeParser::Simple(\$html); $p->get_tag('th') or die "crap"; die "change code, stuff changed" unless $p->get_tag('a')->return_attr->{href} =~ /index_adress.html/i; while( my $t = $p->get_tag('tr') ) { for(1..8){ $p->get_tag('td'); # cause the next token ain't "text" print $p->get_trimmed_text('/td')."\n"; } }
    Here are some other examples of HTML::TokeParser and/or HTML::TokeParser::Simple usage.

    You can get even more by using super search to look for "use HTML::TokeParser" within text.

    Re: Requesting webpages which use cookies and session ids. (rev)
    What holiday is today? <!-- googleholiday.pl -->
    (crazyinsomniac) Re: Getting the Linking Text from a page
    (crazyinsomniac) Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?
    download code from scratchpad
    HTML::TokeParser token dumper
    (crazyinsomniac) Re: HTML Link Modifier
    Re: Re: (crazyinsomniac) Re: Extract info from HTML
    (crazyinsomniac) Re: Extract info from HTML
    (crazyinsomniac) Re: parsing HTML
    Re: Parsing HTML tags with regex

    ____________________________________________________
    ** The Third rule of perl club is a statement of fact: pod is sexy.

    Edit by tye to remove PRE tags around very long lines

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://202414]
Approved by fglock
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (11)
As of 2014-10-01 13:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (21 votes), past polls