Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Parsing HTML tags with regex

by kye (Initiate)
on Oct 03, 2002 at 01:35 UTC ( #202414=perlquestion: print w/replies, xml ) Need Help??

kye has asked for the wisdom of the Perl Monks concerning the following question:

Hello, i would like help on this problem i'm currently having.

I want to parse a page using regexs. I want to get the whole Pure-Dream Table row off Then parse it so that a variable will end up with

" Pure-Dream Europe 0d 00:40 DreamDiver PvPGN BnetD Mod 1.1.6 Linux 42 9"

Then i can split the variable and Output it in the shell This is a stats monitor script that goes to that page and grab the # of users on the Server Pure-Dream. I think i can do the splitting and assigning the result into an array. but all i'm asking is how to parse that page so that the result is Pure-Dream Europe 0d 00:40 DreamDiver PvPGN BnetD Mod 1.1.6 Linux 42 9". OH, and i can't use the HTML::X modules, so please don't ask me why not just use modules. I wana do this with regex so i can learn too. Thanks!

curretly i have:
#!/usr/bin/perl use LWP::Simple; $html = get("");
THANKS in advance directed here by: STrRedWolf

Replies are listed 'Best First'.
Re: Parsing HTML tags with regex
by BrowserUk (Pope) on Oct 03, 2002 at 04:00 UTC

    Re-instated as requested.

    #! perl -sw use strict; use LWP::Simple; my $html = get(""); my @stuff = $html =~ m! <tr>\s+ <td><font\ssize=1><a\shref="bnetd://">([^<]+?)</a></fo +nt></td>\s+ <td><a\starget="_blank"\shref=""><font\ssize= +1>([^<]+?)</font></a></td>\s+ <td><font\ssize=1>([^<]+?)</font></td>\s+ <td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+ <td><font\ssize=1><a\shref="mailto:webmaster\">([^<]+?) +</a></font></td>\s+ <td><font\ssize=1>([^<]+)</font></td>\s+ <td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+ <td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+ </tr>\s+ <tr> !sx; print "@stuff\n"; __DATA__ C:\test>202414 Pure-Dream Europe 0d 00:40 DreamDiver PvPGN&nbsp;BnetD + Mod 1.1.6 Linux 42 9
Re: Parsing HTML tags with regex
by samurai (Monk) on Oct 03, 2002 at 01:45 UTC
    Parsing HTML code (correctly) with hand-crafted regexes is not a feat to be undertaken lightly. It has been known to cause chronic headaches in hobbyists and professionals alike. And then, you have to worry about parsing erroneous HTML code...

    There's a very good reason why people reccomend you use an HTML::* module. But I suppose you should go ahead. You'll learn more than just a bit about regexes, you'll learn why CPAN is so important to the community.

    perl: code of the samurai

      ...and once you're tired of it, check out HTML::TableExtract, which was practically written with your exact problem in mind :-)



      Update: Okay, I'm lazy, I didn't post the code to actually do the job (mainly because I think it really is that trivial). But the wonderful blakem submitted this node to another thread describing what I was thinking. So go upvote him instead :-)

Re: Parsing HTML tags with regex
by davorg (Chancellor) on Oct 03, 2002 at 13:23 UTC

    I know you don't want to use HTML::foo (tho' you never explain why) but in the interests of having at least one "best practices" answer listed here an HTML::TreeBuilder solution is given below:

    #!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TreeBuilder; my $page = get ('') or die; my $tree = HTML::TreeBuilder->new; $tree->parse($page); my @trs = $tree->find_by_tag_name('tr'); my @stuff; foreach my $row (@trs) { if ($row->as_text =~ /^217\.172\.178\.113/) { @stuff = map { ref $_ ? $_->as_text : $_ } $row->content_list; last; } } print "@stuff\n";

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      See also this other node in the other thread; I take a slightly different approach to the problem, but also using HTML::TreeBuilder. TIMTOWTDI, indeed.

      perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'

Re: Parsing HTML tags with regex
by McD (Chaplain) on Oct 03, 2002 at 13:30 UTC
    This article is a good place to start. It describes how to do what you want, and some of the risks of doing it that way.


Re: Parsing HTML tags with regex
by PodMaster (Abbot) on Oct 03, 2002 at 14:14 UTC
    I like HTML::TokeParser a lot, but I LOOOOVE HTML::TokeParser::Simple, so here is an example (cause the others toted such memory hogs as HTML::TreeBuilder, and HTML::Parser doesn't fit for this task)
    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; # friendlier tokens use LWP::Simple; my $html = get(""); =head1 MY Test HTML The "TH" is the 1st trimmeg, so we gotta "seek" to it. Next is a check, to make sure there is a link to index_address.html And if that passes, it means the html ain't changed significantly, so LOOOOOOOOOOOOOOOP while we got TR's { eat a TD and get_trimmed_text 8 times in a row } my $html = q{<tr> <th bgcolor="#808080"><a href="index_adress.html"> +<font size=2>Address</font></a></th> <th bgcolor="#808080"><a href="index_description.h +tml"><font size=2>Description/URL</font></a></th> <th bgcolor="#808080"><a href="index_location.html +"><font size=2>Location</font></a></th> <th bgcolor="#808080"><a href="index_uptime.html"> +<font size=2>Uptime</font></a></th> <th bgcolor="#808080"><a href="index_contact.html" +><font size=2>Contact</font></a></th> <th bgcolor="#808080"><a href="index_software.html +"><font size=2>Software</font></a></th> <th bgcolor="#808080"><a href="index_users.html">< +font size=2>Users</font></a></th> <th bgcolor="#808080"><a href="index_games.html">< +font size=2>Games</font></a></th> </tr> <tr> <td><font size=1><a href="bnetd://"> +</a></font></td> <td><a target="_blank" href="unknown"><font size=1 +>unknown</font></a></td> <td><font size=1>unknown</font></td> <td align=right><font size=1>0d 03:26</font></td> <td><font size=1><a href="mailto:unknown">a PvPGN +user</a></font></td> <td><font size=1>PvPGN&nbsp;BnetD Mod 1.1.6 Linux< +/font></td> <td align=right><font size=1>1158</font></td> <td align=right><font size=1>320</font></td> </tr> }; =cut my $p = new HTML::TokeParser::Simple(\$html); $p->get_tag('th') or die "crap"; die "change code, stuff changed" unless $p->get_tag('a')->return_attr->{href} =~ /index_adress.html/i; while( my $t = $p->get_tag('tr') ) { for(1..8){ $p->get_tag('td'); # cause the next token ain't "text" print $p->get_trimmed_text('/td')."\n"; } }
    Here are some other examples of HTML::TokeParser and/or HTML::TokeParser::Simple usage.

    You can get even more by using super search to look for "use HTML::TokeParser" within text.

    Re: Requesting webpages which use cookies and session ids. (rev)
    What holiday is today? <!-- -->
    (crazyinsomniac) Re: Getting the Linking Text from a page
    (crazyinsomniac) Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?
    download code from scratchpad
    HTML::TokeParser token dumper
    (crazyinsomniac) Re: HTML Link Modifier
    Re: Re: (crazyinsomniac) Re: Extract info from HTML
    (crazyinsomniac) Re: Extract info from HTML
    (crazyinsomniac) Re: parsing HTML
    Re: Parsing HTML tags with regex

    ** The Third rule of perl club is a statement of fact: pod is sexy.

    Edit by tye to remove PRE tags around very long lines

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://202414]
Approved by fglock
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2021-01-24 20:22 GMT
Find Nodes?
    Voting Booth?