Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

[Solved]: Query about regular expression

by Perl300 (Friar)
on Sep 23, 2015 at 21:59 UTC ( #1142838=perlquestion: print w/replies, xml ) Need Help??

Perl300 has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

Please forgive me for asking something that might be very simple. I have a string variable $form_dump that is bascially the output of $query->Dump; from a html form submit. So everything in the form is taken into this string variable, which is a multiline variable. The contents of it look like:

<ul> <li><strong>site_user</strong></li> <ul> <li>user1</li> </ul> <li><strong>compare_hidden</strong></li> <ul> <li>average_speed_answer 25 60 30 60 ^M<br /> calls_waiting 300 500 300 500 ^M<br /> many more rows here post_ivr_calls_handled Wisconsin 50 100 50 100 ^M<br /> post_ivr_calls_handled Wyoming 50 100 50 100 ^M<br /> </li> <li><strong>calls_waiting_good_high</strong></li> <ul> <li>300</li> </ul> <li><strong>calls_waiting_warning_low</strong></li> <ul>

What I am trying to do is to get each line between fourth pair of <li> & </li> and put those lines in an array. So far I have tried:

if ( $form_dump =~ m{(<li>)(.*?)(</li>)}s ) { my $inside_li = $2; print $fh "Value of \$inside_li: $inside_li \n"; }

This gives the text between first pair of <li> & </li> which is <strong>site_user</strong>.

Can I get some clues on how to skip the first three pairs of <li> & </li> and and then stop after text between fourth pair is pulled. I had tried something like $form_dump =~ m{(<li>)(^average_.*?)(</li>)}s in an attempt to get only that text which starts with "average_" but it doesn't match anything and $inside_li prints blank.

I am doing something wrong or missing something there.

Update: Updated subject line to mark solved.

Replies are listed 'Best First'.
Re: Query about regular expression (HTML Parser)
by jeffa (Bishop) on Sep 23, 2015 at 23:49 UTC

    I would use a parser for that instead. Here's some code to get you started should you choose this path:

    use strict; use warnings; use Data::Dumper; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( *DATA ); my $count = 0; my @array; while ( my $token = $p->get_token ) { next unless $token->is_start_tag('li'); next unless ++$count > 3; while ( $token = $p->get_token ) { last if $token->is_end_tag('li'); my $text = $token->as_is; $text =~ s/^\s*//; $text =~ s/\s*$//; push @array, $text unless $token->is_tag('br'); } last; } print Dumper \@array; __DATA__ <ul> <li><strong>site_user</strong></li> <ul> <li>user1</li> </ul> <li><strong>compare_hidden</strong></li> <ul> <li>average_speed_answer 25 60 30 60 ^M<br /> calls_waiting 300 500 300 500 ^M<br /> many more rows here post_ivr_calls_handled Wisconsin 50 100 50 100 ^M<br /> post_ivr_calls_handled Wyoming 50 100 50 100 ^M<br /> </li> <li><strong>calls_waiting_good_high</strong></li> <ul> <li>300</li> </ul> <li><strong>calls_waiting_warning_low</strong></li> <ul>

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Too much work :) HTML::TreeBuilder::XPath is less work
      #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->ignore_unknown(0);; $tree->implicit_tags(0); $tree->no_expand_entities(1); $tree->ignore_unknown(0); $tree->ignore_ignorable_whitespace(0); $tree->no_space_compacting(1); $tree->store_comments(1); $tree->store_pis(1); $tree->parse(q{ <ul> <li><strong>site_user</strong></li> <ul> <li>user1</li> </ul> <li><strong>compare_hidden</strong></li> <ul> <li>average_speed_answer 25 60 30 60 ^M<br /> calls_waiting 300 500 300 500 ^M<br /> many more rows here post_ivr_calls_handled Wisconsin 50 100 50 100 ^M<br /> post_ivr_calls_handled Wyoming 50 100 50 100 ^M<br /> </li> <li><strong>calls_waiting_good_high</strong></li> <ul> <li>300</li> </ul> <li><strong>calls_waiting_warning_low</strong></li> <ul>}); $tree->eof; my @li = $tree->findnodes( q{ //li[ contains( ., 'average' ) ] } +) ; for my $ll ( @li ){ $ll->dump; print $ll->as_text, "\n";; } __END__ <li> @0.1.7.1 "average_speed_answer 25 60 30 60 ^M" <br /> @0.1.7.1.1 "\x0a calls_waiting 300 500 300 500 ^M" <br /> @0.1.7.1.3 "\x0a many more rows here\x0a post_ivr_calls_handled Wisconsin 50 10 +0 50..." <br /> @0.1.7.1.5 "\x0a post_ivr_calls_handled Wyoming 50 100 50 100 ^M" <br /> @0.1.7.1.7 "\x0a" average_speed_answer 25 60 30 60 ^M calls_waiting 300 500 300 500 ^M many more rows here post_ivr_calls_handled Wisconsin 50 100 50 100 ^M post_ivr_calls_handled Wyoming 50 100 50 100 ^M
Re: Query about regular expression
by ww (Archbishop) on Sep 23, 2015 at 22:12 UTC

    If the line you want always begins <li>average_, then match on that (according to your taste, including or excluding the <li>s from your capture) rather than trying to count <li>s.

    If not, you'll have to use a more complex solution... for which you may wish to read in the regex docs about "positive look ahead" and variations.

Re: Query about regular expression
by Anonymous Monk on Sep 24, 2015 at 00:35 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1142838 use strict; use warnings; $| = 1; $_ = join '', <DATA>; my ($fourth) = /(?:.*?<li>(.*?)<\/li>){4}/s; print "$fourth\n"; __DATA__ <ul> <li><strong>site_user</strong></li> <ul> <li>user1</li> </ul> <li><strong>compare_hidden</strong></li> <ul> <li>average_speed_answer 25 60 30 60 ^M<br /> calls_waiting 300 500 300 500 ^M<br /> many more rows here post_ivr_calls_handled Wisconsin 50 100 50 100 ^M<br /> post_ivr_calls_handled Wyoming 50 100 50 100 ^M<br /> </li> <li><strong>calls_waiting_good_high</strong></li> <ul> <li>300</li> </ul> <li><strong>calls_waiting_warning_low</strong></li> <ul>

    Of course, this will have problems if you have nested li's

Re: Query about regular expression
by Perl300 (Friar) on Sep 24, 2015 at 15:36 UTC
    Thanks for your suggestions Anonymous Monk, ww and jeffa. I had started with ww suggestion and modified by code as
    if ( $form_dump_after =~ m{(<li>average_)(.*?)(</li>)}s ) { $inside_li = "average_".$2; print $fh "Value of \$inside_li: \n$inside_li \n"; }
    It gave me output as:
    average_speed_answer 25 60 30 60 ^M<br /> calls_waiting 300 500 300 500 ^M<br /> many more rows here post_ivr_calls_handled Wisconsin 50 100 50 100 ^M<br /> post_ivr_calls_handled Wyoming 50 100 50 100 ^M<br />
    But I think I'll have to use HTML::TokeParser::Simple for the next tasks. I don't have HTML::TreeBuilder::XPath available and being in restricted env can't get it very easily, so can't check that out :-(

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1142838]
Approved by rjt
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2019-07-17 14:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?