[Solved]: Query about regular expression

Perl300 has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

Please forgive me for asking something that might be very simple. I have a string variable $form_dump that is bascially the output of $query->Dump; from a html form submit. So everything in the form is taken into this string variable, which is a multiline variable. The contents of it look like:

<ul>
<li><strong>site_user</strong></li>
<ul>
<li>user1</li>
</ul>
<li><strong>compare_hidden</strong></li>
<ul>
<li>average_speed_answer  25 60 30 60 ^M<br />
 calls_waiting  300 500 300 500 ^M<br />
 many more rows here
 post_ivr_calls_handled Wisconsin 50 100 50 100 ^M<br />
 post_ivr_calls_handled Wyoming 50 100 50 100 ^M<br />
</li>
<li><strong>calls_waiting_good_high</strong></li>
<ul>
<li>300</li>
</ul>
<li><strong>calls_waiting_warning_low</strong></li>
<ul>
[download]

What I am trying to do is to get each line between fourth pair of <li> & </li> and put those lines in an array. So far I have tried:

if ( $form_dump =~ m{(<li>)(.*?)(</li>)}s ) {
    my $inside_li = $2;
    print $fh "Value of \$inside_li: $inside_li \n";
}
[download]

This gives the text between first pair of <li> & </li> which is <strong>site_user</strong>.

Can I get some clues on how to skip the first three pairs of <li> & </li> and and then stop after text between fourth pair is pulled. I had tried something like $form_dump =~ m{(<li>)(^average_.*?)(</li>)}s in an attempt to get only that text which starts with "average_" but it doesn't match anything and $inside_li prints blank.

I am doing something wrong or missing something there.

Update: Updated subject line to mark solved.

Comment on [Solved]: Query about regular expression Select or Download Code

Replies are listed 'Best First'.
Re: Query about regular expression (HTML Parser) by jeffa (Bishop) on Sep 23, 2015 at 23:49 UTC
I would use a parser for that instead. Here's some code to get you started should you choose this path: use strict; use warnings; use Data::Dumper; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( DATA ); my $count = 0; my @array; while ( my $token = $p->get_token ) { next unless $token->is_start_tag('li'); next unless ++$count > 3; while ( $token = $p->get_token ) { last if $token->is_end_tag('li'); my $text = $token->as_is; $text =~ s/^\s//; $text =~ s/\s*$//; push @array, $text unless $token->is_tag('br'); } last; } print Dumper \@array; __DATA__ <ul> <li><strong>site_user</strong></li> <ul> <li>user1</li> </ul> <li><strong>compare_hidden</strong></li> <ul> <li>average_speed_answer 25 60 30 60 ^M<br /> calls_waiting 300 500 300 500 ^M<br /> many more rows here post_ivr_calls_handled Wisconsin 50 100 50 100 ^M<br /> post_ivr_calls_handled Wyoming 50 100 50 100 ^M<br /> </li> <li><strong>calls_waiting_good_high</strong></li> <ul> <li>300</li> </ul> <li><strong>calls_waiting_warning_low</strong></li> <ul> [download] jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l]
Re^2: Query about regular expression ( HTML::TreeBuilder::XPath) by Anonymous Monk on Sep 24, 2015 at 00:08 UTC
Too much work :) HTML::TreeBuilder::XPath is less work #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->ignore_unknown(0);; $tree->implicit_tags(0); $tree->no_expand_entities(1); $tree->ignore_unknown(0); $tree->ignore_ignorable_whitespace(0); $tree->no_space_compacting(1); $tree->store_comments(1); $tree->store_pis(1); $tree->parse(q{ <ul> <li><strong>site_user</strong></li> <ul> <li>user1</li> </ul> <li><strong>compare_hidden</strong></li> <ul> <li>average_speed_answer 25 60 30 60 ^M<br /> calls_waiting 300 500 300 500 ^M<br /> many more rows here post_ivr_calls_handled Wisconsin 50 100 50 100 ^M<br /> post_ivr_calls_handled Wyoming 50 100 50 100 ^M<br /> </li> <li><strong>calls_waiting_good_high</strong></li> <ul> <li>300</li> </ul> <li><strong>calls_waiting_warning_low</strong></li> <ul>}); $tree->eof; my @li = $tree->findnodes( q{ //li[ contains( ., 'average' ) ] } +) ; for my $ll ( @li ){ $ll->dump; print $ll->as_text, "\n";; } __END__ <li> @0.1.7.1 "average_speed_answer 25 60 30 60 ^M" <br /> @0.1.7.1.1 "\x0a calls_waiting 300 500 300 500 ^M" <br /> @0.1.7.1.3 "\x0a many more rows here\x0a post_ivr_calls_handled Wisconsin 50 10 +0 50..." <br /> @0.1.7.1.5 "\x0a post_ivr_calls_handled Wyoming 50 100 50 100 ^M" <br /> @0.1.7.1.7 "\x0a" average_speed_answer 25 60 30 60 ^M calls_waiting 300 500 300 500 ^M many more rows here post_ivr_calls_handled Wisconsin 50 100 50 100 ^M post_ivr_calls_handled Wyoming 50 100 50 100 ^M [download]	[reply] [d/l]
Re: Query about regular expression by ww (Archbishop) on Sep 23, 2015 at 22:12 UTC
If the line you want always begins `<li>average_`, then match on that (according to your taste, including or excluding the `<li>s` from your capture) rather than trying to count `<li>s`. If not, you'll have to use a more complex solution... for which you may wish to read in the regex docs about "`positive look ahead`" and variations. Spirit of the Monastery	[reply] [d/l] [select]
Re: Query about regular expression by Anonymous Monk on Sep 24, 2015 at 00:35 UTC
#!/usr/bin/perl # http://perlmonks.org/?node_id=1142838 use strict; use warnings; $\| = 1; $_ = join '', <DATA>; my ($fourth) = /(?:.?<li>(.?)<\/li>){4}/s; print "$fourth\n"; __DATA__ <ul> <li><strong>site_user</strong></li> <ul> <li>user1</li> </ul> <li><strong>compare_hidden</strong></li> <ul> <li>average_speed_answer 25 60 30 60 ^M<br /> calls_waiting 300 500 300 500 ^M<br /> many more rows here post_ivr_calls_handled Wisconsin 50 100 50 100 ^M<br /> post_ivr_calls_handled Wyoming 50 100 50 100 ^M<br /> </li> <li><strong>calls_waiting_good_high</strong></li> <ul> <li>300</li> </ul> <li><strong>calls_waiting_warning_low</strong></li> <ul> [download] Of course, this will have problems if you have nested li's	[reply] [d/l]
Re: Query about regular expression by Perl300 (Friar) on Sep 24, 2015 at 15:36 UTC
Thanks for your suggestions Anonymous Monk, ww and jeffa. I had started with ww suggestion and modified by code as `if ( $form_dump_after =~ m{(<li>average_)(.*?)(</li>)}s ) { $inside_li = "average_".$2; print $fh "Value of \$inside_li: \n$inside_li \n"; }` [download] It gave me output as: `average_speed_answer 25 60 30 60 ^M<br /> calls_waiting 300 500 300 500 ^M<br /> many more rows here post_ivr_calls_handled Wisconsin 50 100 50 100 ^M<br /> post_ivr_calls_handled Wyoming 50 100 50 100 ^M<br />` [download] But I think I'll have to use HTML::TokeParser::Simple for the next tasks. I don't have HTML::TreeBuilder::XPath available and being in restricted env can't get it very easily, so can't check that out :-( Read more... (1360 Bytes)	[reply] [d/l] [select]


more useful options
	PerlMonks