HTML::Parser / Regex

MissPerl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML::Parser / Regex by Mr. Muskrat (Canon) on May 26, 2017 at 21:05 UTC
You are not using strict and warnings. You have loaded HMTL::Parser instead of HTML::Parser but then you do not try to use it. You are trying to search an undefined variable $text. You are trying to use captured values but have not any capture groups. (May not be a problem in your real code but) you are defining a different variable in each part of the if/elsif blocks. The pattern you are using to match the numbers is a bit odd. Take a look at Regexp::Common. You are trying to use regular expressions to search for slashes without changing the "/" delimiters. Regexp quote-like operators. `# partial snippet use strict; use warnings; use Regexp::Common; # ... while( chomp(my $text = <$f1>) ) { my ($one, $two, $three); # Also these variable names are not very de +scriptive. if ($text =~ m!Employee\sA</th><th>($RE{num}{real})<!) { $one = $1; } # ...` [download]	[reply] [d/l]
Re^2: HTML::Parser / Regex by MissPerl (Sexton) on May 27, 2017 at 06:51 UTC
Hi Mr. Muskrat, Thank you for your reply. I did tried to use HTML::Parser, but it was ended up pretty ugly, so I did not include that part of code. Do you have any recommend link for HTML::Parser? Apologize for not mentioning what perl version I am using at the first place. I am using v5.8.8. And I've tried on the solution you provided, it seems that the version that I am using does't support Regexp::Common. Also thanks for pointing out those mistakes I made! And I totally forgot to turn on strict and warnings!	[reply]
Re^3: HTML::Parser / Regex by AnomalousMonk (Archbishop) on May 27, 2017 at 07:30 UTC
... the version that I am using does't support Regexp::Common. Why do you say that? What errors/system messages do you get? The code I posted here uses Regexp::Common and runs under Perl 5.8.9. Are you sure you have the module installed on your system? Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^4: HTML::Parser / Regex by MissPerl (Sexton) on May 27, 2017 at 08:52 UTC
Re^5: HTML::Parser / Regex by shmem (Chancellor) on May 27, 2017 at 12:13 UTC
Some notes below your chosen depth have not been shown here
Re^3: HTML::Parser / Regex by poj (Abbot) on May 28, 2017 at 17:15 UTC
..tried to use HTML::Parser, but it was ended up pretty ugly, What didn't you like with using HTML::Parser ? #!/usr/bin/perl use warnings; use strict; use HTML::Parser; my %inside = (); my $tbl = -1; my $col; my $row; my @table = (); my $p = HTML::Parser->new( handlers => { start => [ \&start,'tagname' ], end => [ \&end, 'tagname' ], text => [ \&text, 'text' ], } ); $p->parse_file(\*DATA); # or filename # output for my $t (0..$#table){ print "\nTable $t\n"; for my $r (0..$#{$table[$t]}){ my $line = join "\t",$r,@{$table[$t][$r]}; print "$line\n"; } } sub start { my $tag = shift; $inside{$tag} = 1; if ($tag eq 'table'){ ++$tbl; $row = -1; } elsif ($tag eq 'tr'){ ++$row; $col = -1; } elsif ($tag eq 'th'){ ++$col; $table[$tbl][$row][$col] = ''; # or undef } } sub end { my $tag = shift; $inside{$tag} = 0; } sub text { my $str = shift; if ( $inside{'th'} ){ $table[$tbl][$row][$col] = $str; } } __DATA__ </table></body><body bgcolor="black"><h1> Summary</h1><table border="1"><tr><th>Employee A</th><th>-0.82</th> </tr><tr><th>Employee B</th><th>-5.02</th> </tr><tr><th>Employee C</th><th>19</th> </tr></table></body><body bgcolor="black"><h1> Summary</h1><table border="1"><tr><th>Employee A</th><th></th> </tr><tr><th>Employee B</th><th></th> </tr><tr><th>Employee C</th><th></th> [download] poj	[reply] [d/l]
Re^4: HTML::Parser / Regex by MissPerl (Sexton) on May 29, 2017 at 01:20 UTC
Re^4: HTML::Parser / Regex by jobormo (Initiate) on Sep 10, 2019 at 07:57 UTC
Re: HTML::Parser / Regex by AnomalousMonk (Archbishop) on May 26, 2017 at 21:18 UTC
... I still could't quite pick up how this string-matching works. If you haven't already seen them, please take a look at perlrequick, perlretut (and, of course, at perlre for the hard core stuff), and at Pattern Matching, Regular Expressions, and Parsing in the monastery's own Tutorials. Update: Oh, and please let us know the version of Perl you're working with so we can know the regex features available to you! Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^2: HTML::Parser / Regex by MissPerl (Sexton) on May 27, 2017 at 06:11 UTC
Hi AnomalousMonk, Thank you for all those links. I think I have overlooked most of the links. I'm going to have a look on it. Also I am using perl v5.8.8. It was pretty old but yea..	[reply]
Re^3: HTML::Parser / Regex by AnomalousMonk (Archbishop) on May 27, 2017 at 07:18 UTC
... using perl v5.8.8. ... pretty old ... Not to worry. The example code I posted here runs under 5.8.9. It's intentionally ambitious; I hope you will come back to it repeatedly as you read more about Perl, regexes, etc. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re: HTML::Parser / Regex by kcott (Archbishop) on May 27, 2017 at 06:09 UTC
G'day MissPerl, Welcome to the Monastery. "... after a few days of self-learning perl ..." I'd recommend reading through "perlintro - Perl introduction for beginners". It's not particularly long (about 10 screenfuls on my monitor) and will walk you through the basics. As you've only just started learning Perl, you have (quite understandably) made a number of novice mistakes: this document should go a long way to clearing up those problems. In addition, it's peppered with links to more detailed information and advanced topics. Return to this document when the need arises, and delve into the specifics as required. — Ken	[reply]
Re^2: HTML::Parser / Regex by MissPerl (Sexton) on May 29, 2017 at 01:25 UTC
Good day Ken! Thank you for the link ! I am going through all the useful materials from fellow PerlMonks, time is ticking, I'll be sure to come back if I come across with something I don't undertand, Thanks!!	[reply]
Re: HTML::Parser / Regex by AnomalousMonk (Archbishop) on May 26, 2017 at 22:03 UTC
I know that ... some other module might have easier way to do this. But for now, I want to learn and apply HTML::Parser and regex ... Ok, so you're committed to drilling all those holes in your head just to prove to yourself for sure that drilling holes in your head is a bad idea. Here's one approach: c:\@Work\Perl\monks>perl -wMstrict -le "use warnings; use strict; ;; use Regexp::Common; ;; use Data::Dump qw(dd); ;; my @lines = ( 'Summary</h1><table border=\"1\"><tr><th>Employee John Doe</th><th> +-0.82</th>', 'Summary</h1><table border=\"1\"><tr><th> Employee Fred D. Poe </th +><th> -5.03 </th>', 'Summary</h1><table border=\"1\"><tr><th>Employee Billy-Bob Toe</th +><th> </th>', 'Summary</h1><table border=\"1\"><tr><th>Employee</th><th>999</th>' +, '<th>Employee Prince </th><th> 123</th>', '<th>Employee O</th><th> 1.23 </th>', ); ;; my $rx_name = qr{ \S+? (?: \s+ \S+)? }xms; my $rx_th_open = qr{ \s < th > \s* }xms; my $rx_th_close = qr{ \s* < / th > \s* }xms; ;; my %per_employee; ;; LINE: for my $line (@lines) { my $parsed = my ($name, $amount) = $line =~ m{ $rx_th_open Employee \s+ ($rx_name) $rx_th_close $rx_th_open ($RE{num}{real})? $rx_th_close }xms; ;; if (not $parsed) { warn qq{'$line' failed to parse}; next LINE; } ;; $amount = 'no amount' unless defined $amount; $per_employee{$name} = $amount; } ;; dd \%per_employee; " 'Summary</h1><table border="1"><tr><th>Employee</th><th>999</th>' fail +ed to parse at -e line 1. { "Billy-Bob Toe" => "no amount", "Fred D. Poe" => "-5.03", "John Doe" => "-0.82", O => "1.23", Prince => 123, } [download] (Note that the `$rx_name` regex for an actual, human name is very naive. (Update: See off-site Falsehoods Programmers Believe About Names.)) Update: Significant changes to example code: `$rx_th_open $rx_th_close` regexes made more elegant (?); added rudimentary error handling; added corner and error test cases. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: HTML::Parser / Regex by eyepopslikeamosquito (Archbishop) on May 28, 2017 at 06:30 UTC
What you're attempting as a first program is too tough for a complete beginner IMHO ... So, like kcott, I suggest you read perlintro or some of these Learning Perl links. Then write some simpler programs first, to gain some confidence. Feel free to ask more questions if you get stumped. Once you've done that (will probably take a week or two) return to your original problem. That said, I can see you're very determined to try to solve your real world problem immediately! If so, try running this simple program: `use strict; use warnings; my $ca = "california.html"; open(my $f1, "<" , $ca) or die "Can't open file '$ca': $!"; while ( my $line = <$f1> ) { print "line: $line"; if ( $line =~ m{Employee +([^<]+)</th><th>([^<]+)} ) { my $name = $1; my $two = $2; print " name='$name' two='$two'\n"; } } close ($f1);` [download] on your original test `california.html` file: `</tr></table></body><body bgcolor="black"><h1> Summary</h1><table border="1"><tr><th>Employee A</th><th>-0.82</th> </tr><tr><th>Employee B</th><th>-5.02</th> </tr><tr><th>Employee C</th><th>19</th> </tr></table></body><body bgcolor="black"><h1> Summary</h1><table border="1"><tr><th>Employee A</th><th></th> </tr><tr><th>Employee B</th><th></th> </tr><tr><th>Employee C</th><th></th>` [download] which should produce the following output: `line: </tr></table></body><body bgcolor="black"><h1> line: Summary</h1><table border="1"><tr><th>Employee A</th><th>-0.82</ +th> name='A' two='-0.82' line: </tr><tr><th>Employee B</th><th>-5.02</th> name='B' two='-5.02' line: </tr><tr><th>Employee C</th><th>19</th> name='C' two='19' line: </tr></table></body><body bgcolor="black"><h1> line: Summary</h1><table border="1"><tr><th>Employee A</th><th></th> line: </tr><tr><th>Employee B</th><th></th> line: </tr><tr><th>Employee C</th><th></th>` [download] Now, take the time to understand how the above program works by reading the introductory Perl links above. Feel free to ask any questions about it. Please note that I am NOT endorsing the above program as a sound way to solve your real world problem. It is just a simple program, directly related to your real world problem, to help motivate you to learn some Perl basics. For a sound solution to your problem, I suspect HTML-Parser is the way to go.	[reply] [d/l] [select]
Re^2: HTML::Parser / Regex by MissPerl (Sexton) on May 29, 2017 at 01:33 UTC
Hi eyepopslikeamosquito, Thanks for your sample code! However I come across with the error "Can't use global $1 in "my"" This isn't the first time I see them, I tried googled and get around with it, but unfortunately nothing worked. As I am still reading the beginners' material, for my knowledge, I would think that I need $_ or $1, to scan for current lines? And I figured you are the best person I could ask for advice?!	[reply]
Re^3: HTML::Parser / Regex by kcott (Archbishop) on May 29, 2017 at 03:34 UTC
"... the error "Can't use global $1 in "my"" ..." Somewhere in your code, you have "`... my $1 ...`". Here's a couple of examples: `$ perl -e 'my $1;' Can't use global $1 in "my" at -e line 1, near "my $1" Execution of -e aborted due to compilation errors. $ perl -e 'my $1 = 42;' Can't use global $1 in "my" at -e line 1, near "my $1 " Execution of -e aborted due to compilation errors.` [download] You can find the full description of the problem from "perldiag - Perl diagnostic messages". Until you're familiar with that document, it can be a bit difficult finding the information. In this instance, you'd need to search for "`Can't use global`" (not "`Can't use global $1`"). Doing so, locates this: Can't use global %s in "%s" (F) You tried to declare a magical variable as a lexical variable. This is not allowed, because the magic can be tied to only one location (namely the global variable) and it would be incredibly confusing to have variables in your program that looked like magical variables but weren't. While you're learning, you may find it useful to use the diagnostics pragma. Put this line near the start of your code: `use diagnostics;` [download] That will give you a full description, rather than the somewhat terse shortened form. Important: That pragma is intended as a developer tool. Do not leave it production code. — Ken	[reply] [d/l] [select]
Re^3: HTML::Parser / Regex by AnomalousMonk (Archbishop) on May 29, 2017 at 02:59 UTC
... the error "Can't use global $1 in "my"" What is the specific code that produces this error? If you don't show us the code, we can only make more or less wild quesses. This just wastes our time and yours. Please see How do I post a question effectively? and How (Not) To Ask A Question. Update: BTW: I ran the code eyepopslikeamosquito posted here under Perl 5.8.9 and I get the advertised output with no errors or warnings. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]


Your skill will accomplish what the force of many cannot
	PerlMonks