MissPerl has asked for the wisdom of the Perl Monks concerning the following question:
Hi fellow Perl Monks,
I am trying to get text/number in a html file then store them into a variable.
I know that HTML::TableExtract or some other module might have easier way to do this.
But for now, I want to learn and apply HTML::Parser and regex first.
This is part of my failed attempt perl script, it got errors like bareword found (might be runaway multi-line) and can't use global $1 in my.
At the beginning of the script, it prompt user for input then store them into a variable.
For now, I am writing the part for the script to be able to reads the $ca html file and find match.
Then next part of the script will continue for the other states' html file.
use HMTL::Parser;
my $ca = "california.html";
open (my $f1, "<" , $ca) || die ("Can't open file : california.html");
while (<$f1>){
if (my $text =~ /Employee\sA</th><th>.\d</){
my $one = $1;
}elsif (my $text =~ /Employee\sB</th><th>.\d</){
my $two = $1;
}elsif (my $text =~ /Employee\sC</th><th>.\d</){
my $three = $1;
}
}
close ($f1);
Below are a few lines from two different html files.
the employee A/B/C is fixed. But for sometimes there will be no value
+between the <th> tag.
</tr></table></body><body bgcolor="black"><h1>
Summary</h1><table border="1"><tr><th>Employee A</th><th>-0.82</th>
</tr><tr><th>Employee B</th><th>-5.02</th>
</tr><tr><th>Employee C</th><th>19</th>
</tr></table></body><body bgcolor="black"><h1>
Summary</h1><table border="1"><tr><th>Employee A</th><th></th>
</tr><tr><th>Employee B</th><th></th>
</tr><tr><th>Employee C</th><th></th>
And I've been trying to get the value into the variable, so that I could use them later.
$one = -0.82
$two = -5.02
$three = 19
Apologize in advance for the fail attempt of perl script that I wrote.
I could understand if it's too painful to watch.
But kindly point out my mistake and guide me the correct way.
Thank you so much.
P/S: You could say that I am a slow learner because after a few days of self-learning perl,
I still could't quite pick up how this string-matching works.
I'm using perl v5.8.8
Re: HTML::Parser / Regex
by Mr. Muskrat (Canon) on May 26, 2017 at 21:05 UTC
|
You are not using strict and warnings.
You have loaded HMTL::Parser instead of HTML::Parser but then you do not try to use it.
You are trying to search an undefined variable $text.
You are trying to use captured values but have not any capture groups.
(May not be a problem in your real code but) you are defining a different variable in each part of the if/elsif blocks.
The pattern you are using to match the numbers is a bit odd. Take a look at Regexp::Common.
You are trying to use regular expressions to search for slashes without changing the "/" delimiters. Regexp quote-like operators.
# partial snippet
use strict;
use warnings;
use Regexp::Common;
# ...
while( chomp(my $text = <$f1>) ) {
my ($one, $two, $three); # Also these variable names are not very de
+scriptive.
if ($text =~ m!Employee\sA</th><th>($RE{num}{real})<!) {
$one = $1;
} # ...
| [reply] [d/l] |
|
Hi Mr. Muskrat,
Thank you for your reply. I did tried to use HTML::Parser, but it was ended up pretty ugly, so I did not include that part of code.
Do you have any recommend link for HTML::Parser?
Apologize for not mentioning what perl version I am using at the first place.
I am using v5.8.8.
And I've tried on the solution you provided, it seems that the version that I am using does't support Regexp::Common.
Also thanks for pointing out those mistakes I made! And I totally forgot to turn on strict and warnings!
| [reply] |
|
| [reply] [d/l] |
|
|
|
|
#!/usr/bin/perl
use warnings;
use strict;
use HTML::Parser;
my %inside = ();
my $tbl = -1; my $col; my $row;
my @table = ();
my $p = HTML::Parser->new(
handlers => {
start => [ \&start,'tagname' ],
end => [ \&end, 'tagname' ],
text => [ \&text, 'text' ],
}
);
$p->parse_file(\*DATA); # or filename
# output
for my $t (0..$#table){
print "\nTable $t\n";
for my $r (0..$#{$table[$t]}){
my $line = join "\t",$r,@{$table[$t][$r]};
print "$line\n";
}
}
sub start {
my $tag = shift;
$inside{$tag} = 1;
if ($tag eq 'table'){
++$tbl; $row = -1;
} elsif ($tag eq 'tr'){
++$row; $col = -1;
} elsif ($tag eq 'th'){
++$col;
$table[$tbl][$row][$col] = ''; # or undef
}
}
sub end {
my $tag = shift;
$inside{$tag} = 0;
}
sub text {
my $str = shift;
if ( $inside{'th'} ){
$table[$tbl][$row][$col] = $str;
}
}
__DATA__
</table></body><body bgcolor="black"><h1>
Summary</h1><table border="1"><tr><th>Employee A</th><th>-0.82</th>
</tr><tr><th>Employee B</th><th>-5.02</th>
</tr><tr><th>Employee C</th><th>19</th>
</tr></table></body><body bgcolor="black"><h1>
Summary</h1><table border="1"><tr><th>Employee A</th><th></th>
</tr><tr><th>Employee B</th><th></th>
</tr><tr><th>Employee C</th><th></th>
poj | [reply] [d/l] |
|
|
Re: HTML::Parser / Regex
by AnomalousMonk (Archbishop) on May 26, 2017 at 21:18 UTC
|
... I still could't quite pick up how this string-matching works.
If you haven't already seen them, please take a look at perlrequick, perlretut (and, of course, at perlre for the hard core stuff), and at Pattern Matching, Regular Expressions, and Parsing in the monastery's own Tutorials.
Update: Oh, and please let us know the version of Perl you're working with so we can know the regex features available to you!
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] |
|
| [reply] |
|
| [reply] [d/l] |
Re: HTML::Parser / Regex
by kcott (Archbishop) on May 27, 2017 at 06:09 UTC
|
G'day MissPerl,
Welcome to the Monastery.
"... after a few days of self-learning perl ..."
I'd recommend reading through "perlintro - Perl introduction for beginners".
It's not particularly long (about 10 screenfuls on my monitor) and will walk you through the basics.
As you've only just started learning Perl, you have (quite understandably) made a number of novice mistakes:
this document should go a long way to clearing up those problems.
In addition, it's peppered with links to more detailed information and advanced topics.
Return to this document when the need arises, and delve into the specifics as required.
| [reply] |
|
Good day Ken!
Thank you for the link !
I am going through all the useful materials from fellow PerlMonks, time is ticking, I'll be sure to come back if I come across with something I don't undertand, Thanks!!
| [reply] |
Re: HTML::Parser / Regex
by AnomalousMonk (Archbishop) on May 26, 2017 at 22:03 UTC
|
I know that ... some other module might have easier way to do this. But for now, I want to learn and apply HTML::Parser and regex ...
Ok, so you're committed to drilling all those holes in your head just to prove to yourself for sure that drilling holes in your head is a bad idea. Here's one approach:
c:\@Work\Perl\monks>perl -wMstrict -le
"use warnings;
use strict;
;;
use Regexp::Common;
;;
use Data::Dump qw(dd);
;;
my @lines = (
'Summary</h1><table border=\"1\"><tr><th>Employee John Doe</th><th>
+-0.82</th>',
'Summary</h1><table border=\"1\"><tr><th> Employee Fred D. Poe </th
+><th> -5.03 </th>',
'Summary</h1><table border=\"1\"><tr><th>Employee Billy-Bob Toe</th
+><th> </th>',
'Summary</h1><table border=\"1\"><tr><th>Employee</th><th>999</th>'
+,
'<th>Employee Prince </th><th> 123</th>',
'<th>Employee O</th><th> 1.23 </th>',
);
;;
my $rx_name = qr{ \S+? (?: \s+ \S+)*? }xms;
my $rx_th_open = qr{ \s* < th > \s* }xms;
my $rx_th_close = qr{ \s* < / th > \s* }xms;
;;
my %per_employee;
;;
LINE:
for my $line (@lines) {
my $parsed =
my ($name, $amount) = $line =~ m{
$rx_th_open Employee \s+ ($rx_name) $rx_th_close
$rx_th_open ($RE{num}{real})? $rx_th_close
}xms;
;;
if (not $parsed) {
warn qq{'$line' failed to parse};
next LINE;
}
;;
$amount = 'no amount' unless defined $amount;
$per_employee{$name} = $amount;
}
;;
dd \%per_employee;
"
'Summary</h1><table border="1"><tr><th>Employee</th><th>999</th>' fail
+ed to parse at -e line 1.
{
"Billy-Bob Toe" => "no amount",
"Fred D. Poe" => "-5.03",
"John Doe" => "-0.82",
O => "1.23",
Prince => 123,
}
(Note that the $rx_name regex for an actual, human name is very naive. (Update: See off-site Falsehoods Programmers Believe About Names.))
Update: Significant changes to example code: $rx_th_open $rx_th_close regexes made more elegant (?); added rudimentary error handling; added corner and error test cases.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: HTML::Parser / Regex
by eyepopslikeamosquito (Archbishop) on May 28, 2017 at 06:30 UTC
|
What you're attempting as a first program is too tough for a complete beginner IMHO ...
So, like kcott, I suggest you read perlintro or some of these
Learning Perl links.
Then write some simpler programs first, to gain some confidence.
Feel free to ask more questions if you get stumped.
Once you've done that (will probably take a week or two) return to your
original problem.
That said, I can see you're very determined to try to solve your real world problem immediately!
If so, try running this simple program:
use strict;
use warnings;
my $ca = "california.html";
open(my $f1, "<" , $ca) or die "Can't open file '$ca': $!";
while ( my $line = <$f1> ) {
print "line: $line";
if ( $line =~ m{Employee +([^<]+)</th><th>([^<]+)} ) {
my $name = $1;
my $two = $2;
print " name='$name' two='$two'\n";
}
}
close ($f1);
on your original test california.html file:
</tr></table></body><body bgcolor="black"><h1>
Summary</h1><table border="1"><tr><th>Employee A</th><th>-0.82</th>
</tr><tr><th>Employee B</th><th>-5.02</th>
</tr><tr><th>Employee C</th><th>19</th>
</tr></table></body><body bgcolor="black"><h1>
Summary</h1><table border="1"><tr><th>Employee A</th><th></th>
</tr><tr><th>Employee B</th><th></th>
</tr><tr><th>Employee C</th><th></th>
which should produce the following output:
line: </tr></table></body><body bgcolor="black"><h1>
line: Summary</h1><table border="1"><tr><th>Employee A</th><th>-0.82</
+th>
name='A' two='-0.82'
line: </tr><tr><th>Employee B</th><th>-5.02</th>
name='B' two='-5.02'
line: </tr><tr><th>Employee C</th><th>19</th>
name='C' two='19'
line: </tr></table></body><body bgcolor="black"><h1>
line: Summary</h1><table border="1"><tr><th>Employee A</th><th></th>
line: </tr><tr><th>Employee B</th><th></th>
line: </tr><tr><th>Employee C</th><th></th>
Now, take the time to understand how the above program works by reading the introductory Perl links above.
Feel free to ask any questions about it.
Please note that I am NOT endorsing the above program as a sound way to solve your real world problem.
It is just a simple program, directly related to your real world problem, to help motivate you to learn some Perl basics.
For a sound solution to your problem, I suspect HTML-Parser is the way to go.
| [reply] [d/l] [select] |
|
Hi eyepopslikeamosquito,
Thanks for your sample code!
However I come across with the error "Can't use global $1 in "my""
This isn't the first time I see them, I tried googled and get around with it, but unfortunately nothing worked.
As I am still reading the beginners' material, for my knowledge, I would think that I need $_ or $1, to scan for current lines?
And I figured you are the best person I could ask for advice?!
| [reply] |
|
$ perl -e 'my $1;'
Can't use global $1 in "my" at -e line 1, near "my $1"
Execution of -e aborted due to compilation errors.
$ perl -e 'my $1 = 42;'
Can't use global $1 in "my" at -e line 1, near "my $1 "
Execution of -e aborted due to compilation errors.
You can find the full description of the problem from
"perldiag - Perl diagnostic messages".
Until you're familiar with that document, it can be a bit difficult finding the information.
In this instance, you'd need to search for "Can't use global" (not "Can't use global $1").
Doing so, locates this:
Can't use global %s in "%s"
(F) You tried to declare a magical variable as a lexical variable. This is not allowed, because the magic can be tied to only one location (namely the global variable) and it would be incredibly confusing to have variables in your program that looked like magical variables but weren't.
While you're learning, you may find it useful to use
the diagnostics pragma.
Put this line near the start of your code:
use diagnostics;
That will give you a full description, rather than the somewhat terse shortened form.
Important:
That pragma is intended as a developer tool.
Do not leave it production code.
| [reply] [d/l] [select] |
|
... the error "Can't use global $1 in "my""
What is the specific code that produces this error? If you don't show us the code, we can only make more or less wild quesses. This just wastes our time and yours. Please see How do I post a question effectively? and How (Not) To Ask A Question.
Update: BTW: I ran the code eyepopslikeamosquito posted here under Perl 5.8.9 and I get the advertised output with no errors or warnings.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] |
|
|