Perl Array Question, combining HTML::HeadParser and regex

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl Array Question, combining HTML::HeadParser and regex by kcott (Archbishop) on Feb 01, 2016 at 06:59 UTC
You have several problems with your post: You've shown no representative input. You've shown no actual output. You've shown no expected output. Your problem description is: "I can't get it to work". That's of absolutely no help to us whatsoever. I can see a number of problems with your code. I'll document the first two that leapt out at me. Fix these and, if you still need further help, please address all the points I've raised above (see "How do I post a question effectively?" if you need more help with this). Problem 1: `my $string = quotemeta @TAGS2;` [download] That forces a scalar context which will give you the number of elements in the array, e.g. `$ perl -wE 'my @x = qw{$ % ^}; my $y = quotemeta @x; say $y' 3` [download] Perhaps you need something closer to: `$ perl -wE 'my @x = qw{$ % ^}; my @y = map { quotemeta } @x; say "@y"' \$ \% \^` [download] Problem 2: `while ( my $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {` [download] You may be confusing assignment (using `=`): `my $x = ...` [download] with regex matching (using `=~`): `$x =~ /.../` [download] Perhaps your code should look more like: `my $text = ... ... while ($text =~ ...` [download] — Ken	[reply] [d/l] [select]
Re: Perl Array Question, combining HTML::HeadParser and regex by Anonymous Monk on Feb 01, 2016 at 07:37 UTC
Thanks for your response and your constructive criticism. I can definitely see your points. Here are some examples of my html input: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta name="generator" content= "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 15. +15), see www.w3.org"> <title>Aberdeen%20Genetic%20purity , News Search \| Ask.com</title> <link rel="shortcut icon" href="http:/"> <link rel="apple-touch-icon" href= "http://www."> <link rel="apple-touch-icon" sizes="76x76" href= "http://www."> <link rel="apple-touch-icon" sizes="120x120" href= "http://www"> <link rel="apple-touch-icon" sizes="152x152" href= "http://www."> <link rel="apple-touch-icon" sizes="180x180" href= "http://www."> <meta name="robots" content="noindex, nofollow"> <style type="text/css"> #midblock,#rightblock,#mobile-footer,.mobile-web-results,.hcsa,.footer + { visibility:visible !important; } </style> <link type="text/css" rel="stylesheet" href= "/st/c/css/ask_news.min.1fbfe92a.css"> <style type="text/css"> .hcsa { visibility:hidden; } .sprite { background-image: url(http://); } .news-top-video-arrow{ background-image: url(http://); } @media (-webkit-min-device-pixel-ratio: 2), (min-resolution: 192dpi) { .sprite { background-image: url(http://); background-size: 111px 181px; } } </style> </head> [download] Here's an example of my output: `Filename Content-Base Title X-Meta-author X-Meta-descripti +on X-Meta-keywords X-Meta-name Test/1.html Aberdeen%20Animal%20trait%20analysis , News Search \| As +k.com` [download] I'm trying to get the regex results to print above after the last entry X-Meta-name. If I use the follow code instead of the array that I'm using: `my $string = quotemeta 'CEO'; while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { print $fh1 $1, ","; }` [download] I can get the following Test/Ames_Animal trait analysis.html.result.txt_parsed_for_news.txt.ht +ml Ames Animal trait analysis , News Search \| Ask.com + Test/Ames_Biobank.html.result.txt_parsed_for_clinic.txt.html + both adults and infants. Dr. Kocher has requested who hrough a sepa +rate study. Dr. Lazaridis' samples alon colon and rectal cance +r. Dr. Nelson has requested sto rointestinal microbiome. Dr. Nelso +n and her colleague in a new research study. Dr. Ames is recruitin +g parti ers.</p> <p>In addition Dr. Thibodeau has expanded t sh; who have PKD.</ +p> <p>Dr. Harris' goal is to bette h another study. + </p> <p>Dr. Heit has also asked for ients who've had a clot. Dr. Heit's + goal is to identi To study microvesicles Dr. Jayachandran is + requesti pice caregivers. </p> <p>Dr. Kaur is researching whet 18">Nilufer Taner M.D. Ph.D +.</a> is studying geneti 0027660">Janet E. Olson Ph.D.</a> + and <a href="http: Test/Ames_Biobank.html.result.txt_parsed_for_jobs.txt.html + ompany-overview/hamilton-awards">Awards</a></li> + <l Test/Ames_Biobank.html.result.txt_parsed_for_news.txt.html + Ames Biobank , News Search \| Ask.com Test/Ames_Biorepository.html.result.txt_parsed_for_news.txt.html Am +es%2520Biorepository , News Search \| Ask.com [download] but as you can see I have spacing and formatting issues as I'm trying to get 1 row per file with all items from that file listed on that 1 line I can certainly copy this: `my $string = quotemeta 'CEO'; while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { print $fh1 $1, ","; }` [download] 20 times and replace the $string value with what I'm looking for but that seems like an inelegant and wasteful way as I should be able to do it in an array I've modified my code to the following #!perl use strict; use warnings; use File::Find; use HTTP::Headers; use HTML::HeadParser; use Text::CSV; # config my $dfile = 'all_tags.csv'; my $dir = 'Test'; my @TAGS = ('Content-Base', 'Title', 'X-Meta-author', 'X-Meta-description', 'X-Meta-keywords', 'X-Meta-name',); my @TAGS2 = ('CEO', 'founder', 'professor', 'Dr.', 'Ph.D', 'M.D.', 'company called', 'startup called', 'joins', 'receives funding', 'SBIR', 'receiving the grant', 'seed investment', 'seed fund', 'appointed', 'chosen', 'secures', 'award', 'seed investment', 'awarded', ); # output my $csv = Text::CSV->new({eol => $/}); open my $fh1, ">:encoding(utf8)", $dfile or die "Error opening $dfile: $!"; $csv->print($fh1,['Filename',@TAGS]); # parser header my $string = map {quotemeta} @TAGS2; #my $text = while ( my $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { $string->print($fh1, ['Filename',@TAGS2]);# regex header } # input find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir); close $fh1 or die "Error closing $dfile: $!"; exit; sub HTML_Files { parse_HTML_Header($File::Find::name) if /\.html?$/; } sub parse_HTML_Header { my $ifile = shift; print "parsing $ifile\n"; open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n"; my $text = do{ local $/; <$fh0> }; close $fh0; my $h = HTTP::Headers->new; my $p = HTML::HeadParser->new($h); $p->parse($text); my @cols = map{ $h->header($_) }@TAGS; $csv->print($fh1, [$ifile,@cols]); my @cols2 = map{ $h->$string($_) }@TAGS2; $string->print($fh1, [$ifile,@cols2]); #my $string = quotemeta 'awarded'; #while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { #print $fh1 $1,"\n"; # } } [download] and get the following errors: Use of uninitialized value $text in pattern match (m//) at header_parser12.pl line 38. parsing Test/1.html Can't locate object method "20" via package "HTTP::Headers" at header_parser12.pl line 66.	[reply] [d/l] [select]
Re^2: Perl Array Question, combining HTML::HeadParser and regex by poj (Abbot) on Feb 01, 2016 at 08:43 UTC
Try #!perl use strict; use warnings; use File::Find; use HTTP::Headers; use HTML::HeadParser; use Text::CSV; # config my $dfile = 'all_tags.csv'; my $dir = 'Test'; my @TAGS = ('Content-Base', 'Title', 'X-Meta-author', 'X-Meta-description', 'X-Meta-keywords', 'X-Meta-name',); # match words my @WORDS = qw( press founder professor Dr. Ph.D M.D called receives joins timing find two self bottom true amazing forget night next day ); my $words = join '\|',map { quotemeta } @WORDS ; my $regex = qr/.{0,25} (?:$words) .{0,25}/; # output my $csv = Text::CSV->new({eol => $/}); open my $fh1, ">:encoding(utf8)", $dfile or die "Error opening $dfile: $!"; $csv->print($fh1,['Search Words',@WORDS]); # header $csv->print($fh1,['Filename',@TAGS,'Search Results']); # header # input find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir); close $fh1 or die "Error closing $dfile: $!"; exit; sub HTML_Files { parse_HTML_Header($File::Find::name) if /\.html?$/; } sub parse_HTML_Header { my $ifile = shift; print "parsing $ifile\n"; open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n"; my $text = do{ local $/; <$fh0> }; close $fh0; my @matches = ($text =~ /($regex)/gisx); #print join "\n",@matches; my $h = HTTP::Headers->new; my $p = HTML::HeadParser->new($h); $p->parse($text); my @cols = map{ $h->header($_) \|\| '' }@TAGS; $csv->print($fh1, [$ifile,@cols,@matches]); } [download] poj	[reply] [d/l]
Re^3: Perl Array Question, combining HTML::HeadParser and regex by Anonymous Monk on Feb 01, 2016 at 15:50 UTC
Re^3: Perl Array Question, combining HTML::HeadParser and regex by Anonymous Monk on Feb 01, 2016 at 14:36 UTC
Re^4: Perl Array Question, combining HTML::HeadParser and regex by poj (Abbot) on Feb 01, 2016 at 17:13 UTC
Re: Perl Array Question, combining HTML::HeadParser and regex by ww (Archbishop) on Feb 01, 2016 at 15:52 UTC
In addition to what kcott noted in his reply, your OP would have been a tad better had you included a link to the prior thead Parsing web content with File::Find, HTML::HeadParser and using a regex (in PM markup, `[id://1154144]`), as it casts some light on your issues. See also What shortcuts can I use for linking to other information? *Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:* code verbatim error and/or warning messages *a coherent explanation of what "doesn't work* actually means.**	[reply] [d/l]


more useful options
	PerlMonks