Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Perl Array Question, combining HTML::HeadParser and regex

by Anonymous Monk
on Feb 01, 2016 at 05:57 UTC ( [id://1154161]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I had some great help earlier getting some extra printing out of my script. I thought I only needed one regex but it turns out I need >20. I've tried to cobble together what I think should work with my minimal perl abilities but I can't get it to work. The script looks at several html files in a folder and is supposed to parse header info and regex look up results and write to one file one line per html file. Here's my code (which I've had lot's of help on already, thanks! I'm sure it's my second array and the way I'm trying to print it that's wrong.

#!perl use strict; use warnings; use File::Find; use HTTP::Headers; use HTML::HeadParser; use Text::CSV; # config my $dfile = 'all_tags.csv'; my $dir = 'Test'; my @TAGS = ('Content-Base', 'Title', 'X-Meta-author', 'X-Meta-description', 'X-Meta-keywords', 'X-Meta-name',); my @TAGS2 = ('press', 'founder', 'professor', 'Dr.', 'Ph.D', 'M.D.', 'called', 'receives', 'joins', 'timing', 'find', 'two', 'self', 'bottom', 'true', 'amazing', 'forget', 'night', 'next', 'day', ); # output my $csv = Text::CSV->new({eol => $/}); open my $fh1, ">:encoding(utf8)", $dfile or die "Error opening $dfile: $!"; $csv->print($fh1,['Filename',@TAGS]); # header my $string = quotemeta @TAGS2; while ( my $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { $string->print($fh1, [@TAGS2,","]); } # input find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir); close $fh1 or die "Error closing $dfile: $!"; exit; sub HTML_Files { parse_HTML_Header($File::Find::name) if /\.html?$/; } sub parse_HTML_Header { my $ifile = shift; print "parsing $ifile\n"; open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n"; my $text = do{ local $/; <$fh0> }; close $fh0; my $h = HTTP::Headers->new; my $p = HTML::HeadParser->new($h); $p->parse($text); my @cols = map{ $h->header($_) }@TAGS; $csv->print($fh1, [$ifile,@cols]); #my $string = quotemeta 'awarded'; #while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { #print $fh1 $1,"\n"; # } }

Replies are listed 'Best First'.
Re: Perl Array Question, combining HTML::HeadParser and regex
by kcott (Archbishop) on Feb 01, 2016 at 06:59 UTC

    You have several problems with your post:

    • You've shown no representative input.
    • You've shown no actual output.
    • You've shown no expected output.
    • Your problem description is: "I can't get it to work". That's of absolutely no help to us whatsoever.

    I can see a number of problems with your code. I'll document the first two that leapt out at me. Fix these and, if you still need further help, please address all the points I've raised above (see "How do I post a question effectively?" if you need more help with this).

    Problem 1:

    my $string = quotemeta @TAGS2;

    That forces a scalar context which will give you the number of elements in the array, e.g.

    $ perl -wE 'my @x = qw{$ % ^}; my $y = quotemeta @x; say $y' 3

    Perhaps you need something closer to:

    $ perl -wE 'my @x = qw{$ % ^}; my @y = map { quotemeta } @x; say "@y"' \$ \% \^

    Problem 2:

    while ( my $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {

    You may be confusing assignment (using =):

    my $x = ...

    with regex matching (using =~):

    $x =~ /.../

    Perhaps your code should look more like:

    my $text = ... ... while ($text =~ ...

    — Ken

      Thanks for your response and your constructive criticism. I can definitely see your points. Here are some examples of my html input:

      <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta name="generator" content= "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 15. +15), see www.w3.org"> <title>Aberdeen%20Genetic%20purity , News Search | Ask.com</title> <link rel="shortcut icon" href="http:/"> <link rel="apple-touch-icon" href= "http://www."> <link rel="apple-touch-icon" sizes="76x76" href= "http://www."> <link rel="apple-touch-icon" sizes="120x120" href= "http://www"> <link rel="apple-touch-icon" sizes="152x152" href= "http://www."> <link rel="apple-touch-icon" sizes="180x180" href= "http://www."> <meta name="robots" content="noindex, nofollow"> <style type="text/css"> #midblock,#rightblock,#mobile-footer,.mobile-web-results,.hcsa,.footer + { visibility:visible !important; } </style> <link type="text/css" rel="stylesheet" href= "/st/c/css/ask_news.min.1fbfe92a.css"> <style type="text/css"> .hcsa { visibility:hidden; } .sprite { background-image: url(http://); } .news-top-video-arrow{ background-image: url(http://); } @media (-webkit-min-device-pixel-ratio: 2), (min-resolution: 192dpi) { .sprite { background-image: url(http://); background-size: 111px 181px; } } </style> </head>

      Here's an example of my output:

      Filename Content-Base Title X-Meta-author X-Meta-descripti +on X-Meta-keywords X-Meta-name Test/1.html Aberdeen%20Animal%20trait%20analysis , News Search | As +k.com

      I'm trying to get the regex results to print above after the last entry X-Meta-name. If I use the follow code instead of the array that I'm using:

      my $string = quotemeta 'CEO'; while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { print $fh1 $1, ","; }

      I can get the following

      Test/Ames_Animal trait analysis.html.result.txt_parsed_for_news.txt.ht +ml Ames Animal trait analysis , News Search | Ask.com + Test/Ames_Biobank.html.result.txt_parsed_for_clinic.txt.html + both adults and infants. Dr. Kocher has requested who hrough a sepa +rate study. Dr. Lazaridis' samples alon colon and rectal cance +r. Dr. Nelson has requested sto rointestinal microbiome. Dr. Nelso +n and her colleague in a new research study. Dr. Ames is recruitin +g parti ers.</p> <p>In addition Dr. Thibodeau has expanded t sh; who have PKD.</ +p> <p>Dr. Harris' goal is to bette h another study. + </p> <p>Dr. Heit has also asked for ients who've had a clot. Dr. Heit's + goal is to identi To study microvesicles Dr. Jayachandran is + requesti pice caregivers. </p> <p>Dr. Kaur is researching whet 18">Nilufer Taner M.D. Ph.D +.</a> is studying geneti 0027660">Janet E. Olson Ph.D.</a> + and <a href="http: Test/Ames_Biobank.html.result.txt_parsed_for_jobs.txt.html + ompany-overview/hamilton-awards">Awards</a></li> + <l Test/Ames_Biobank.html.result.txt_parsed_for_news.txt.html + Ames Biobank , News Search | Ask.com Test/Ames_Biorepository.html.result.txt_parsed_for_news.txt.html Am +es%2520Biorepository , News Search | Ask.com

      but as you can see I have spacing and formatting issues as I'm trying to get 1 row per file with all items from that file listed on that 1 line

      I can certainly copy this:

      my $string = quotemeta 'CEO'; while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { print $fh1 $1, ","; }

      20 times and replace the $string value with what I'm looking for but that seems like an inelegant and wasteful way as I should be able to do it in an array

      I've modified my code to the following

      #!perl use strict; use warnings; use File::Find; use HTTP::Headers; use HTML::HeadParser; use Text::CSV; # config my $dfile = 'all_tags.csv'; my $dir = 'Test'; my @TAGS = ('Content-Base', 'Title', 'X-Meta-author', 'X-Meta-description', 'X-Meta-keywords', 'X-Meta-name',); my @TAGS2 = ('CEO', 'founder', 'professor', 'Dr.', 'Ph.D', 'M.D.', 'company called', 'startup called', 'joins', 'receives funding', 'SBIR', 'receiving the grant', 'seed investment', 'seed fund', 'appointed', 'chosen', 'secures', 'award', 'seed investment', 'awarded', ); # output my $csv = Text::CSV->new({eol => $/}); open my $fh1, ">:encoding(utf8)", $dfile or die "Error opening $dfile: $!"; $csv->print($fh1,['Filename',@TAGS]); # parser header my $string = map {quotemeta} @TAGS2; #my $text = while ( my $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { $string->print($fh1, ['Filename',@TAGS2]);# regex header } # input find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir); close $fh1 or die "Error closing $dfile: $!"; exit; sub HTML_Files { parse_HTML_Header($File::Find::name) if /\.html?$/; } sub parse_HTML_Header { my $ifile = shift; print "parsing $ifile\n"; open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n"; my $text = do{ local $/; <$fh0> }; close $fh0; my $h = HTTP::Headers->new; my $p = HTML::HeadParser->new($h); $p->parse($text); my @cols = map{ $h->header($_) }@TAGS; $csv->print($fh1, [$ifile,@cols]); my @cols2 = map{ $h->$string($_) }@TAGS2; $string->print($fh1, [$ifile,@cols2]); #my $string = quotemeta 'awarded'; #while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { #print $fh1 $1,"\n"; # } }

      and get the following errors: Use of uninitialized value $text in pattern match (m//) at header_parser12.pl line 38. parsing Test/1.html Can't locate object method "20" via package "HTTP::Headers" at header_parser12.pl line 66.

        Try
        #!perl use strict; use warnings; use File::Find; use HTTP::Headers; use HTML::HeadParser; use Text::CSV; # config my $dfile = 'all_tags.csv'; my $dir = 'Test'; my @TAGS = ('Content-Base', 'Title', 'X-Meta-author', 'X-Meta-description', 'X-Meta-keywords', 'X-Meta-name',); # match words my @WORDS = qw( press founder professor Dr. Ph.D M.D called receives joins timing find two self bottom true amazing forget night next day ); my $words = join '|',map { quotemeta } @WORDS ; my $regex = qr/.{0,25} (?:$words) .{0,25}/; # output my $csv = Text::CSV->new({eol => $/}); open my $fh1, ">:encoding(utf8)", $dfile or die "Error opening $dfile: $!"; $csv->print($fh1,['Search Words',@WORDS]); # header $csv->print($fh1,['Filename',@TAGS,'Search Results']); # header # input find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir); close $fh1 or die "Error closing $dfile: $!"; exit; sub HTML_Files { parse_HTML_Header($File::Find::name) if /\.html?$/; } sub parse_HTML_Header { my $ifile = shift; print "parsing $ifile\n"; open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n"; my $text = do{ local $/; <$fh0> }; close $fh0; my @matches = ($text =~ /($regex)/gisx); #print join "\n",@matches; my $h = HTTP::Headers->new; my $p = HTML::HeadParser->new($h); $p->parse($text); my @cols = map{ $h->header($_) || '' }@TAGS; $csv->print($fh1, [$ifile,@cols,@matches]); }
        poj
Re: Perl Array Question, combining HTML::HeadParser and regex
by ww (Archbishop) on Feb 01, 2016 at 15:52 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1154161]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-04-25 13:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found