<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 15.
+15), see www.w3.org">
<title>Aberdeen%20Genetic%20purity , News Search |
Ask.com</title>
<link rel="shortcut icon" href="http:/">
<link rel="apple-touch-icon" href=
"http://www.">
<link rel="apple-touch-icon" sizes="76x76" href=
"http://www.">
<link rel="apple-touch-icon" sizes="120x120" href=
"http://www">
<link rel="apple-touch-icon" sizes="152x152" href=
"http://www.">
<link rel="apple-touch-icon" sizes="180x180" href=
"http://www.">
<meta name="robots" content="noindex, nofollow">
<style type="text/css">
#midblock,#rightblock,#mobile-footer,.mobile-web-results,.hcsa,.footer
+ {
visibility:visible !important;
}
</style>
<link type="text/css" rel="stylesheet" href=
"/st/c/css/ask_news.min.1fbfe92a.css">
<style type="text/css">
.hcsa {
visibility:hidden;
}
.sprite {
background-image: url(http://);
}
.news-top-video-arrow{
background-image: url(http://);
}
@media
(-webkit-min-device-pixel-ratio: 2),
(min-resolution: 192dpi) {
.sprite {
background-image: url(http://);
background-size: 111px 181px;
}
}
</style>
</head>
Here's an example of my output:
Filename Content-Base Title X-Meta-author X-Meta-descripti
+on X-Meta-keywords X-Meta-name
Test/1.html Aberdeen%20Animal%20trait%20analysis , News Search | As
+k.com
I'm trying to get the regex results to print above after the last entry X-Meta-name.
If I use the follow code instead of the array that I'm using:
my $string = quotemeta 'CEO';
while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {
print $fh1 $1, ",";
}
I can get the following
Test/Ames_Animal trait analysis.html.result.txt_parsed_for_news.txt.ht
+ml Ames Animal trait analysis , News Search | Ask.com
+
Test/Ames_Biobank.html.result.txt_parsed_for_clinic.txt.html
+
both adults and infants. Dr. Kocher has requested who hrough a sepa
+rate study. Dr. Lazaridis' samples alon colon and rectal cance
+r. Dr. Nelson has requested sto rointestinal microbiome. Dr. Nelso
+n and her colleague in a new research study. Dr. Ames is recruitin
+g parti ers.</p>
<p>In addition Dr. Thibodeau has expanded t sh; who have PKD.</
+p>
<p>Dr. Harris' goal is to bette h another study.
+
</p>
<p>Dr. Heit has also asked for ients who've had a clot. Dr. Heit's
+ goal is to identi To study microvesicles Dr. Jayachandran is
+ requesti pice caregivers.
</p>
<p>Dr. Kaur is researching whet 18">Nilufer Taner M.D. Ph.D
+.</a> is studying geneti 0027660">Janet E. Olson Ph.D.</a>
+ and <a href="http:
Test/Ames_Biobank.html.result.txt_parsed_for_jobs.txt.html
+
ompany-overview/hamilton-awards">Awards</a></li>
+
<l Test/Ames_Biobank.html.result.txt_parsed_for_news.txt.html
+ Ames Biobank , News Search | Ask.com
Test/Ames_Biorepository.html.result.txt_parsed_for_news.txt.html Am
+es%2520Biorepository , News Search | Ask.com
but as you can see I have spacing and formatting issues as I'm trying to get 1 row per file with all items from that file listed on that 1 line
I can certainly copy this:
my $string = quotemeta 'CEO';
while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {
print $fh1 $1, ",";
}
20 times and replace the $string value with what I'm looking for but that seems like an inelegant and wasteful way as I should be able to do it in an array
I've modified my code to the following
#!perl
use strict;
use warnings;
use File::Find;
use HTTP::Headers;
use HTML::HeadParser;
use Text::CSV;
# config
my $dfile = 'all_tags.csv';
my $dir = 'Test';
my @TAGS = ('Content-Base', 'Title',
'X-Meta-author', 'X-Meta-description',
'X-Meta-keywords', 'X-Meta-name',);
my @TAGS2 = ('CEO', 'founder',
'professor', 'Dr.',
'Ph.D', 'M.D.',
'company called', 'startup called',
'joins', 'receives funding',
'SBIR', 'receiving the grant',
'seed investment', 'seed fund',
'appointed', 'chosen',
'secures', 'award',
'seed investment', 'awarded',
);
# output
my $csv = Text::CSV->new({eol => $/});
open my $fh1, ">:encoding(utf8)", $dfile
or die "Error opening $dfile: $!";
$csv->print($fh1,['Filename',@TAGS]); # parser header
my $string = map {quotemeta} @TAGS2;
#my $text =
while ( my $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {
$string->print($fh1, ['Filename',@TAGS2]);# regex header
}
# input
find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir);
close $fh1 or die "Error closing $dfile: $!";
exit;
sub HTML_Files {
parse_HTML_Header($File::Find::name) if /\.html?$/;
}
sub parse_HTML_Header {
my $ifile = shift;
print "parsing $ifile\n";
open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n";
my $text = do{ local $/; <$fh0> };
close $fh0;
my $h = HTTP::Headers->new;
my $p = HTML::HeadParser->new($h);
$p->parse($text);
my @cols = map{ $h->header($_) }@TAGS;
$csv->print($fh1, [$ifile,@cols]);
my @cols2 = map{ $h->$string($_) }@TAGS2;
$string->print($fh1, [$ifile,@cols2]);
#my $string = quotemeta 'awarded';
#while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {
#print $fh1 $1,"\n";
# }
}
and get the following errors:
Use of uninitialized value $text in pattern match (m//) at header_parser12.pl line 38.
parsing Test/1.html
Can't locate object method "20" via package "HTTP::Headers" at header_parser12.pl line 66.
|