Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

RFC: Module for extracting data from generated HTML pages

by Jaap (Curate)
on Jul 26, 2005 at 12:37 UTC ( [id://478157]=perlmeditation: print w/replies, xml ) Need Help??

Oh Wise Ones,

In a sudden burst of sanity i created a module that can extract data from a bunch of generated or similar HTML pages. It's a bit template-like but in the wrong direction.

To use it, one has to edit the "template" file and replace any value they want with [% name_of_the_value %]. The module then returns a ref to a hash with these names and their corresponding values in the second document.

Edit: I just found out about Template::Extract so this can be moved to /dev/null


Example:
Lets's say i have a bunch of html documents that all look kinda like this:
<html> <head> <title>Mammals</title> </head> <body> <h1>Mammals</h1> <h2 id="1">Monkeys</h2> </body> </html>
Now i want to extract certain values from that html document.
From the html document i create a template that looks like this:
<html> <head> <title>[% title %]</title> </head> <body> <h1>Mammals</h1> <h2 id="[% myidentifier %]">[% animal %]</h2> </body> </html>
Now this piece of code:
#!/usr/bin/perl use strict; use warnings; use ExtractDiff; use File::Slurp; my $template = read_file('template.html'); my $document = read_file('document.html'); my $resultRef = ExtractDiff::getValues(\$template, \$document); foreach (keys %$resultRef) { print "$_: $$resultRef{$_}\n"; }
Would produce this:
myidentifier: 1 animal: Monkeys title: Mammals
The actual code is this:
package ExtractDiff; use strict; use warnings; use Algorithm::Diff qw(sdiff); use Data::Dumper; sub getValues { my $template = shift; my $document = shift; my %result; foreach my $item (sdiff(splitFile($template), splitFile($docum +ent))) { if (($item->[0] eq 'c') && ($item->[1] =~ m/\[ +\%\s*(.+?)\s*\%\]/)) { my $name = $1; my $templateString = $item->[1]; my $documentString = $item->[2]; if ($templateString =~ m/^(.*?)\[\%.*? +\%\](.*?)$/) { my $prefix = $1; my $postfix = $2; if ($documentString =~ m/^\Q$p +refix\E(.*)\Q$postfix\E$/) { #print "$name: $1\n"; $result{$name} = $1; } } } } return \%result; } sub splitFile { my $ref = shift; my @file; push (@file, grep { $_ } split(/\s*(<.+?>)\s*/, $$ref)); return \@file; } 1;
Does anybody have any comments on this? Is it handy enough to put on CPAN? What would be a good name?

Replies are listed 'Best First'.
Re: RFC: Module for extracting data from generated HTML pages
by gellyfish (Monsignor) on Jul 26, 2005 at 13:03 UTC

    To be honest in the first instance I would suggest that you have a discussion with the author of Template::Extract to see if some of the features that you find that module lacks and you are trying to provide in your module can be provided, you mean even want to provide a set of patches that implement these.

    /J\

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://478157]
Approved by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (2)
As of 2024-03-19 06:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found