getting and printing form values etc from html stripping out all else

kalkisong has asked for the wisdom of the Perl Monks concerning the following question:

I've been working on some code that will search several html pages in my site and strip out specific lines of html, reformat it the way I want and print to a single page. right now it sort of works, I can get my lines but it prints the entire line, I am not sure how to write the code to get exactly want I want. example: current perl code below will find all lines on html pages that have br< exists on the same line the code prints the entire line like this below.
<img src="http://www.mysite/graphics/blue.jpg" alt="Hey" width="100" height="100" ><br>yada yada
I really just want it to print something like this: <g:image_link>http://www.mysite/graphics/blue.jpg</g:image_link> on one line.
I am also grabbing hidden forms as well and trying to print the just the values from them as well. How do strip out <br> <hr> <b> etc, etc, etc.
I am stuck now and not really sure what to do now thank you for assistance.

#!/usr/bin/perl -T
# Set variables
######################################################################
+#######
$output_file = "../../cgi/generator/source-output.html"; #Chmod 766
######################################################################
+#######
#Pages to Scan
######################################################################
+#######
my @files= qw| ../../page1.html ../../page2.html |;
my @allfiles;
for my $filename(@files){
    open FILE, $filename ||
        die "Cannot open $filename for reading: $!\n";    
    push @allfiles, $_ while (<FILE>);
    @lines = @allfiles;    
$line = @allfiles;  
close FILE;    
}
#output generator
######################################################################
+#######
open (OUTPUT,">$output_file") || die "Can't Open $output_file: $!\n";
printf OUTPUT "Generator CGI Tester\n\n";
s/\<[^\<]+\>//;
foreach $line (@lines) {
   $print_flag = 0;
    s/\<[^\<]+\>//;      
#generator variables 
######################################################################
+#######
       ###Image
   if ($line =~ m/<img src/) {
      $img = $line;
      printf OUTPUT "$img";
     $print_flag = 1;
  } 
        
          ###Form action
   if ($line =~ m/<input type=\"hidden\" name=\"description\"/) {
      $hidden = $line;
      printf OUTPUT "$hidden";
      $print_flag = 1;
   }

}
close (OUTPUT);
# Redirect browser to generated page
print "Location: $output_file\n\n";
exit;
[download]

Comment on getting and printing form values etc from html stripping out all else Select or Download Code

Replies are listed 'Best First'.
Re: getting and printing form values etc from html stripping out all else by pemungkah (Priest) on Feb 24, 2010 at 21:18 UTC
The big problem is that the tool you're using (regular expressions) is terribly bad at parsing things like HTML. I'd strongly recommend one of the HTML parsing modules (HTML::Parser or HTML::TreeParser) to do this job. They take care of the messy business of actually understanding the HTML and let you concentrate on stuff like "I want the contents of this tag".	[reply]
Re^2: getting and printing form values etc from html stripping out all else by kalkisong (Initiate) on Feb 24, 2010 at 22:17 UTC
Thank you i've looked over the links and the info for the HTML::Parser etc, however I am unsure how to implement it into my current code. I am fairly new to this. I'll keep reading into it, anymore assistance will be appreciated.	[reply]
Re: getting and printing form values etc from html stripping out all else by ww (Archbishop) on Feb 25, 2010 at 02:13 UTC
For your first requirement, a regex is probably safe and effective, since (unless I'm having a Sr. moment) the html 4.x standard does not allow an image tag with a literal ">" inside the tag. One way to approach the job, therefore, is to extend your regex with less-greedy (aka "minimally greedy") matching and a lookahead. Here's a sketch, minus file-handling, CGI, etc: #!/usr/bin/perl use strict; use warnings; #825146 my @line = <DATA>; for my $line(@line) { chomp $line; if ( $line =~ m/(<img .?[^>]+)/ ){ print "<g:image_link> " . $1 . "> </g:image_link>\n"; } else { print "\t nope: $line \n"; # you may want to send this to a di +fferent file } } __DATA__ <p><img src="http://www.mysite/graphics/blue.jpg" alt="Hey" width="100 +" height="100" ><br>yada yada</p> <p><img src="../grapics/blue1.gif" alt="Yo" width="200" height="75"></ +p> <p>foobar with no img</p> <blockquote><img width="75" height="75" src="blue2.png"></blockquote> [download] Output: `<g:image_link> <img src="http://www.mysite/graphics/blue.jpg" alt="Hey +" width="100" height="100" > </g:image_link> <g:image_link> <img src="../grapics/blue1.gif" alt="Yo" width="200" he +ight="75"> </g:image_link> nope: <p>foobar with no img</p> <g:image_link> <img width="75" height="75" src="blue2.png"> </g:image_ +link>` [download] BUT* take the advice from pemungkah above: Use a parser! Trying to deal with all the possible unwanted tags in a form with regexen is going to get you deeper and deeper into complexities. And if you're planning to read user input from a form, for heaven's sake, read about untainting. You really don't want to let the fumble-fingered or malicious run around loose in your playground.	[reply] [d/l] [select]
Re: getting and printing form values etc from html stripping out all else by Anonymous Monk on Feb 25, 2010 at 02:54 UTC
HTML::Form `use HTML::Form; $form = HTML::Form->parse($html, $base_uri);` [download]	[reply] [d/l]


There's more than one way to do things
	PerlMonks