First of all, if you're a novice you should have the following code at the top of all your code:
use strict;
use warnings;
This will help catch mistakes before they happen. Second, why aren't you using a module from CPAN to parse the HTML, i.e. HTML::TreeBuilder. You should never mess around with regular expressions on HTML. The original SGML specifications from which HTML is derived are pretty loose, which means for every rule there are a half dozen exceptions (or more!) which will render under most browsers even though it makes for a pain to parse. Not only will it make your code more robust, it will make your code much more intuitive to read, i.e.:
use strict;
use warnings;
use HTML::TreeBuilder;
my $HTML_to_parse = shift (@ARGV);
my $tree = HTML::TreeBuilder->new;
$tree->parse($HTML_to_parse);
$tree->eof;
my @paragraph_tags = $tree->look_down('_tag', 'p');
foreach my $p (@paragraph_tags) {
# note that this variable will "hide" the other
# copy of @paragraph_tags and be garbage collected
# as soon as it goes out of scope (the end of the
# while loop)
my @paragraph_tags = $p->look_down('_tag', 'p');
if (scalar (@paragraph_tags) == 1) {
my $tag = shift (@paragraph_tags);
my @contents = $tag->content_list;
my $content = "";
foreach my $con (@contents) {
# check that we have text and not an object
$content .= $con unless (ref $con);
}
print $content;
}
}
Just to give you an idea of why using regular expressions to parse HTML is a bad idea, look at this:
<p class="foo">This is <p class="bar">HTML code using CSS Style sheets
+.</p></p>
Now you have no contingencies for the class="" in your original regular expressions. So your code would break on a page that made use of attributes for any of the tags. HTML::TreeBuilder would take it in stride and let you access the attributes if you ever needed to use them using: my %attr = $node->all_external_attr;. So again, don't reinvent the wheel if you don't have to.