RE: another regex Question
by Russ (Deacon) on May 24, 2000 at 09:04 UTC
|
Just a quick response for you...
If you add a ? after the *, it will make the expression
non-greedy. As long as you don't have any nested tables,
this will work. In greedy mode (the default), the regex
will grab as much as it can into that .*, matching everything
from the first correct table tag, through the very last
closing table tag.
my $thispage = q{
<TABLE border=0 cellPadding=2 cellSpacing=0>junk</TABLE>
<TABLE border=0 bogusTag="Don't match this">more junk</TABLE>
<TABLE border=0 cellPadding=2 cellSpacing=0>most junk</TABLE>
};
$thispage=~s/<TABLE border=0 cellPadding=2 cellSpacing=0>.*?<\/TABLE>/
+/g;
print $thispage;
outputs:
<TABLE border=0 cellPadding=2 cellSpacing=1>more junk</TABLE>
There are better answers to this problem, as others have
posted, but I think this "fixes" your regex... :-)
Russ
| [reply] [d/l] |
Re: another regex Question
by athomason (Curate) on May 24, 2000 at 08:15 UTC
|
First read this faq. In a nutshell, HTML parsing, especially something like analyzing arbitrary tables, is pretty difficult. There are modules designed especially for this, though, so check out HTML::Parser and HTML::TokeParser. Also see answers to a similar question here. | [reply] |
|
There's a ready-made subclass of HTML::Parser which should help. Check out HTML::TableExtract
| [reply] |
|
Yes, you need to be very careful with HTML, as:
cellpadding, cellPadDing, ceLLpaddinG are all the same. I actually saw a table-extract module which may be of use to you.
Personally, I would not advise doing any HTML parsing yourself. Use modules -- their authors know their stuff!
| [reply] |
RE: another regex Question
by Michalis (Pilgrim) on May 24, 2000 at 14:14 UTC
|
I think you shoulld be able to do it like that:
if (/<TABLE border=0 cellPadding=2 cellspacing=0>(.*?)<\/TABLE>/)
{ print $1;
}
| [reply] [d/l] |
Re: another regex Question
by johncoswell (Acolyte) on May 24, 2000 at 17:37 UTC
|
Why not try a split? I process many HTML files with this command. Split the file on every occurance of <TABLE and you won't have to worry about eating up too many tables.
@parsedfile = split(/\<TABLE/,$file);
This way, you can concentrate on only one table at time and won't have to worry about greedy regexps.
foreach $line (@parsedfile) {
if ($line =~ /cellpadding\=2/) {
do whatever
}
}
John Coswell - http://www.coswell.com | [reply] [d/l] [select] |
|
You will run into problems with nested tables, won't you?
A more complex solution would involve keeping track of the numbers of open/close table tags, so you can be sure that you have matches. For instance, each time you pass an open table tag, increment a counter, each time you pass a close table tag, decrement the counter, when the counter goes >1, you are inside a table, when it hits 0, you are outside of a table. If it hits 2 or more, you are inside a nested table.
I don't know how feasible this is, but it might be useful.
J. J. Horner
Linux, Perl, Apache, Stronghold, Unix
jhorner@knoxlug.org http://www.knoxlug.org
| [reply] |
|
Definitely true. 8^) I guess it depends on if you use nested tables and need to keep track of the nesting for some purpose. If you have a table nested within a table, and you just want to delete the table definition, the information would be plopped into the outer table's cell without formatting, kind of like how you merge cells in PageMill.
| [reply] |
Re: another regex Question
by KM (Priest) on May 24, 2000 at 18:08 UTC
|
Stop steering him in the direction of reinventing the wheel. The answer to look at HTML::Parser and HTML::TokeParser and the other HTML::* is the best. Don't try to reinvent what already works.
Cheers,
KM | [reply] |
Re: a regex Question
by BigJoe (Curate) on May 25, 2000 at 00:46 UTC
|
Actually I am just trying to remove all the embedded tables that are in the HTML file but leave the one main table | [reply] |