Re: Help constructing a regex that also matches hyphens and parentheses
by wfsp (Abbot) on Apr 05, 2006 at 18:35 UTC
|
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TokeParser::Simple;
my @array = <DATA>;
for my $line (@array){
my $tp = HTML::TokeParser::Simple->new(\$line);
my $cell_data;
while (my $t = $tp->get_token){
$cell_data++, next if
$t->is_start_tag('td') and
$t->get_attr('height') == 40 and
$t->get_attr('width') == 40 and
$t->get_attr('border') == 0;
print $t->as_is if $cell_data and $t->is_text;
}
}
__DATA__
<td width='10' height='40' border='0'>cell data 1</td>
<td width='20' height='40' border='0'>cell data 2</td>
<td width='30' height='40' border='0'>cell data 3</td>
<td width='40' height='40' border='0'>cell data 4</td>
<td width='40' height='40' border='0'>cell data 5</td>
<td width='40' height='40' border='0'>cell data 6</td>
<td width='40' height='40' border='0'>cell data 7</td>
<td width='40' height='40' border='0'>cell data 8</td>
Output:
---------- Capture Output ----------
> "C:\Perl\bin\perl.exe" _new.pl
cell data 4
cell data 5
cell data 6
cell data 7
cell data 8
> Terminated with exit code 0.
I believe this gives you greater flexibility/adaptability and, for me, is quicker to write. :-)
Hope that helps | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Help constructing a regex that also matches hyphens and parentheses
by zer (Deacon) on Apr 05, 2006 at 17:09 UTC
|
None of this matches with a /td at the end because none of the test results have a /td. Otherwise they all test true.
#!/usr/bin/perl
while (<DATA>){
print if(m#\s+((\w+\s*)+)</td>#g);
}
__DATA__
This is a test
This is a (test)
This is a (test123)
this is a (123)
This-is a test
Just Another Perl Hacker </td>
| [reply] [Watch: Dir/Any] [d/l] |
Re: Help constructing a regex that also matches hyphens and parentheses
by McDarren (Abbot) on Apr 05, 2006 at 17:28 UTC
|
Would I be right in saying that you wish to match everything between a <td> and a matching </td>?
If that's the case:
m/<td>(.*?)<\/td>/;
Should do the trick.
(Note that the ? makes the .* non-greedy, which avoids slurping up too much if you have multiple <td></td> tags on the one line.)
Cheers,
Darren :) | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Help constructing a regex that also matches hyphens and parentheses
by johngg (Canon) on Apr 05, 2006 at 18:44 UTC
|
I think it would help matters if you were able to post a sample of the data you are trying to match against, not just items in the data you hope to pull out.Cheers, JohnGG | [reply] [Watch: Dir/Any] |
Re: Help constructing a regex that also matches hyphens and parentheses
by swampyankee (Parson) on Apr 05, 2006 at 17:29 UTC
|
\w matches neither hyphens nor parentheses; to quote the documents (perlre) "\w Match a "word" character (alphanumeric plus "_").". You'll have to explicitly add the hyphen them to the regex.. Your regex will also fail for </TD>, which may not be what you want.
However, given your stated goal ("…I need it to match the entire line in $1 before it finds the </td>"), it would seem to me that your regex could just as easily read something like this:
push(@found_items, $1) if $data =~ m!(.*)</td>!i;
which, at first glance, looks as if it may work: everything up to </td> gets slurped into $1. Note that this WAS NOT TESTED.
emc
"Being forced to write comments actually improves code, because it is easier to fix a crock than to explain it. " —G. Steele
| [reply] [Watch: Dir/Any] [d/l] |
Re: Help constructing a regex that also matches hyphens and parentheses
by CountZero (Bishop) on Apr 05, 2006 at 21:37 UTC
|
Verily, It is sayeth that the Only True Way () to deal with HTML-tags is to run a parser on it and everything else is an Abomination and will be cast into the Pit!
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [Watch: Dir/Any] |
Re: Help constructing a regex that also matches hyphens and parentheses
by kutsu (Priest) on Apr 05, 2006 at 17:14 UTC
|
Check out the section on "character classes" in perlre (their marked with braces, [ ]), as these will allow you to match hyphens and numbers as well. Also you could do that push(...) while ... as @found_items = $data =~ m#\s+((\w+\s*)+)</td>#g;.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Help constructing a regex that also matches hyphens and parentheses
by Herkum (Parson) on Apr 05, 2006 at 17:09 UTC
|
Let me ask you this, is there a beginning tag that you can also match against?
If we can be a little more specific about the beginning and end, can we match anything in between and return that; instead of expanding your search parameters to include these other special characters?
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] [d/l] |
Re: Help constructing a regex that also matches hyphens and parentheses
by zer (Deacon) on Apr 05, 2006 at 17:26 UTC
|
This is a quick little hack on the matter. Ill look at it more later
#!/usr/bin/perl
while (<DATA>){
print if(m#\s+([\w\-]+\s*)+</td>#g);
}
__DATA__
This is a test
This is a (test)
This is a (test123)
this is a (123)
This-is a test
This-is-a test </td>
| [reply] [Watch: Dir/Any] [d/l] |
Re: Help constructing a regex that also matches hyphens and parentheses
by zer (Deacon) on Apr 05, 2006 at 19:18 UTC
|
whats with these anonymous monks asking questions and not following up with peoples responses? | [reply] [Watch: Dir/Any] |
|
Yah! P'raps we ought to petition the Pope & Saints to change "Anonymous Monk" to "Anonymous Penitent"8^) --roboticus
| [reply] [Watch: Dir/Any] |
Re: Help constructing a regex that also matches hyphens and parentheses
by ww (Archbishop) on Apr 05, 2006 at 17:32 UTC
|
I think you may be looking for a "zero-width positive lookahead" such as ([^<\/td>]) or m#...([^</td>])#; depending on your selection of regex delimiters. One of the above will match anything until a </td>.
UPDATE: Thanks, ikegami (see below). The character class is just flat wrong! But (well, actually, " And") so is the rest of my brain_spasm above this line. ... ....argh!
You appear to have solved the issue of where to start such a match in your addendum/update, below
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Very wrong. The regexp you posted is neither zero-width, nor positive, nor useful. [^<\/td>]* means "Any number of characters other than <, /, t and d." It wouldn't match asdf, for example, since it contains d.
I think you meant the zero-width negative lookahead and its common usage: (?:(?!<\/td>).)* That means "Any number of characters which don't contains the sequence </td>."
((?:(?!<\/td>).)*)<\/td>
is very similar to
(.*?)<\/td>
but it's possible for the latter to capture much more than anticipated in some circumstances.
| [reply] [Watch: Dir/Any] [d/l] [select] |