Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Help constructing a regex that also matches hyphens and parentheses

by Anonymous Monk
on Apr 05, 2006 at 16:59 UTC ( [id://541438]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

push(@found_items, $1) while $data =~ m#\s+((\w+\s*)+)</td>#g; matches NEARLY everything I want. However, it fails to load if a word has a hypen, like: Look-a-hyphen . It also fails if there is a set of paranthesis with numbers and/or letters in it. I need it to match all of these. A few examples of what I need to match.

This is a test This is a (test) This is a (test123) this is a (123) This-is a test
Of course my content is dynamic so it won't literally be this, but I need it to match the entire line in $1 before it finds the </td>. In reality, the optional hypens at the end are probably always going to be on the end. But if it's any easier, I guess I could match it anywhere in the string. If it's not easier, I wouldn't worry about it.

2006-04-06 Retitled by planetscape, as per Monastery guidelines
Original title: 'help with regex'

Replies are listed 'Best First'.
Re: Help constructing a regex that also matches hyphens and parentheses
by wfsp (Abbot) on Apr 05, 2006 at 18:35 UTC
    I often find that what starts out as an apparently trivial task of parsing apparantly trivial HTML ends up being very difficult.

    Even in this case I would reach for an HTML parser.

    #!/usr/bin/perl use warnings; use strict; use HTML::TokeParser::Simple; my @array = <DATA>; for my $line (@array){ my $tp = HTML::TokeParser::Simple->new(\$line); my $cell_data; while (my $t = $tp->get_token){ $cell_data++, next if $t->is_start_tag('td') and $t->get_attr('height') == 40 and $t->get_attr('width') == 40 and $t->get_attr('border') == 0; print $t->as_is if $cell_data and $t->is_text; } } __DATA__ <td width='10' height='40' border='0'>cell data 1</td> <td width='20' height='40' border='0'>cell data 2</td> <td width='30' height='40' border='0'>cell data 3</td> <td width='40' height='40' border='0'>cell data 4</td> <td width='40' height='40' border='0'>cell data 5</td> <td width='40' height='40' border='0'>cell data 6</td> <td width='40' height='40' border='0'>cell data 7</td> <td width='40' height='40' border='0'>cell data 8</td>
    Output:

    ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl cell data 4 cell data 5 cell data 6 cell data 7 cell data 8 > Terminated with exit code 0.

    I believe this gives you greater flexibility/adaptability and, for me, is quicker to write. :-)

    Hope that helps

Re: Help constructing a regex that also matches hyphens and parentheses
by zer (Deacon) on Apr 05, 2006 at 17:09 UTC
    None of this matches with a /td at the end because none of the test results have a /td. Otherwise they all test true.
    #!/usr/bin/perl while (<DATA>){ print if(m#\s+((\w+\s*)+)</td>#g); } __DATA__ This is a test This is a (test) This is a (test123) this is a (123) This-is a test Just Another Perl Hacker </td>
Re: Help constructing a regex that also matches hyphens and parentheses
by McDarren (Abbot) on Apr 05, 2006 at 17:28 UTC
    Would I be right in saying that you wish to match everything between a <td> and a matching </td>?

    If that's the case:

    m/<td>(.*?)<\/td>/;
    Should do the trick.

    (Note that the ? makes the .* non-greedy, which avoids slurping up too much if you have multiple <td></td> tags on the one line.)

    Cheers,
    Darren :)

Re: Help constructing a regex that also matches hyphens and parentheses
by johngg (Canon) on Apr 05, 2006 at 18:44 UTC
    I think it would help matters if you were able to post a sample of the data you are trying to match against, not just items in the data you hope to pull out.

    Cheers,

    JohnGG

Re: Help constructing a regex that also matches hyphens and parentheses
by swampyankee (Parson) on Apr 05, 2006 at 17:29 UTC

    \w matches neither hyphens nor parentheses; to quote the documents (perlre) "\w Match a "word" character (alphanumeric plus "_").". You'll have to explicitly add the hyphen them to the regex.. Your regex will also fail for </TD>, which may not be what you want.

    However, given your stated goal ("…I need it to match the entire line in $1 before it finds the </td>"), it would seem to me that your regex could just as easily read something like this:

    push(@found_items, $1) if $data =~ m!(.*)</td>!i;
    which, at first glance, looks as if it may work: everything up to </td> gets slurped into $1. Note that this WAS NOT TESTED.

    emc

    "Being forced to write comments actually improves code, because it is easier to fix a crock than to explain it. "
    —G. Steele
Re: Help constructing a regex that also matches hyphens and parentheses
by CountZero (Bishop) on Apr 05, 2006 at 21:37 UTC
    Verily, It is sayeth that the Only True Way (™) to deal with HTML-tags is to run a parser on it and everything else is an Abomination and will be cast into the Pit!

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Help constructing a regex that also matches hyphens and parentheses
by kutsu (Priest) on Apr 05, 2006 at 17:14 UTC

    Check out the section on "character classes" in perlre (their marked with braces, [ ]), as these will allow you to match hyphens and numbers as well. Also you could do that push(...) while ... as @found_items = $data =~ m#\s+((\w+\s*)+)</td>#g;.

Re: Help constructing a regex that also matches hyphens and parentheses
by Herkum (Parson) on Apr 05, 2006 at 17:09 UTC

    Let me ask you this, is there a beginning tag that you can also match against?

    If we can be a little more specific about the beginning and end, can we match anything in between and return that; instead of expanding your search parameters to include these other special characters?

      This is the whole regex, I shortened it for simplicity in the original post.

      push(@found_items, $1) while $saved_lot =~ m#width='40' height='40' border='0'>\s+((\w+\s*)+)</td>#g;

Re: Help constructing a regex that also matches hyphens and parentheses
by zer (Deacon) on Apr 05, 2006 at 17:26 UTC
    This is a quick little hack on the matter. Ill look at it more later
    #!/usr/bin/perl while (<DATA>){ print if(m#\s+([\w\-]+\s*)+</td>#g); } __DATA__ This is a test This is a (test) This is a (test123) this is a (123) This-is a test This-is-a test </td>
Re: Help constructing a regex that also matches hyphens and parentheses
by zer (Deacon) on Apr 05, 2006 at 19:18 UTC
    whats with these anonymous monks asking questions and not following up with peoples responses?
      Yah! P'raps we ought to petition the Pope & Saints to change "Anonymous Monk" to "Anonymous Penitent"

      8^)

      --roboticus

Re: Help constructing a regex that also matches hyphens and parentheses
by ww (Archbishop) on Apr 05, 2006 at 17:32 UTC
    I think you may be looking for a "zero-width positive lookahead" such as ([^<\/td>]) or m#...([^</td>])#; depending on your selection of regex delimiters. One of the above will match anything until a </td>.

    UPDATE: Thanks, ikegami (see below). The character class is just flat wrong! But (well, actually, " And") so is the rest of my brain_spasm above this line. ...   ....argh!

    You appear to have solved the issue of where to start such a match in your addendum/update, below

      Very wrong. The regexp you posted is neither zero-width, nor positive, nor useful. [^<\/td>]* means "Any number of characters other than <, /, t and d." It wouldn't match asdf, for example, since it contains d.

      I think you meant the zero-width negative lookahead and its common usage: (?:(?!<\/td>).)* That means "Any number of characters which don't contains the sequence </td>."

      ((?:(?!<\/td>).)*)<\/td>
      is very similar to
      (.*?)<\/td>
      but it's possible for the latter to capture much more than anticipated in some circumstances.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://541438]
Approved by kutsu
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2024-03-28 10:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found