http://www.perlmonks.org?node_id=872404

lepetitalbert has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks

#!/usr/bin/perl use strict; use warnings; my $string = qq|<a href="/cgi-programming-with-perl.zip">cgi-programmi +ng-with-perl.zip</a>|; my $item = 'perl'; $string =~ s/(\s?[-\w\.]*(?<!\/)$item[-\w\.]*\s?)/<b>$1<\/b>/gi; print $string . "\n"; output : <a href="/<b>book-cgi-programming-with-perl.zip</b>"><b>book-cgi-progr +amming-with-perl.zip</b></a>

I'd like to match

book-cgi-programming-with-perl.zip

but not

/book-cgi-programming-with-perl.zip

I think I need a Lookbehind but after a long afternoon I have found no solution.

So any tip, hint, solution would be welcome.

Thanks !

Have a nice day

"There is only one good, namely knowledge, and only one evil, namely ignorance." Socrates

Replies are listed 'Best First'.
Re: regex match word , don't match word preceeded by slash
by kcott (Archbishop) on Nov 19, 2010 at 02:26 UTC

    I've expanded the solution so it's a bit easier to read:

    $string =~ s{ (?<![-\w./]) ( \s? [-\w.]* $item [-\w.]* \s? ) }{<b>$1<\/b>}gimsx;

    Test output:

    $ regex_slash_prob.pl <a href="/cgi-programming-with-perl.zip"><b>cgi-programming-with-perl. +zip</b></a>

    Update:

    While the solution above answers your question, consider the following.

    If the HTML actually looks more like this:

    my $string = qq| <a href="/cgi-programming-with-perl.zip"> cgi-programming-with-perl.zip </a> |;

    Your output will look like:

    <a href="/cgi-programming-with-perl.zip"> <b> cgi-programming-with-perl.zip </b> </a>

    If you'd prefer it to look like:

    <a href="/cgi-programming-with-perl.zip"> <b>cgi-programming-with-perl.zip</b> </a>

    Remove both \s? lines, leaving:

    $string =~ s{ (?<![-\w./]) ( [-\w.]* $item [-\w.]* ) }{<b>$1<\/b>}gimsx;

    You had them in your original so I left them in thinking they perhaps served some other purpose in the real data you're working on (as the string you posted contained no whitespace at all).

    -- Ken

      Hi kcott,

      works like a charm !
      thank you very much :)

      Have a nice day !

      "There is only one good, namely knowledge, and only one evil, namely ignorance." Socrates
Re: regex match word , don't match word preceeded by slash
by Anonymous Monk on Nov 19, 2010 at 02:50 UTC
    #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder; my $html = '<html><body> <a href="/cgi-programming-with-perl.zip">cgi-programming-with-perl.zip +</a> <a href="cgi-programming-with-perl.zip">cgi-programming-with-perl.zip< +/a> </body></html>'; { my $tree = HTML::TreeBuilder->new(); $tree->ignore_ignorable_whitespace(0); $tree->no_space_compacting(1); $tree->parse( $html )->eof; $tree->look_down( qw' _tag a href ', qr!^/! , sub { $_[0]->push_content( HTML::Element->new('b')->push_content( $_[0]->detach_content ), ); return; }, ); print $tree->as_HTML('<>&',' ',{}), "\n"; } __END__ <html> <head> </head> <body> <a href="/cgi-programming-with-perl.zip"><b>cgi-programming-with-perl. +zip</b></a> <a href="cgi-programming-with-perl.zip">cgi-programming-with-perl.zip< +/a> </body> </html>

      Hi again kcott,

      I tried so many combinations I can't remember why those spaces where there ! thank you again

      I'm trying to undertand

      (?<![-\w./])

      so the / is the one preceeding the word
      but I don't get the -\w.
      if someone has 2 minutes left :)

      thank you too Anonymous Monk, I took a look at HTML::TreeBuilder but I wouldn't have found your solution in one afternoon ! ( is that english ? )

      Have a nice day !

      "There is only one good, namely knowledge, and only one evil, namely ignorance." Socrates

        In the href part, if you just have the slash in the lookbehind, the regex engine finds that gi-programming-with-perl.zip is a match and you end up with: /c<b>gi-programming-with-perl.zip</b>. By saying not a slash or any other character I'm trying to match, /cgi-programming-with-perl.zip does not match at all; the content, however, has a greater-than in that position (which is neither a slash nor a character your looking for, i.e. [-\w.]) so it does match. In my second example, the whitespace doesn't match [-\w./], so it works here also.

        -- Ken