http://www.perlmonks.org?node_id=1038208

virudinesh has asked for the wisdom of the Perl Monks concerning the following question:

am tried different type of matching concept but don't match

$li ="aaaa g f mj http://www.jkghfdjbh.org jgbmhj gkhkh " ; if($li=~m/http://www.[a-z]|[A-Z].[a-z]/i) { print "$& \n"; }

output comes in error near this //

i want to fetch only http://www.jkghfdjbh.org from $li line

Replies are listed 'Best First'.
Re: how to do match www.jkghfdjbh.org from $li?
by davido (Cardinal) on Jun 11, 2013 at 07:33 UTC

    Have you read perlretut yet? You've got to get through some of that stuff if you want to move forward.

    . is a special metacharacter inside of Perl regular expressions. It means to match anything except for newline.

    Character classes match only a single character unless you add a quantifier.

    Alternation is constrained to the entire regular expression, or the first enclosing ( ... ) or (?: ... ) construct.

    Case insensitivity applies to character classes too.

    Combine those issues, and what you have is:

    m/ www # match literal 'www' . # match any single character except \n. [a-z] # match any single character between a and z. | # OR [A-Z] # match any single character between A and Z. . # match any single character except \n. [a-z] # match any single character between a and z. /ix # /i makes everything case-insensitive, so there's # no difference between [A-Z] and [a-z].

    If you want to accomplish this without learning regular expressions, install the URI::Find distribution, and use its URI::Find::Schemeless module.


    Dave

      Further to davido's point about alternation:
      Because it's a point that often escapes people (it's escaped me often enough), I want to emphasize that the effective low precedence of the  | (alternation) regex operator means that the  [A-Z].[a-z] portion of the OPed regex matches independently of the rest of that particular regex. E.g., after fixing the  // delimiter confusion, but leaving the  . (dot) matching as it was:

      >perl -wMstrict -le "my $li = 'foo'; ;; print qq{matched '$&' in '$li'} if $li =~ m{http://www.[a-z]|[A-Z].[a-z]}i; " matched 'foo' in 'foo'

      NB: Don't get into the habit of using the  $& $` $' special matching variables in your regexes. See the paragraph in perlre that begins "WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in the program ..." for a discussion of the cost of using them, and also the following paragraph for a workaround available in Perl version 5.10+. Also see the discussion of these variables in perlretut for workarounds using substr that can be used pre-5.10.

        Lol. I totally missed the /http//:...../ delimiter bug. It starts looking like an intentional attempt to get it wrong just so we will jump around like our hair is on fire trying to fix it. ...because there really is nothing that is right within it. An honest attempt to get it right would contain at least one portion of the RE that isn't a bug. ;)


        Dave

Re: how to do match www.jkghfdjbh.org from $li?
by cdarke (Prior) on Jun 11, 2013 at 10:18 UTC
    In addition to the replies above, your actual error is because of the embedded / in:
    m/http://www.[a-z]|[A-Z].[a-z]/i
    There are several solutions, the simplest is to use some other character as your delimiter, for example:
    m!http://www\.[a-z]+\.[a-z]+!i
Re: how to do match www.jkghfdjbh.org from $li?
by hdb (Monsignor) on Jun 11, 2013 at 08:14 UTC

    • In order to match a literal dot, use \. or [.]
    • As you have the /i applied to your regex, [a-z] and [A-Z] are really the same, so you only need one of them.
    • In order to match one or more characters, use [a-z]+
    • In order to match exactly three characters like org, use braces like [a-z]{3}