Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

What is missing from the beginning of this string?

by japhy (Canon)
on Oct 07, 2010 at 22:18 UTC ( [id://864102]=perlquestion: print w/replies, xml ) Need Help??

japhy has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to solve the following problem with a regex, and so far I have not been successful. I've mostly resigned myself to solving it without a regex, but I'd like to know if anyone here can come up with a clever solution.

I have a potentially malformed URL. It MAY be missing some of the leading characters of the "http://". That is, it might be tp://www.foo.com/ or ://www.foo.com; then again, it might be fine.

I am trying to determine WHAT I need to supply to the beginning of it. I know I could just do a check for ^http://, ^ttp://, ^tp://, and so on, but that seems so barbaric. Any ideas?


Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
Nos autem praedicamus Christum crucifixum (1 Cor. 1:23) - The Cross Reference (My Blog)

Replies are listed 'Best First'.
Re: What is missing from the beginning of this string? (direct)
by tye (Sage) on Oct 08, 2010 at 05:12 UTC
    #!/usr/bin/perl -p s-^(((((((h?t)?)t)?p)?:)?/)?/)?(?=\w+\.)-http://-; __END__ http://www.perlmonks.org/ ttp://www.perlmonks.org/ tp://www.perlmonks.org/ p://www.perlmonks.org/ ://www.perlmonks.org/ //www.perlmonks.org/ /www.perlmonks.org/ www.perlmonks.org/

    produces

    http://www.perlmonks.org/ http://www.perlmonks.org/ http://www.perlmonks.org/ http://www.perlmonks.org/ http://www.perlmonks.org/ http://www.perlmonks.org/ http://www.perlmonks.org/ http://www.perlmonks.org/

    (update:) Replace \. with [./] if you want to support intranet URLs like http://cvs/. Supporting alternate-port intranet URLs like http://wiki:8080/ with just [./:] would cause ftp://... to become http://ftp://... but you could consider (?=\w+([./]|:\d)).

    - tye        

Re: What is missing from the beginning of this string?
by Marshall (Canon) on Oct 07, 2010 at 22:32 UTC
    Why don't you just get rid of the stuff in front of the www.foo.com stuff? I.e., assume its "bad" and put "http://" in front of it? Or for that matter just leave the http:// off once you've done step (1).
    #!/usr/bin/perl -w use strict; my @urls = ('tp://www.foo.com/' , '://www.foo.com', 'http//:www.foo.com', 'www.foo.com'); foreach (@urls) { s/^.*?www/www/; print "http://$_\n"; } __END__ prints: http://www.foo.com/ http://www.foo.com http://www.foo.com http://www.foo.com
    Update: well, this could be more complex as a valid URL does not have to start with www, it could be xyz.tv, then I guess you would want: http://xyz.tv? It helps if you present a representative set of test cases.

    It also helps if you can say something about the context of the application. Here I suppose you are trying to "guess" the user's intention of a manually entered URL? And then auto-magically "fix" it? Sometimes it is better to just try to use what the user entered and if it doesn't work, present an error message about what is acceptable for a URL.

    Just another regex example... I'm sure that other monks can provide even better regex'es, but specifying the problem as clearly as you can is important.

    my @urls = ('tp://www.foo.com/' , '://www.foo.com', 'http//:www.foo.com', 'www.foo.com', 'xxx.tv', 'http//:xxx.tv', 'tp:xx.tv'); foreach (@urls) { s/^(.*?)(\w+\.)/$2/; print "http://$_\n"; } __END__ prints: http://www.foo.com/ http://www.foo.com http://www.foo.com http://www.foo.com http://xxx.tv http://xxx.tv http://xx.tv
      Your regex solution s/^(.*?)(\w+\.)/$2/ should work perfectly for this.

      The mad scientist in me, though, is still wondering if there's a way to do this sort of thing abstractly: to provide a prefix for a string where the prefix may be only partially present. I'll think about it later. It's Friday.


      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      Nos autem praedicamus Christum crucifixum (1 Cor. 1:23) - The Cross Reference (My Blog)
Re: What is missing from the beginning of this string?
by jakeease (Friar) on Oct 08, 2010 at 00:55 UTC

    I posted this in the wrong place first, so it may show twice

    try something like this:

    sub fix_URL { use URI; my $in = shift; my $url = URI->new($in); $url->scheme('http'); print "input is: $in\n"; print "fixed url is: $url\n"; }
    perl> fix_URL 'tp://www.cnn.com' input is: tp://www.cnn.com fixed url is: http://www.cnn.com perl>
      excellent and elegant solution. the problem spec is still a bit hazy though. there's a whole host of urls that are not plain http://, do these also need pseudo-correction? e.g. https:, mailto:, javascript:, etc. even including browser specific ones such as those used in mozilla based browsers. if there's a need to fix all these other kinds of urls automatically, it would be pretty much impossible.
      the hardest line to type correctly is: stty erase ^H
Re: What is missing from the beginning of this string?
by dasgar (Priest) on Oct 07, 2010 at 23:00 UTC

       I have a potentially malformed URL. It MAY be missing ... then again, it might be fine.

    Honestly, I haven't the faintest clue what you're trying to do or why you think there is a problem. Here's what I'm able to gather from your post:

    1. You have some source providing your script/program with URLs.
    2. Your script/program is acting on that information.
    3. Since your code is not always doing what is expected, you believe that there's a chance you're getting invalid URLs.

    Sounds to me like it's time for debugging, which for me means to start adding print statements to figure out what's happening where. For example, print out the URLs, then look to see what kind of issues there are, and then develop a plan to deal with them. Doing this might help you figure out if you're getting invalid URLs and if so, how are they invalid, which in turn helps with figuring out the regex.

    The only other idea right now is to find a module that does the URL validation for you.

      "I haven't the faintest clue what you're trying to do"

      I am trying to correct a potentially malformed "http://" at the beginning of a URL.

      I am not in control of the URLs I am receiving. It is not for me to debug, it is simply for me to correct.


      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      Nos autem praedicamus Christum crucifixum (1 Cor. 1:23) - The Cross Reference (My Blog)

      Haven't you ever copy-pasted urls with the mouse from a window to the browser title bar and find that you failed to select the h at the beginning?

Re: What is missing from the beginning of this string?
by ssandv (Hermit) on Oct 08, 2010 at 19:43 UTC

    use index to find "//", or failing that, "/". (It would be easier if all URLs started with www, but good luck on that). Depending on the results of the previous tests, prepend the appropriate substr from "http://".

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://864102]
Approved by lidden
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (7)
As of 2024-04-16 09:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found