What is missing from the beginning of this string?

japhy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: What is missing from the beginning of this string? (direct) by tye (Sage) on Oct 08, 2010 at 05:12 UTC
`#!/usr/bin/perl -p s-^(((((((h?t)?)t)?p)?:)?/)?/)?(?=\w+\.)-http://-; __END__ http://www.perlmonks.org/ ttp://www.perlmonks.org/ tp://www.perlmonks.org/ p://www.perlmonks.org/ ://www.perlmonks.org/ //www.perlmonks.org/ /www.perlmonks.org/ www.perlmonks.org/` [download] produces `http://www.perlmonks.org/ http://www.perlmonks.org/ http://www.perlmonks.org/ http://www.perlmonks.org/ http://www.perlmonks.org/ http://www.perlmonks.org/ http://www.perlmonks.org/ http://www.perlmonks.org/` [download] (update:) Replace `\.` with `[./]` if you want to support intranet URLs like `http://cvs/`. Supporting alternate-port intranet URLs like `http://wiki:8080/` with just `[./:]` would cause `ftp://...` to become `http://ftp://...` but you could consider `(?=\w+([./]\|:\d))`. - tye	[reply] [d/l] [select]
Re: What is missing from the beginning of this string? by Marshall (Canon) on Oct 07, 2010 at 22:32 UTC
Why don't you just get rid of the stuff in front of the www.foo.com stuff? I.e., assume its "bad" and put "http://" in front of it? Or for that matter just leave the http:// off once you've done step (1). `#!/usr/bin/perl -w use strict; my @urls = ('tp://www.foo.com/' , '://www.foo.com', 'http//:www.foo.com', 'www.foo.com'); foreach (@urls) { s/^.?www/www/; print "http://$_\n"; } __END__ prints: http://www.foo.com/ http://www.foo.com http://www.foo.com http://www.foo.com` [download] Update:* well, this could be more complex as a valid URL does not have to start with www, it could be xyz.tv, then I guess you would want: http://xyz.tv? It helps if you present a representative set of test cases. It also helps if you can say something about the context of the application. Here I suppose you are trying to "guess" the user's intention of a manually entered URL? And then auto-magically "fix" it? Sometimes it is better to just try to use what the user entered and if it doesn't work, present an error message about what is acceptable for a URL. Just another regex example... I'm sure that other monks can provide even better regex'es, but specifying the problem as clearly as you can is important. `my @urls = ('tp://www.foo.com/' , '://www.foo.com', 'http//:www.foo.com', 'www.foo.com', 'xxx.tv', 'http//:xxx.tv', 'tp:xx.tv'); foreach (@urls) { s/^(.*?)(\w+\.)/$2/; print "http://$_\n"; } __END__ prints: http://www.foo.com/ http://www.foo.com http://www.foo.com http://www.foo.com http://xxx.tv http://xxx.tv http://xx.tv` [download]	[reply] [d/l] [select]
Re^2: What is missing from the beginning of this string? by japhy (Canon) on Oct 08, 2010 at 13:13 UTC
Your regex solution `s/^(.?)(\w+\.)/$2/` should work perfectly for this. The mad scientist in me, though, is still wondering if there's a way to do this sort of thing abstractly: to provide a prefix for a string where the prefix may be only partially present. I'll think about it later. It's Friday. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker Nos autem praedicamus Christum crucifixum* (1 Cor. 1:23) - The Cross Reference (My Blog)	[reply] [d/l]
Re: What is missing from the beginning of this string? by jakeease (Friar) on Oct 08, 2010 at 00:55 UTC
I posted this in the wrong place first, so it may show twice try something like this: `sub fix_URL { use URI; my $in = shift; my $url = URI->new($in); $url->scheme('http'); print "input is: $in\n"; print "fixed url is: $url\n"; }` [download] `perl> fix_URL 'tp://www.cnn.com' input is: tp://www.cnn.com fixed url is: http://www.cnn.com perl>` [download]	[reply] [d/l] [select]
Re^2: What is missing from the beginning of this string? by aquarium (Curate) on Oct 08, 2010 at 03:51 UTC
excellent and elegant solution. the problem spec is still a bit hazy though. there's a whole host of urls that are not plain http://, do these also need pseudo-correction? e.g. https:, mailto:, javascript:, etc. even including browser specific ones such as those used in mozilla based browsers. if there's a need to fix all these other kinds of urls automatically, it would be pretty much impossible. the hardest line to type correctly is: stty erase ^H	[reply]
Re: What is missing from the beginning of this string? by dasgar (Priest) on Oct 07, 2010 at 23:00 UTC
I have a potentially malformed URL. It MAY be missing ... then again, it might be fine. Honestly, I haven't the faintest clue what you're trying to do or why you think there is a problem. Here's what I'm able to gather from your post: You have some source providing your script/program with URLs. Your script/program is acting on that information. Since your code is not always doing what is expected, you believe that there's a chance you're getting invalid URLs. Sounds to me like it's time for debugging, which for me means to start adding print statements to figure out what's happening where. For example, print out the URLs, then look to see what kind of issues there are, and then develop a plan to deal with them. Doing this might help you figure out if you're getting invalid URLs and if so, how are they invalid, which in turn helps with figuring out the regex. The only other idea right now is to find a module that does the URL validation for you.	[reply]
Re^2: What is missing from the beginning of this string? by japhy (Canon) on Oct 08, 2010 at 00:47 UTC
"I haven't the faintest clue what you're trying to do" I am trying to correct a potentially malformed "http://" at the beginning of a URL. I am not in control of the URLs I am receiving. It is not for me to debug, it is simply for me to correct. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker Nos autem praedicamus Christum crucifixum (1 Cor. 1:23) - The Cross Reference (My Blog)	[reply]
Re^2: What is missing from the beginning of this string? by ambrus (Abbot) on Oct 09, 2010 at 09:22 UTC
Haven't you ever copy-pasted urls with the mouse from a window to the browser title bar and find that you failed to select the `h` at the beginning?	[reply] [d/l]
Re: What is missing from the beginning of this string? by ssandv (Hermit) on Oct 08, 2010 at 19:43 UTC
use `index` to find "//", or failing that, "/". (It would be easier if all URLs started with www, but good luck on that). Depending on the results of the previous tests, prepend the appropriate substr from "http://".	[reply] [d/l]


Welcome to the Monastery
	PerlMonks