Re: Regex to match file extension in URL -- Bundled Extensions

Hi Jazz,

This alternative uses File::Basename to extract the filename and the query string.

Hey! Thats cheating! :-)
No just kidding. Actually you are very right. Using File::Basename is much better than using a roll your own regex, you are much less likely to find the rex doesnt work on some strange OS, and that some of the weirder cases are propperly handled. (For instance a really robust regex would match BOTH / and \'s) OTOH it _is_ an worthy educational process to learn how to do this. Tokenizing filenames with a regex is not a trivial exercise and IMHO therefore makes a good learning opportunity.

The non-trivial nature of tokenizing such a string is illustrated incidentally in the post by crazyinsomniac. Now this is a senior monk, with undoubtadly considerable experience, yet clearly he didn't examine too many cases with either his substr/index solution, nor with his regex solution. When I run his solutions against my earlier posted testdata I get some perverse results indeed. (The regex and substr version dont even produce the same results)

# selected results of CrazyInsomniacs Substr impl.
# doubles pacining converted to single by me.
http://perlmonks.com/index.pl?node_id=68135
        looks like the file name is: index.pl?node_id=68135
               and the extension is: pl?node_id=68135
   We even got a query string, whoa: node_id=68135
      so the true filename would be: index.pl?
and the true file extension would b: pl

http://www.foobar.com/foo/
        looks like the file name is: 
               and the extension is: com/foo/

http://www.foobar.com/foo?test
        looks like the file name is: foo?test
               and the extension is: com/foo?test
   We even got a query string, whoa: test
      so the true filename would be: foo?
and the true file extension would b: com/foo

http:///file.ext
        looks like the file name is: file.ext
               and the extension is: ext

#Selected results of CrazyInsomniacs regex implementation
#input string added by me
http://www.foobar.com
(, , )
(http, www.foobar.com/, foo/bar/foobar.html)
http://www.foobar.com/foo/bar/foo.bar.html
http:///file.ext
(, , )
[download]

Actually for me there is a moral here, MOST times that I have seen this type of issue attacked with substr() and index() the result is wrong! There is a notable pain in the ass poster on CLPM (who shall remain nameless, scales and all) who insists on solving every problem she can with substr and index and rindex. Most of these 'solutions' crack under proper test data. On the regex level there is another moral, obvious intuitive regexes in my experience dont usually work the way one might wish. :-)

Note that this code will not handle multi-level extensions, such as .tar.gz

Ahh yes. Originally, as can be seen from the list I provided in my OP, I intended to post two solutions, one along the MS type lines one along a more natural 'bundled' extension line. However I got a bit distracted by using CGI to output that table (yes it took me a while Amoe but thats ok, I was using it to learn basic cgi) and completely forgot to post the other solution. :-)

So in penance I offer the two variants of the above regex. One will return all of the extensions bundled together, the other will return ONLY the last two or less extensions. This second variant could easily be modified for whatever level of bundling is required. I havent included the full regex, these two snippets should fit in place over my earlier filename part and extension part leaving the other parts untouched.

# regex snippet for matching
# at most two bundled extensions
# foobar.gzip      -> foobar,.gzip
# foobar.tar.gzip  -> foobar,.tar.gzip
# foo.bar.tar.gzip -> foo.bar,.tar.gzip
# the snippt should paste into place over 
# my earlier matches for filename and extension
                  (            #capture the filename
                      [^./?]   #  doesnt start with a . or ? or /
                      [^/?]+?  #  all chars not / or ? , (ctd.)
                               #    --leave stuff for rest of rex
                   )?          #we dont have to have a filename

                  (            #capture the extension
                     (?:       #  Group but dont capture
                        \.     #     they start with dots you know
                        [^?.]* #     any letter that arent a . or ?
                     ){0,2}    #  anywhere from 0 to 2 exts please.
                   )           #thanks..

# regex snippet for matching
# filename and all bundled extensions
# foobar.gzip      -> foobar,.gzip
# foobar.tar.gzip  -> foobar,.tar.gzip
# foo.bar.tar.gzip -> foo,.bar.tar.gzip
# the snippt should paste into place over 
# my earlier matches for filename and extension
                  (            #capture the filename
                      [^./?]   #  doesnt start with a . or ? or /
                      [^/?.]+? #  all chars not / or ? or. 
                               #    --leave stuff for rest of rex
                   )?          #we dont have to have a filename

                  (            #capture the extension
                      \.       #   they start with dots you know
                      [^?]*    #   any letter that arent a  or ?
                   )?          #they are optional you know
[download]

Anyway, Jazz thanks for the analysis, I didnt know that bit about the @valid_extensions in File::Basename. Yves

--
You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

Comment on Re: Regex to match file extension in URL -- Bundled Extensions Select or Download Code


go ahead... be a heretic
	PerlMonks