Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
XP is just a number
 
PerlMonks  

Re: Regex to match file extension in URL

by Jazz (Curate)
on Sep 09, 2001 at 23:50 UTC ( #111318=note: print w/ replies, xml ) Need Help??


in reply to Regex to match file extension in URL

This alternative uses File::Basename to extract the filename and the query string. It then uses the extension and parameter capturing portions of demerphq's regex (posted above) to extract the extension and strip the query string, if any.

#!/usr/bin/perl use File::Basename; use strict; my @files = ( 'http://server.com/subdir/index.html', 'http://server.com/subdir/dist.tar.gz', 'http://server.com/whatever.cgi?testing=1', 'ftp://server.com/pub/whatever.zip', 'file://local/subdir/testing.txt', ); foreach my $file ( @files ){ my $suffix = ( fileparse( $file, '\..*$' ) )[2]; $suffix =~ s/(\.?[^.?]*)?\?.*?$/$1/; print $suffix, "\n"; }

Note that this code will not handle multi-level extensions, such as .tar.gz. The extension for dist.tar.gz will be reported as .gz (same deal with demerphq's code).

For extensions of this type, you'll probably need to create an array that's propagated with valid file extensions. Coincidentally, you can throw this array at File::Basename to easily ignore invalid extensions. Example:

my @valid_extensions = qw/ .tar.gz .html .zip /; foreach my $file ( @files ){ my $suffix = ( fileparse( $file, @valid_extensions ) )[2]; print $suffix, "\n"; }
The above code will list a suffix only for the file types noted in @valid_extensions (not the txt or cgi files).

Jasmine


Comment on Re: Regex to match file extension in URL
Select or Download Code
Re: Regex to match file extension in URL -- Bundled Extensions
by demerphq (Chancellor) on Sep 10, 2001 at 14:14 UTC
    Hi Jazz,

    This alternative uses File::Basename to extract the filename and the query string.

    Hey! Thats cheating! :-)
    No just kidding. Actually you are very right. Using File::Basename is much better than using a roll your own regex, you are much less likely to find the rex doesnt work on some strange OS, and that some of the weirder cases are propperly handled. (For instance a really robust regex would match BOTH / and \'s) OTOH it _is_ an worthy educational process to learn how to do this. Tokenizing filenames with a regex is not a trivial exercise and IMHO therefore makes a good learning opportunity.

    The non-trivial nature of tokenizing such a string is illustrated incidentally in the post by crazyinsomniac. Now this is a senior monk, with undoubtadly considerable experience, yet clearly he didn't examine too many cases with either his substr/index solution, nor with his regex solution. When I run his solutions against my earlier posted testdata I get some perverse results indeed. (The regex and substr version dont even produce the same results)

    # selected results of CrazyInsomniacs Substr impl. # doubles pacining converted to single by me. http://perlmonks.com/index.pl?node_id=68135 looks like the file name is: index.pl?node_id=68135 and the extension is: pl?node_id=68135 We even got a query string, whoa: node_id=68135 so the true filename would be: index.pl? and the true file extension would b: pl http://www.foobar.com/foo/ looks like the file name is: and the extension is: com/foo/ http://www.foobar.com/foo?test looks like the file name is: foo?test and the extension is: com/foo?test We even got a query string, whoa: test so the true filename would be: foo? and the true file extension would b: com/foo http:///file.ext looks like the file name is: file.ext and the extension is: ext #Selected results of CrazyInsomniacs regex implementation #input string added by me http://www.foobar.com (, , ) (http, www.foobar.com/, foo/bar/foobar.html) http://www.foobar.com/foo/bar/foo.bar.html http:///file.ext (, , )
    Actually for me there is a moral here, MOST times that I have seen this type of issue attacked with substr() and index() the result is wrong! There is a notable pain in the ass poster on CLPM (who shall remain nameless, scales and all) who insists on solving every problem she can with substr and index and rindex. Most of these 'solutions' crack under proper test data. On the regex level there is another moral, obvious intuitive regexes in my experience dont usually work the way one might wish. :-)

    Note that this code will not handle multi-level extensions, such as .tar.gz

    Ahh yes. Originally, as can be seen from the list I provided in my OP, I intended to post two solutions, one along the MS type lines one along a more natural 'bundled' extension line. However I got a bit distracted by using CGI to output that table (yes it took me a while Amoe but thats ok, I was using it to learn basic cgi) and completely forgot to post the other solution. :-)

    So in penance I offer the two variants of the above regex. One will return all of the extensions bundled together, the other will return ONLY the last two or less extensions. This second variant could easily be modified for whatever level of bundling is required. I havent included the full regex, these two snippets should fit in place over my earlier filename part and extension part leaving the other parts untouched.

    # regex snippet for matching # at most two bundled extensions # foobar.gzip -> foobar,.gzip # foobar.tar.gzip -> foobar,.tar.gzip # foo.bar.tar.gzip -> foo.bar,.tar.gzip # the snippt should paste into place over # my earlier matches for filename and extension ( #capture the filename [^./?] # doesnt start with a . or ? or / [^/?]+? # all chars not / or ? , (ctd.) # --leave stuff for rest of rex )? #we dont have to have a filename ( #capture the extension (?: # Group but dont capture \. # they start with dots you know [^?.]* # any letter that arent a . or ? ){0,2} # anywhere from 0 to 2 exts please. ) #thanks.. # regex snippet for matching # filename and all bundled extensions # foobar.gzip -> foobar,.gzip # foobar.tar.gzip -> foobar,.tar.gzip # foo.bar.tar.gzip -> foo,.bar.tar.gzip # the snippt should paste into place over # my earlier matches for filename and extension ( #capture the filename [^./?] # doesnt start with a . or ? or / [^/?.]+? # all chars not / or ? or. # --leave stuff for rest of rex )? #we dont have to have a filename ( #capture the extension \. # they start with dots you know [^?]* # any letter that arent a or ? )? #they are optional you know
    Anyway, Jazz thanks for the analysis, I didnt know that bit about the @valid_extensions in File::Basename. Yves

    --
    You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://111318]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2014-04-19 10:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (480 votes), past polls