substitution of illegal chars in filename

lahf has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: substitution of illegal chars in filename by jeffa (Bishop) on Oct 04, 2003 at 13:54 UTC
Check out URI::Escape's `uri_unescape()` method. However, before you go slamming this into a `-pi -e` oneliner, consider using an HTML Parser instead. (If you only want to change, say, `<a>` or `<img>` tags.) UPDATE: Hmmm, now i see what you mean in your last paragraph. You want to change the filename referenced in some (HTML?) document AND you want to change that file's name as well? If so, you will need to keep track of the offending files you find (in a hash) and after you have finished cleansing the document, you can then iterate through the hash and use something like `rename` to change the name. Hope this helps. :) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply]
Re: Re: substitution of illegal chars in filename by lahf (Initiate) on Oct 04, 2003 at 14:12 UTC
Wow, I am severely out of my depth now! Cool paradiddles, in 6/4 time no less	[reply]
Re: substitution of illegal chars in filename by lahf (Initiate) on Oct 04, 2003 at 14:04 UTC
In the filename: index.cgi%3Fsect = has the alt code in index.cgi?sect = has the reserved character Linux can read them fine as part of a filename, but say I wanted to put them onto a different system, I would have problems. I was trying to alias the perl command like so: alias subs="perl -pi~ -e 's/@1/@2/g' @3" but it did something crazy, well nothing at all, except pipe out command and errors: $ subs /www/ads/209.50.251.107/ "" legal-USAGetaway.htm Can't open @3: No such file or directory. Can't do inplace edit: /www/ads/209.50.251.107/ is not a regular file. Can't open : No such file or directory. Does that mean I'm not escaping the /'s in the command I'm aliasing, and I didn't get why it says can't open @3, does it not recognise "" as @2? am i crazzy, or just expecting too much, or whhich part of the manual does it say that in?	[reply]
Re: substitution of illegal chars in filename by matsmats (Monk) on Oct 04, 2003 at 17:47 UTC
I might be way off here, but my guess is that you have fetched a lot of pages from the web, and want to click around between them locally - and you're running into trouble with the characters basically at the "cgi?"-files. Am I right? Perhaps, then, you should look into fetching the pages again with something like wget. It's made for just that. The -E switch for wget might do what you're after. Mats	[reply]
Re: Re: substitution of illegal chars in filename by lahf (Initiate) on Oct 04, 2003 at 18:01 UTC
Cheers Matt, Thats exactly what I'm doing on linux. I have to as I'm on a 56k modem. wget parses the files, but keeps the filename intact with the ?whatever bit after the filename. the -E switch only adds .html to the end of it, so I end up getting filename.cgi.html for instance, its stll parsed as cgi. wget is not to great though, but I dont know what else to use. Its a real pain, but my only other choice is to let my wife have the phone 24 hours. Its a domestic thing. I do it at night she needs it during the day. shed sum calories, lahf a little, then digest more perl	[reply]
Re: substitution of illegal chars in filename by graff (Chancellor) on Oct 05, 2003 at 05:20 UTC
Oh yeah, putting things like ampersands and quotes into file names is one of the "features" of wget that tends to put that tool on my "do not use" list. I'd rather spend a little more time probing a web site myself, and using a perl script with the LWP module to focus on the sets of urls I really want -- and as I fetch each page, assign a sensible file name (with no shell-magic characters) to save it locally. But trying to maintain the linkages among the href's inside each file is a bit more challenging; jeffa's reply has the basic approach: convert all the wget-assigned file names to sensible names first (making sure to avoid collisions), rename the files, and keep the old-new relations in a hash; then, for each file in the harvest, replace all occurrences of a wget-style (cgi-based) file name string with the corresponding sensible name. Tedious, but not so difficult.	[reply]
Re: Re: substitution of illegal chars in filename by lahf (Initiate) on Oct 10, 2003 at 14:06 UTC
I wouldn't say its so much of a feature, but an automatic filename, and wget has not been given the chance to be clever, and save say the address of this file with all those %20 s' which are sposed to be spaces, and %3A s' which are colons i think, and also the ?s' as well, which are not automatically replaced by its alt code. Maybe it would be better to isolate the code in wget to automaticallychange it itself. The only thing is, I'm not a coder. I can do the odd thing, but I feel like I'd have to learn the whole language first which I dont want to do. I just want to know what the things I need are, and also how to use them, and what other essential things I'd have ot put in the script. I already spend hours poreing thru html & php, and VB C C++ but not perl yet.	[reply]


Syntactic Confectionery Delight
	PerlMonks