Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

substitution of illegal chars in filename

by lahf (Initiate)
on Oct 04, 2003 at 13:47 UTC ( [id://296504]=perlquestion: print w/replies, xml ) Need Help??

lahf has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

How can i substitute part of a filename, that principally has reserved or alt codes for its handle?
For instance:
 index.cgi%3Fsect = index.cgi?sect
and the : or \ cant be read by windows properly, but linux can, as part of a filename, yeah i know windows translates it to the directory structure, but linux keeps it as part of a filename, and also other reserved characters too!

So, I have the very basics of something, e.g.
perl -pi~ -e 's/badcharshere/goodcharshere/g' filename.ext

Now this is to translate whats inside the file, but not the filename. So I end up with bad links!
Is there a script that does this already?
Can anyone point me to it, please?

Thanks ever so much for your time, effort and patience.

lahf
  • Comment on substitution of illegal chars in filename

Replies are listed 'Best First'.
Re: substitution of illegal chars in filename
by jeffa (Bishop) on Oct 04, 2003 at 13:54 UTC
    Check out URI::Escape's uri_unescape() method. However, before you go slamming this into a -pi -e oneliner, consider using an HTML Parser instead. (If you only want to change, say, <a> or <img> tags.)

    UPDATE:
    Hmmm, now i see what you mean in your last paragraph. You want to change the filename referenced in some (HTML?) document AND you want to change that file's name as well? If so, you will need to keep track of the offending files you find (in a hash) and after you have finished cleansing the document, you can then iterate through the hash and use something like rename to change the name. Hope this helps. :)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Wow, I am severely out of my depth now! Cool paradiddles, in 6/4 time no less
Re: substitution of illegal chars in filename
by lahf (Initiate) on Oct 04, 2003 at 14:04 UTC
    In the filename:

    index.cgi%3Fsect = has the alt code in
    index.cgi?sect = has the reserved character

    Linux can read them fine as part of a filename, but say I wanted to put them onto a different system, I would have problems.
    I was trying to alias the perl command like so:
    alias subs="perl -pi~ -e 's/@1/@2/g' @3"
    but it did something crazy, well nothing at all, except pipe out command and errors:
    $ subs /www/ads/209.50.251.107/ "" legal-USAGetaway.htm
    Can't open @3: No such file or directory.
    Can't do inplace edit: /www/ads/209.50.251.107/ is not a regular file.
    Can't open : No such file or directory.
    Does that mean I'm not escaping the /'s in the command I'm aliasing, and I didn't get why it says can't open @3, does it not recognise "" as @2?
    am i crazzy, or just expecting too much, or whhich part of the manual does it say that in?
Re: substitution of illegal chars in filename
by matsmats (Monk) on Oct 04, 2003 at 17:47 UTC

    I might be way off here, but my guess is that you have fetched a lot of pages from the web, and want to click around between them locally - and you're running into trouble with the characters basically at the "cgi?"-files. Am I right?

    Perhaps, then, you should look into fetching the pages again with something like wget. It's made for just that. The -E switch for wget might do what you're after.

    Mats

      Cheers Matt,
      Thats exactly what I'm doing on linux. I have to as I'm on a 56k modem. wget parses the files, but keeps the filename intact with the ?whatever bit after the filename.
      the -E switch only adds .html to the end of it, so I end up getting filename.cgi.html for instance, its stll parsed as cgi. wget is not to great though, but I dont know what else to use.
      Its a real pain, but my only other choice is to let my wife have the phone 24 hours.
      Its a domestic thing. I do it at night she needs it during the day.

      shed sum calories, lahf a little, then digest more perl
Re: substitution of illegal chars in filename
by graff (Chancellor) on Oct 05, 2003 at 05:20 UTC
    Oh yeah, putting things like ampersands and quotes into file names is one of the "features" of wget that tends to put that tool on my "do not use" list. I'd rather spend a little more time probing a web site myself, and using a perl script with the LWP module to focus on the sets of urls I really want -- and as I fetch each page, assign a sensible file name (with no shell-magic characters) to save it locally.

    But trying to maintain the linkages among the href's inside each file is a bit more challenging; jeffa's reply has the basic approach: convert all the wget-assigned file names to sensible names first (making sure to avoid collisions), rename the files, and keep the old-new relations in a hash; then, for each file in the harvest, replace all occurrences of a wget-style (cgi-based) file name string with the corresponding sensible name. Tedious, but not so difficult.

      I wouldn't say its so much of a feature, but an automatic filename, and wget has not been given the chance to be clever, and save say the address of this file with all those %20 s' which are sposed to be spaces, and %3A s' which are colons i think, and also the ?s' as well, which are not automatically replaced by its alt code. Maybe it would be better to isolate the code in wget to automaticallychange it itself. The only thing is, I'm not a coder. I can do the odd thing, but I feel like I'd have to learn the whole language first which I dont want to do. I just want to know what the things I need are, and also how to use them, and what other essential things I'd have ot put in the script. I already spend hours poreing thru html & php, and VB C C++ but not perl yet.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://296504]
Approved by Steve_p
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-04-19 22:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found