Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Is there a good way to unify text files something like dos2nix shell script(s) do?

by taint (Chaplain)
on Dec 20, 2013 at 03:30 UTC ( #1067911=perlquestion: print w/ replies, xml ) Need Help??
taint has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, Monks.

Today I went out looking for something like Padre (a replacement for) after having real bad experiences with it. I only found a couple of editors, but they were immature. So no joy. Anyway, to the point. As a rule, I always examine any code before I install it. I look for mixed charset, mixed line endings, or other "oddities" I feel makes for a bad release. Maybe it's just me, but, in this day, and age; it's hard to imagine anyone releasing source that isn't utf-8, with unified line endings, and has no trailing/hanging spaces. To me, anything else, is just bad policy. Anyway. It's such a pill performing the "unification" task within my editor, and when confronted with mixed-iso files, I don't trust the frequently used shell scripts that convert/unify line endings not to corrupt documents that aren't all utf-8 encoded (without BOM). So what I guess I'm asking; does anyone know of a Perl script that safely "unifies" text/source files? Or can anyone suggest a way to do it?

Thanks for all your consideration.

--Chris

EDIT: Line endings, and trailing spaces are trivial to change
Correctly determining encoding/charset, and converting to utf-8, can be tricky

Yes. What say about me, is true.

Comment on Is there a good way to unify text files something like dos2nix shell script(s) do?
Re: Is there a good way to unify text files something like dos2nix shell script(s) do?
by 2teez (Priest) on Dec 20, 2013 at 05:21 UTC

    Hi taint,
    I know on linux box, one can use the command split to break files specifying how many megabytes you want like split -b 1024m file_name, then use the command cat to assemble them together, like cat bfile* > file_name
    I know it has been used for several file types.
    I don't know if that is what you are looking for or if windows has something similar to these.
    Using man split and man cat show it's usage.

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me

      Thanks for the reply, 2teez.

      I'm also on a *NIX box (FreeBSD). It looks like I may not have used the best word to describe my ultimate goal (unify). What my ultimate goal is. Is to parse files recursively, and based on their format (iso-*-*, line endings, perhaps trailing spaces) unify them, in the sense that they are all the same in those respects. Ultimately (for me) utf-8, *NIX line endings, with no trailing spaces. I don't have a lot of difficulty making the conversions, so much as I have "tasting" the file before hand. So as to convert it w/o buggering it up. For example, a file in a different (spoken) language that isn't already utf-8. Knowing in advance, what it is, and converting it to utf-8 can be tricky. Even tho I know Perl is pretty good at it.

      I'm still searching, and while I haven't found a complete solution. I did find a couple of interesting Text::Filter Modules that may help in cobbling something up. In fact, their pretty nice general purpose Filters for a lot of things: Text::Filter, and Text::Filter::Chain. If I don't use them for this project. I can sure think of a lot of other things to use tham with. :)

      Thanks again, 2teez, for the reply.

      --Chris

      Yes. What say about me, is true.
      
        "It looks like I may not have used the best word to describe my ultimate goal". No surprises there then.
Re: Is there a good way to unify text files something like dos2nix shell script(s) do?
by soonix (Curate) on Dec 20, 2013 at 09:03 UTC
Re: Is there a good way to unify text files something like dos2nix shell script(s) do?
by Laurent_R (Parson) on Dec 20, 2013 at 18:54 UTC
    Perhaps a starting point on trailing spaces and Windows carriage returns from the a*.pl files in the current directory:
    $ perl -i.bak -pe 's/\s*[\r\n]+$/\n/g;' ./a*.pl
      merci! Laurent_R.

      I use something similar, myself. But I'm not inclined to make any changes to any of the files, until I've managed to discover their encoding, and then change them (successfully) into utf-8. Otherwise, I often end up with "mojibake" -- Mojibake.

      Thanks again, for taking the time to reply, Laurent_R.

      --Chris

      Yes. What say about me, is true.
      

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1067911]
Approved by boftx
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (5)
As of 2014-10-26 08:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (152 votes), past polls