Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: selecting columns from a tab-separated-values file

by mildside (Friar)
on Jan 22, 2013 at 00:00 UTC ( #1014524=note: print w/ replies, xml ) Need Help??


in reply to selecting columns from a tab-separated-values file

What OS are you using? This answer might seem a bit odd given this is a Perl forum, but sometimes the best solution isn't Perl. (Blasphemy?)

If you are in a unix environment, and all you need to do is extract the fields as described above then you can use the unix command line program 'cut'. This does exactly what you want without needing to write any code.

cut -f3,1,6 inputfile.txt > outputfile.txt


Comment on Re: selecting columns from a tab-separated-values file
Download Code
Re^2: selecting columns from a tab-separated-values file
by mildside (Friar) on Jan 22, 2013 at 01:22 UTC
    Forgot to mention, it should be fast too!
Re^2: selecting columns from a tab-separated-values file
by Lotus1 (Chaplain) on Jan 22, 2013 at 01:49 UTC

    Very cool command, thanks for posting it.

    I just tried it and noticed that it prints out the fields 3,1,6 but in the order 1,3,6. It went through a file with 10^6 lines in about 7 seconds and 2*10^6 lines in 10 seconds. So 10^9 lines should take about an hour. My test data only had 9 fields not 50.

    I'm running on Windows XP by the way with 8 cores. I Installed GNU textutils a long time ago and am always surprised when I find out about these things I have but don't how to use.

      You can't reorder the output fields with 'cut', but if you have 'sed' you can do this:
      $ cat t FIRST MIDDLE LAST STRNO STRNAME CITY STATE ZIP $ sed -n 's/\(.*\t\).*\t\(.*\t\).*\t.*\t\(.*\t\).*\t.*/\2\1\3/p' t LAST FIRST CITY
        You can't reorder the output fields with 'cut'

        Isn't that what I just said? Maybe you intended to reply to mildside.

        ...if you have 'sed' you can do this...

        I have sed but I prefer Perl.

        update: So I couldn't resist trying this sed command and found that it works for the input provided but as soon as you add more fields at the end it breaks.

        Given this input file:

        FIRST MIDDLE LAST STRNO CITY STATE ZIP 1 2 + 3 4 5 FIRST MIDDLE LAST STRNO CITY STATE ZIP 1 2 + 3 4 FIRST MIDDLE LAST STRNO CITY STATE ZIP

        You get this output:

        ZIP FIRST MIDDLE LAST STRNO CITY 3 STATE FIRST MIDDLE LAST STRNO 2 LAST FIRST CITY

        The greedy '.*' regex expressions cause the regex engine to match from the right and work back. '\1' ends up holding everything on the left that remains unmatched. For the first line \1 holds FIRST    MIDDLE    LAST    STRNO        CITY    .

        Here is a version that works.

        C:\b\perlmonks\commands>sed -n "s/^\([^\t]*\t\)[^\t]*\t\([^\t]*\t\)[^\ +t]*\t[^\t]*\t\([^\t]*\).*/\2\1\3/p" sedtest.csv LAST FIRST CITY LAST FIRST CITY LAST FIRST CITY
Re^2: selecting columns from a tab-separated-values file
by ibm1620 (Beadle) on Jan 22, 2013 at 03:29 UTC
    I'm on CentOS, and (as pointed out), cut won't, er, cut it because I can't reorder the fields. (Also I need to do a little bit of processing.)

      Yes sorry, I'd forgotten about the limitation of cut keeping the original field order (it's fast though!).

      Given you need to do a bit of processing as well I think the post by Kenosis should get you well on the way.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1014524]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2014-08-02 03:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (54 votes), past polls