dbach355 has asked for the wisdom of the Perl Monks concerning the following question:
Hello, I am a bit novice at perl and I may be able to resolve this with shell, but I think perl is likely better solution.
The input file is many fields perhaps 100 that are space delimited. My goal is to output a delmited file with a unique delimiter such as \f. The layout is a bit more complicated that some of the fields have spaces in it. To get around this, the preceeding field contains the number of spaces that are in the field. Below is a short example (not the entire file)
Field names (file is not comma delimited) ssn, employee number, Number of characters of employee name, employee name, hire date, number of characters for address, address, state, number of characters for city, city, zip
123445678 45612 11 Steve Smith 11012015 16 1001 Main Street GA 7 Atlanta 30553
The number of fields is fixed, the number of fixed field lengths and variable field lengths varies. I was trying to figure out how to start processing the file and the first 10 fields are of fixed length, and then the variable ones start and are mixed in with some fixed fields.
I am thinking to do something such as an array, or hash to have the field name and maybe the type of field. ssn,empNo,ncEmpName,empName,hireDate,ncAddr,addr,state,ncCity,city,zip (nc=number of characters) f,f,v,d,f,v,d,f,v,d,f (f -fixed, v-stating characters for next field which is variable, d - data of the variable field)
I would then run a subroutine based on if the field is fixed, variable, or data, but then need some method to set the remaining characters to continue processing
The reading of the first fixed fields is simple, reading the first field that contains the number of characters of the next variable field is also simple. Such as in my example it is easy to read ssn empNo ncEmpName...then My thoughts were to start processing the next as an single character array, and when I know the number of characters such as 11, I would pull those characters for the field, so then I am not sure how to read the reaminder of the data as a new field for input to continue processing.
Sorry, this is likely pretty vague what I am trying to describe. I am looking for some suggestions of how to process, such as using single character array and start processing data that way, or if there are some other methods I am likely not familiar with.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: How to process variables length fields in delimited file.
by liverpole (Monsignor) on Oct 06, 2016 at 02:03 UTC | |
My first approach would be to define, programmatically (ie. with a data structure), what the input file contains on each line. Once that's in a script, you run it and prove to yourself that your data does in fact behave as expected. Since each line is made up of space-delimited items, but some of them are count-prefixed, you could define your line format with an array containing an array reference for each item. Each array reference would hold the LABEL of the item (eg. 'ssn' for social-security, 'emp_num' for employee number, etc.), and a compiled regular expression (that's the qr/.../ syntax) used to parse the item. In cases where the item is prefixed with a count, specifying the length of the item, you could use a string like 'COUNT' instead of a regex. Here's an example for what you've defined: Then you write a subroutine parse_line that you call for each line of your input file. (I would also pass in the line number, in case the line doesn't match your formula, so you can die with an error saying which line was invalid). For each array ref in @line_format you either parse the COUNT, and pull off that number of characters, or you apply the next regex. If the data validates, you assign it into a hash local to the subroutine, with the label as the key. When the subroutine completes successfully, you pass back a reference to that hash. Here's how you might write the parse_line subroutine:
When I call that subroutine with the data you defined for a single line:
This simple program dumps as its result:
So I know I'm on the right track. The next steps would be something like;
Does that help? Edit: fixed whom I'm responding to (thanks choroba) s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/ | [reply] [d/l] [select] |
Re: How to process variable length fields in delimited file.
by GrandFather (Saint) on Oct 06, 2016 at 01:37 UTC | |
If the fixed fields really are fixed length rather than space delimited then you can pull the lines apart using a template like this:
Prints:
For output I'd strongly recommend using a module like Text::CSV to generate correctly formatted CSV files.
Premature optimization is the root of all job security
| [reply] [d/l] [select] |
Re: How to process variable length fields in delimited file.
by choroba (Cardinal) on Oct 06, 2016 at 07:44 UTC | |
Update: used tr instead of s .
| [reply] [d/l] [select] |
Re: How to process variable length fields in delimited file.
by Marshall (Canon) on Oct 06, 2016 at 01:29 UTC | |
My goal is to output a delmited file with a unique delimiter such as \f. The real problem with your format are the embedded spaces. These first 10 columns can be handled in a number of ways. What do the other columns look like? Do they contain embedded spaces, like "John Smith"? Do they have a constant field width perhaps? Your goal is achievable. I just need a bit more info.
Update:
Update with code:
One way to describe the fields and implement this is shown below. I do suspect that your problem can be solved "easier" than this, but without more info about the other ~90 columns, I am unsure. Of course the fixed length fields can have trailing spaces, but that is easy to get rid of: or some such similar formulation. Also, a very long but simple (no back-tracking) regex can execute quite quickly. I doubt that a regex approach will be a performance problem even if the regex is so long that it is incomprehensible to a human. | [reply] [d/l] [select] |
Re: How to process variable length fields in delimited file.
by Tux (Canon) on Oct 06, 2016 at 07:40 UTC | |
If the first 10 fields are of fixed length, I'd use unpack on that part. Using A10 for a 10 characted wide field will strip the trailing spaces. Work on from there.
Enjoy, Have FUN! H.Merijn | [reply] [d/l] [select] |
Re: How to process variable length fields in delimited file.
by johngg (Canon) on Oct 06, 2016 at 11:20 UTC | |
This is similar to GrandFather's approach moving along the line field by field a time but uses the @fieldNames array and a counter to determine whether we have an actual field or the field width of the next field.
The output.
I hope this is of interest. Cheers, JohnGG | [reply] [d/l] [select] |
Re: How to process variable length fields in delimited file.
by shadowsong (Pilgrim) on Oct 06, 2016 at 11:37 UTC | |
Hi dbach355 Seeing as how The number of fields is fixed, the number of fixed field lengths and variable field lengths varies - if all you need is another file with custom delimiters; you can achieve this with a one-liner:
The offset within the @F array denotes each input field in your file; so offset 0 would represent ssn, offset 1 employee number and so on; simply craft your output line how you'd like it... See http://www.perl.com/pub/2004/08/09/commandline.html for additional command line options. Cheers, | [reply] [d/l] |
Re: How to process variable length fields in delimited file.
by dbach355 (Initiate) on Oct 06, 2016 at 13:28 UTC | |
Thank you all for your responses. I will review them for a better understanding. I am very novice in perl, so I would like to read details to get a good understanding of the proposed methods I could have put the exact code to begin with, but I did not want to get to long winded, but at times details are better. One reason I was thinking of using the \f character is I don't care about printing the data ( I say that now), the data once in a readable delimited file will pass to SPLUNK application for end use. The problem in the data is there is about every character in the text. There are maybe 1,0000,000 lines of text a day and from the below message these are text from network devices which include characters such as #@$^|}{[]<> and about every character I could think of. They had tabs in also. I finally grepped the file for several days of output and I did not find a \f. Other possibility is to use multicharacter delimiter such as @#! which is unlikely to be together as standard text. Here is the devil in the details of the true layout and an example of 1 data line. I will review and when I have time, comment on the solution. Thank you all
| [reply] [d/l] |
by shmem (Chancellor) on Oct 06, 2016 at 20:10 UTC | |
Here is the devil in the details of the true layout and an example of 1 data line The squirrel is always in the details, since the devil is a squirrel. But I can't help you here with the data you provided (only one record? seriously?) since in "39 router174.network.microsoft.com" - well, "router174.network.microsoft.com" is just 31 chars long, not 39. Even with a NULL terminator it would be 32 chars long, not 39. Hence, the following is just bull - you know, garbage in => garbage out.
perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
| [reply] [d/l] |
by dbach355 (Initiate) on Dec 20, 2016 at 18:16 UTC | |
You are correct, sorry about giving only 1 record. And some have additional info. Without getting too lengthy, I included 10 records. I am reviewing comments and proceeding. Read more... (83 kB)
| [reply] [d/l] |
Re: How to process variable length fields in delimited file.
by dbach355 (Initiate) on Oct 17, 2016 at 20:02 UTC | |
Thank you all for comments. I have not had change to review and test to see if I understand. The job gets in the way :) And the company is forcing 2 weeks off, so I have been trying to clean up some old tasks. Again, thank you and I will update. David | [reply] |