Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
Don't ask to ask, just ask
 
PerlMonks  

Split file using perl and regexp

by brad_nov (Novice)
on Jan 17, 2013 at 19:34 UTC ( #1013867=perlquestion: print w/ replies, xml ) Need Help??
brad_nov has asked for the wisdom of the Perl Monks concerning the following question:

Hi , I have huge files around 400 mb, which has clob data and have diffeent scenarios: I am trying to pass scenario number as parameter and and get required modified file based on the scenario number and criteria. Scenario 1: file name : scenario_1.txt
1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~pi=[10.10.10.10.10],uid=[23231131]}~ 1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~pi=[10.10.10.10.11],uid=[3456]}~ 1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~pi=[10.10.10.10.12],uid=[659784]}~ 1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~pi=[10.10.10.10.13],uid=[654812]}~ 1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~pi=[10.10.10.10.14],uid=[323]}~ 1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~pi=[10.10.10.10.10],uid=[97945641564]}~ 1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~pi=[10.10.10.10.10],uid=[1654594]}~
Now I am trying to split the data like below to a new file scenario_1_n.txt: It should get all the data till last "|" and the pi, uid
1|1212|34353|56575|||||4|10.10.10.10.10|23231131 1|1212|34353|56575|||||4|10.10.10.10.11|3456 1|1212|34353|56575|||||4|10.10.10.10.12|659784 1|1212|34353|56575|||||4|10.10.10.10.13|654812 . . . .
Scenario 2: file name : scenario_2.txt
1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~390=10.10.10.10.10,391=23231131,394~ 1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~390=10.10.10.10.11,391=3456,394~ 1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~390=10.10.10.10.12,391=659784,394~ 1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~390=10.10.10.10.13,391=654812,394~ 1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~390=10.10.10.10.14,391=323,394~ 1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~390=10.10.10.10.10,391=97945641564,394~ 1|1212|34353|56575|||||4|~somedata~some data~~~~~~~~~~~~some data~~~~~ +~~~~~~~~~some data~~~~~~~390=10.10.10.10.10,391=1654594,394~
Now I am trying to split the data like below to a new file scenario_2_n.txt: It should get all the data till last "|" and the date after 390=, and 391=
1|1212|34353|56575|||||4|10.10.10.10.10|23231131 1|1212|34353|56575|||||4|10.10.10.10.11|3456 1|1212|34353|56575|||||4|10.10.10.10.12|659784 1|1212|34353|56575|||||4|10.10.10.10.13|654812 . . . .
Scenario 3: file name : scenario_3.txt
1|1212|34353|56575|||||4|~somedata~10.10.10.10.10~123546~~~~~~~~~~~som +e data~~~~~~~~~~~~~~some data~~~~~~~ 1|1212|34353|56575|||||4|~somedata~10.10.10.10.11~546~~~~~~~~~~~some d +ata~~~~~~~~~~~~~~some data~~~~~~~ 1|1212|34353|56575|||||4|~somedata~10.10.10.10.12~3415646~~~~~~~~~~~so +me data~~~~~~~~~~~~~~some data~~~~~~~ 1|1212|34353|56575|||||4|~somedata~10.10.10.10.13~12156~~~~~~~~~~~some + data~~~~~~~~~~~~~~some data~~~~~~~ 1|1212|34353|56575|||||4|~somedata~10.10.10.10.10~15464~~~~~~~~~~~some + data~~~~~~~~~~~~~~some data~~~~~~~ 1|1212|34353|56575|||||4|~somedata~10.10.10.10.10~8465~~~~~~~~~~~some +data~~~~~~~~~~~~~~some data~~~~~~~ 1|1212|34353|56575|||||4|~somedata~10.10.10.10.10~15654~~~~~~~~~~~some + data~~~~~~~~~~~~~~some data~~~~~~~
Now I am trying to split the data like below to a new file scenario_3_n.txt: It should get all the data till last "|" and the date after second~ and third~
1|1212|34353|56575|||||4|10.10.10.10.10|123546 1|1212|34353|56575|||||4|10.10.10.10.11|546 1|1212|34353|56575|||||4|10.10.10.10.12|3415646 1|1212|34353|56575|||||4|10.10.10.10.13|12156 . . . .
Thanks for looking and thanks for your help.

Comment on Split file using perl and regexp
Select or Download Code
Re: Split file using perl and regexp
by keszler (Priest) on Jan 17, 2013 at 19:36 UTC
    Have you looked into using any of the various CSV modules?
Re: Split file using perl and regexp
by ww (Bishop) on Jan 17, 2013 at 19:43 UTC
    ... or into hiring a programmer?

    You've outlined, by example, a longish spec ...but haven't shown any hint of an attempt to solve your problem, even though you've asked similar questions at least 3 times in the past 60 days.

    Where's your code? Precisely, what's wrong with it?

Re: Split file using perl and regexp
by davido (Archbishop) on Jan 17, 2013 at 19:45 UTC

    Could you explain how this question substantially or conceptually differs from Split a file based on column, which you posted (and followed-up to) yesterday? I thought we already dealt with this.

    What did you mean, in that thread, by "Thanks, got it working"?


    Dave

      I mean I was able to split the file based on the solution given by Kenosis. It's an extension to yesterday's problem. Sorry if I was not clear. Thanks.

        If it's an extension of yesterday's problem, post the code that you're currently using so that we can help in extending it.

        Otherwise it just looks like you're making zero progress on your own, and hoping someone will do free work for you.


        Dave

Re: Split file using perl and regexp
by AnomalousMonk (Monsignor) on Jan 17, 2013 at 22:37 UTC

    Here's an approach based on the observation of certain similarities (common prefix characters) in the data fields of interest in the three different types of data files. No discrimination between the three data file types is needed in the code.

    Some notes of caution:

    • The code shown assumes the data being fed to it is valid. It is intended only as an example of a regex-based approach.
    • The code is critically dependent on the definition of the  $rx_oct regex. (I gave it this name because it superficially suggests an IP octet.) The OP shows only limited examples of this sub-field in the range (10 .. 13). You (brad_nov) will have to change this regex to reflect the real data – or else maybe reveal an actual spec!

    >perl -wMstrict -le "my @records = ( '1|1212|34353|56575|||||4|~some~~pi=[10.10.10.10.10],uid=[11]}~', '1|1212|34353|56575|||||4|~som~~390=10.10.10.10.11,391=222,394~', '1|1212|34353|56575|||||4|~somedata~10.10.10.10.12~3333~~a~~~~', ); ;; my $rx_oct = qr{ \d{1,3} }xms; my $rx_quint = qr{ $rx_oct (?: \. $rx_oct){4} }xms; ;; my $rx_dotted = qr{ (?<! \d) $rx_quint (?! \d) }xms; my $rx_int = qr{ \d+ }xms; ;; for my $record (@records) { print qq{'$record'}; my ($const, $var) = $record =~ m{ ( \A .+) \| ( .* \z) }xms; my (undef, $dotted, $int) = $var =~ m{ (\D) ($rx_dotted) .*? \1 ($rx_int) }xms; my $new_record = join '|', $const, $dotted, $int; print qq{'$new_record' \n}; } " '1|1212|34353|56575|||||4|~some~~pi=[10.10.10.10.10],uid=[11]}~' '1|1212|34353|56575|||||4|10.10.10.10.10|11' '1|1212|34353|56575|||||4|~som~~390=10.10.10.10.11,391=222,394~' '1|1212|34353|56575|||||4|10.10.10.10.11|222' '1|1212|34353|56575|||||4|~somedata~10.10.10.10.12~3333~~a~~~~' '1|1212|34353|56575|||||4|10.10.10.10.12|3333'

    Update: After playing around with this a bit and doing a little, um, testing, I think I would change the definition of  $rx_dotted as follows (change to final look-ahead):
        my $rx_dotted = qr{ (?<! \d) $rx_quint (?! [.\d]) }xms;
    This change does not affect behavior for valid records.

Re: Split file using perl and regexp
by Anonymous Monk on Jan 18, 2013 at 01:00 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1013867]
Approved by keszler
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (6)
As of 2014-04-20 19:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (487 votes), past polls