Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Regex Split and Formatting

by reaper9187 (Scribe)
on Apr 03, 2013 at 11:45 UTC ( #1026834=perlquestion: print w/ replies, xml ) Need Help??
reaper9187 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,,
I most desperately seek your help on an issue that has been plaguing me for a while ..
I need to read a text file like so:
"123", "DEF123","this is test","C:\Abhinav\test.jpg" "456", "DEF456","this is test","C:\Matt\test.jpg" "726", "DEF726","this is test","C:\Matt\test.jpg"
My purpose is to save the parameters associated with each entry ( i.e the first parameter which is a number)... For eg: the output array should look something like :
1. "123", "DEF123","this is test","C:\Matt\test.jpg" 2. "456", "DEF456","this is test","C:\Matt\test.jpg" 3. "726", "DEF726","this is test","C:\Matt\test.jpg"
I'm not very good with regex and hence, couldnt figure out a way to do it .. Note. the new line characters in between quotes within values. Well those are NOT TO BE REMOVED . hence, no chomping allowed.
Could you please guide me with your wisdom ??? Please help .. Thank you ..!!
*Update* : Oh yes .. I forgot to mention.. i'm not allowed to use any modules..

Comment on Regex Split and Formatting
Select or Download Code
Replies are listed 'Best First'.
Re: Regex Split and Formatting
by ww (Bishop) on Apr 03, 2013 at 12:02 UTC

    "...not allowed to use any modules?"

    If that means your problem is homework, you should have so labeled it to begin with.

    Further: What did you try? How did it fail?

    We like to know the answers to these questions...because the Monastery's mission is to help you learn; not to do your homework.

    If you didn't program your executable by toggling in binary, it wasn't really programming!

Re: Regex Split and Formatting
by ggoebel (Sexton) on Apr 03, 2013 at 12:47 UTC

    Read perlre. If you're in a rush read perlrequick. And if you find the learning curve too steep, take a step back and start with perlretut

    Note, this isn't perfect. It assumes your end of line character is a linefeed. It also doesn't handle linefeeds in the 4th field. But then in the sample provided, it wasn't significant.

    use strict; use warnings; my $data; localscope: { local $/; $data = <DATA>; my $i=1; while ($data =~ /\G((?:(?:[^,]*),){3}(?:[^\n]*\n))/g) { my $entry = $1; print "$i $entry"; $i++; } } 1; __DATA__ "123", "DEF123","this is test","C:\Abhinav\test.jpg" "456", "DEF456","this is test","C:\Matt\test.jpg" "726", "DEF726","this is test","C:\Matt\test.jpg"
      I would also recommend Jeffrey Friedl's Mastering Regular Expressions book. I haven't read all those pods to the end yet so perhaps they cover all that's needed, but I can say his book helped me a great deal and touches specifically on some of the issues here (assuming you can't be convinced to use a module). For instance, I didn't properly understand how to match across newlines (or, sadly, even that newline is one of the things \s matches -- yikes) or what the /m and /s (and /ms) qualifiers do exactly until I read his book. The information was probably there in the perl pods but my eye must have glazed over it or struggled with the wording.

      It also deals with how to match within quotes (and how to do so efficiently), to the point where Damian Conway's Perl 6 Exegesis 5 document even refers to a certain kind of regex as being "Friedl style".

      If the latest edition is too long for you, the one I read (recently) was the 1st edition, and I can say that it's still valuable, even if it does leave out some newer Perl regex features. Just in case, I read parts of the chapters on dfas and nfas in a newer edition from the library and I think perhaps the coverage there was flushed out and improved some (have to say the car analogy is the single thing I dislike about the book -- maybe I'd be less bothered if it was a bicycle analogy, I dunno), but the bulk of the size increase seemed to come from covering more regex flavours from more languages (ones that I don't personally care about as much as the ones in the 1st edition). Someone correct me if this is a poor impression to put out into the world.

Re: Regex Split and Formatting
by sundialsvc4 (Abbot) on Apr 03, 2013 at 12:49 UTC

    If we are to take you at your word that this is not homework, then the stricture that you are not allowed to use any modules makes no sense.   The purpose, and the power, of Perl, is to leverage what someone else has already done.   You have received excellent suggestions, and you should, I think, go to your manager and explain to him or her that the most expeditious way to solve the problem is to use one of these packages.   (When you want to install a new door in your house, you do not begin by cutting down a tree with an axe ...)

    Remember that packages can be installed locally, even on a per-application basis, without disrupting anything else that is installed.   This is discussed in topics that include non-root user in their titles here, because it is an issue that is constantly dealt with when installing software on a shared-hosting web server.

      I can relate to the OP. My employer is actually afraid of Perl for "security reasons" and generally won't install any modules. Silly, given that Perl's one of the best tools out there in terms of security. Not all the neanderthals are dead...

        I could almost understand that if modules required root access and a sysadmin to install (if only because I'm a sysadmin, and I don't want to be bothered for stuff like that), but with things like local::lib and even just running cpan as a non-privileged user, it's hard to justify or make sense of it.

        Christopher Cashell
Re: Regex Split and Formatting (Text::CSV)
by Anonymous Monk on Apr 03, 2013 at 11:51 UTC
Re: Regex Split
by Ratazong (Prior) on Apr 03, 2013 at 11:51 UTC


    • Advice #1: use a module to parse such files, e.g.Text::CSV (csv = comma seperated values). Don't use a regex for it.
    • Advice #2: check the documentation of that module how to handle the newlines. E.g. Text::CSV advises you to use binmode.

    HTH, Rata

Re: Regex Split and Formatting
by topher (Scribe) on Apr 03, 2013 at 17:05 UTC
    *Update* : Oh yes .. I forgot to mention.. i'm not allowed to use any modules..

    Parsing CSV isn't your real problem; this is your real problem.

    CSV is one of those things that seems simple and easy to handle at first glance. But, once you start digging in and coming across corner-cases, you realized it can be very tricky to get right. You mention the embedded newlines, but what about embedded commas (commas inside quoted fields)? Are all fields quoted? Can you guarantee that? What about escaped quotes inside of quoted fields?

    Unless this is a homework assignment, or you have strange desire to reinvent the wheel (in which case you should be reviewing the code in existing CSV modules), you would be better off copying and pasting Text::CSV into your program than you would be in trying to recreate it.

    Christopher Cashell
      Funny, that when people post saying "I'm not allowed to use modules", I almost never hear them say why.

        I was thinking the exact same thing. I'm not going to say that there are no valid reasons or situations where modules aren't allowed, but it's very hard to think of one that doesn't involve, "because I'm supposed to write this functionality myself as homework".

        worst-case scenario, you can almost always bundle the modules into your own application, or utilize modern tools like App::FatPacker to do the work for you.

        Programming in Perl without CPAN is like going into a gun fight without bullets.

        Christopher Cashell
Re: Regex Split and Formatting
by hdb (Prior) on Apr 03, 2013 at 11:55 UTC

    You should use Text::CSV!

    use strict; use warnings; local $/=''; my @data = split /"\s*"/, <DATA>; print join "\n\n\n\n", @data; __DATA__ "123", "DEF123","this is test","C:\Abhinav\test.jpg" "456", "DEF456","this is test","C:\Matt\test.jpg" "726", "DEF726","this is test","C:\Matt\test.jpg"
Re: Regex Split and Formatting
by reaper9187 (Scribe) on Apr 03, 2013 at 12:09 UTC
    Thank you so much for the prompt reply ...
    I did try using regex but failed miserably.. I got thus far :
    $out_file = "testout.txt"; #this is the file to be edited open (OUT, "<$out_file") or die "Can't open $out_file: $!\n"; $/ = "\n"; while ($line = <OUT>) { my $first = (split /,/, $line)[0]; $first =~ tr/"//d; $first =~ s/^\s+//; ....... } close OUT;
    *Tried to use the first numeric element as matching criteria and the split that line into an array(but that got way too complicated)
    * Then tried using splice to split the entire file into equal sized arrays( but i did not know how to do that)

    I did not know how to proceed(hence the "...." in the code) The method i'm using is long (since I'm a newbie to perl and don't understand all the nuances)..Appreciate the help ..

    P.S: This is not my homework.. Just something i'm working on currently ...
Re: Regex Split and Formatting
by clueless newbie (Hermit) on Apr 03, 2013 at 13:07 UTC

    Did something similar years ago --- so...

    #!\user\bin\perl -w while (@a_Field=GetCSV(@a_Field)) { print join(' | ',@a_Field)."\n\n"; }; sub GetCSV{ my($s_Line,$s_Text); while ($s_Line=<DATA>) { #print "'$s_Line'\n"; my(@a_Field); $s_Text.=$s_Line; while ($s_Text =~ m{([^",]*)([,\n])|"((?:[^"]|"")*)"([,\n])}gs) +{ unless (substr($`,-1,1) eq '"') { if (defined $1) { push(@a_Field,$1); if ($2 eq "\n") { return @a_Field; }; } else { my($s_Field); ($s_Field=$3)=~s/""/"/g; push(@a_Field,$s_Field); if ($4 eq "\n") { return @a_Field; }; }; }; }; }; return (); }; __DATA__ 10915,"S","Phil","ing Valley Middle School",$0.00,0,,0 10916,"Tr","Ny",,,,,,"999-999-9999",,,,"715000 works at Re-Max Wishing for housecoat with dogs, no luck",$0.00,0,,0 10917,"Ro","Ox","3677 As Dr","W","BC","V4T 2W5",,"1111111111",,,,,$0.0 +0,0,,0 10918,"Sa","Fri",,"K","BC",,,"2222222222",,,,,$0.00,0,,0 10355,"Val","Woj",,,,,,"3333333333",,,,"Solutions",$0.00,0,,0 10356,"Ter","Bes",,,,,,"1211211212",,,,,$0.00,0,,0 10357,"Phi","Har",,,,,,"1231231234",,,,"6 Woodcroft Ave St Catns x1x1x1 999-999-3203 check the address to see if it is still her's.",$0.00,0,,0 10358,"Ra","Gak",,"Kel",,,,"3453453456",,,,,$0.00,0,,0 10359,"J","Ru",,"V","BC",,,"7777777777",,,,"st ing to tell J to come a +gain… (555) 899-9999",$0.00,0,,0 10360,"Li","Sa",,"Win",,,,"4444444444",,,,"LDr Claremont",$0.00,0,,0 10361,"Ke","Ta",,"K","BC",,,"5555555555",,,,,$0.00,0,,0 10362,"Kat","son",,"V","BC",,,"6666666666",,,,,$0.00,0,,0
Re: Regex Split and Formatting
by ww (Bishop) on Apr 04, 2013 at 15:56 UTC

    If the first record -- in fact -- is the same as the first record -- as posted -- how do you transform it to the sample output you offer. IOW, how does:

    "123", "DEF123","this is test","C:\Abhinav\test.jpg"</i>" # Abhinav


    "1. "123", "DEF123","this is test","C:\Matt\test.jpg" # Matt???

    Were you merely careless about posting your request for a way to resolve your problem. Carelessness like that often wastes the time of those good enough to try to help you!

    And, not just BTW, I'm not persuaded by your claim that this is not homework. I'm not sure it is, either, but the 'no modules' statement -- in the face of advise (to use a CSV approach) from those far wiser about Perl than thee -- is pretty hard to accept as anything other than nonsense.

    Also note: your comment in Re: Regex Split and Formatting that you "failed miserably" is not a precise statement of a problem... other than failure to persevere in the face of discouragement. The code accompanying that generality might serve to offer you some enlightenment, however, were you to add a few strategic print statements to see what you're really getting into $line and $first.

    If you didn't program your executable by toggling in binary, it wasn't really programming!

Re: Regex Split and Formatting
by reaper9187 (Scribe) on Apr 04, 2013 at 05:42 UTC
    Just to clear the air.. I'm developing this tool for a friend of mine. When he told me about the problem, usage of modules was the first thing i suggested him to do (By the way, its not his homework too if you guys are wondering..!!). Even he's new to perl and has never used them..
    So i thought it would be a challenge to do this without using them. So here i am... I made a lot of mistakes but i got to learn some great new stuff too ...

    When you want to install a new door in your house, you dont cut down a tree with an axe. But its also worthwhile to learn something new (say build a door with the basic tools rather than just go out and buy a readymade one.. :) :) )

    And thank you all for the posts .. Learning something new everyday .. :)

    @Anonymous Monk - I know the feeling. My manager too feels the same way about it ..
Re: Regex Split and Formatting
by kcott (Abbot) on Apr 04, 2013 at 11:08 UTC

    G'day reaper9187,

    This does what you want:

    $ perl -Mstrict -Mwarnings -e ' my $cols = 4; my $re = qr{("(?:[^"\\]++|\\.)*+"\s*,*\s*)}m; my $file = do { local $/; <> }; my @params = $file =~ /$re/g; print "$_. ", splice(@params, 0, $cols), "\n" for 1 .. @params / $ +cols; ' "123", "DEF123","this is test","C:\Abhinav\test.jpg" "456", "DEF456","this is test","C:\Matt\test.jpg" "726", "DEF726","this is test","C:\Matt\test.jpg" 1. "123", "DEF123","this is test","C:\Abhinav\test.jpg" 2. "456", "DEF456","this is test","C:\Matt\test.jpg" 3. "726", "DEF726","this is test","C:\Matt\test.jpg"

    For the regexp, I've made a very minor addition to what's provided in perlre (search for "match a double-quoted string"). I also used splice to access groups of, what you're calling, parameters.

    In future, can I suggest you state something like this in your OP: "I'm aware of Whatever::Module but I want to do this using other_thing because reason_why.". I think that might have saved a lot of questions/answers/updates/etc.

    -- Ken

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1026834]
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (12)
As of 2015-08-03 18:18 GMT
Find Nodes?
    Voting Booth?

    The oldest computer book still on my shelves (or on my digital media) is ...

    Results (48 votes), past polls