Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Proper use of split

by th3j4ckl3 (Novice)
on Jun 01, 2012 at 14:50 UTC ( #973794=perlquestion: print w/replies, xml ) Need Help??
th3j4ckl3 has asked for the wisdom of the Perl Monks concerning the following question:

Thank you fellow monks. This week I decided to learn perl. I am some data in a text as per example below.

{"temp":70.00,"tmode":2,"fmode":0,"override":0,"hold":0,"t_cool":70.00 +,"tstate":0,"fstate":0,"time":"day":3,"hour":23,"minute":29},"t_type_ +post":0}
All of the data is in one line and repeats at time intervals throughout the day. I have been trying to understand a simple way for me to learn perl and use the above an a beginners way to learn to extraction of data. My simple program I have creatd below.

#! /usr/bin/perl open (FILE, '/home/julian/tstatcollect'); while (<FILE>) { chomp; ($temp, $tmode, $fmode, $override,) = split (","); #my %newval = split (/[:]/, $temp); print "Temperature: $temp\n"; print "Tmode: $tmode\n"; print "Fmode: $fmode\n"; print "Override: $override\n"; print "________\n"; } close (FILE); exit;
_______________________________________________ output shows
Temperature: {"temp":70.00 Tmode: "tmode":2 Fmode: "fmode":0 Override: "override":0
I understand that what it is doing is parsing out the data between each ,. What I was hoping to do is figure out how I can use split to parse the data between each " " and each : and ,. Is split the right way to go about this. Hopefully there is a simple way? Again, remember I just started last week so as simple as possible would be best. Thanks, Julian.

Replies are listed 'Best First'.
Re: Proper use of split
by kennethk (Abbot) on Jun 01, 2012 at 15:02 UTC
    So, the text you've provided looks like some corrupted JSON to me, in which case you could simply accomplish your parsing task using JSON. CPAN is an amazing resource, and your exploration of Perl would be well served by learning its ways.

    The issue with a simple split that you propose is that, in the general case, the quoted elements of a JSON string can contain your delimiters (, and :). You could do this using a regular expression or splits, but it's much easier to do this sort of thing with a state machine and character-wise parsing. And beyond that, it's much easier to use freely available, well-tested code someone else wrote.

    And having said all that, with the given example, you can do what you want with (undef, $temp, undef,$tmode, undef,$fmode, undef,$override,) = map split (":"), split (",");, which outputs

    Temperature: 70.00 Tmode: 2 Fmode: 0 Override: 0 ________

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Proper use of split
by aaron_baugher (Curate) on Jun 01, 2012 at 15:58 UTC

    As others have said, if this is a known format like JSON (with a couple typos in your example), then your best bet is to use a module that knows how to parse that format. However, if that's not the case, a series of regex matches is one way to do it. For instance, this will pluck out the number following "temp":

    my($temp) = /"temp":([\d.]+)/;

    That should give you a head start on parsing out the other values you want. While a series of regexes is almost surely slower than the split/map solution offered upthread, one advantage it has is that it won't matter if the order of your key/value pairs changes. So choose the solution that best fits your data.

    Aaron B.
    Available for small or large Perl jobs; see my home node.

      That brings up a very interesting idea:   if you are looking just for particular keys, such as "temp", a regex that specifically included that string would be a great way to probe the string for precisely that key, ignoring all the others if any.   (The “qr//” construct might come in handy here; see Regex Quote-Like Operators in perldoc perlop.)

Re: Proper use of split
by Kenosis (Priest) on Jun 01, 2012 at 19:18 UTC

    Hi, th3j4ckl3.

    I'm impressed by what you've accomplished, having decided to learn Perl just this week! I think you've done quite well.

    By the great responses to your query, you may have noticed the multiple solutions to your question. You may also know that this is not at all unusual. Although I think using a Perl Module to parse the data string is an excellent suggestion for some time in your learning, I'm more inclined to think it's more pedagogically sound--at this point, at least--to code it yourself, especially since you've already practically solved this issue.

    Give this, here's another solution that primarily uses your approach:

    use Modern::Perl; my $tstatcollect = '/home/julian/tstatcollect'; open my $file, "<$tstatcollect" or die "Can't open $tstatcollect: $!"; my $data = <$file>; close $file; my @dataElem = ( split ',', $data )[ 0 .. 3 ]; foreach (@dataElem) { my ( $lable, $value ) = /"(.*)":(.*)/; say "\u$lable: $value"; }


    Temp: 70.00 Tmode: 2 Fmode: 0 Override: 0

    Setting up a programming environment to help debug scripts is invaluable and time saving. The use Modern::Perl pragma includes both use strict; and use warnings;--which you'll often see at the top of scripts. I encourage you to always include these in your scripts.

    You'll notice the or die after the open call. This handles errors, just in case your open fails.

    Since there's only one line in your file, I chose to not use while, but rather just read that single line into a variable. I liked your splitting on a comma. Enclosing split within parentheses creates a list, and the [ 0 .. 3 ] notation requests only elements 0-3 of that list, which are placed into @dataElem.

    Now, we iterate over those split elements, using a regex with captures to grab the label and value from each element. (Side note: instead of assigning the capture results to $lable and $value, we could have used $1 and $2, which directly correspond to the captured contents.)

    Finally, we print with an "\n" (say) the result, using \u to uppercase the first letter of the label.

    Hope this helps!


    Give your updated data string and wanting to display all fields, you can make the following change in the above script to handle the time elements:

    my @dataElem = ( split ',', $data ); foreach (@dataElem) { next if !( my ( $lable, $value ) = (/("time":{"?)*"(.*)":([^}]+)/)[ 1 +, 2 ] ); say "\u$lable: $value"; }

    Output (given your new string):

    Temp: 75.50 Tmode: 2 Fmode: 0 Override: 0 Hold: 0 T_cool: 75.00 Tstate: 0 Fstate: 0 Day: 4 Hour: 13 Minute: 49

    It's an ugly regex, but it's trying to match three items, where the first is optional. Notice the [ 1, 2 ] notation again, which was present in the original script. The matching creates a list, and we're requesting the last two elements of the list--provided there *is* a match.

    With you now wanting to use all the fields of your data string and having updated its format, I think kennethk's use of the JSON Module is an excellent solution for you to consider.

Re: Proper use of split
by RichardK (Parson) on Jun 01, 2012 at 15:05 UTC

    You might find it easier to use a regular expression. try perlretut for a start.

      Thank you RichardK, tremendously. Hope I wasnt asking for too much. I researched this all week in various searches and I was up till 2 a.m. this morning. I even scanned through the perl programming book from O'Rielly but like with anything else, it takes the right direction or the right search in google. This will give me a push in the right direction for more searching. Thanks for your quick reply. I'll let you all know what I did with it." Julian

Re: Proper use of split
by sundialsvc4 (Abbot) on Jun 01, 2012 at 15:35 UTC

    Assuming for the moment that this is not a well-known format like JSON for which a more complete solution exists, I can think of two general approaches for tackling this problem.   One is to split the string on commas into a list.   The other is to use the “global matching” (g and c) as ultimately described in the section, Using regular expressions in Perl, in perldoc perlretut.   Of the two, I rather like the second one best, especially if the data is consistently numeric.

    “Global matching” lets you apply a regex more than one time to the same string, so that you can take a “winnowing the wheat from the chaff” approach by using a regular expression that corresponds to the “wheat.”   The position of the matching string is established by the pos() function, which has one very important “gotcha”:   that the start-position corresponding to “from the start of the string” is undef, not zero.   (Uh huh... “ouch! it bit me!”)

    As an extemporaneous example, a pattern such as \"([a-z_]+)\"\:([0-9.]+) could be applied and it would return the matched substrings as $1 and $2 ... I repeat, extemporaneous example ... and it would return $1='temp' $2="70.00' the first time, $1='tmode', $2='2' the second time, and so on (if I actually got it right).   It would skip over anything that did not match in search of the next thing that did.   This can be a useful technique, although as with everything else having to do with regular-expressions it demands rigorous testing.   (Beware that if the regular expression does not encompass all of the actual data, any data which doesn’t match will simply be skipped!   For example, I had to edit this post to include an underscore-character ...)

      The position of the matching string is established by the pos() function...

      But pos sez (emphasis added):

      Returns the offset of where the last "m//g" search left off for
      the variable in question [...]. Note that 0 is a valid match offset. "undef"
      indicates that the search position is reset (usually due to
      match failure, but can also be because no match has yet been run
      on the scalar).

      IOW,  pos controls the point at which  m//g matching resumes following a previous  m//g match on a given string. If there was no previous  m//g match (either because such a match was not attempted or because it failed), the point at which to resume  m//g matching has no meaning and is literally undef.

      Update: Assuming that the start-position corresponding to “from the start of the string” refers to the  \A assertion, consider the following (note that  print_pos() undefines  pos($_) on each call):

      >perl -wMstrict -le "$_ = 'abcdef'; print_pos('initial'); ;; m{ \A }xms; print_pos('\A'); ;; m{ \A }xmsg; print_pos('\A/g'); ;; m{ \A }xmsg; m{ \A }xmsg; print_pos('\A/g repeated'); ;;;; sub print_pos { printf qq{%14s: pos = %s \n}, $_[0], defined(pos) ? pos() : 'undef' ; pos = undef; } " initial: pos = undef \A: pos = undef \A/g: pos = 0 \A/g repeated: pos = undef

      pos($_) is undefined after the initialization of the  $_ scalar as a string. Following the first
          m{ \A }xms;
      statement (non-m//g match),  pos($_) is undefined because no  m//g has yet been done. Following the single
          m{ \A }xmsg;
      global match statement,  pos($_) is 0 because this is the character position after the  \A absolute-beginning-of-the-string assertion. (Remember that  \A is a zero-width assertion and so can be comfortable in the narrow confine between the start of the string and its first character!) This is the position from which a subsequent  m//g would begin matching. Following the repeated
          m{ \A }xmsg;
          m{ \A }xmsg;
      global match statements,  pos($_) is undefined because the second global match failed: it could not find a position at which the  \A assertion was true when searching from character position 0 to the end of the string.

      Ok, you're so smart, so go explain these results:

      >perl -wMstrict -le "$_ = 'abcdef'; print_pos('initial'); ;; m{ \b }xmsg; print_pos('single \b/g'); ;; m{ \b }xmsg; m{ \b }xmsg; print_pos('double \b/g'); ;; m{ \b }xmsg; m{ \b }xmsg; m{ \b }xmsg; print_pos('triple \b/g'); ;;;; sub print_pos { printf qq{%14s: pos = %s \n}, $_[0], defined(pos) ? pos() : 'undef' ; pos = undef; } " initial: pos = undef single \b/g: pos = 0 double \b/g: pos = 6 triple \b/g: pos = undef

      (In particular, if the string  'abcdef' has six characters and therefore character positions 0 .. 5 inclusive, what does it mean that a
          m{ \b }xmsg;
      statement finds a 'match' at a  pos of six?)

Re: Proper use of split
by th3j4ckl3 (Novice) on Jun 01, 2012 at 19:03 UTC

    First let me say thank you for the wealth of good information. I have been doing a lot of reading in areas of various parts of perl that have been called out in each of your replies. Terms like JSON, etc. Again, I am new and its a learning process. Using a personal project as a referance point for me makes it a little bit fun. I think the first response on using the map and undef helped me the most for now in understanding how I can carve up my data, but for some reason in my original post I think my example got butchered and I ran into another problem. It looks the editor field on this website does some strange wrapping of the line I am trying to work with. The end of the data in tstatcollect has a subsection that seems to break the original splits.

    {"temp":75.50,"tmode":2,"fmode":0,"override":0,"hold":0,"t_cool":75.00 +,"tstate":0,"fstate":0,"time":{"day":4,"hour":13,"minute":49},"t_type +_p ost":0}
    My working code now looks like
    #! /usr/bin/perl open (FILE, '/home/julian/tstatcollect'); while (<FILE>) { chomp; (undef,$temp, undef,$tmode, undef,$fmode, undef,$override, undef,$hold +, undef,$tcool, undef,$tstate, undef,$fstate, undef,$time, undef,$day +, undef,$hour, undef,$minute, undef,$t_type_post,) = map split (":"), + split (","), ; print "Temperature: $temp\n"; print "Tmode: $tmode\n"; print "Fmode: $fmode\n"; print "Override: $override\n"; print "Hold: $hold\n"; print "tcool: $tcool\n"; print "tstate: $tstate\n"; print "fstate: $fstate\n"; print "day: $day\n"; print "hour: $hour\n"; print "minute: $minute\n"; print "t_type_post $t_type_post\n"; print "________\n"; } close (FILE); exit;
    With an output now of
    Temperature: 75.50 Tmode: 2 Fmode: 0 Override: 0 Hold: 0 tcool: 75.00 tstate: 0 fstate: 0 day: "hour" hour: "minute" minute: "t_type_post" t_type_post "
    It seems as though when the regular format changes to something it doesnt expect the continuing parsing of the data goes to trash? My thought being is do I need do break the data into two forms. One with the first portion and another with the date time nested portion? I also liked the statement made earlier about collecting the data regardless of where its at with regards to the delimiter? Interesting. I apprecaite all the help and its giving me a lot of reading and researching to do for this weekend. I know there is a ton of resources and books written on the subject but without a referance point to start it made it hard to know what I was looking for. This gives me more to work with. Not looking for answers to the problem as much as I am looking for places to start learning. Thanks again. Julian

      So, the reason that the split approach fails is that it is very sensitive to changes in input structure. If we examine the corrected input stream, you'll note that it is now valid JSON, so the script
      use strict; use warnings; use JSON; use Data::Dumper; $_ = '{"temp":75.50,"tmode":2,"fmode":0,"override":0,"hold":0,"t_cool" +:75.00,"tstate":0,"fstate":0,"time":{"day":4,"hour":13,"minute":49}," +t_type_post":0}'; print Dumper decode_json ($_);
      $VAR1 = { 'fstate' => 0, 't_cool' => '75', 'time' => { 'hour' => 13, 'minute' => 49, 'day' => 4 }, 'fmode' => 0, 'tmode' => 2, 'temp' => '75.5', 'hold' => 0, 'override' => 0, 'tstate' => 0, 't_type_post' => 0 };

      Rather than having to fight writing a parser, the module does the heavy lifting, and your script could be rewritten perhaps like:

      use strict; use warnings; use JSON; open (my $fh, '<', '/home/julian/tstatcollect') or die "Open failed:$! +"; while (<$fh>) { chomp; my $obj = decode_json($_); print "Temperature: $obj->{temp}\n"; print "Tmode: $obj->{tmode}\n"; print "Fmode: $obj->{fmode}\n"; print "Override: $obj->{override}\n"; print "Hold: $obj->{hold}\n"; print "tcool: $obj->{t_cool}\n"; print "tstate: $obj->{tstate}\n"; print "fstate: $obj->{fstate}\n"; print "day: $obj->{time}{day}\n"; print "hour: $obj->{time}{hour}\n"; print "minute: $obj->{time}{minute}\n"; print "t_type_post $obj->{t_type_post}\n"; print "________\n"; }
      which outputs
      Temperature: 75.5 Tmode: 2 Fmode: 0 Override: 0 Hold: 0 tcool: 75 tstate: 0 fstate: 0 day: 4 hour: 13 minute: 49 t_type_post 0 ________

      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        This is excellent, kennethk!

Re: Proper use of split
by th3j4ckl3 (Novice) on Jun 01, 2012 at 19:51 UTC
    <print> As odd as it may seem, I have been the unix career for about 25 years. ATT -> SCO -> now in AIX. I havent really dabbled with using perl but I stumbled upon a project that now has me fascinated by the language. I recently purchased a wifi filtrete thermostat that allows me to curl data from the thermostat itself and present it to me in the format you have here. Like with any of us that have been unix admins, its a constant growing experience. If anything I have learned there are some really good people in this field and you guys have proven that. I overnighted my new 2012 4th edition perl programming book from amazon and I should be getting it today. As I tinker with the info I have so far, I'll update here but you all have given me MORE then enough to go with! I couldn't have expected better treatment from this groups members. Thanks again. I'll let you know how I progress on this project and my adventures from another newb in Perl. Loving it so far. An amazing language. Julian.


      Welcome to Perl and PM. I have worked in AIX for about 24 years, and have used Perl for about the last 15 years.

      Perl just gets better and better.

      You ordered the right book, but you may also want to get the O'Reilly companion book "Perl Cookbook". I only found out about it on PM, and it has a lot of useful examples that are full scripts. You can get source for the scripts from the O'Reilly web site. It may help you appreciate the power of Perl!

      One note of caution about AIX. IBM has sometimes shipped the development version of Perl, and not the stable version. Versions 5.8.8 and 5.12+ or newer are great versions of Perl. I have compiled them with both the gcc 4.2.4 and the xlc compilers without much problem. You can check your version:

      perl -v # tells you just the version perl -V # thats a capital V, and it tells how Perl was +compiled
      This is covered in your book.


      "Well done is better than well said." - Benjamin Franklin

Re: Proper use of split
by th3j4ckl3 (Novice) on Jun 01, 2012 at 20:09 UTC

    And as I post this reply, my UPS guy just showed up with my hardbound copy. Let the adventures begin!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://973794]
Approved by kennethk
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2018-01-20 16:06 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (227 votes). Check out past polls.