Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Help:getting parts of the strings from a file into managable variables

by my_perl (Initiate)
on Nov 11, 2004 at 17:10 UTC ( [id://407105]=perlquestion: print w/replies, xml ) Need Help??

my_perl has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am very new to perl and i would be very thankful for all help you can give me. I am trying to make a program that will extract info from a file and them do some statistic analysis with it. I need for every UserID to get all the variables. As you can see they are distributed throughout the file. Any suggestions on how should i approach this? File is in following format: <UserID>46786<UserID>
<start>2004-10-21TO09:57:25Z</start>
<dev>Some Text</dev>
<var1>some string</var1>
<var2>some string</var2>
<USerID>57864</UserID>
<start>2004-10-25TO09:57:25Z</start>
<dev>Some Text</dev>
<var1>some string</var1>
<UserID>46786<UserID>
<var3>some string</var3>
<var4>some string</var4>
<UserID>98766</UserID>
<start>2004-10-21TO09:57:25Z</start>
<dev>Some Text</dev>
<var1>some string</var1>
<var2>some string</var2>
<var5>some string</var5>
<var6>some string</var6>
<USerID>57864</UserID>
<var4>some string</var4>
<var6>some string</var6>
  • Comment on Help:getting parts of the strings from a file into managable variables

Replies are listed 'Best First'.
Re: getting parts of the strings from a file into managable variables
by Roy Johnson (Monsignor) on Nov 11, 2004 at 17:24 UTC
    I think this problem is a little advanced for someone "very new" to Perl. Do you understand the basic data structures of perl (perldoc perldata)? Do you have a fair understanding of regular expressions (perldoc perlretut)? How about references (perldoc perlreftut)? If you have a pretty good grasp of those concepts, you should be able to take a stab at this problem.

    Generally, it's recommended that you use a module to parse XML type data files, but a quick and dirty solution might go like this:

    use strict; use warnings; my %info; my $thisuser; while (<DATA>) { my ($var, $val) = /<([^>]+)>([^<]+)/; if ($var eq 'UserID') { $thisuser = $val; } else { $info{$thisuser}{$var} = $val; } } use Data::Dumper; print Dumper(\%info), "\n"; __DATA__ <UserID>46786<UserID> <start>2004-10-21TO09:57:25Z</start> <dev>Some Text</dev> <var1>some string</var1> <var2>some string</var2> <USerID>57864</UserID> <start>2004-10-25TO09:57:25Z</start> <dev>Some Text</dev> <var1>some string</var1> <UserID>46786<UserID> <var3>some string</var3> <var4>some string</var4> <UserID>98766</UserID> <start>2004-10-21TO09:57:25Z</start> <dev>Some Text</dev> <var1>some string</var1> <var2>some string</var2> <var5>some string</var5> <var6>some string</var6> <USerID>57864</UserID> <var4>some string</var4> <var6>some string</var6>

    Caution: Contents may have been coded under pressure.
Re: getting parts of the strings from a file into managable variables
by jZed (Prior) on Nov 11, 2004 at 17:19 UTC
    This format is maddeningly close to XML. I would suggest running a regex to to turn it into real XML and then using XML tools to parse it. Try something like s~<UserID>(\d+)</UserID>~</record><record UserID="$1">~g; Then lop off the first </record> and add an enclosing tag for the entire set and you should have real XML.
Re: getting parts of the strings from a file into managable variables
by Eimi Metamorphoumai (Deacon) on Nov 11, 2004 at 17:29 UTC
    Your data looks a little suspect (lacking a few /, and inconsistent case), but this parses it and builds a nested hash from it.
    my $user; my %uservariables; while(<DATA>){ chomp; my ($key, $value) = m{^<([^>]+)>(.*)<}i or die $_; $key = lc $key; #to adjust for case differences if ($key eq "userid"){ $user = $value; next; } $uservariables{$user}->{$key}=$value; } use Data::Dumper; print Dumper(\%uservariables); __DATA__ <UserID>46786<UserID> <start>2004-10-21TO09:57:25Z</start> <dev>Some Text</dev> <var1>some string</var1> <var2>some string</var2> <USerID>57864</UserID> <start>2004-10-25TO09:57:25Z</start> <dev>Some Text</dev> <var1>some string</var1> <UserID>46786<UserID> <var3>some string</var3> <var4>some string</var4> <UserID>98766</UserID> <start>2004-10-21TO09:57:25Z</start> <dev>Some Text</dev> <var1>some string</var1> <var2>some string</var2> <var5>some string</var5> <var6>some string</var6> <USerID>57864</UserID> <var4>some string</var4> <var6>some string</var6>
      Hi, thanks a lot for response, it is working great. What do i need to change if i have spaces before strings, and number of spaces is not constant? Thanks again :) Aida
        Hmmm, you might try changing the regexp to something like
        my ($key, $value) = m{^\s*<([^>]+)>\s*(.*)<}i;
        Although I'm starting to agree with the others that a real XML parser might be the way to go.
      Hi, how would i access these variables?
      foreach userid i would like to print only var1 , its value, and var2 , its value.
      foreach $key (keys %uservariables) {
      ???????
      }
      THanks
        Something like
        for my $key (keys %uservariables){ print "User $key has var1: $uservariables{$key}->{var1}, var2: $user +variables{$key}->{var2}"; }
Re: Help:getting parts of the strings from a file into managable variables
by tmoertel (Chaplain) on Nov 11, 2004 at 17:37 UTC
    (Update: Noticed that UserIDs could repeat; changed code to merge values for duplicate UserIDs.)

    Here's one way of doing it that stores the data as a hash of hashes:

    #!/usr/bin/perl use warnings; use strict; my %user_data; my $current_user; while (<DATA>) { if (my ($elem, $content) = m|^ <([^>]+)> (.*) </\1> |x) { $current_user = $content if $elem eq "UserID"; $user_data{$current_user}{$elem} = $content; } else { print "Bad line $.: $_"; } } use Data::Dumper; print Dumper(\%user_data); # $VAR1 = { # '98766' => { # 'var6' => 'some string', # 'var1' => 'some string', # 'dev' => 'Some Text', # 'UserID' => '98766', # 'var2' => 'some string', # 'var5' => 'some string', # 'start' => '2004-10-21TO09:57:25Z' # }, # '57864' => { # 'var6' => 'some string', # 'var1' => 'some string', # 'dev' => 'Some Text', # 'var4' => 'some string', # 'UserID' => '57864', # 'start' => '2004-10-25TO09:57:25Z' # }, # '46786' => { # 'var3' => 'some string', # 'var1' => 'some string', # 'dev' => 'Some Text', # 'var4' => 'some string', # 'UserID' => '46786', # 'var2' => 'some string', # 'start' => '2004-10-21TO09:57:25Z' # } # }; __DATA__ <UserID>46786</UserID> <start>2004-10-21TO09:57:25Z</start> <dev>Some Text</dev> <var1>some string</var1> <var2>some string</var2> <UserID>57864</UserID> <start>2004-10-25TO09:57:25Z</start> <dev>Some Text</dev> <var1>some string</var1> <UserID>46786</UserID> <var3>some string</var3> <var4>some string</var4> <UserID>98766</UserID> <start>2004-10-21TO09:57:25Z</start> <dev>Some Text</dev> <var1>some string</var1> <var2>some string</var2> <var5>some string</var5> <var6>some string</var6> <UserID>57864</UserID> <var4>some string</var4> <var6>some string</var6>
    When parsing files, it's a good idea to detect and report errors. A few of your sample lines, for example, had opening and closing tags that didn't match. I fixed them in my example data, but only after the error-reporting code caught them.

    Cheers,
    Tom

      Hi, THank you so much for a prompt response :) I cut and pasted your response, and it does not work for me. It gives result that every line is bad. and VAR1{} any idea why? Thanks one more time.
        To make my code easier to read, I indent it by four spaces when I quote it. (That way, it doesn't get lost in the surrounding flow of text.) As a result, you'll need to unindent it (or at least the __DATA__ portion) before running it.

        Try processing the script through this one liner to remove the leading four spaces:

        perl -i.bak -pe's/^ //' the-script.pl # unindent the-script.pl
        That ought to do it.

        Cheers,
        Tom

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://407105]
Approved by bgreenlee
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-04-20 13:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found