Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Parsing multi-line record with varying data

by VoidWander (Initiate)
on Aug 09, 2013 at 17:45 UTC ( #1048809=perlquestion: print w/ replies, xml ) Need Help??
VoidWander has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Perl Monks..

I've been struggling to find a suitable solution (or direction) in regards to a parsing task I'm working on. The dataset looks similar to the following (without the dashes - those are simply used here to encapsulate each entry as an example:

-- rn: uid:<user>, <irrelevant-text> id-info: <URL> | <ID> | <random-string> creation-time: 1366069064 -- rn: uid:<user>, <irrelevant-text> id-info: <URL> | <ID> | <random-string> id-info: <URL> | <ID> | <random-string> id-info: <URL> | <ID> | <random-string> creation-time: 1366069064 -- rn: uid:<user>, <irrelevant-text> id-info: <URL> | <ID> | <random-string> id-info: <URL> | <ID> | <random-string> # random empty line in each entry with 'deletion-time' deletion-time: 1367949064 creation-time: 1366069064 --
Now what I need to do is go through each entry and count the number of <ID>'s assigned to it, adding it to a hash I figure, like so: $myHash{totalForEntry}++, eventually ending up so I can print out a table like: IDS TOTAL 1 23 2 536 3 51 4 353 ..etc I'm not exactly sure how I can break down each section and count the total IDs assigned only within that entry. I had headed in the direction of 'paragraph mode' but what quickly threw me off of there was the empty line within each entry containing a 'deletion-time', which throws off the input separator. Here's some very messy code I had started with, lots of nonsense as I was experimenting here and there:

#!/usr/bin/perl use DateTime; use strict; use warnings; my $LOGFH = 'testdata'; my $/ = '\n'; my $x=0; open(LOGFH) or die("Couldn't open 'er!"); while(<LOGFH>) { #chomp; print $_, "\n"; $x++; last if $x == 2; } # #print $_, "\n"; # if ($_ =~ m!\|(.*)\|!){ # $id = $1 . "|"; # #print $id; } elsif ($_ =~ m!creationDate:\s+(\d+)!){ ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localt +ime($1); $dt = DateTime->from_epoch( epoch => $1 ); print $dt->year; } #last; }

Monks, do you have any pointers for me in regards to which direction I should be taking this to solve the problem? I'm in no way looking for a handout - I like a challenge and have been thinking this over for the past day, but figured a little community input wouldn't hurt! :) Many thanks, VoidWander

Comment on Parsing multi-line record with varying data
Select or Download Code
Re: Parsing multi-line record with varying data
by BrowserUk (Pope) on Aug 09, 2013 at 17:56 UTC
    I had headed in the direction of 'paragraph mode' but what quickly threw me off of there was the empty line within each entry containing a 'deletion-time', which throws off the input separator.

    Use $/ = 'rn:'; as the input separator and discard the first read. After that each read will be one complete section.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Parsing multi-line record with varying data
by hippo (Curate) on Aug 09, 2013 at 17:57 UTC

    Welcome to the monastery.

    To be honest, there's a lot in that script preventing it even from compiling. It might be best if you could just write a new script which simply manages to open the file correctly, read each line and print each line. Once you have that, save it. Then as you modify it at each step to add extra processes, test it, and if it fails, keep refining your addition until it works.

    Good luck.

Re: Parsing multi-line record with varying data
by moritz (Cardinal) on Aug 09, 2013 at 17:59 UTC

    Here is an example that reads through the example file, and prints each uid along with the number of id-info lines attached to it:

    use strict; use warnings; use 5.010; local $/ = "--\n"; while (<DATA>) { if (/^rn: uid:(\S+),/m) { my $uid = $1; my $count = 0; $count++ for /^id-info:/mg; say "$uid: $count"; } } __DATA__ -- rn: uid:<user>, <irrelevant-text> id-info: <URL> | <ID> | <random-string> creation-time: 1366069064 -- rn: uid:<user>, <irrelevant-text> id-info: <URL> | <ID> | <random-string> id-info: <URL> | <ID> | <random-string> id-info: <URL> | <ID> | <random-string> creation-time: 1366069064 -- rn: uid:<user>, <irrelevant-text> id-info: <URL> | <ID> | <random-string> id-info: <URL> | <ID> | <random-string> # random empty line in each entry with 'deletion-time' deletion-time: 1367949064 creation-time: 1366069064 --

    I hope it can serve as a starting point for you.

    If you need more help, please provide example (not whited-out) input data along with exactly what output you are expecting.

Re: Parsing multi-line record with varying data
by Kenosis (Priest) on Aug 09, 2013 at 18:11 UTC

    Building upon BrowserUk's suggestion, perhaps the following will be helpful:

    use strict; use warnings; use Data::Dumper; $/ = 'rn:'; my %hash; my $entry = 1; while (<DATA>) { chomp; /\S/ and $hash{ $entry++ } = () = /id-info:/g; } print Dumper \%hash; __DATA__ rn: uid:<user>, <irrelevant-text> id-info: <URL> | <ID> | <random-string> creation-time: 1366069064 rn: uid:<user>, <irrelevant-text> id-info: <URL> | <ID> | <random-string> id-info: <URL> | <ID> | <random-string> id-info: <URL> | <ID> | <random-string> creation-time: 1366069064 rn: uid:<user>, <irrelevant-text> id-info: <URL> | <ID> | <random-string> id-info: <URL> | <ID> | <random-string> deletion-time: 1367949064 creation-time: 1366069064 rn: uid:<user>, <irrelevant-text> creation-time: 1366069064

    Output:

    $VAR1 = { '4' => 0, '1' => 1, '3' => 2, '2' => 3 };
Re: Parsing multi-line record with varying data
by VoidWander (Initiate) on Aug 12, 2013 at 01:36 UTC

    Thank you for the advice, Monks. I've made some progress but have also had some difficulties in pursuing my solution further. I have written code that does the majority of what I would like to accomplish, but here's where I'm stuck. For each line with 'id-info:' I need to get a copy of <ID> and assign to it an incremental counter, eventually getting to the point in my code where I can assign a hash as follows: (There are only about 10 unique IDs, just in case you're wondering..)

    $count{$month}{$day}{$ID}++;

    Some of the records appear as follows, which throws me off as I can't seem to access each individual line and instead access the record as a whole, correct?

    rn: uid:<user>, <irrelevant-text> id-info: <URL> | 12345 6789 | <random-string> id-info: <URL> | 9876543 21 | <random-string> id-info: <URL> | 134257 869 | <random-string> creation-time: 1366069064

    As you can see, the ID is split and attempting to join / assign the entire ID to a variable is turning out to be a pain. Monks, could I perhaps coax out some advice? Many thanks and infinite appreciation, VoidWander.

    .

    Was thinking something along the lines of this, but it seems like there would be an easier way and this obviously doesn't work...

    if (/id-info:/mg){ ($_ =~ /.*\|(\d+) \|/mg) print $1, "\n"; }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1048809]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2014-08-28 03:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (256 votes), past polls