Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Perl, SQLite3, and Parsing the Chatterbox Feed.

by DigitalKitty (Parson)
on Feb 14, 2008 at 05:51 UTC ( #667880=perlquestion: print w/ replies, xml ) Need Help??
DigitalKitty has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.

With help from: parv, dhoss, Fairy_Nuff, and planetscape, I started writing a chatterbox history tool for educational reasons.

use warnings; use strict; use LWP::Simple; use DBI; my $data = ''; my $dbh = ''; my $url = 'http://www.perlmonks.org/?node_id=207304'; my $pat = qr{ .*<author>(.*)<\/author>.*<text>(.*)<\/text }xs; $data = get( $url ); $dbh = DBI->connect( "dbi:SQLite:dbname=C:\\testdb", "", "" ); while ( ( my($auth, $text) = ( $data =~ m/$pat/gc ) ) ) { for( $text ) { s/[ ]+/ /g; s/^\s+//; s/\s+$//; } printf "%s: %s\n\n" , $auth , $text; $dbh->do('insert into monks values(?,?)', undef, $auth, $text ); }


I was hoping some of you could offer suggestions regarding how I might improve the design/functionality of the (currently beta quality) program. At the present time, it only displays the most recent author/comment as opposed to several speakers and their respective comments.

I took the liberty of including my (simple) table design as well:
SQLite 3.5.6
CREATE TABLE monks( monk varchar(25), comment varchar(255) );

Thanks,
~Katie

Comment on Perl, SQLite3, and Parsing the Chatterbox Feed.
Select or Download Code
Re: Perl, SQLite3, and Parsing the Chatterbox Feed.
by pc88mxer (Vicar) on Feb 14, 2008 at 06:15 UTC
    You definitely need to use less greedy regex's. Instead of:

    my $pat = qr{ .*<author>(.*)<\/author>.*<text>(.*)<\/text }xs;

    use:

    my $pat = qr{ .*?<author>(.*?)<\/author>.*?<text>(.*?)<\/text }xs;

    Also, I'm not sure you are using the /g option correctly. I've had better luck with:

    while ($data =~ m/$pat/gc) { my ($auth, $text) = ($1, $2); for( $text ) { s/[ ]+/ /g; s/^\s+//; s/\s+$//; } printf "%s: %s\n\n" , $auth , $text; }
      I don't see why either of you are using /c. It's definitely not useful, and I suspect it's harmful.

        I am responsible for /c in match condition & simultaneous assignment in while loop (for replied in hurry, misread the /c description). Here is what works ...

        # Without /g, it would be an endless loop for match will # always start at the start of $data. while ( $data =~ m/$parse/g ) { my ( $auth , $text ) = ( "$1" , "$2" ); ... }

        (Circa 2001-2005, there are some examples of XML::(Twig|Simple) use to parse the chatterbox XML around here somewhere.)

Re: Perl, SQLite3, and Parsing the Chatterbox Feed.
by McDarren (Abbot) on Feb 14, 2008 at 06:27 UTC
    um, two comments..

    1. You're parsing XML with a regex. Tsk! Tsk!. You should know better than that :p
      Use a proper XML parser such as XML::Twig or XML::Simple.
    2. Given that you're creating a CB history, wouldn't you think it a good idea to include a date/time field in your database? ;)

    Cheers,
    Darren :)

      I'd like to second both of these suggestions, as well as add a couple of my own...

      Have you considered parsing the posts for links in the CB? I imagine it would lend itself to some very interesting correlations down the road:
      "What percentage of posts link to cpan?
      Which Monk links to his/her scratchpad most often?
      etc..."


      To make this really work well, you would definately need at least a time value as suggested already (and if you plan to keep more than 24 hours worth of data, a date value will be necessary as well).
Re: Perl, SQLite3, and Parsing the Chatterbox Feed.
by hipowls (Curate) on Feb 14, 2008 at 06:36 UTC

    To get the data into a usable form this works. It needs error checking but the idea is sound.

    use XML::Simple; use LWP::Simple; use Data::Dumper; my $url = 'http://www.perlmonks.org/?node_id=207304'; my $text = get($url); my $ref = XMLin( $text, ForceArray => ['message'], ); print Dumper $ref; __END__ $VAR1 = { 'info' => { 'sitename' => 'PerlMonks', 'count' => '2', 'gentimeGMT' => '2008-02-14 06:32:17', 'lastid' => '703987', 'content' => 'Rendered by the New Chatterbox XML Ticker', 'xmlmaker' => 'XML::Fling 1.001', 'site' => 'http://perlmonks.org/', 'xmlstyle' => 'clean,new', 'fromid' => '00703985', 'ticker_id' => '207304' }, 'message' => [ { 'message_id' => '703986', 'epoch' => '1202970679', 'text' => 'testing', 'time' => '01:31:19', 'date' => '2008-02-14', 'user_id' => '660179', 'author' => 'hipowls' }, { 'message_id' => '703987', 'epoch' => '1202970708', 'text' => 'just ignore it', 'time' => '01:31:48', 'date' => '2008-02-14', 'user_id' => '660179', 'author' => 'hipowls' } ] };

    Update: Added ForceArray => ['message'] so that messages are always in a list even when there is only one.

Reaped: Re: Perl, SQLite3, and Parsing the Chatterbox Feed.
by NodeReaper (Curate) on Feb 14, 2008 at 06:36 UTC
Re: Perl, SQLite3, and Parsing the Chatterbox Feed.
by holli (Monsignor) on Feb 14, 2008 at 10:27 UTC
    working in "production":
    #!/usr/bin/perl use lib qw( /mnt/web4/10/47/51683347/htdocs/lib/site_perl/5.8.5 ); use warnings; use strict; use DBI; use WWW::Mechanize; use XML::Simple; my ($sth, $dbh, $xml); my $messages = []; my $mech = WWW::Mechanize->new(); while (1) { my $resp = $mech->get( 'http://www.perlmonks.org/index.pl?node_id= +207304' ); if ( $resp->is_success ) { my $xml = $resp->content; my $jatter = XMLin( $xml, ForceArray => ['message'] ); if ( $jatter->{info}->{count} > 0 ) { print STDERR "adding ", scalar @{$jatter->{message}}, "\n" +; unless ( $dbh ) { $dbh = DBI->connect("DBI:mysql:database=DB354211;host= +rdbms.strato.de", 'U354211', 'pw354211'); $sth = $dbh->prepare('INSERT INTO pmf_jatterboxx (user +_id, author, epoch, message_id, message) VALUES (?, ?, ?, ?, ?)'); } for ( @{$jatter->{message}} ) { $sth->execute( $_->{user_id}, $_->{author}, $_->{epoch +}, $_->{message_id}, $_->{text} ); } } else { print STDERR "snooze\n"; } } sleep(5); }
    note: It is not obvious, but the chatterbox feed somehow notices the caller and returns only the chat-lines that are new; even without passing a date flage or something. I am curious how that works.


    holli, /regexed monk/

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://667880]
Approved by McDarren
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (9)
As of 2014-09-18 22:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (125 votes), past polls