Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Perl, SQLite3, and Parsing the Chatterbox Feed.

by DigitalKitty (Parson)
on Feb 14, 2008 at 05:51 UTC ( #667880=perlquestion: print w/ replies, xml ) Need Help??
DigitalKitty has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.

With help from: parv, dhoss, Fairy_Nuff, and planetscape, I started writing a chatterbox history tool for educational reasons.

use warnings; use strict; use LWP::Simple; use DBI; my $data = ''; my $dbh = ''; my $url = 'http://www.perlmonks.org/?node_id=207304'; my $pat = qr{ .*<author>(.*)<\/author>.*<text>(.*)<\/text }xs; $data = get( $url ); $dbh = DBI->connect( "dbi:SQLite:dbname=C:\\testdb", "", "" ); while ( ( my($auth, $text) = ( $data =~ m/$pat/gc ) ) ) { for( $text ) { s/[ ]+/ /g; s/^\s+//; s/\s+$//; } printf "%s: %s\n\n" , $auth , $text; $dbh->do('insert into monks values(?,?)', undef, $auth, $text ); }


I was hoping some of you could offer suggestions regarding how I might improve the design/functionality of the (currently beta quality) program. At the present time, it only displays the most recent author/comment as opposed to several speakers and their respective comments.

I took the liberty of including my (simple) table design as well:
SQLite 3.5.6
CREATE TABLE monks( monk varchar(25), comment varchar(255) );

Thanks,
~Katie

Comment on Perl, SQLite3, and Parsing the Chatterbox Feed.
Select or Download Code
Re: Perl, SQLite3, and Parsing the Chatterbox Feed.
by pc88mxer (Vicar) on Feb 14, 2008 at 06:15 UTC
    You definitely need to use less greedy regex's. Instead of:

    my $pat = qr{ .*<author>(.*)<\/author>.*<text>(.*)<\/text }xs;

    use:

    my $pat = qr{ .*?<author>(.*?)<\/author>.*?<text>(.*?)<\/text }xs;

    Also, I'm not sure you are using the /g option correctly. I've had better luck with:

    while ($data =~ m/$pat/gc) { my ($auth, $text) = ($1, $2); for( $text ) { s/[ ]+/ /g; s/^\s+//; s/\s+$//; } printf "%s: %s\n\n" , $auth , $text; }
      I don't see why either of you are using /c. It's definitely not useful, and I suspect it's harmful.

        I am responsible for /c in match condition & simultaneous assignment in while loop (for replied in hurry, misread the /c description). Here is what works ...

        # Without /g, it would be an endless loop for match will # always start at the start of $data. while ( $data =~ m/$parse/g ) { my ( $auth , $text ) = ( "$1" , "$2" ); ... }

        (Circa 2001-2005, there are some examples of XML::(Twig|Simple) use to parse the chatterbox XML around here somewhere.)

Re: Perl, SQLite3, and Parsing the Chatterbox Feed.
by McDarren (Abbot) on Feb 14, 2008 at 06:27 UTC
    um, two comments..

    1. You're parsing XML with a regex. Tsk! Tsk!. You should know better than that :p
      Use a proper XML parser such as XML::Twig or XML::Simple.
    2. Given that you're creating a CB history, wouldn't you think it a good idea to include a date/time field in your database? ;)

    Cheers,
    Darren :)

      I'd like to second both of these suggestions, as well as add a couple of my own...

      Have you considered parsing the posts for links in the CB? I imagine it would lend itself to some very interesting correlations down the road:
      "What percentage of posts link to cpan?
      Which Monk links to his/her scratchpad most often?
      etc..."


      To make this really work well, you would definately need at least a time value as suggested already (and if you plan to keep more than 24 hours worth of data, a date value will be necessary as well).
Re: Perl, SQLite3, and Parsing the Chatterbox Feed.
by hipowls (Curate) on Feb 14, 2008 at 06:36 UTC

    To get the data into a usable form this works. It needs error checking but the idea is sound.

    use XML::Simple; use LWP::Simple; use Data::Dumper; my $url = 'http://www.perlmonks.org/?node_id=207304'; my $text = get($url); my $ref = XMLin( $text, ForceArray => ['message'], ); print Dumper $ref; __END__ $VAR1 = { 'info' => { 'sitename' => 'PerlMonks', 'count' => '2', 'gentimeGMT' => '2008-02-14 06:32:17', 'lastid' => '703987', 'content' => 'Rendered by the New Chatterbox XML Ticker', 'xmlmaker' => 'XML::Fling 1.001', 'site' => 'http://perlmonks.org/', 'xmlstyle' => 'clean,new', 'fromid' => '00703985', 'ticker_id' => '207304' }, 'message' => [ { 'message_id' => '703986', 'epoch' => '1202970679', 'text' => 'testing', 'time' => '01:31:19', 'date' => '2008-02-14', 'user_id' => '660179', 'author' => 'hipowls' }, { 'message_id' => '703987', 'epoch' => '1202970708', 'text' => 'just ignore it', 'time' => '01:31:48', 'date' => '2008-02-14', 'user_id' => '660179', 'author' => 'hipowls' } ] };

    Update: Added ForceArray => ['message'] so that messages are always in a list even when there is only one.

Reaped: Re: Perl, SQLite3, and Parsing the Chatterbox Feed.
by NodeReaper (Curate) on Feb 14, 2008 at 06:36 UTC
Re: Perl, SQLite3, and Parsing the Chatterbox Feed.
by holli (Monsignor) on Feb 14, 2008 at 10:27 UTC
    working in "production":
    #!/usr/bin/perl use lib qw( /mnt/web4/10/47/51683347/htdocs/lib/site_perl/5.8.5 ); use warnings; use strict; use DBI; use WWW::Mechanize; use XML::Simple; my ($sth, $dbh, $xml); my $messages = []; my $mech = WWW::Mechanize->new(); while (1) { my $resp = $mech->get( 'http://www.perlmonks.org/index.pl?node_id= +207304' ); if ( $resp->is_success ) { my $xml = $resp->content; my $jatter = XMLin( $xml, ForceArray => ['message'] ); if ( $jatter->{info}->{count} > 0 ) { print STDERR "adding ", scalar @{$jatter->{message}}, "\n" +; unless ( $dbh ) { $dbh = DBI->connect("DBI:mysql:database=DB354211;host= +rdbms.strato.de", 'U354211', 'pw354211'); $sth = $dbh->prepare('INSERT INTO pmf_jatterboxx (user +_id, author, epoch, message_id, message) VALUES (?, ?, ?, ?, ?)'); } for ( @{$jatter->{message}} ) { $sth->execute( $_->{user_id}, $_->{author}, $_->{epoch +}, $_->{message_id}, $_->{text} ); } } else { print STDERR "snooze\n"; } } sleep(5); }
    note: It is not obvious, but the chatterbox feed somehow notices the caller and returns only the chat-lines that are new; even without passing a date flage or something. I am curious how that works.


    holli, /regexed monk/

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://667880]
Approved by McDarren
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2014-08-28 08:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (257 votes), past polls