Hello Experts !!
I am trying to extract data from web using Web API. The web query from which I am trying to extract the data is in RDF-XML format which cannot be uploaded here. So, I uploaded it on google.docs. Kindly access the file through this link https://docs.google.com/document/d/12sQnToF4Vzr3lKl5oxyEVwggCEEMKmeWIcnvBxUJR5g/edit?hl=en_US&authkey=CLfQkZUB
I am trying to extract title(may be just dc:title or prism:title), PMID, users(creator) their respective tags(subject) and authors(foaf:name) out of this file using perl code. The rdf.xml file is just an example. If there are more than one users(creator) as shown in the file, the rest of the infomration for title, authors, PMID are repeated for all the users. I want to get unique title, authors, PMID and remove duplicates. I am all new to perl programing. The perl code is as follows:
use lib 'C:/Perl64/www-connotea-perl-0.1/lib/';
my $fn0="extracted-connotea-pubmedID-1.txt";
open (IN0, $fn0) or
die "Can't open $fn0: $!\n";
open (FH, ">:utf8",'title_pmid_users_tags.txt');
###
# Modules Used
###
use lib '../lib';
use WWW::Connotea;
###
#Stage 0: Supply log-in credientals and autheticate
###
my $currentURI;
###
# Collect posts for the unique uris that is imported using file handle
+r.
###
my $c = WWW::Connotea->new( user => 'myusername', password => '...
+.....' );
$c->authenticate; ### dies if log-in credentials are incorrect
while (<IN0>)
{
my $currentURI = $_; # for
+each unique URI
chomp($currentURI);
my @tags = $c->posts_for(uri =>"$currentURI"); # To g
+et the posts for the unique uris
die "No candidate related articles\n" unless @tags;
print FH "$currentURI\n";
# foreach my $tag (@tags) { # To g
+et the title directly from posts_for. It extracts the title from post
+s part in the XML file, the element <title>.
# print FH "Title: ";
# my $zoo = $tag->title;
# print FH $zoo;
# print FH "\n";
# }
# for my $tag (@tags) { # To g
+et the title indirectly from posts_for using through bookmarks_for, j
+ust the element <dc:title>
# print FH "title: ";
# my $boo = $tag->bookmark();
# print FH $boo->title();
# print FH "\n";
# }
foreach my $tag (@tags) { # To
+ get the title indirectly from posts_for using through bookmarks_for
+and citations, the element <prism:title>
print FH "title: ";
my $boo = $tag->bookmark();
my $zoo = $boo->citation();
print FH $zoo->title();
print FH "\n";
}
foreach my $tag (@tags) {
print FH "PMID: ";
my $boo = $tag->bookmark();
my $foo = $boo->citation();
for $bar($foo->identifiers()){
if ($bar =~ /PMID: (\d+)/)
{
print FH "$1\n";
}
}
}
foreach my $tag (@tags) {
print FH "User: ";
my $bar = $tag->user;
if (ref($bar) eq "ARRAY") {
foreach my $q (@$bar){
print FH $q ,",";
}
} else {
print FH $bar,",";
}
print FH "Tags: ";
my $foo = $tag->tags;
if (ref($foo) eq "ARRAY") {
foreach my $p (@$foo){
print FH $p ,",";
}
} else {
foreach my $p (@$foo) {
print FH $foo,"\n";
}
}
print FH "\n";
}
}
close IN0;
close FH;
When I use this code I get output something like this, this is the output for the rdf.xml file, which has repeated title and PMIDs and if I extract authors they will be repeated to ( I haven't yet extracted authors). I just want unique title and PMID and author information when I extract it.
http://www.ncbi.nlm.nih.gov/pubmed/17580848
title: Synthesis and evaluation of tripodal peptide analogues for cell
+ular delivery of phosphopeptides.
title: Synthesis and evaluation of tripodal peptide analogues for cell
+ular delivery of phosphopeptides.
PMID: 17580848
PMID: 17580848
User: guofengye,Tags: guofeg,
User: mblau3,Tags: pubmed,