Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Extracting Unique elements

by smandape1 (Acolyte)
on Jun 14, 2011 at 22:55 UTC ( [id://909665]=perlquestion: print w/replies, xml ) Need Help??

smandape1 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Experts !! I am trying to extract data from web using Web API. The web query from which I am trying to extract the data is in RDF-XML format which cannot be uploaded here. So, I uploaded it on google.docs. Kindly access the file through this link https://docs.google.com/document/d/12sQnToF4Vzr3lKl5oxyEVwggCEEMKmeWIcnvBxUJR5g/edit?hl=en_US&authkey=CLfQkZUB I am trying to extract title(may be just dc:title or prism:title), PMID, users(creator) their respective tags(subject) and authors(foaf:name) out of this file using perl code. The rdf.xml file is just an example. If there are more than one users(creator) as shown in the file, the rest of the infomration for title, authors, PMID are repeated for all the users. I want to get unique title, authors, PMID and remove duplicates. I am all new to perl programing. The perl code is as follows:

use lib 'C:/Perl64/www-connotea-perl-0.1/lib/'; my $fn0="extracted-connotea-pubmedID-1.txt"; open (IN0, $fn0) or die "Can't open $fn0: $!\n"; open (FH, ">:utf8",'title_pmid_users_tags.txt'); ### # Modules Used ### use lib '../lib'; use WWW::Connotea; ### #Stage 0: Supply log-in credientals and autheticate ### my $currentURI; ### # Collect posts for the unique uris that is imported using file handle +r. ### my $c = WWW::Connotea->new( user => 'myusername', password => '... +.....' ); $c->authenticate; ### dies if log-in credentials are incorrect while (<IN0>) { my $currentURI = $_; # for +each unique URI chomp($currentURI); my @tags = $c->posts_for(uri =>"$currentURI"); # To g +et the posts for the unique uris die "No candidate related articles\n" unless @tags; print FH "$currentURI\n"; # foreach my $tag (@tags) { # To g +et the title directly from posts_for. It extracts the title from post +s part in the XML file, the element <title>. # print FH "Title: "; # my $zoo = $tag->title; # print FH $zoo; # print FH "\n"; # } # for my $tag (@tags) { # To g +et the title indirectly from posts_for using through bookmarks_for, j +ust the element <dc:title> # print FH "title: "; # my $boo = $tag->bookmark(); # print FH $boo->title(); # print FH "\n"; # } foreach my $tag (@tags) { # To + get the title indirectly from posts_for using through bookmarks_for +and citations, the element <prism:title> print FH "title: "; my $boo = $tag->bookmark(); my $zoo = $boo->citation(); print FH $zoo->title(); print FH "\n"; } foreach my $tag (@tags) { print FH "PMID: "; my $boo = $tag->bookmark(); my $foo = $boo->citation(); for $bar($foo->identifiers()){ if ($bar =~ /PMID: (\d+)/) { print FH "$1\n"; } } } foreach my $tag (@tags) { print FH "User: "; my $bar = $tag->user; if (ref($bar) eq "ARRAY") { foreach my $q (@$bar){ print FH $q ,","; } } else { print FH $bar,","; } print FH "Tags: "; my $foo = $tag->tags; if (ref($foo) eq "ARRAY") { foreach my $p (@$foo){ print FH $p ,","; } } else { foreach my $p (@$foo) { print FH $foo,"\n"; } } print FH "\n"; } } close IN0; close FH;

When I use this code I get output something like this, this is the output for the rdf.xml file, which has repeated title and PMIDs and if I extract authors they will be repeated to ( I haven't yet extracted authors). I just want unique title and PMID and author information when I extract it.

http://www.ncbi.nlm.nih.gov/pubmed/17580848 title: Synthesis and evaluation of tripodal peptide analogues for cell +ular delivery of phosphopeptides. title: Synthesis and evaluation of tripodal peptide analogues for cell +ular delivery of phosphopeptides. PMID: 17580848 PMID: 17580848 User: guofengye,Tags: guofeg, User: mblau3,Tags: pubmed,

Replies are listed 'Best First'.
Re: Extracting Unique elements
by toolic (Bishop) on Jun 14, 2011 at 23:35 UTC
Re: Extracting Unique elements
by wind (Priest) on Jun 14, 2011 at 23:49 UTC

      I tried, but I am unable to use them directly. The thing is the data 'title' gets extracted twice for more than one users because of the loop. I want to restrict the loop to extract elements like title, PMID and list of authors only once. And I want to do it while I am extracting it. It seems that I can remove the duplicates later but, it all messes up. Because there are some users and tags that are duplicates too, but I want them. Can you help please.

        Just use a %seen hash like demonstrated in the resource I linked you to. It will enable you to filter out any duplicates as you go just as easily as removing duplicates after the fact.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://909665]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-03-29 09:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found