http://www.perlmonks.org?node_id=546107

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Pardon my cross-posting from comp.lang.perl.misc at Google Groups. I have no experience with constructing Perl code at all, but I've been reading through PerlMonks and CPAN for most of yesterday and today. Here's my issue that I hope Perl can help me with:

I was given a very large and monotonous task at work yesterday: I have to enter 3,000+ contacts into Outlook. The information I need is found on a password protected site (I have the login and password). On the main page are links to profiles that bear the information I need.

So far, my understanding of this task, next to entering each contact manually, is to devise a script where I can spider each link, extract the profiles, convert them all into an excel or txt file, then export them somehow to Outlook, while making sure information gets transferred into proper fields. (preferably excel because I'd also need to include a notes field with specific comments).
I can open a profile (.asp) and save it as .xls and this gives me good row numbers to work with (e.g., the info I need is in rows 1-5 and then 7-12). But if I'd have to save each profile as it's own .xls file, then I'm pretty much back where I started in monotonous/redundancy land, not to mention memory overload land.

The info on the profiles look like this:

Company Name
Address1
Address2
Address3

Phone
Fax
Website

Employees (heading)
Name1, JobTitle
Name2, JobTitle
Name3, JobTitle

Info Section (which can go on for pages following the previous info. Info in this section is not needed.)

I have to ultimately create a vCard for each of the individuals with their job titles, all having the same address, phone, etc info for those fields.

Does anyone know of any codes or snippets of scripts that can help me? I am currently looking up how to even start a Perl script, the norms and expected formats. And I have JEdit. I'm good with theory, bad with actual codewriting. Any advice as to get me to the end result (3000+ vCards that my boss wants) will be much appreciated and welcomed! I don't expect any one else to write this application for me. I am happy to do it (somehow), but I just want to know I'm going about it appropriately.

Thanks in advance!

PM

2006-04-28 Retitled by planetscape, as per Monastery guidelines
Original title: 'Help, I think only Perl can save me'

Replies are listed 'Best First'.
Re: Extract data from website and transfer it to Outlook
by hesco (Deacon) on Apr 27, 2006 at 20:55 UTC
    Its true. Only perl can save you.

    But if you are just getting started with the language, you'll want to look at: the tutorials hosted at this site, and learn how to use perldoc. Good advice above, regarding starting with WWW::Mechanize to crawl your existing data. If you ultimately need vcards, don't waste your energy on MS OLE, if there are other paths to nirvana, and apparently there are.

    Start your scripts with:

    #!/usr/bin/perl -wT use strict; use warnings; use diagnostics; # comment out for production
    read up about Taint. And keep bringing your questions back here. You might even consider joining, getting your own login, etc. It will make it easier to track responses to your questions. A lot of anonymous monks around here post a lot of questions to sift through.

    -- Hugh

      Thanks Hugh!
Re: Extract data from website and transfer it to Outlook
by jdporter (Paladin) on Apr 27, 2006 at 19:51 UTC

    You could use WWW::Mechanize to scrape data from the site, then use Net::vCard to write vCard files directly. That's probably a lot easier than trying to shove the data into Outlook.

    We're building the house of the future together.
      Outlook has ... interesting ... interpretations of how vCard fields map to Outlook fields. You may be better off going with a CSV file, provided you make sure you know how Outlook wants the data.

      Ivan Heffner
      Sr. Software Engineer, DAS Lead
      WhitePages.com, Inc.

        Now that you mention it... I had a dog of a time trying to export a contact from Outlook to a vCard file, and importing it back in. I never did get it to work, and gave up trying.

        We're building the house of the future together.
Re: Extract data from website and transfer it to Outlook
by crashtest (Curate) on Apr 27, 2006 at 21:27 UTC

    If my boss expected me to manually enter 3K addresses into a system, the first thing I'd do is quit. Really.

    Certainly Perl is a great tool for just this sort of thing. But if the information you're trying to access is already on some website, it's probably stored in a database. If you can get someone to give you an extract of this database, you've eliminated most of your work already. All that would be left to do is import the data into Outlook. Is this a possibility you've looked into?

    If that's not feasible, I would look at some tutorials, grab a copy of [id://merlyn]'s Learning Perl and cobble something together with the help of:

    Hope this helps.

      As I was reading this I was just about to post similar to your comment concerning the expectation of the OP's employer that they perform this function

      of course, before I quit I would have asked for a contractor to write the code to accomplish the task

      ah..the joys of having a pointy headed manager

        Thanks guys! You have all been really helpful and really nice! I like this site so much I've taken the good advice and registered myself.

        I've been reading up on WWW::Mechanize. Sadly, the website in question IS in Javascript. I haven't given up yet. I'm going to spend some more time looking up as much info as I can.

        I also have acquired a .pdf version of the web data. I'm doing the best I can , trying to tackle this with straight Perl or trying to convert and extract info from the .pdf .

        Hopefully, I will be successful.

        As to quitting...I will not be doing this forever. I will not be doing this forever. I will not be doing this forever.

        Many thanks, and of course further advice is welcomed!

        Telly
Re: Extract data from website and transfer it to Outlook
by punkish (Priest) on Apr 28, 2006 at 02:38 UTC
    I have no experience with constructing Perl code... I think only Perl can save me
    Why do you think so? I am curious. If you have no experience with Perl, what makes you think that it can save you. From my experience, assuming I take your statements at face value, you probably will make your life miserable trying to accomplish this task with Perl. Why not do it with some other programming language that you might already know?

    That said, yes, Perl definitely can assist you in this task. But Outlook fields are a mess. All kinds of esoteric information is stored in them, and you have to make sure fields map properly.

    Is the website which has the information in your control? It is probably being powered by a database. Could you wrangle access to that database? If yes, your job will be much easier. If not, read on...

    Do the following... instead of mucking around with Perl right-away, launch MS-Winword (assuming you have a reasonably modern copy of it). Open your website URL in Winword, and suck the entire website down, traversing each link till you have all the information. Now you will have all the info on your 'puter in one big mongo file.

    Save that file as text, and scan through that looking for patterns. See if you can have Excel parse all that crap into columns.

    Actually, you can also try opening the URL directly in Excel. It supports these web queries where it tries to parse out tabular info from html (if that is applicable to you).

    Truly, Perl may be the most inappropriate tool for you given that you are "currently looking up how to even start a Perl script."

    Good luck. You will need it.

    --

    when small people start casting long shadows, it is time to go to bed
      Thanks Punkish, I'm actually working on multiple avenues towards that end result of getting all these contacts onto my Boss' Outlook rolodex. There is a .pdf file of this information as well and I'll be working on converting that to a text file and so forth. If I was fluent in Perl, it'd help me abbreviate the 4 or 5 tiered process I'd have to go through, because the script will ideally do this, and take it all into consideration. But the .pdf might pose its own set of problems. I'll see about this when I try to tackle it tomorrow.

      I'm really trying all avenues at once and I thought this was a good time as any to take a deeper look at Perl.
        I'm really trying all avenues at once and I thought this was a good time as any to take a deeper look at Perl.
        Probably is a good reason to take a deeper look at Perl, whether or not it is a good time is your call depending on how much of a hurry you are in to do the current task.

        On another list I subscribe to, a common question is "What language is the best for scripting?" This is for a GIS (mapping) application that can be scripted using many different languages. My answer is, "The best language is the one you know best."

        There is a .pdf file of this information as well and I'll be working on converting that to a text file and so forth

        That is great, esp. if the PDF is actually text and not an image. You can export the text out to RTF, save that as plain text, open it up in Excel, and format it as CSV. All that would remain would be to determine the CSV format that Outlook expects, massage your Excel file accordingly, export it to CSV, and then import to Outlook.

        Once this task is behind you, you can start learning Perl, and then maybe even come back to this task to see how you could do it using Perl.

        --

        when small people start casting long shadows, it is time to go to bed
Re: Extract data from website and transfer it to Outlook
by Anonymous Monk on Apr 29, 2006 at 03:49 UTC
    I have done a couple of web sites like this. I did not use WWW::Mechanize, but instead LWP::Simple or LWP::UserAgent, and HTML::TokeParser; HTML::LinkExtractor; no one showed me how to do this, I just did some reading and tried it.

    I also have no problem breaking it up into steps. As in, get some data in a format, then use my trusty text editor with Grep expressions to reform the data lines, then back to a perl script to put it somewhere.

    In several cases, I looked at the URL that gets me a page of data, then made a list of values to substitute into that URL, then ran a perl script to do it and catch the result. e.g. www.academicsFlow.edu/greatBooks.php?item=1
    then looped with values 1 through 99 inserted in that line
    If there is a nested link(s) I want to get, I might write a loop that uses the simple mode of LinkExtractor, iterating from one HREF to the next, and saving the results in seqentially named text files.

    Then I run through the text files, open each one, and write a simple state machine loop that has a bunch of hard coded knowledge of the format of the page to look for and extract the data fields. Not generalized, not pretty, but works for the task. I write out each of the values in say a tab-delimitted format I have defined.

    Lastly, I read in the tab delimited file into whatever target DB, like Outlook I suppose. THough I have not gone there :-)

    hth