Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Perl version of 'diff'?

by ninja_byte (Acolyte)
on Jul 19, 2004 at 22:16 UTC ( #375728=perlquestion: print w/ replies, xml ) Need Help??
ninja_byte has asked for the wisdom of the Perl Monks concerning the following question:

I have a database filled with about 200,000+ records. Each one has information - domain name, username, host.
This is *not* a dynamic database - I have to manually go through each one of the hosts, extract a master domain list, and parse it from there...
The initial challenge was to get all of the information into a database - that has been handled(albeit crudely).

Now I'm faced with the task of keeping it updated every few days/weeks...
So lets say I have 'host01-original.txt', and I can get 'host01-updated.txt' at any given time. Any suggestions as to the best way of finding new additions to the file only?

The brute force method of a huge 'grep' loop seems a bit distasteful at present.

The file is currently in the format:
domainname.com:username

Thanks!

Comment on Perl version of 'diff'?
Re: Perl version of 'diff'?
by borisz (Canon) on Jul 19, 2004 at 22:23 UTC
    I would put the data into a real database. Like postgres, perhaps SQLite is enough.
    Boris
Re: Perl version of 'diff'?
by PERLscienceman (Curate) on Jul 19, 2004 at 22:32 UTC
    Greetings Fellow Monk!
    You may want to have a look at the following module on CPAN: Text::Diff . It's description states that it 'performs diffs and files and record sets......providing a basic set of services akin to GNU diff'.
    I believe that this may be the very nearest thing to a "Perl Version of 'diff'".
Re: Perl version of 'diff'?
by Joost (Canon) on Jul 19, 2004 at 22:35 UTC
Re: Perl version of 'diff'?
by runrig (Abbot) on Jul 19, 2004 at 23:18 UTC
    If both files are sorted, and you just want the new lines, then just:
    comm -13 host01-original.txt host01-updated.txt
Re: Perl version of 'diff'?
by tachyon (Chancellor) on Jul 20, 2004 at 00:21 UTC

    If you move to a real DB (or even with a slight modification to the current format) there is a useful trick - add an update time field. Unix epoch time is fine. Thus your master database looks like:

    data_field1 data_field2 ... update_time

    With the update time to hand it is a simple matter to select all the records with an update time > some_value and thus generate a record set than encapsulates all the latest changes, valid from any fixed time point.

    It looks to me as though your data storage format is going to give you issues but this depends on what you are actually doing. There are essentially 3 types of data changes you may need to deal with:

    1. INSERTS - adding a brand new entry
    2. DELETIONS - removing dross
    3. UPDATES - ie what if username@domain.com changes their username? You either need to UPDATE their record or delete the old record and insert the new one.

    cheers

    tachyon

Re: Perl version of 'diff'?
by cLive ;-) (Parson) on Jul 30, 2004 at 19:58 UTC
    How about I add a timestamp field to the relevant table and show you the SOAP interface to the DB :)

    cLive ;-)

    Err, for those of you wondering, Ninja_Byte is sitting about 10yds away from me :)

      That'd be cool, but it's supposed to be a master list spanning all types of servers(both *nixes, windows, jails). I have to use a variety of methods to get the list, then grep and awk my way into a standardized text file format.

      From there it's another deal to get it into sql format.

      from there, the updates.. It's supposed to be a bit overlapping, with the same person showing up on multiple hosts. I'm going to play with the aforementioned modules and see if I can avoid reinventing the wheel..

      "efficiency via the work of others..."

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://375728]
Approved by PERLscienceman
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (15)
As of 2014-07-29 13:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (217 votes), past polls