Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

mapping coordinates- suggestion needed

by baxy77bax (Deacon)
on Oct 14, 2010 at 16:05 UTC ( #865294=perlquestion: print w/replies, xml ) Need Help??
baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi ppl, i need some clever suggestion for the following problem. what i have is a set of coordinates with some , let say user names associated to it:
start,stop 22,25 >uname1 344,360 >uname2 433,540 >uname3 432,532 >uname4
on the other side i have another set of coordinates
start,stop 21,23 >job_id1 255,345 >job_id2 345,355 >job_id3 356,366 >job_id4
what i'm trying to figure out is which users at specific intervals ran a specific job. so im trying to map job id's intervals to uname intervals. where rules are that even if only part of the job_id_interval crosses the uname interval, this should be reported. the thing, is there are over 20 million of such intervals(intervals overlap) in each group and the size of allowed interval in both cases is the same and spans from 1 to 20 million.

now what i was thinking about is to, using a Bit::Vector libraries, create vector field and map the both coordinated on the the vector space, then see where they overlap and just remove the non-overlapping fields. but then it hit me how will i track down which unames and job_ids those overlaps belong to. then i thought about hashing. but how will i the find a coordinate in my hash key that is less then X

find: my $uname = $hash{> then $start and <= then stop} #???????
i mean i would need to sort hash keys and the loop through them to find hash keys that are >= to some start key(id) and <= to some stop key(id).

and now i'm stuck and crying to you for help.

so let me summarize my problem : i need to map if possible job_id's intervals to uname intervals and preserve >uname1 >job_id2 tags. keep in mind that those datasets have piled up over the years and are quite large. so some simple loop within a loop would not be a good solution

thank you

baxy

PS

max for the coordinates in both cases is 20000000

PPS

to moritz

this is a fraction from the real data set but don't worry about that since. the dataset, as i said large, and i cannot by hand pick real representative data to illustrate the problem

this corresponds to the job lines 14230157,14230182,3445:7:3:707:620 3437306,3439308,3445:7:3:990:634 14593103,14593128,3445:7:3:537:287 16948765,16948768,127305:7:3:49:800 12044820,12044845,127303:7:3:686:44 11310494,11310519,127340:7:3:67:320 19408728,19408753,127438:7:3:508:614 17007683,17007685,127439:7:3:481:403
and these are unames :
16820359,16821584,5:7:3:1:5 17979480,17999505,4:7:3:948:200 12491787,14491812,4:7:3:784:575 17389967,18389969,34:7:3:617:920 11671837,19671839,34:7:3:516:921
as i said this is probably not a a good example for the problem illustration so please do refer to the example above :)

Replies are listed 'Best First'.
Re: mapping coordinates- suggestion needed
by moritz (Cardinal) on Oct 14, 2010 at 16:32 UTC
    Please see Optimizing a double loop for inspiration - it discusses a slighly different problem, but similar techniques could be used for solving your problem.

    It might help if you included a sample of real data, which would make it easier for us to get a feel for the characteristics of the data.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: mapping coordinates- suggestion needed
by jethro (Monsignor) on Oct 14, 2010 at 17:31 UTC

    Divide and conquer. Create 10000 files. In file 0 write all job_ids that are partly or completely between 0 and 1999. In file 1 the range 2000 and 3999. And so on.

    After this for any uname_id you would have to check only about 2000 items and seldom more when it overlaps into more than one range (in that case you have to check for multiple finds of the same range, but that check is easy with a result hash)

    If you have to check many uname_ids at a time, you should have the uname_ids sorted. That way you need basically the same job-range (which fits easily into memory) for many consecutive checks. In that case even some preprocessing might make sense, where you create a hash with the 2000 numbers of the range as key and an array of job-ranges that include this number as data.

    Naturally the 10000 files ist just one possibility, maybe 1000 files with a range of 20000 numbers or something inbetween might make more sense

Re: mapping coordinates- suggestion needed
by BrowserUk (Pope) on Oct 14, 2010 at 16:30 UTC

    What is the maximum of the coordinates?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: mapping coordinates- suggestion needed
by BrowserUk (Pope) on Oct 14, 2010 at 16:49 UTC
    PS:max for the coordinates in both cases is 20000000

    Just to confirm: you have two datasets of 20e6 items each. Where each item is a pair of coordinates; where each coordinate pair can range from '1,2 >tag' through '19999999,20000000 >tag' ?

      Yes , that is exactly the case. where the ">tag" is something that should be connected to the coordinate so that i can say, after mapping, tagX is connected to the coordinate associated with unameY

      and the coordinate does not necessary need to be two consecutive numbers , but it can

        Let's see. If the "coordinates" of the time intervals are in seconds, 20e6 covers about 8 months, which doesn't really tally with your "those datasets have piled up over the years".

        But if they were in minutes, then it represents 39 years. Why would anyone care what user ran what job 38 years ago? They've an above average chance of being dead already.

        And then there is the "rules are that even if only part of the job_id_interval crosses the uname interval, this should be reported." bit. How can a userid be responsible for a particular job if they didn't log on until the last (second|minute|lesser known time unit) of the jobs life?

        And then there's that coincidence between there being 20e6 logons, 20e6 jobs; & 20e6 intervals during which those logos and job runs occurred?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: mapping coordinates- suggestion needed
by jandrew (Chaplain) on Oct 15, 2010 at 03:38 UTC
    Have you considered loading the two tables to a database and running DBI with the following SQL embedded? I used the test data for two tables one containing Jobs and one containing Users with headers for the tables being Start, Stop, and User or Job
    SELECT Jobs.Job, Users.User, Count(Users.User) AS CountOfUser FROM Jobs LEFT JOIN Users ON ( ( Jobs.Start <= Users.Start and Jobs.Stop >= Users.Start ) or ( Jobs.Start <= Users.Stop and Jobs.Stop >= Users.Stop ) or ( Users.Start <= Jobs.Start and Users.Stop >= Jobs.Start ) or ( Users.Start <= Jobs.Stop and Users.Stop >= Jobs.Stop ) ) GROUP BY Jobs.Job, Users.User HAVING Count(Users.User) = 1;
    I haven't put this into DBI syntax I just drummed it up quickly in Access but the SELECT .. FROM .. syntax should copy over. My results on your test data are.
    Job User CountOfUser job_id1 uname1 1 job_id2 uname2 1 job_id3 uname2 1 job_id4 uname2 1

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://865294]
Approved by Old_Gray_Bear
Front-paged by moritz
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2018-07-21 17:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (449 votes). Check out past polls.

    Notices?