Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: CSV Cross Referencing

by KurtSchwind (Chaplain)
on Dec 03, 2014 at 14:55 UTC ( [id://1109121]=note: print w/replies, xml ) Need Help??


in reply to CSV Cross Referencing

So, it looks like you have some nice answers as to how to do a lookup/join across two delimited files. So I won't enter that part of the discussion.

Instead let's focus on the next part. That is that instead of a straight join, it appears you really need a distance function. Depending on how many entries, you could solve this in a number of ways. The brute-force method is to use the following formula. sqrt((lat1-lat2)^2 + (long1-long2)^2). That'll give you a course distance. (Course, the earth is round, not flat, and there are better forumlas, but I think that this will be good enough for our discussion.

So you'll want to compare the distance to each entry. Pick the smallest distance (note that an exact match would equal 0). Compare that distance to some $threshold value and see if it's "close enough". You'll notice that this method is very different than joining or doing lookups per-se. And it's possible you could sort/store your file in a way that it could easily filter out a lot of GPS locations immediately because they are so far off.

--
I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.

Replies are listed 'Best First'.
Re^2: CSV Cross Referencing
by SuicideJunkie (Vicar) on Dec 03, 2014 at 15:33 UTC

    Did you mean 'coarse'?

    As a note; there is no need to sqrt(...) for all N^2 of the distances. Instead, simply square your threshold; that is a much cheaper operation, only needing to be done once.

    If you have a large number of coordinates, you could also break the world up into a grid of buckets. Each entry then only needs to check distance to the entries in the nearest four adjacent buckets. Use a 2D hash of buckets since most buckets will be unused/empty out in the countryside.

      Still cheaper (which Tux did in his code), just compare the Lat and Lon separately. Only need to calculate the distance (or square of distance) when there are multiple matches.

        That's essentially what the 2d hash does.

        $buckets->{latitude}{longitude}. You're only checking things with a similar latitude, and among those, only the things with a similar longitude. The upside is you don't need to loop. Instead of doing a latitude compare against everything, you immediately O(1) have the short list of things of the same latitude. Then instead of doing a compare against the longitudes of everything remaining, you immediately have the short list of things that match both.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1109121]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2024-04-19 01:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found