Re: CSV Cross Referencing

So, it looks like you have some nice answers as to how to do a lookup/join across two delimited files. So I won't enter that part of the discussion.

Instead let's focus on the next part. That is that instead of a straight join, it appears you really need a distance function. Depending on how many entries, you could solve this in a number of ways. The brute-force method is to use the following formula. sqrt((lat1-lat2)^2 + (long1-long2)^2). That'll give you a course distance. (Course, the earth is round, not flat, and there are better forumlas, but I think that this will be good enough for our discussion.

So you'll want to compare the distance to each entry. Pick the smallest distance (note that an exact match would equal 0). Compare that distance to some $threshold value and see if it's "close enough". You'll notice that this method is very different than joining or doing lookups per-se. And it's possible you could sort/store your file in a way that it could easily filter out a lot of GPS locations immediately because they are so far off.

--
I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.

Comment on Re: CSV Cross Referencing

Replies are listed 'Best First'.
Re^2: CSV Cross Referencing by SuicideJunkie (Vicar) on Dec 03, 2014 at 15:33 UTC
Did you mean 'coarse'? As a note; there is no need to `sqrt(...)` for all N^2 of the distances. Instead, simply square your threshold; that is a much cheaper operation, only needing to be done once. If you have a large number of coordinates, you could also break the world up into a grid of buckets. Each entry then only needs to check distance to the entries in the nearest four adjacent buckets. Use a 2D hash of buckets since most buckets will be unused/empty out in the countryside.	[reply] [d/l]
Re^3: CSV Cross Referencing by RonW (Parson) on Dec 03, 2014 at 18:22 UTC
Still cheaper (which Tux did in his code), just compare the Lat and Lon separately. Only need to calculate the distance (or square of distance) when there are multiple matches.	[reply]
Re^4: CSV Cross Referencing by SuicideJunkie (Vicar) on Dec 03, 2014 at 18:43 UTC
That's essentially what the 2d hash does. `$buckets->{latitude}{longitude}`. You're only checking things with a similar latitude, and among those, only the things with a similar longitude. The upside is you don't need to loop. Instead of doing a latitude compare against everything, you immediately O(1) have the short list of things of the same latitude. Then instead of doing a compare against the longitudes of everything remaining, you immediately have the short list of things that match both.	[reply] [d/l]


go ahead... be a heretic
	PerlMonks