Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^2: Using filepath method to identify an .html page

by Nik
on Jan 22, 2013 at 16:14 UTC ( #1014714=note: print w/ replies, xml ) Need Help??


in reply to Re: Using filepath method to identify an .html page
in thread Using filepath method to identify an .html page

yes it has to be a 4-digit number to fit the database table's respective column

here i just posted what i want to to more clearly: http://www.perlmonks.org/?node_id=1014708


Comment on Re^2: Using filepath method to identify an .html page
Re^3: Using filepath method to identify an .html page
by blue_cowdawg (Monsignor) on Jan 22, 2013 at 16:50 UTC

    OK: How about this:

    $ cat testMD5.pl use strict; foreach my $url(qw@ /index.html /about/time.html @){ hashit($url); } sub hashit { my $url=shift; my @ltrs=split(//,$url); my $hash = 0; foreach my $ltr(@ltrs){ $hash = ( $hash + ord($ltr)) %10000; } printf "%s: %0.4d\n",$url,$hash }
    which yields:
    $ perl testMD5.pl /index.html: 1066 /about/time.html: 1547
    Keep in mind this is hardly bullet proof. You need to also keep in mind a method to detect hash collisions and and a rehash algorithm.

    This brings to mind "the old days" circa 1974 writing assemblers for 8080 microprocessors. Symbol "folding" and hashing.

    UPDATE: Limiting yourself to four digits may not be very useful if you have a lot of pages that you are trying to index into your database. The wider your hash is the less likely there will be hash collisions.


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
      Now that iam thinking of it more and more, i don't have to turn the 'path' back to a 'number'

      So, what i want is a function foo() that does this:

      foo( "some long string" ) --> 1234

      =====================
      1. User requests a specific html page( .htaccess gives my script the absolute path for that .html page)
      2. turn the 'path' to 4-digit number and store it to tha database as 'pin' (how?)
      3. i store that number to the database. I DONT EVEN HAVE TO STORE THE HTML PAGE'S PATH TO THE DATABASE ANYMORE!!! this is just great!

      At some later time i want to check the weblog of that .html page

      1. request the page as: http://mydomain.gr/index.html?show=log
      2. .htaccess gives my script the absolute path of the requested .html file
      3. turn the 'path' to 4-digit number (this is what i'am asking)
      4. use 'pin' variable to select all log records for that specific .html page (based on the 'pin' column)


      Since i have the requested 'path' which has been converted to a database stored 4-digit number, i'am aware for which page i'am requesting detailed data from, so i look upon the 'pin' column in the database and thus i know which records i want to select. NO NEED to store absolute apths anymore, just a 4-digit number for each .html page

      No need, to turn the number back to a path anymore, just the path to a number, to identify the specific .html page

      Does your solution which SEEMS GREAT APPLY to my specifications?
            ( .htaccess gives my script the absolute path for that .html page)

        How's that?

        I've shown you a simple hash function to convert an arbitrary string into a four digit number. It's up to you to go from there...


        Peter L. Berghold -- Unix Professional
        Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
      The perl code will produce the same hash for "abc.html" as for "bca.html"

      In any case, the likelihood of a hash collision for any non-trivial website is substantial. If you hash 100 files you have about a 40% chance of a collision.

      If you hash 220 files, the likelihood is about 90%
            The perl code will produce the same hash for "abc.html" as for "bca.html"

        Which underscores the point I made earlier about adding collision detection and rehashing logic to whatever algorithm you use. One workround I've seen:

        | handwaving here... my @i = split(//,$url); # put each letter in it's own bin my $j=0; # Initailize our my $k=1; # hashing increment values my @m=(); # workspace foreach my $n(@i){ my $q=ord($n); # ASCII for character $k += $j; # Increment our hash offset $q += $k; # add our "old" value $j = $k; # store that. push @m,$q; # save the offsetted value } my $hashval=0; #initialize our hash value # Generate that map { $hashval = ($hashval + $_) % 10000} @m;
        Using that method ABC.html and CBA.html now have different values because each letter position's value gets bumped up increasingly from left to right.


        Peter L. Berghold -- Unix Professional
        Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1014714]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (11)
As of 2014-12-19 15:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (85 votes), past polls