Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Using filepath method to identify an .html page

by Nik
on Jan 22, 2013 at 15:10 UTC ( #1014689=perlquestion: print w/ replies, xml ) Need Help??
Nik has asked for the wisdom of the Perl Monks concerning the following question:

What i want to do, is to associate a number to an html page's absolute path for to be able to use that number for my database relations instead of the BIG absolute path string.

so to get an integer out of a string i would just have to type:

htmlpage = a string respresenting the absolute path of the requested .html file
# =========================================================== # produce a hash string based on html page's filepath and convert it t +o an integer, that will then be used to identify the page itself # =========================================================== pin = int( htmlpage )
But would that be unique?



Here is some background information:

This counter script will work on a shared hosting enviroment, so absolutes paths are BIG and expected like this:



/home/nikos/public_html/varsa.gr/articles/html/files/index.html

In addition to that my counter.py script maintains details in a database table that stores information for each and every webpage requested.

My 'visitors' database has 2 tables:

pin --- page ---- hits (that's to store general information for al +l html pages) <br><br> pin <-refers to-> page <br><br> pin ---- host ---- hits ---- useros ---- browser ---- date (that's t +o store detailed information for all html pages) <br><br> (thousands of records to hold every page's information) <br><br>
'pin' has to be a number because if i used the column 'page' instead, just imagine the database's capacity withholding detailed information for each and every .html requested by visitors!!!

So i really - really need to associate a (4-digit integer <=> htmlpage's absolute path)

Maybe it can be done by creating a MySQL association between the two columns, but i dont know how such a thing can be done(if it can).

So, that why i need to get a "unique" number out of a string. please help.

Comment on Using filepath method to identify an .html page
Select or Download Code
Re: Using filepath method to identify an .html page
by Anonymous Monk on Jan 22, 2013 at 15:16 UTC
    Read int, and then write your own function
      The only thing i know is that:

      a) i only need to get a number out of string(being an absolute path)
      b) That number needs to be unique, because "that" number is an indicator to the actual html file.

      Will int function get the job done or a hashing method needs to get involved?

      I dont know HOW this is supposed to be written. i just know i need this:

      number = function_that_returns_a_number_out_of_a_string( absolute_path_of_a_html_file)

      pin = int ( '/home/nikos/public_html/index.html' )
      This fails for me. Is it because it has slashes in it?

        Why don't you create a database table with two columns. One column is "the string", and the other column is a unique integer. Most databases have an almost inexhaustible supply of unique integers for such columns.

        Will int function get the job done or a hashing method needs to get involved?

        Read int and then you will know, then write your own function

        I dont know HOW this is supposed to be written.

        Keep a database of numbers ( AnyDBM_File ), assign one to each path, and then you're done

Re: Using filepath method to identify an .html page
by RichardK (Priest) on Jan 22, 2013 at 15:44 UTC

    You can also trim your paths to be relative rather then absolute using File::Spec abs2rel to remove the common prefix.

    But as other have said normalise your data!

Re: Using filepath method to identify an .html page
by blue_cowdawg (Prior) on Jan 22, 2013 at 16:04 UTC
        What i want to do, is to associate a number to an html page's absolute path

    Does it have to be an integer? How about a hex string?

    $ cat testMD5.pl use strict; use Digest::MD5 qw/ md5_hex /; my $digest=md5_hex("http://www.berghold.net"); printf "%s\n",$digest
    which yields:
    $ perl testMD5.pl 84b40875c5bc4da7ae368175025a32f9
    Just a thought...


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
      yes it has to be a 4-digit number to fit the database table's respective column

      here i just posted what i want to to more clearly: http://www.perlmonks.org/?node_id=1014708

        OK: How about this:

        $ cat testMD5.pl use strict; foreach my $url(qw@ /index.html /about/time.html @){ hashit($url); } sub hashit { my $url=shift; my @ltrs=split(//,$url); my $hash = 0; foreach my $ltr(@ltrs){ $hash = ( $hash + ord($ltr)) %10000; } printf "%s: %0.4d\n",$url,$hash }
        which yields:
        $ perl testMD5.pl /index.html: 1066 /about/time.html: 1547
        Keep in mind this is hardly bullet proof. You need to also keep in mind a method to detect hash collisions and and a rehash algorithm.

        This brings to mind "the old days" circa 1974 writing assemblers for 8080 microprocessors. Symbol "folding" and hashing.

        UPDATE: Limiting yourself to four digits may not be very useful if you have a lot of pages that you are trying to index into your database. The wider your hash is the less likely there will be hash collisions.


        Peter L. Berghold -- Unix Professional
        Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg

      The OP seems a bit unwilling to listen, but if you truncate the hash, you meet his spec.

      use Digest::MD5 qw/ md5_hex /; my $digest=md5_hex("http://www.berghold.net"); printf "%d\n", hex(substr($digest, -4)) % 10_000;

      So the one-liner he's looking for would be hex(substr(md5_hex($someurl), -4)) % 10_000. I'm betting that is somewhat collision resistant. The OP still should consider enlarging his int2 column to int4 or int8 or even char(4)

Re: Using filepath method to identify an .html page
by Anonymous Monk on Jan 22, 2013 at 16:07 UTC

    Do I understand correctly that you've already got a database table with a column "pin" (I assume a unique integer?) and another column "page" which contains something like the URL of the page? Why not just use that unique identifier? (Or create another database table, as other people here have suggested?)

    Anyways, although I don't think this is the correct solution for your situation, here is an answer to your original question: One concept of generating a number from a string is a Hash function, however, those numbers are generally not unique. Also, int has nothing to do with hash functions (the closest built-in is probably crypt, and that's not what you want, either).

    The problem with your question is this: using a hash function causes you to lose information about the original string, and the numbers you generate won't be guaranteed to be unique anymore. If you want the numbers to be truly unique, then the only way to guarantee that is to keep a list of the original strings around, which you say you don't want to do because of the amount of data that means. Sorry, you can't have it both ways...

    But maybe we're not quite understanding your existing set-up or what your goal is?

      Read this please, so to clarify what i need to do

      http://www.perlmonks.org/?node_id=1014708
Re: Using filepath method to identify an .html page
by Anonymous Monk on Jan 22, 2013 at 16:40 UTC

    If you've already got a database table with a column that holds the absolute path (or you can create one), and all you want to do is pick a number for each row, then, assuming you've marked the column that holds the number as UNIQUE, then you could just pick a random number for each row, right? Or, if you haven't filled the table yet, maybe you're looking for a feature like AUTO_INCREMENT? (a similar feature exists in virtually all databases)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1014689]
Approved by blue_cowdawg
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2014-07-26 11:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls