Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: Large chunks of text - database or filesystem?

by BrowserUk (Patriarch)
on Mar 19, 2005 at 13:53 UTC ( #440923=note: print w/replies, xml ) Need Help??


in reply to Re: Large chunks of text - database or filesystem?
in thread Large chunks of text - database or filesystem?

Assuming MySQL, BLOB types are variable length. The length you store + 1, 2, 3 or 4 bytes to store the length.

Column type Storage required CHAR(M) M bytes, 1 <= M <= 255 VARCHAR(M) L+1 bytes, where L <= M and 1 <= M <= 255 TINYBLOB, TINYTEXT L+1 bytes, where L < 2^8 BLOB, TEXT L+2 bytes, where L < 2^16 MEDIUMBLOB, MEDIUMTEXT L+3 bytes, where L < 2^24 LONGBLOB, LONGTEXT L+4 bytes, where L < 2^32 ENUM('value1','value2',...)1 or 2 bytes, depending on the number of en +umeration values (65535 values maximum) SET('value1','value2',...) 1, 2, 3, 4 or 8 bytes, depending on the num +ber of set members (64 members maximum)

Which should ease your concerns a little. You could improve things a bit by zipping the data on the way in, and unzipping it on retrieval, but that comes at the cost of being able to use the (somewhat slow and primitive) SQL search facilities.

Then again, you could index (using perl rather than the DB), the words in each post when it is created, which would allow you to locate the posts via keywords very quickly. You could store the index in the DB either as Storable hash or as a table, though the latter might be costly in terms of space and time.

And, once you have built an index of the words in a post, why store the words? If you fully invert your posts, you could probably get away with storing a 24-bit number to represent the "words" in the posts", which would probably represent an overall compression as you would be discarding the spaces.

The same file/table/frozen hash that you use to relate index entries to words, could also store post numbers where the words appear. You now have the basis of fairly well compressed, highly indexed datastore the allows very fast search and retreival.

Just musing :)


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
Rule 1 has a caveat! -- Who broke the cabal?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://440923]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (2)
As of 2022-05-28 19:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (101 votes). Check out past polls.

    Notices?