Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Assuming MySQL, BLOB types are variable length. The length you store + 1, 2, 3 or 4 bytes to store the length.

Column type Storage required CHAR(M) M bytes, 1 <= M <= 255 VARCHAR(M) L+1 bytes, where L <= M and 1 <= M <= 255 TINYBLOB, TINYTEXT L+1 bytes, where L < 2^8 BLOB, TEXT L+2 bytes, where L < 2^16 MEDIUMBLOB, MEDIUMTEXT L+3 bytes, where L < 2^24 LONGBLOB, LONGTEXT L+4 bytes, where L < 2^32 ENUM('value1','value2',...)1 or 2 bytes, depending on the number of en +umeration values (65535 values maximum) SET('value1','value2',...) 1, 2, 3, 4 or 8 bytes, depending on the num +ber of set members (64 members maximum)

Which should ease your concerns a little. You could improve things a bit by zipping the data on the way in, and unzipping it on retrieval, but that comes at the cost of being able to use the (somewhat slow and primitive) SQL search facilities.

Then again, you could index (using perl rather than the DB), the words in each post when it is created, which would allow you to locate the posts via keywords very quickly. You could store the index in the DB either as Storable hash or as a table, though the latter might be costly in terms of space and time.

And, once you have built an index of the words in a post, why store the words? If you fully invert your posts, you could probably get away with storing a 24-bit number to represent the "words" in the posts", which would probably represent an overall compression as you would be discarding the spaces.

The same file/table/frozen hash that you use to relate index entries to words, could also store post numbers where the words appear. You now have the basis of fairly well compressed, highly indexed datastore the allows very fast search and retreival.

Just musing :)


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
Rule 1 has a caveat! -- Who broke the cabal?

In reply to Re^2: Large chunks of text - database or filesystem? by BrowserUk
in thread Large chunks of text - database or filesystem? by TedPride

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (2)
As of 2022-05-29 07:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (101 votes). Check out past polls.

    Notices?