How stable is the format of this file? Have you done any statistical analysis to test your assumptions? For instance, are section headings always left aligned? Always in caps as in the sample file? There is variability in the dividers between item number and section title (sometimes a colon and sometimes a hyphen). Is this the only variability?
You mention that sometimes section 3 is found within section 1. Do you mean that section 1 is interrupted by section 3 and then resumes? Or that section 3 immediately follows section 1? If section 1 resumes how do you know as a human reader that you have transitioned from the end of section 3 and back to the remainder of section 1?
In general using regexes in natural language documents to identify the boundaries of semantic chunks is not very reliable. Regexes are the textual equivalent of hearing sentences in a language you don't know. As a listener you can identify that certain sound sequences occur but if you hear them in two places you have no way of knowing if both are part of a noun or one is part of a verb and another is part of a noun. And even if it turns out both are part of a noun, you don't know whether they mean the same thing because nouns can sometimes have two meanings.
Using regexes sometimes works if you have a rigid document format and no possibility that markers of section boundaries can occur elsewhere in the document with different meanings and uses. For example, suppose the SEC will only accept documents where (a) the section titles are always marked by the word "ITEM" (all caps) followed by section title section (b) titles never cross line boundaries and are limited to a specific set of values (c) the next line is always a series of hyphens (d) the number of hyphens equals the number of characters in item + title. It would be highly unlikely that such a sequence would appear naturally as part of the regular text of a section. You could then use such a structure to chunk the text.
On the other hand, if "item" can be lower or upper case and there is no SEC mandated format to titles, then you indeed have a problem because there are many uses of the word "item" even in your sample text. Even if it were true that titles are always left aligned, it wouldn't be enough to pick out the section headings. Since section content text is left aligned, there is a significant possibility that "item" as part of context text will be left aligned in at least some of the SEC files. You'd have to do statistical analysis on the rate of false matches, i.e. comparing your algorithm's extraction to a human reader's extraction. Then you would have to check with your client about its acceptability. If your client thinks there are too many false matches, you'll need to have some mechanism to disambiguate between the different contextual uses of "item" and may need to look into setting up some sort of Baysian filter and training corpus.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
Outside of code tags, you may need to use entities for some characters:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||