Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

How stable is the format of this file? Have you done any statistical analysis to test your assumptions? For instance, are section headings always left aligned? Always in caps as in the sample file? There is variability in the dividers between item number and section title (sometimes a colon and sometimes a hyphen). Is this the only variability?

You mention that sometimes section 3 is found within section 1. Do you mean that section 1 is interrupted by section 3 and then resumes? Or that section 3 immediately follows section 1? If section 1 resumes how do you know as a human reader that you have transitioned from the end of section 3 and back to the remainder of section 1?

In general using regexes in natural language documents to identify the boundaries of semantic chunks is not very reliable. Regexes are the textual equivalent of hearing sentences in a language you don't know. As a listener you can identify that certain sound sequences occur but if you hear them in two places you have no way of knowing if both are part of a noun or one is part of a verb and another is part of a noun. And even if it turns out both are part of a noun, you don't know whether they mean the same thing because nouns can sometimes have two meanings.

Using regexes sometimes works if you have a rigid document format and no possibility that markers of section boundaries can occur elsewhere in the document with different meanings and uses. For example, suppose the SEC will only accept documents where (a) the section titles are always marked by the word "ITEM" (all caps) followed by section title section (b) titles never cross line boundaries and are limited to a specific set of values (c) the next line is always a series of hyphens (d) the number of hyphens equals the number of characters in item + title. It would be highly unlikely that such a sequence would appear naturally as part of the regular text of a section. You could then use such a structure to chunk the text.

On the other hand, if "item" can be lower or upper case and there is no SEC mandated format to titles, then you indeed have a problem because there are many uses of the word "item" even in your sample text. Even if it were true that titles are always left aligned, it wouldn't be enough to pick out the section headings. Since section content text is left aligned, there is a significant possibility that "item" as part of context text will be left aligned in at least some of the SEC files. You'd have to do statistical analysis on the rate of false matches, i.e. comparing your algorithm's extraction to a human reader's extraction. Then you would have to check with your client about its acceptability. If your client thinks there are too many false matches, you'll need to have some mechanism to disambiguate between the different contextual uses of "item" and may need to look into setting up some sort of Baysian filter and training corpus.


In reply to Re: Help with negative look ahed by ELISHEVA
in thread Help with negative look ahed by eversuhoshin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-03-28 18:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found