I am going to make a regex collection for capturing specific (english) language constructs. These can then be used to parse/index/search texts. If such a regex-collection is large and general enough, it should be possible to collect
and organise them without knowing the precise form of the text beforehand. My experience with science-like articles (which are the target) is that the text and style are often repetitive, almost monotonous (not meant
My question is: would something like a Natural Language regex collection already be in existence? I know Regexp::Common &c, but they all seem to be very much more specialized than what I was hoping to find.
I'd be thankful for pointers or further ideas.