http://www.perlmonks.org?node_id=974514

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone know of a ready source of some (millions) of phrases or sentences?

Of course the internet is full of billions of sentences, but if there is a freely available source without markup, that would save a lot of time and effort.

Some English latin1 would be a good starting point, but any Latin script would also be fine.

It's for testing a fast string similarity algorithm.

Thanks.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re: Random phrases
by pemungkah (Priest) on Jun 05, 2012 at 19:02 UTC
    Project Gutenberg text files? You might need to trim the boilerplate from the start of each book.

      Yes. That will do nicely. Thank you.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

Re: Random phrases
by brx (Pilgrim) on Jun 05, 2012 at 18:59 UTC
Re: Random phrases
by Not_a_Number (Prior) on Jun 05, 2012 at 19:50 UTC

    Have you considered NLTK?

    It comes with a selection of plain-text corpora:

    • abc: Australian Broadcasting Commission 2006: Science News, Rural News
    • genesis: Genesis Corpus
    • gutenberg: Project Gutenberg Selections
    • inaugural: US Presidential Inaugural Address Corpus
    • udhr: Universal Declaration of Human Rights Corpus
    • state_union: US Presidential State of the Union Address Corpus

    plus a lot more besides...

Re: Random phrases
by afoken (Chancellor) on Jun 05, 2012 at 18:22 UTC

    Wikipedia, splitted at [.?!]?

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      Unless I've missed something, all the wikipedia dumps are in XML tagged format which would require a considerable amount of effort to remove the markup.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        http://en.wikipedia.org/wiki/Wikipedia:Computer_help_desk/ParseMediaWikiDump links to Parse::MediaWikiDump, a tool that can handle those dumps. The documentation for the Parse::MediaWikiDump::page class has an example that dumps title and id for each page. Replace the print with print ${$page->text()} and you get all article texts. Not much work for you, but perhaps for your machine. ;-)

        BTW:

        This software is being RETIRED - MediaWiki::DumpFile is the official successor to Parse::MediaWikiDump and includes a compatibility library called MediaWiki::DumpFile::Compat that is 100% API compatible and is a near perfect standin for this module. It is faster in all instances where it counts and is actively maintained. Any undocumented deviation of MediaWiki::DumpFile::Compat from Parse::MediaWikiDump is considered a bug and will be fixed.

        Looking at http://search.cpan.org/~triddle/MediaWiki-DumpFile-0.2.1/lib/MediaWiki/DumpFile/FastPages.pm, I see an example that should give you exactly what you want: Plan text phrases from a Wikipedia dump written to STDOUT or whatever is currently select()ed, and optimized for speed.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)