Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Perl Module for identifying country name

by maheshkumar (Sexton)
on Aug 03, 2012 at 14:50 UTC ( #985253=perlquestion: print w/ replies, xml ) Need Help??
maheshkumar has asked for the wisdom of the Perl Monks concerning the following question:

Is there any possible way or other Perl Module with which it can be identified the names of the countries which are in a particular text file for example if a text file has some data which is can be anything but it consists of United States, China and Germany. Then I can know that the text file consists of these names?

Comment on Perl Module for identifying country name
Re: Perl Module for identifying country name
by talexb (Canon) on Aug 03, 2012 at 14:59 UTC

    It's a little difficult to comprehend what you're asking for, but my guess is that you could achieve your goal by using grep on the file for the country name that you're looking for. Does that help?

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      Actually what I want is just to find which country names are there in a text file

      for grep i think i will need to mention if it is United States or Germany right? This way I can miss the country name Canada if it is in the file

        You can use a regular expression to find all (English) country names.
        (?-xism:(?:S(?:a(?:int (?:(?:Vincent and the Grenadine|Kitts and Nevi) +s|Lucia)|o Tome and Principe|(?:udi Arabi|mo)a|n Marino)|o(?:uth (?:( +?:Afric|Kore)a|Sudan)|lomon Islands|malia)|(?:(?:lov(?:ak|en)|yr)i|ri + Lank)a|w(?:(?:itzer|azi)land|eden)|e(?:ychelles|negal|rbia)|i(?:erra + Leon|ngapor)e|u(?:riname|dan)|pain)|B(?:o(?:(?:snia and Herzegovi|ts +wa)n|livi)a|a(?:h(?:amas|rain)|ngladesh|rbados)|u(?:r(?:kina Faso|und +i|ma)|lgaria)|e(?:l(?:arus|gium|ize)|nin)|r(?:azil|unei)|hutan)|M(?:a +(?:l(?:a(?:ysia|wi)|dives|ta|i)|urit(?:ania|ius)|c(?:edonia|au)|rshal +l Islands|dagascar)|o(?:n(?:(?:tenegr|ac)o|golia)|zambique|ldova|rocc +o)|icronesia|exico)|C(?:o(?:(?:sta Ric|lombi)a|te d'Ivoire|moros)|a(? +:m(?:bodia|eroon)|pe Verde|nada)|(?:entral African|zech) Republic|h(? +:i(?:le|na)|ad)|(?:roati|ub)a|yprus)|T(?:u(?:rk(?:menistan|ey)|nisia| +valu)|a(?:(?:jikist|iw)an|nzania)|rinidad and Tobago|o(?:nga|go)|imor +-Leste|hailand)|A(?:(?:n(?:tigua and Barbud|dorr|gol)|(?:l(?:ban|ger) +|ustr(?:al)?)i|r(?:gentin|meni))a|(?:fghanist|zerbaij)an)|P(?:a(?:l(? +:estinian Territories|au)|(?:pua New Guine|nam)a|kistan|raguay)|o(?:r +tugal|land)|hilippines|eru)|N(?:e(?:therland(?:s Antille)?s|w Zealand +|pal)|i(?:ger(?:ia)?|caragua)|or(?:th Korea|way)|a(?:mibia|uru))|G(?: +u(?:inea(?:-Bissau)?|(?:atemal|yan)a)|e(?:orgia|rmany)|re(?:nada|ece) +|a(?:mbia|bon)|hana)|E(?:(?:(?:quatorial Guin|ritr)e|(?:thiop|ston)i) +a|(?:(?:l Salv|cu)ad|ast Tim)or|gypt)|L(?:i(?:(?:b(?:eri|y)|thuani)a| +echtenstein)|e(?:banon|sotho)|a(?:tvia|os)|uxembourg)|U(?:nited (?:St +ates of America|Arab Emirates|Kingdom)|zbekistan|kraine|ruguay|ganda) +|D(?:e(?:mocratic Republic of the Congo|nmark)|ominica(?:n Republic)? +|jibouti)|I(?:r(?:a[nq]|eland)|nd(?:ones)?ia|celand|srael|taly)|K(?:( +?:azakh|yrgyz)stan|iribati|osovo|uwait|enya)|R(?:(?:(?:oman|uss)i|wan +d)a|epublic of the Congo)|H(?:o(?:n(?:g Kong|duras)|ly See)|ungary|ai +ti)|V(?:enezuela|anuatu|ietnam)|J(?:a(?:maica|pan)|ordan)|F(?:i(?:nla +nd|ji)|rance)|Z(?:imbabwe|ambia)|(?:Yeme|Oma)n|Qatar))

        BTW, you will not find "United States" with this regex since the official name is "United States of America".

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics
Re: Perl Module for identifying country name
by frozenwithjoy (Curate) on Aug 03, 2012 at 15:36 UTC
    It seems like the biggest hurdle is getting a list of countries. To overcome this, you could use Locale::Country to do: @country_names = all_country_names();

    Then you could do something like put the countries in one hash and the words from the file in another hash and look for overlapping keys. use List::Compare:

    $lc = List::Compare->new( \@country_names, \@words_in_file ); @countries_in_file = $lc->get_intersection;

    Edit: Now that I think about it a little more, it might be better to use your array of country names to grep through your file contents (after replacing new lines with spaces) to avoid issues with multi-word countries names.

      Already used Locale::Country to put all names of countries in an array and i am getting the countries that appear in a file name :)

Re: Perl Module for identifying country name
by TomDLux (Vicar) on Aug 03, 2012 at 16:09 UTC

    You can search for any group of strings you wish to. The problem is, what are the possible values. Will it be the English name or the German on: Germany or Deutchland? Will it be the current name or an older one: Myamar or Burma? Sri Lanka or Ceylon? Mumbai or Bombay?

    If you have a file with one value per line, you can use "grep -f countries datafile" to examine datafile for all the countries in the countries file. The perl equivalent is simple:

    • read in the set of countries into an array
    • form into a regular expression which will capture the found string:
      my $re_text = join '|', map {($_)} @countries; my $re = rx/$re_text/;
    • and then test each input line against the re:
      while ( my $line = <$fh>) { chomp $line; my $found = ($line =~ /$re/); # Profit! }

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

      What use is the map {($_)} in your regex-making code? You will create a great many "captures" which are unnecessary.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
Re: Perl Module for identifying country name
by ww (Bishop) on Aug 03, 2012 at 17:11 UTC
    ... and do you want to know about instances of "los Estados Unidos" ou "les Etats Unis;" about "Bundesrepublik Deutschland" (auf Deutsch) oder "Alemania" (língua portuguesa) or Chine, perhaps in one of the several written forms of Chinese?

    In other words, what you didn't tell us is "Is the text file guaranteed to be Angličané jazyk? Is Czech or some other language possible?"

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://985253]
Approved by jdtoronto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (14)
As of 2014-07-23 21:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (152 votes), past polls