Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Counting matches

by AnomalousMonk (Archbishop)
on May 29, 2017 at 13:17 UTC ( [id://1191506]=note: print w/replies, xml ) Need Help??


in reply to Counting matches

A couple of nice stabs at answers from kcott++ and shmem++, but I think Nicpetbio23! has wandered away!


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^2: Counting matches
by Nicpetbio23! (Acolyte) on May 29, 2017 at 14:08 UTC
    Hi, I want to read the file Genomes_used_Hant.txt into an array and look for each element in the array in NRT2.txt and return a count. For example Gloin1 is at line 24 in the Genomes_used_Hant.txt:
    Gloin1
    Below is an example of what is in the NRT2.txt file.
    >Gloin1_46659 MVKLFARPLPIDP.... >Gloin1_30454 MIKLFDKPSKELS....
    I would like to return the following in an output file for each element in the array. Since this is not an exact match I expect I need to use a regex.
    Gloin1: 2 occurrences in NRT2.txt

      Here is an SSCCE which matches your spec. Enjoy.

      #!/usr/bin/env perl use strict; use warnings; use Test::More; my @hant = ('Gloin1'); my @counts = (2); plan tests => scalar @hant; my $nrt2 = <<EOT; >Gloin1_46659 MVKLFARPLPIDP.... >Gloin1_30454 MIKLFDKPSKELS.... EOT for my $pat (@hant) { my $matches = () = $nrt2 =~ /$pat/g; is ($matches, shift @counts, "Correct number of matches found for +$pat"); }
        I think that is too specific. I want to count the occurrences of every element in Genomes_used_Hant.txt that occurs in NRT2.txt. Not just Gloin1
        Genomes_used_hant.txt Laesu1 Patat1 Hydru2 Armost1 Pacta12 VKMF3808 Gaegr1 Corca1 Artol1 Agabivarbur1 Uncre1 Armme11 Suidec1 Aspka11 MagorBR32 Bjead1 Gymlu1 CopciAmutBmut1 Thihy1 Aspgl1 Leugo1 Bacci1 Schoc1 Gloin1 ....ext
        NRT2.txt >ANRT2 MDFAKLLVASPEVNPNNRKALTIPVLNPFNTYGRVFFFSWFGFMLAFLSWYAFPPLLTVTIRDDLDMSQT +QIANSNIIALLATLLVRLICGPLCDRFGPRLVFIGLLLVGSIPTAMAGLVTSPQGLIALRFFIGILGGT +FVPCQVWCTGFFDKSIVGTANSLAAGLGNAGGGITYFVMPAIFDSLIRDQGLPAHKAWRVAYIVPFILI +VAAALGMLFTCDDTPTGKWSERHIWMKEDTQTASKGNIVDLSSGAQSSRPSGPPSIIAYAIPDVEKKGT +ETPLEPQSQAIGQFDAFRANAVASPSRKEAFNVIFSLATMAVAVPYACSFGSELAINSILGDYYDKNFP +YMGQTQTGKWAAMFGFLNIVCRPAGGFLADFLYRKTNTPWAKKLLLSFLGVVMGAFMIAMGFSDPKSEA +TMFGLTAGLAFFLESCNGAIFSLVPHVHPYANGIVSGMVGGFGNLGGIIFAIIFRYSHHDYARGIWILG +VISMAVFISVSWVRPVPKSQMRE >Metac1_3189 MGFNISLLWKTPMVDPINKKARSIPVLNVVDPYGRVFFFSWMGFMLGFWAWYTFPPLLTVTIKKDLHLSA +AEVANSNIVSLCATLLLRFVAGPLCDQFGSRRVYASLLLLGCLPVGLAPLVKTANGLYVSRFFIGILGA +TFVPCQVWCTGFFDKNIVGTANALSGGWGNAGGGITYFIMPAVFDSLVASQGMAPSKAWRVTFVVPLIC +LIACALGMLFLCPDTPLGSWEERSQKLQENLDQYSPTSTTAVNTPHILSEPPSRDVEKAEEFDEDSKFY +KQPSAISLSEAVAIAQAETVVKPSFKDSLPVMLSLQTLFHVATYSCSFGGELAVNSILSSYYKANFPHL +DQTKASNYAAIFGFLNFVTRPLGGVVADILYRMSGQNLWTKKAWITMAGLLSGALLIIVGKVDPSEANG +RDIGTMVGLVTVAAFFIEAGNGANFALVPHVYPAANGVLSGCTGGGGNLGGVVFAIIFRFIDHGSGYAT +ACWVIGVIHIAVNLAVCRIPPLPKGQVGGQ >MagorUS71_00075311 MGINVKFSDLYRAPEVNPITRKARSIPALNVINMYGRVFFFSWFGFMIAFWAWYTFPPLLTVTIRKDLNL +TAAEVANSNIVSLVATLFVRMVAGPLCDLWGSRVVFGGVLLVGAIPLGLAPLIQNATGLYVSRFFIGIL +GGAFVPCQVWSTGFFDKNVVGTANALTGGFGNAGGGITYFIMPAVFDSFVHRMGYTPGQAWRLTFVVPL +VMIIVTGVSLLLLCPDTPTGKWSERHMHAQQMVGQASTTDATNQDKIVDVPGSITDKGPNASNSSEGNS +FVEEKEKTRKEKDEQVGELLDAEAGRVIKSDDAAVQNTDTIAKPTFGESLRVMASPQTLVHVLTYFCSF +GGELAINAILSSYYLKNFPELGQTGASNYAAIFGFLNFITRPLGGVVSDLLYNAAGSGPRGLWLKKGWI +HVCGIATGALLILIGQLNPHHQPTMLGLVIFMAFFHEAGNGANFALIPHVHPHANGLVSGITGAGGNLG +GVVFAVVFRFVGGGTGYATGFWIIGIVHIAINIAMAWIKPLPKGQIGGY >Phchr2_2932727 MVYFPFARPQRSSVAPAETADALDTAAAAQIGHPEKLSLWERLTTVRINPANNKCTTLPILKLNNPYSIN +FHLSWLGFWVAFLSWFAFSPLVPEAVKNDLKLTQKQIGNSNIVSLCSTLLVRVIVGPLCDRFGPRKVMA +GLLIVGAIPSGLAGTVSSAQGLYVIRFFIGILGGTFVPCQAWTTAFYDTSIVGRANALVAGWGNSGGGF +TFIIMVALYDRLRSDGLSPHSAWRAAFAIVPVPILFFVAIITLLVGTDHPNGKWADRHKNAALVPAALT +DGSPRGSDDIEAIREIAPTQDGPKEKTENEKDAVNVDVTAVMPSRPPSIRQDSVPLTWKIALDVVLNPL +TWLPALAYMSSFGYELAIDANLANVYFGLYNKTKGFGQTRCGYIASIFGLLNVFSRPLGGYMGDVVYRR +WGVPGKKYLVLALGVLQGALSLAWGLYLDRHAASLAVVIVLMILTAAVDELGNGANFSLVPHCNPSSNG +VMTGIVGAMGNLGGVWFALMFRFQPSPFGKAFWIAGVVTMVTSVLLVVIRVPRK >Thiar1_121068 MGLKFHHLYASPEVNPASLKARSIPFFNPVDIYGRVFFFSWFGFMVAFWAWAAFPPLLTKVIQKELGLTP +AEVANSNIISPCAALLVRLVAGPLCDQFGPRIVFGGLLLVGSIPLGLAPLVHNAAGLYVSRFFIGILGG +AFVPCQVWSTGFFDKNIVGTANALAGGFGTAGGGITYFVMPAVYDAFVSYGHTAGEAWRLAFIVPLAVV +ITTGTALIVLCPDTPTGKWSERHLANSTPDDGSPSHNMTPANCSVIDVPGRITDKLPSPTAPSLSLSSR +QDPESGRQKPSEKNSHLANHKPMLDPESQLPIITLATAANTTKSEVVQKPTLSQAIRVAFSPQAIFHLL +TYMCSFGSELAINMIISSYYVKNFPSLSQTSAATFAALFGFQNFVTRPLGGVVSDLLYNYCGRSLWLKK +LWIVSCGVLAGVFLIVTGRLDPHGEGAMFGLVAVAGVFLQAGNGANFSLVPHVHPFANGILSGLTGAGG +NFGGVVFSVIFRFMDGGTNYAKGFWVIGVVNLVVCLGLSWIPPLPKGQVGGH >Thiar1_767720 MGFKPSDLWRTPEVNPVNKKARSVPILNPIDRHGRVFFFSWMGFMLAFWAWYTFPPLLSVTIKKDLNLTS +EEVANSNIVSLVATLLVRFAAGPLCDLLGSRKVFSLILLVGSIPIGLAPLIKDATGLYIIRFFIGILGG +SFVPCQVWCTGWFDKNVVGTANALSGGWGNAGGGITYFIMPAVYDSLVHRHGHTSGEAWRITFIVPLVC +LITCGLGLLFLCDDTPMGKWSDRHENVQQNLETQGISGKVVAITGNIADREPPSSSTSPSRAPSDIEKA +DPEKPKLTGDLTVTEAIETAQGETVVKPTFRDSLPVVFSLHALFHTATYACSFGGELAVNSILGAYYLK +NFPHLGQTNASNYAALFGFLNFVTRPLGGVVGDMLYNYFGRNLWLKKIWIHVCGLLTGALLILIGMLDP +HDLGTMVGLIVLMAVFHEAGNGANFALVPHVYPHANGVLSGLTGAGGNLGGVVFAIIFRYMDNGTNYAK +GFWVIGIMHIILNLAVCWIPPIPKGQIGGR >Micmi1_478558 MPAIFDHFVGHYKLSPHDAWRRAFFVPFAIIVGTAILMVVLCPDTPVGKWSERHEAVEANLRVIHEQGRH +VPQSSVVYGESTGASPADSTEKVDKLAITKDVEFGKGEVTEIDAEYAHEVIEKPSAKEIFKVFISPQTV +ALMACYFNSFGSELAINSVLGAYYLKNFPKLGQTSSGRWAAMFGLLNVYGRPLGGIISDIIYKYTKGNL +WAKKIWIHFLGVTMGVFMLAIGLANSHNQHTMIGLVAGLAFFMDASNGANFALVPHVHPQANGIVSGFV +GAVGNFGGVIGAIIFRYNVTNYGKSIWILGVIAIVMNLSVAWIRPIPKGQIGGR >Micmi1_311120 MGFNPAVLFKAPQVNPITKKARSIPILNPFNVYGRVFFFSWWGFMVAFLSWYAWSPLIGETIKADLKLTQ +AQIANSNILALVATLLVRCIAGPLCDKFGPRLVFAGVLLAGAVPTAFAFAIKNAAGLIVLRFFVGILGG +SFVPCQVWSTGFFDKNIVGTANSITGGFGNAGGGITYFVMPAIFDTFVNHYGMTKHKAWRMAFFVPFGM +IVGTAILMLLLTPDTPVGKWKDRHAAVEANLRAEHEAGRIIPHTGLGEAHHAHGPPLVLDDKKNDSTSD +VEHGTGEVVAVDTEYSHEVVMSPTFKEIVQIALSPQTLTLMACYFCSFGAELAINSILGAYYLKNFPKL +GQSGSGDWAAMFGLLNVVFRPMGGMMSDALYKFTGGKVWSKKILVHVMGVLMGMFMIIIGATDSHNRST +MVGLIAGLAFFLEAGNGANFGLVPHVHPYANGVVSGFTGASGNLGGIIGAIIFRYNGLHYGKSIWIFGI +IAIVLNLAVCWIRPVPKGQIGGR >Aspnid1_6363 MKPTQVLRLAVAAPDVNPQTRKARSIPVLNPFDLYGRVFFFSWIGFLVAFLSWYAFPPLLSVTIKKDLHM +SQDDVANSNIVALLGTFVMRFIAGPLCDRFGPRLVFVGLLICGAVPTAMAGLVTTPQGLIALRFFVGIL +GATFVPCQVWCTGFFDKNIVGTANSLAGGFGNAGGGITYFVMPAIYDSFVHDRGLTPHKAWRVSYIVPF +IIIVSIALAMLFTCPDTPTGKWADREKTSGQSIVDLSSTPNASSANSINISSDEKKAVHPEVTDSEAQV +HVRAGQIESSDAVIEAPTIKRYLSIALDPSALAVAVPYACSFGAELAINSILGAYYLLNFPLLGQTQSG +RWASMFGLVNVVFRPMGGFIADLIYARTNSVWAKKMWLVVLGLAMSGMAILIGFLDPHRESVMFGLVVL +MAFFIAASNGANFAIVPHVHPSANGIVSGIVGGMGNFGGIIFAIVFRYNGTQYHRSLWIIGFIILGCTL +FFSWVRPVPKQNH >Aspnid1_5705 MDFAKLLVASPEVNPNNRKALTIPVLNPFNTYGRVFFFSWFGFMLAFLSWYAFPPLLTVTIRDDLDMSQT +QIANSNIIALLATLLVRLICGPLCDRFGPRLVFIGLLLVGSIPTAMAGLVTSPQGLIALRFFIGILGGT +FVPCQVWCTGFFDKSIVGTANSLAAGLGNAGGGITYFVMPAIFDSLIRDQGLPAHKAWRVAYIVPFILI +VAAALGMLFTCDDTPTGKWSERHIWMKEDTQTASKGNIVDLSSGAQSSRPSGPPSIIAYAIPDVEKKGT +ETPLEPQSQAIGQFDAFRANAVASPSRKEAFNVIFSLATMAVAVPYACSFGSELAINSILGDYYDKNFP +YMGQTQTGKWAAMFGFLNIVCRPAGGFLADFLYRKTNTPWAKKLLLSFLGVVMGAFMIAMGFSDPKSEA +TMFGLTAGLAFFLESCNGAIFSLVPHVHPYANGIVSGMVGGFGNLGGIIFAIIFRYSHHDYARGIWILG +VISMAVFISVSWVRPVPKSQMRE >PenroP1_04323 MAPGFFKRLYVSPEINPSTHKAKSIPVLNPFDKYGRVFFFSWLGFMVAFLSWYAFPPLLNVTIKKDLKMT +QEDVANSNIVALLATLLVRFVAGPLCDRYGPRLVFVGLLLCGAIPTAMAGLVTGPKGLIALRFFIGILG +GTFVPCQVWCTGFFDKSIVGAANSLSGGWGNAGGGITYFVMPAVYDSLVQSRGIPSHKAWRIAYVIPFI +IITAVALCMLVLCEDTPTGKWSERNLWAKDSNGTTSAPNANIVDINSCTSSSGTMTPHNAATIDSEKKG +TQSPHVIDDTPATGQIDIFRQETVVSPTRREALNVAMSLSTMALAIPYACSFGSELAINSMLGSYYTEQ +FPHMSQTKSGQWAAMFGLLNVVCRPAGGLFGDLVYLYTGTAWSKKILIAFLGIGMGAFQLAIGLSNPST +EATMFGLVAGLAFFIEASNGANFALVPHVYPFANGIVSGIVGGLGNLGGIIFAIIFRYNGSNYGRSLWI +IGVISLATNLAVSWIRPIPKSQTLS >Pyrtt1_5571 MPFAISMLWSAPELNPYNKKARSIPVLNPVNKYGRVFFFSWLGFFIAFWSWYAFPPLLSKSIKADMHLSQ +DQIANSNIVALCATLLVRFIAGPMCDHFGPRITFASLLFAGAIPTALAGTAHNATGLYFIRFFVGILGG +TFVPCQVWTTGFYDKNVVGSANALVGGWGNSGGGITYFVMPVIYDSLKSNQGLSSHVAWRVSFIVPFVL +ISACAVALLLLTEDTPTGKWSERGVTVVSGDQPNQAGHSIVPTTGALDDKPSTAASLSSNDEKKYENTA +ADVETANGDVQIMDEVQHEVVVKPSLKEGLKVMFSLQTGALCAGYFCSFGGELAINSILGAYYLKNFPY +LGQTQSGRWAAMFGLLNVITRPLGGFIADLLYQTTGHNLWAKKLWINFVGIMTGVMCIIIGKLDPHNLS +EMIGLIALMAIFLEAGNGANFALVPHVHPHANGVLSGIVGATGNFGGIIFAIIFRYHKTNYSQVFWIIG +IMIIALNCAFIWVRPIPKNQIGGR >Sodal1_324937 MGLDYLWKAPEVNPINLKACRRFETTRKIQGCPANKLQQARSVPVLNPFNKYGAAFFFSWMGFMIAFWAW +YTFPPLLTVTIRDDLNLTPAQVANSNIVSLSSTLLMRLLAGPACDKFGSRLVFGGLLLLGALPVGLAPL +VQDATGLYISRFFIGVLGATFVPCQVWCTGFFDKNIVGTANALAGGWGNAGGGITYFVMPAVFDSFRDR +GYSPAVAWRLTFIVPLICIIVCGVGLILCCEDTPMGKWSDRHLHIQENLRNQGVEDATLVNVVNVPGGI +TDRPEPSPAPASADEERNSSTKSRKDESHFDAQAIDLSRAEMLETAQGETVAKPSLRDSLRVAVSPQTI +FHVLTYACSFGGELAINAILSSYYLKNFPHLGQTGASNWAAMFGFLNFVTRPLGGIVGDLLYNYVGRDL +WWKKGWIVLCGVATGVLLVLIGQLDPHHEPTMFGLIFLMAVFHEAGNGANFALVPHVHPAANGVLSGLT +GAGGNLGGVVFAIIFRFMDGGTDYAKGFWVIGCMHIGLNLLVSWIPPLPKGQIGGH >Aspfl1_27006 MDSVKLLFLSPEVNPSNRKARSIPILNPFDKYGRVYFFSWLGFMVAFLSWYAFPPLLTVTIRKDLKMTQP +EVANSNIVALLATLLVRFVAGPLCDRFGPRLVFIGLLLCGSIPTAMAGLVTNAQGLIALRFFVGILGGT +FVPCQVWCTGFFDKKIVGTANSLAAGWGNAGGGITYFVMPAIFDSLVHNQGLPAHKAWRVAYIVPFIII +VVIAVAMFFTCEDTPTGKWSERHLWAEETSRFEGNIVNINSGISSSHPSSPPSTTNIVADLEKKGNPSP +PESIAPMPGQLESLRTDTVVAPTFKEAMNVLLSLSTAAVAIPYACSFGAELAINSILGDFYAENFPYMG +QTKTGQWAAMFGLLNVICRPAGGFIADLLYRHTQSVWSKKILLSFLGVGMGAFQLALGFSNPKSEATMF +GLTAGLAFFLEACNGANFAVVPHVHPFANGIVSGAVGGMGNLGGIIFAIIFRYNGSHYARSLWIIGIIA +IAANLAVSWIRPVPRPQMV >Necha2_90170 MGFQIAHMWKAPEVNPISRKARSVPVLNPVDIYGRVFFFSWMGFMLAFWAWYTFPPLLTVTIKKDLHLTP +AQIANSNIVSLSATFFLRFITGPLCDQFGPRRVFAYLILLGCFPIGLAPLVKNATGLYISRFFIGILGA +TFVPCQVWCTGFFDKNIVGTANALAGGWGNAGGGITYFIMPAVFDSLVAHQGLTPSKAWRVTFIVPLIC +LIVCGVGMLLLCPDSPMGDWDDQAQQVRKNMEEHGVSTSDEITPVGTPDRSGSDEENSAGCEKDVKVGD +HEHSITRNEAMEIAQGEIVVKPSLKEALPVLYSPQTFFHVATYACSFGGELAINAVLSAYLKKNFPHLD +QTKASNYAAIFGFLNFVNRPLGGVIADILYNKFGRNLWLKKGWITVCGLLTGALLILIGRVNPAESNGG +TVGTFVGLIVLMSVFHEAGNGANFALVPHVHPFANGILSGLTGGGGNLGGVIFAIIFRFMNQGKDFAMG +FWVIGIIHIALNLAVCWIPPLPKGQVGGH >Conli1_4928 MGVNIKFTDLYKAPDVNPVNRKAHSIPALNPINMYGRVFFFSWFGFMIAFWAWYTFPPLLTVTIKADLHL +TPAQVANSNIVSLVATLFIRFVAGPLCDMYGPRRVFAGTLLVGALPLGLAPLIHNATGLYVSRFFIGIL +GGSFVPCQVWSTGFFDKNIVGTANALTGGFGNAGGGITYFIMPAVYDSFVHMGHTPHQSWRLTFIVPLI +MAIATGLSLLLLCPDTPMGKWSERHLHVQENLQLHGVAERVVDIPGGITDKATPSESHVSDGEEKKPVT +YDHEAALSKSEMIETAQGETIQKPTLREALPVIFSQQTAFHFLTYFCSFGGELAINAILSSYYLKNFPT +LGQTKASNWAAMFGFLNFVTRPLGGVVSDLLYNLAGRNLWLKKGWITTCGVATGALLILIGQLNSHHQS +TMYGLVALMAFFLEAGNGANFALVPHVHPFANGILSGVTGAGGNLGGVVFAVIFRFMGGGTNYAKAFWV +IGVIHIAMNLSVCWIRPLPKGQIGGH ...ext

      Now that you've provided an indication of your data, a much better solution (than my earlier tentative suggestion) presents itself.

      Assuming you have a filehandle, e.g. $matches_fh, to your file of match data (Genomes_used_Hant.txt in your example); and another, e.g. $fasta_fh, to your fasta data (NRT2.txt in your example); you can capture the wanted counts like this:

      my $alt = join '|', reverse sort <$matches_fh>; my $re = qr{(?x: ^ > ( $alt ) )}; my %count; /$re/ && ++$count{$1} while <$fasta_fh>;

      The code you've presented, in a couple of your posts in this thread, use the 3-argument form of open with lexical filehandles: this is very good. You are not, however, checking for I/O errors: this is not good at all. The easiest method is to let Perl do this checking for you with the autodie pragma; the alternative is to do this yourself, as shown in the open documentation.

      In the test code below, I've used Inline::Files purely for convenience. The count information is in %count: you can format and output this however you want.

      #!/usr/bin/env perl use strict; use warnings; use Data::Dump; use Inline::Files; my $alt = join '|', reverse sort <MATCHES>; my $re = qr{(?x: ^ > ( $alt ) )}; my %count; /$re/ && ++$count{$1} while <FASTA>; dd \%count; __MATCHES__ Gloin1 XYZ1 XYZ XYZ12 __FASTA__ >Gloin1_1 unwanted data >XYZ_1 unwanted data >XYZ12_1 unwanted data >XYZ1_2 unwanted data >XYZ1_1 unwanted data >XYZ12_3 unwanted data >Gloin1_2 unwanted data >XYZ12_2 unwanted data

      Output:

      { Gloin1 => 2, XYZ => 1, XYZ1 => 2, XYZ12 => 3 }

      — Ken

      Something along these lines then:

      use strict; use warnings; my $inputFile = q{/path/to/NRT2.txt}; open my $inputFH, q{<}, $inputFile or die qq{open: < $inputFile: $!\n}; my %occurs; while ( <$inputFH> ) { $occurs{ $1 } ++ if m{^>([^_]+)} } close $inputFH or die qq{close: < $inputFile: $!\n}; print qq{$_: $occurs{ $_ } occurrences in $inputFile\n} for sort keys %occurs;

      A more comprehensive example of your input file would be needed to be sure of the solution.

      Update: Too simplistic, ignore this as examples of both files are needed before making a stab at a solution.

      Cheers,

      JohnGG

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1191506]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2024-04-24 08:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found