Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

.VCF records cleansing

by solocazzimiei (Initiate)
on Apr 08, 2020 at 09:51 UTC ( #11115215=perlquestion: print w/replies, xml ) Need Help??

solocazzimiei has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all, ...and sorry for bad post formatting but PerlMonks seems don't recornize \n statement..

Here we are: after had accumulate lots of .vcf files were inside of each are plenty of duplicated records, I'm working on custom script for rationalize,merging & cleansing over 700k records.

I stuck on find universal solution for splits fields from records like this :

$in[$x]="NOTE;ENCODING=QUOTED-PRINTABLE:=0AAddress:=0A=0Aor. Soroca=0A +Republic of Moldova=0A=0A=0A=0A Footwear. =Children's footwear. Lady' +s footwear."

using code like:

$in[$x] =~ /(\w+\W?\w*\W?\w*\W?\w*\W?\w*\W?\w*\W*?\w*\W*?\w*\W*?\w*)\: +(=?0?.+:*)/; $a =$1;$2; $a =~ s/;X-SYNCMLREF\d+//; $key{$a} = $a;
This script work fine on almost the VCF's lables like : N, FN, ORG, TEL, etc.. but in record frame like above I obtain splitting on the lastest \: instead of first one, despite use of non greedy techniques:
$1 = "NOTE;ENCODING=QUOTED-PRINTABLE:=0AAddress" $2 = "=0A=0Aor. Soroca=0ARepublic of Moldova=0A=0A=0A=0A Footwear. =Ch +ildren's footwear. Lady's footwear."
Any suggestions ? Thks

Replies are listed 'Best First'.
Re: .VCF records cleansing
by hippo (Chancellor) on Apr 08, 2020 at 10:21 UTC

    Hello, solocazzimiei and welcome to the Monastery. It is quite difficult to read your code in this post and it would help considerably if you could enclose each section of code within <code>...</code> tags.

    While it is not entirely clear what you actually want here, you can use many techniques to separate a key from a value based upon the first colon. Since you are using a regex anyway, here is one such.

    use strict; use warnings; use Test::More tests => 2; my $in = "NOTE;ENCODING=QUOTED-PRINTABLE:=0AAddress:=0A=0Aor. Soroca=0 +ARepublic of Moldova=0A=0A=0A=0A Footwear. =Children's footwear. Lady +'s footwear."; my $want_key = 'NOTE;ENCODING=QUOTED-PRINTABLE'; my $want_value = "=0AAddress:=0A=0Aor. Soroca=0ARepublic of Moldova=0A +=0A=0A=0A Footwear. =Children's footwear. Lady's footwear."; $in =~ /^([^:]+):(.*)/; is $1, $want_key, 'Key matches'; is $2, $want_value, 'Value matches';

    Equally, you could use split or even index and substr to achieve the same end.

Re: .VCF records cleansing
by karlgoethebier (Abbot) on Apr 08, 2020 at 16:44 UTC

    VCF might be helpful. Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Re: .VCF records cleansing
by AnomalousMonk (Bishop) on Apr 08, 2020 at 18:02 UTC

      Thanks indeed for your info !<\p>

      Despite my .VCFs are not properly uniform & clean because of use of different codec, I can't use stnd VCF modules and need custom the script.<\p>

      Anyhow seems to had fount the right script scheme and solved !!<\p>

      Simon<\p>

        Oh, and another thing...

        ... my .VCFs are not properly uniform & clean ... I can't use stnd VCF modules ...

        Even if you cannot use a module as it stands, you can always look at the source of the module and steal the code, adapting it to your particular needs. That's why CPAN modules are posted publicly with broad, open-source copyrights.


        Give a man a fish:  <%-{-{-{-<

        I'm glad you've found a solution you're happy with!

        BTW: PerlMonks posts are in HTML markup, which <\p> is not. That's why  \n is ignored: HTML renderers usually (always?) ignore whitespace. Please see Writeup Formatting Tips and Markup in the Monastery.


        Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11115215]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2020-10-28 15:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (261 votes). Check out past polls.

    Notices?