Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Remove tabs only between quotation marks

by bestresearch2 (Initiate)
on Apr 12, 2011 at 14:31 UTC ( #898944=perlquestion: print w/replies, xml ) Need Help??
bestresearch2 has asked for the wisdom of the Perl Monks concerning the following question:

I am reading in a tab delimited text file exported from Excel. Some fields have begin and end with quotation marks because the fields themselves have non-printable characters including tabs. How do I do search/replace for just the tabs inside of strings bounded by quotation marks?

In the following example, I'd like to remove the two tabs next to field3:


Replies are listed 'Best First'.
Re: Remove tabs only between quotation marks
by Fletch (Chancellor) on Apr 12, 2011 at 14:38 UTC

    Try using something like Text::CSV or the like and let it handle the nastiness of parsing the CSV-y data for you (just tell it to use \t as the delimiter rather than a comma) then do your diddling on the chunks it returns.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Remove tabs only between quotation marks
by kennethk (Abbot) on Apr 12, 2011 at 15:01 UTC
    Text::CSV (or another CSV parser) is your best bet for a reliable, robust solution. The issue primarily revolves about that fact that you need to count the number of quotation marks from the start of the string. For example, if your input were (check out Quote and Quote like Operators):


    you would not want to strip the tabs surrounding field4 even though it is surrounded by quotation marks. A regular expression like s/^([^"]*"[^\t"]*)\t+/$1/ would strip the tabs from the first quoted field, but not from all fields. You could wrap that in a while loop and swap the first character class to allow for paired quotes

    while (s/^((?:[^"]|"[^"]*")*"[^\t"]*)\t+/$1/){1}

    but CSV also includes an escape character, and any escaped quotation marks would then break the above. So why fight with all that when there is a well-tested solution at your fingertips?

Re: Remove tabs only between quotation marks
by jethro (Monsignor) on Apr 12, 2011 at 14:38 UTC

    Not that easy with a regex. You could split on " and search only in every second chunk. This only works if " does not occur escaped. inside a field.

    Ah, I see you got better advice in the Chatterbox with Text::CSV

Re: Remove tabs only between quotation marks
by JavaFan (Canon) on Apr 12, 2011 at 16:27 UTC
    $str = join "", map {my $_ = $_; /^"/ && s/\t+//g; $_} $str =~ /"[^"]* +"|[^"]*/g;
    This assumes you cannot have escaped quotes inside your quotes. And that your fields themselves do not contain quotes.
Re: Remove tabs only between quotation marks
by AnomalousMonk (Canon) on Apr 12, 2011 at 19:50 UTC

    Here's a regex solution. Single-quotes are used in place of double-quotes to avoid Windoze command-line escape-ology, but this approach will work with double-quotes just as well. Note: All quotes must be paired. Quoted sub-strings can contain escaped quotes.

    >perl -wMstrict -le "my $string = qq{uuu\tvvv\t'www\t\t'\txxx\t'\t\tyyy'\tzzz\n}; print qq{$string}; ;; $string =~ s{ ( ' [^\\']* (?: \\. [^\\']*)* ' ) } { (my $notabs = $1) =~ tr{\t}{}d; $notabs }xmsge; ;; print qq{$string}; " uuu vvv 'www ' xxx ' yyy' + zzz uuu vvv 'www' xxx 'yyy' zzz

    Update: This version ignores escaped double-quotes both inside and outside paired-quoted sub-strings:

    use warnings; use strict; # check with escaped double-quotes inside and outside # double-quoted sub-strings. my $string = qq{uuu\t\\"vvv\t"notab\t\t"\txxxx\t"\t\tno\\"tab"\tzzz\n}; print qq{$string}; $string =~ s{ ((?<! \\) " [^\\"]* (?: \\. [^\\"]*)* ") } { (my $notabs = $1) =~ tr{\t}{}d; $notabs }xmsge; print qq{$string};


    uuu \"vvv "notab " xxxx " no\"ta +b" zzz uuu \"vvv "notab" xxxx "no\"tab" zzz
Re: Remove tabs only between quotation marks
by wind (Priest) on Apr 14, 2011 at 23:01 UTC

    As has already been stated, Text::CSV is the most robust solutionf or this type or problem:

    use Text::CSV; use strict; use warnings; my $inputstring = qq{field1\tfield2\t\"field3\t\t\"\tfield4\n}; my $csv = Text::CSV->new({eol => "\n", sep_char => "\t", binary => 1}) +; $csv->parse($inputstring) or die "csv error: " . $csv->error_input(); my @columns = $csv->fields(); s/\t//g for (@columns); $csv->combine(@columns) or die "csv error: " . $csv->error_input(); my $string = $csv->string(); print $string;

    Here's the regex solution though using double quotes that allow escaped quotes

    my $inputstring = qq{field1\tfield2\t\"field3\t\t\"\tfield4\n}; $inputstring =~ s{("(?: (?> [^\"\\]+ ) | \\ . )*")}{ (my $str = $1) =~ + s/\t//g; $str }xesg; print $inputstring;
Re: Remove tabs only between quotation marks
by sundialsvc4 (Abbot) on Apr 12, 2011 at 15:53 UTC

    You seriously might wish to consider ... a brute-force loop.   Initialize a string-index, and initialize an in_quotes flag to false.   While that index has not yet passed the end of the string, look for a quote-mark or a tab.   A quote-mark flips the value of the in_quotes flag.   A tab, if in_quotes, is removed and the loop is restarted without incrementing the index (i.e. next) because the string just shrank and we don’t want to overlook any characters.

    Seriously... you won’t go home and tell your significant other “honey, let me tell you about the amazing algorithm I designed today,” but it will do the job perfectly well.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://898944]
Approved by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (9)
As of 2017-03-30 21:46 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (364 votes). Check out past polls.