Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Problem with larger files (and s/)

by Cloudster (Novice)
on Jun 24, 2008 at 21:43 UTC ( #693829=perlquestion: print w/replies, xml ) Need Help??

Cloudster has asked for the wisdom of the Perl Monks concerning the following question:

A boon I crave, oh wise monks! Be merciful upon this newb supplicant!

I wrote my first Perl program a couple of weeks ago and am quite happy with it (now I just have to figure out how to link it in to a web page and my PHP chat program, but Iíll figure that out eventually.) My job is database administration and development. Yesterday I sat down to do something that I do several times a week, reformat a script generated by SQL Server 2000ís Enterprise Manager when I realized that this was another opportunity to learn more of Perl and save quite a bit of time in the future, so I set forth and have fallen flat on my face. (I wrote most of this post this morning, Iíve since made most of it work)

Hereís what a SQL Server script looks like, this would be saved in a file:

CREATE TABLE [dbo].[CustomerBalances] ( [arbh_acct] [int] NOT NULL , [arbh_prop_code] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_ +AS NOT NULL , [arbh_unpd_bal] [decimal](10, 2) NOT NULL , [rCount] [tinyint] NOT NULL , [arbh_ar_cat] [tinyint] NULL ) ON [PRIMARY] GO

What I want to do is remove the square brackets, remove the COLLATE blah blah blah, insert three tabs in front of the data type, and insert a couple of tabs in front of the null option, etcetera. The tabs wouldnít align perfectly in the end because of variance in the field name length, but thatís ok, TextPad is excellent for making that easy.

Iíve got most of that working. I have two final problems, both of which are beyond my skill. First, I canít get the NULL/NOT NULL to parse correctly. I end up with NOT\t\tNULL.

Second, and most critical, is file size. If Iím dealing with a dinky little file like the above (300 bytes), it runs just fine. But if I feed it a 9k script file with 200 lines of code, I get a really weird result. The output file displays like it has an additional space between every character, there are lots of non-ASCII values in the file, and nothing has matched and been reformatted.

I have no idea whatís going on with my program. My intent was that a line would be read, tested to see if it contains a space and a data type name, and if it did, replace that space with three tabs. My result file is very badly mangled, and no longer ASCII. Obviously Iím missing something.

This is what the first part of the file looks like when itís done processing:

CREATE TABLE 搀戀漀 .䌀甀猀琀漀洀攀爀䈀愀氀愀渀挀攀猀 ( ਍ऀ arbh_acct  int 一伀吀 一唀䰀䰀 Ⰰഀഀ 愀爀戀栀开瀀爀漀瀀开挀漀搀攀 瘀愀爀挀栀愀爀 (30) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL , ਍ऀ arbh_unpd_bal  decimal⠀㄀ Ⰰ ㈀⤀ 一伀吀 一唀䰀䰀 Ⰰഀഀ 爀䌀漀甀渀琀 琀椀渀礀椀渀琀 NOT NULL , ਍ऀ arbh_ar_cat  tinyint 一伀吀 一唀䰀䰀 ഀഀ ) ON 倀刀䤀䴀䄀刀夀 ਍䜀伀ഀഀ ਍䌀刀䔀䄀吀䔀 吀䄀䈀䰀䔀  dbo⸀ EmailText ⠀ഀഀ 䴀攀猀猀愀最攀䈀漀搀礀 瘀愀爀挀栀愀爀 (255) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL , ਍ऀ InsertSequence  bigint 䤀䐀䔀一吀䤀吀夀 ⠀㄀Ⰰ ㄀⤀ 一伀吀 一唀䰀䰀 ഀഀ ) ON 倀刀䤀䴀䄀刀夀 ਍䜀伀ഀഀ ਍䌀刀䔀䄀吀䔀 吀䄀䈀䰀䔀  dbo⸀ ErrorCodes ⠀ഀഀ 䔀爀爀漀爀䌀漀搀攀 猀洀愀氀氀椀渀琀 NOT NULL , ਍ऀ ErrorDesc  varchar ⠀㐀 ⤀ 䌀伀䰀䰀䄀吀䔀 匀儀䰀开䰀愀琀椀渀㄀开䜀攀渀攀爀愀氀开䌀倀㄀开䌀䤀开䄀匀 一伀吀 一唀䰀䰀 Ⰰഀഀ 䔀爀爀漀爀䰀漀挀愀琀椀漀渀 挀栀愀爀 (1) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ਍⤀ 伀一  PRIMARYഀഀ GO ਍ഀഀ

Iíve looked at the source file, and itís definitely ASCII text, not Unicode. Hereís my program, I was trying to have all of the data types in one array and work it from that angle, but thatís beyond my skill right now:

print "\nThis program reformats scripts produced by SQL Server 2000 En +terprise Manager\n"; print "to remove brackets and tab out data types and null settings.\n\ +n"; print "You provide a file name, this program reads it and produces a n +ew file\n"; print "with a .out extension.\n\n"; print "File name to process? (<enter> to end program.) "; chomp($sqlfile = <stdin>); $outfile = $sqlfile . ".out"; $datatypes = " binary/ bigint/ bit/ char/ datetime/ " . " decimal/ float/ image/ int/ money/ " . " nchar/ ntext/ nvarchar/ numeric/ real/ " . " smalldatetime/ smallint/ smallmoney/ sql_variant/ " . " sysname/ text/ timestamp/ tinyint/ varbinary/ varchar/ " . " uniqueidentifier"; open(IN, $sqlfile) || die "cannot open $sqlfile for input: $!"; open(OUT, ">$outfile") || die "cannot open $outfile for output: $!"; while (<IN>) { chomp; #delete or comment out the two following lines #if you need brackets around your object names. s/\[//g; s/\]//g; s/ \(/\(/g; s/ \,/\,/g; s/COLLATE SQL_Latin1_General_CP1_CI_AS//g; s/ON PRIMARY//g; s/\sbinary/\t\t\tbinary/g; s/\sbigint/\t\t\tbigint/g; s/\sbit/\t\t\tbit/g; s/\schar/\t\t\tchar/g; s/\sdatetime/\t\t\tdatetime/g; s/\sdecimal/\t\t\tdecimal/g; s/\sfloat/\t\t\tfloat/g; s/\simage/\t\t\timage/g; s/\sint/\t\t\tint/g; s/\smoney/\t\t\tmoney/g; s/\snchar/\t\t\tnchar/g; s/\sntext/\t\t\tntext/g; s/\snvarchar/\t\t\tnvarchar/g; s/\snumeric/\t\t\tnumeric/g; s/\sreal/\t\t\treal/g; s/\ssmalldatetime/\t\t\tsmalldatetime/g; s/\ssmallint/\t\t\tsmallint/g; s/\ssmallmoney/\t\t\tsmallmoney/g; s/\ssql_variant/\t\t\tsql_variant/g; s/\ssysname/\t\t\tsysname/g; s/\stext/\t\t\ttext/g; s/\stimestamp/\t\t\ttimestamp/g; s/\stinyint/\t\t\ttinyint/g; s/\svarbinary/\t\t\tvarbinary/g; s/\svarchar/\t\t\tvarchar/g; s/\suniqueidentifier/\t\t\tuniqueidentifier/g; s/NOT NULL/\t\tNOT NULL/s; s/NULL/\t\tNULL/s; print "$_\n"; print OUT "$_\n"; } close(OUT) || die "problem closing new $outfile: $!"; close(IN) || die "problem closing original $sqlfile: $!";
Any suggestions would be most welcome. Yes, it's a rather brute-force approach, but I'm new to Perl and it does what I want it to (mostly).

Replies are listed 'Best First'.
Re: Problem with larger files (and s/)
by jds17 (Pilgrim) on Jun 24, 2008 at 22:44 UTC
    Hi, your code did not output strange signs when running it, I would suspect there is something wrong with the input file, but cannot tell without further information.

    I would like to focus on your other problem (inserting the tabs in some places, not in others). My suggestion would be as follows, for completeness I include big parts of what you already have written. It's late so the code is not completely polished, e.g. the first two substitutions should be written with inline comments and there may be more elegant ways to write some of the regexes, but here it is:

    print "\nThis program reformats scripts produced by SQL Server 2000 En +terprise Manager\n"; print "to remove brackets and tab out data types and null settings.\n\ +n"; print "You provide a file name, this program reads it and produces a n +ew file\n"; print "with a .out extension.\n\n"; print "File name to process? (<enter> to end program.) "; chomp($sqlfile = <stdin>); $outfile = $sqlfile . ".out"; open(IN, $sqlfile) || die "cannot open $sqlfile for input: $!"; open(OUT, ">$outfile") || die "cannot open $outfile for output: $!"; while (<IN>) { #remove square brackets s/(\[|\])//g; #remove whitespace before round brackets... s/\s+((\(|\)))/$1/g; #...and commas s/\s+,/,/g; #remove some keywords s/COLLATE SQL_Latin1_General_CP1_CI_AS//g; s/ON PRIMARY//g; #remove duplicate whitespace s/\s+(\s)/$1/g; #THE MOST INTERESTING PART: #For lines not starting with non-whitespace (should hopefully be #the case only for the first line, otherwise you have to track #the line number lest you analyze keywords): #replace (single) whitespace character before word by three #tabs in case the following expression is neither "NULL" #nor "NOT NULL" s/\s+(?!(?:NOT )?NULL)([a-zA-Z]\w*)/\t\t\t$1/g if !/^\S/; print; print OUT; } END { close OUT || die "problem closing new $outfile: $!"; close IN || die "problem closing original $sqlfile: $!"; }
    Some comments:
    1. The most interesting part is the negative lookahead used for inserting the three tabs in the places described above.
    2. You do not need to chomp the input lines since otherwise you need to add newlines afterwards again.

    Hope this helps a bit and gave you some new ideas.

      Thank you! Very interesting code, I look forward to studying and dissecting it.
Re: Problem with larger files (and s/)
by moritz (Cardinal) on Jun 24, 2008 at 22:12 UTC
    open(IN, $sqlfile) || die "cannot open $sqlfile for input: $!"; open(OUT, ">$outfile") || die "cannot open $outfile for output: $!";

    If you are sure that the files are all ASCII, try this instead:

    open(IN, '<:encoding(ASCII)', $sqlfile) or die "cannot open $sqlfile f +or input: $!"; open(OUT, '>:encoding(ASCII)', $outfile or die "cannot open file $outf +ile for writing: $!";

    Are you really sure that the files are in ASCII? which editor did you open it with? Try to inspect the files with a hex editor/dumper and see if it's really ASCII.

      I was certain that the file was ASCII. Correction, I was absolutely certain that it was ASCII.

      It wasn't.

      There are two ways of generating scripts in SQL Server 2000's Enterprise Manager. The script generator can save them to a file, or preview them and you can copy/paste into whatever you want. This is the route that I normally go. I did not notice that the final option screen let you specify OEM, ANSI or Unicode, and it defaults to Unicode.

      *facepalm*
      *bang head against wall*
      *rinse, repeat*

      I couldn't find a hex mode in TextPad, turns out that you specify a binary file type in the Open dialog, and then it quite obviously was Unicode.

Re: Problem with larger files (and s/)
by jethro (Monsignor) on Jun 24, 2008 at 22:40 UTC
    The NOT\t\tNULL issue is trivially corrected with
    if (not s/NOT NULL/\t\tNOT NULL/s) { s/NULL/\t\tNULL/s; }
    (at least I hope so, didn't test it).

    The following regex works too:

    s/(NOT\s+)?NULL/\t\t$1NULL/s
    To find the solution to the other problem I would suggest loading that 9k file into an editor (not word, not wordpad, but you probably know that), cut it by half and feeding it to your program again. If the problem is still there when it is 300 bytes long you know it isn't because of size, but probably the wrong file format.

    If not, put another print statement before all the regexp to see if perl reads the file incorrectly or mangles it in your code (which to me looks perfectly ok).

      Thank you very much for your trivial correction, it was spot-on. I'm learning Perl via an older copy of O'Reilly's Perl CD collection, and had not yet seen an example of using /s as an if test. Works like a charm.

      As far as the Unicode problem, see my reply to the previous reply.

      *facepalm again*
        Sorry, the 'trivially' wasn't meant as in 'trivial problem' but more like 'less sophisticated solution', i.e. without packing it into one regex or using negative lookahead (which I wanted to mention, but I couldn't remember the name, which was ample proof to me that it is a non-trivial solution). ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://693829]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (8)
As of 2019-07-22 11:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If you were the first to set foot on the Moon, what would be your epigram?






    Results (15 votes). Check out past polls.

    Notices?