Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

string gets front truncated

by hsfrey (Beadle)
on Jul 29, 2008 at 23:47 UTC ( [id://700949]=perlquestion: print w/replies, xml ) Need Help??

hsfrey has asked for the wisdom of the Perl Monks concerning the following question:

I'm reading a very long html file in and concatenating it into a very long string. The FRONT of the file gets truncated! The code couldn't be simpler:
# call as: ConvertFindLaw.pl PilotLifeFindLaw.htm use 5.010; use Switch; use strict; my $wholeFile; my $INPUT_FILE = shift; my $FH; open(FH, $INPUT_FILE) or die "Bad $INPUT_FILE: $!"; ; foreach my $line (<FH>) { # print $line; # All lines printed # chomp $line; # this makes no difference to the problem $wholeFile .= $line; } # Build a long string close(FH); print $wholeFile; # front truncated here
If I print the lines in the loop (commented out above) they are coming in as expected. When I print $wholeFile after the loop, only the last quarter or so of the file appears. The FRONT is somehow lopped off, in the middle of a plain text string. Any idea how this could happen?

Replies are listed 'Best First'.
Re: string gets front truncated
by BrowserUk (Patriarch) on Jul 30, 2008 at 04:31 UTC

    I think that the answer is that you have a (one or more) "\r"s (0x13) in your file. So, it prints part of the file, return the (ancient term warning)carriage control cursor to the left, and then prints over what it has already printed.

    Try running tr[\r][]d; on it before you print it out.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: string gets front truncated
by MidLifeXis (Monsignor) on Jul 30, 2008 at 00:15 UTC

    You are removing your newline before appending the line to $wholeFile with the chomp command. My guess is that your terminal is responding to the resulting long line.

    Other comments:

    The variable my $FH; is never used.

    use warnings;

    --MidLifeXis

      I get the same result with or without the chomp.
Re: string gets front truncated
by graff (Chancellor) on Jul 30, 2008 at 03:27 UTC
    Try this:
    #!/usr/bin/perl use strict; use warnings; die "Usage: $0 input.file\n" unless ( @ARGV == 1 and -f $ARGV[0] ); my $input_file = shift; my $input_byte_count = -s $input_file; { local $/; # set input_record_separator to undef -- switch to "slu +rp" mode open( I, $input_file ) or die "$input_file: $!"; my $whole_file = <I>; close I; } if ( length( $whole_file ) < $input_byte_count ) { die "$0: Perl can't read all of $input_file"; } elsif ( length( $whole_file ) > $input_byte_count ) { die "$0: Either -s $input_file is lying, or bytes were added durin +g read\n"; } else { warn "$0: got exactly $input_byte_count bytes from $input_file\n"; } # now, what was it you need to do with $whole_file? ... # to normalize all whitespace to " " (making it just one long line): s/\s+/ /g; print;
    Note that after removing all the line breaks, you might have trouble seeing the whole thing in any sort of shell window. If the input is valid html data, the output should be perfectly viewable in a web browser.
      Thanks! I tried that and got the same problem, though it seemed pre-truncated at a different place.
        Um, if you "tried that", I assume you mean that you ran the code as I posted it. I know that what I posted compiles without errors or warnings; I don't have your data to test it on, but when I save the script as "j.pl" and run it on any file I have like this:
        j.pl some.file
        I consistently get a message like this printed to STDERR:
        j.pl: got exactly nnn bytes from some.file
        where "nnn" turns out to be the actual size of the data file provided as a command-line arg.

        So what sort of message did it print to STDERR when you ran it? If the message was "got exactly nnnnn bytes from your.file", and your.file happens to have nnnnn bytes, then the read was successful, and you are simply having trouble viewing all the data -- that is, the problem is not in the perl script, but instead would be in your display tool, and in how that tool handles this data stream.

        It could perhaps be something in the data file itself that is causing your display tool (terminal window? browser? something else?) to behave in some unexpected way -- e.g. some unexpected control byte is causing it to overwrite or otherwise erase/obliterate part of the data that is being given to it for display.

        Consider looking for other ways to inspect the data so you can see what is going on. Redirect the perl script output to a file, edit that file with some trustworthy editor (emacs, vi, or somesuch), view it with some sort of hex-dump tool, etc.

        If you ran the script as I posted it, then there should be no line breaks in the output -- just spaces. For fun, you could try changing that last line before the print statement; instead of this:

        s/\s+/ /g;
        do this:
        s/\s+/\n/g;
        to put each non-space token on a separate line. If you have something like the gnu "less" (unix "more") for paging through a long file in a terminal window, that should convince you that the perl script is not losing any of the data.
Re: string gets front truncated
by harishnuti (Beadle) on Jul 30, 2008 at 01:49 UTC

    i dont know what you are trying to achieve, but even i had the same requirement for a XML file to calculate checksum after stripping all XML tags and bringing data into one line and calculating checksum crc32 on it , this was a requirement from one of our client

    Why dont you try the below?
    # call as: ConvertFindLaw.pl PilotLifeFindLaw.htm use 5.010; use Switch; use strict; my $wholeFile; my $INPUT_FILE = shift; my $FH; open(FH, $INPUT_FILE) or die "Bad $INPUT_FILE: $!"; ; while (<FH>) { # print $line; # All lines printed chomp $_; # This makes difference , if want it in one line $wholeFile .= qq~$_~; # Just use quoted qq } # Build a long string $wholeFile .= qq~\n~; # This is needed, when you say u want it in one +line, there should one new line character after doing appending in wh +ile loop, now when you do wc -l on this one, it will show one line close(FH); print $wholeFile; # front truncated here

    * You should remember , if you are putting the resultant string in file and want to open it and see, i guess some editors have limitation in terms of number charcaters it can display in a single line, for ex: AIX5 , my vi editor cannot display more than 2048 characters in a single line.

    As said by some of the monks , unless you specify what you are going to achieve, its difficult to suggest the solution.
      > i guess some editors have limitation in terms of number charcaters it can display in a single line, < I'm not reading it in an editor - I'm just printing it out in the DOS box. And again, whatever I'm going to use it for, shouldn't this be allowed in perl? BTW, it doesn't seem to be caused by a memory shortage - I killed about 6 programs that were running simultaneously, and the front-truncation happened in exactly the same place.
        And again, whatever I'm going to use it for, shouldn't this be allowed in perl?
        Yes, it should and is. I think everyone responding is in no doubt that the problem is something other than the front of the string magically disappearing.

        Try putting this in place of the print (in the chomp-less version):

        printf "total of %d bytes read (%d including carriage returns)\n", length($wholeString), $. + length($wholeString);
        and comparing that to the length of the input file as reported by the dir command?
        I'm just printing it out in the DOS box.
        I've seen console windows omit bits of the output when flooded with data. Have you tried doing ConvertFindLaw.pl PilotLifeFindLaw.htm >tempfile and looking to see what's in tempfile?
        In agreement with ysth, the problem is most likely with your DOS window.

        DOS windows have a limit on the buffer size it can display, try increasing it, but still the maximum value might not be enough. So either use Windows Powershell or print the string to a file.
Re: string gets front truncated
by toolic (Bishop) on Jul 30, 2008 at 00:58 UTC
    What is your ultimate goal? What will you do with this very big string? This seems like an XY Problem. If you can describe your goal, you will receive more accurate help. For example, if you are planning on parsing the HTML file, there are numerous modules on CPAN to help you out.

    Also, How can I read in an entire file all at once? discusses some issues associated with slurping in big files.

      Regardless of my goal, since perl is supposed to have no intrinsic limit on string length, shouldn't I be able to read a long string without it getting truncated? And, isn't it odd that the FRONT is truncated? If a buffer was getting full, I'd expect that the late-arriving data would be dumped. And, if it was starting over to fill it up again from the front, how come it ends at the right place?
        You are assuming that 1) its getting truncated, 2) perl is truncating it. You can't verify the output by what you see in your console. Example
        C:\>perl -e "print qq,\r$_, for 1 .. 3" 3 C:\>perl -e "print qq,\r$_, for 1 .. 3" |hexdump 00000000: 0D 31 0D 32 0D 33 - | 1 2 3| 00000006;
      why is his goal has anything to do with the problem? He is not asking for an alternative, he is asking why his code does not work. You either have the answer or not, don't waste your time and his as well.
Re: string gets front truncated
by GrandFather (Saint) on Jul 30, 2008 at 00:45 UTC

    How much is "very long string" and is 1/4 of that a "magic" number like 2n or n-1? Could you be overflowing an OS/shell buffer?


    Perl reduces RSI - it saves typing
Re: string gets front truncated
by Krambambuli (Curate) on Jul 30, 2008 at 07:42 UTC
    Is it possible that the input file contains chars in some unusual encoding scheme ?
    UTF_8, UTF_16, ... ?

    What happens if you halve your problematic input file ?

    Does each half print out OK or does the error 'move' into one of them ?


    Krambambuli

Re: string gets front truncated
by ysth (Canon) on Jul 30, 2008 at 03:30 UTC
      The original file was 55kb. I cleaned some stuff out manually, and got it down to 38kb, and had the same problem.
Re: string gets front truncated
by blazar (Canon) on Jul 30, 2008 at 23:01 UTC
    use 5.010; use Switch;

    I personally believe that however irrelevant to your actual problem it may be, it's very awkward to use Switch along with 5.10 features, given that the latter provide in particular a true switch construct called nothing less than... given! (And it has always been a bad idea to resort to Switch.pm especially in "production code" since it is based on a source filter instead.)

    --
    If you can't understand the incipit, then please check the IPB Campaign.
Re: string gets front truncated
by goibhniu (Hermit) on Jul 31, 2008 at 14:32 UTC

    Have you tried slurp mode? perlvar has this example:

    open my $fh, "foo" or die $!; local $/; # enable localized slurp mode my $content = <$fh>; close $fh;

    $/ is the input line separator. By having a local $/ undefined, the single line read, my $content = <$fh>; reads the whole file. If you're not doing anything to each line before putting it into your $wholeFile variable, then you may as well not think of the file as a set of lines and just read the whole file at once. You can see here how what you're trying to do might affect what kind of advice you might get.

    If you still are thinking that you're missing the front of your file, then BrowserUK's explanation is probably the kind of thing to look for and you should prove it to yourself with graff's or ysth's approches of checking the length of your data or dumping to an output file.

    update: I see where toolic pointed you to the slurp-mode solution in perlfaq5.


    #my sig used to say 'I humbly seek wisdom. '. Now it says:
    use strict;
    use warnings;
    I humbly seek wisdom.
Re: string gets front truncated
by Anonymous Monk on Jul 30, 2008 at 20:12 UTC
    # call as: ConvertFindLaw.pl PilotLifeFindLaw.htm try ConvertFindLaw.pl PilotLifeFindLaw.htm > output.txt and then open output.txt in a text editor. VOILA!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://700949]
Approved by toolic
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (2)
As of 2024-04-26 00:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found