Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Processing an encoded file backwards

by LanX (Archbishop)
on Jan 18, 2020 at 18:47 UTC ( #11111579=perlquestion: print w/replies, xml ) Need Help??

LanX has asked for the wisdom of the Perl Monks concerning the following question:

Hi

lets say I wanted a sliding window to search a file from end to start.

Could do this with seek and read in a loop.

Now is seek operating on byte boundaries ,but read depends on the encryption layer.

What's the best way than to read an encoded file, like in UTF-8, backwards ?

Is read fail-proof when accidentally starting inside a wide character after a seek?

Or is it better to open :raw and to search the next character (or line) boundary manually and to decode with Encode then?

I'm aware of File::ReadBackwards , but want to understand the mechanisms better and operate on windows and not lines.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

updates
)
DB<1> open $fh,'<','input2' DB<2> read $fh => $in, 100 DB<3> p tell $fh 114 # surprise DB<4> open $fh,'<:raw','input2' DB<5> p tell $fh 0 DB<6> read $fh => $in, 100 DB<7> p tell $fh 100 DB<8>

) I've been asked what I mean with "sliding window", please see this sliding window description. There I start from the beginning, but it's often favorable to start from the end. (choroba++ for pm'ing me)

Replies are listed 'Best First'.
Re: Processing an encoded file backwards
by haukex (Chancellor) on Jan 18, 2020 at 20:23 UTC

    I've actually thought about this myself. seek, tell, sysseek, and sysread all operate on bytes, while read operates on bytes or characters depending on the I/O layers. So because we can only seek in bytes, I think the only way to approach it is to first read a chunk of bytes from the end of the file, and then look at what was read to determine whether a UTF-8 encoded character was chopped off - specifically, if the block of data begins with a byte of which the two high bits are 10xxxxxx, since that is a UTF-8 continuation byte. Discard those bytes, and you should then have a buffer that can be correctly decoded as UTF-8 and that you can inspect for how many characters it contains, how many lines, etc., depending on what you actually want your window to be counted on. So I took this opportunity to finally express my idea in code :-)

    sub readbackwards_utf8 { # returns an iterator my ($fn, $window) = @_; die "Bad window $window" unless $window>=4; open my $fh, '<:raw', $fn or die "open $fn: $!"; my $curpos = -s $fh; return sub { if ( $curpos<1 ) { close $fh if $fh; $fh=undef; return } my $bytes = $curpos < $window ? $curpos : $window; seek($fh, $curpos-=$bytes, 0) or die "seek $curpos $fn: $!"; read($fh, my $buf, $bytes) == $bytes or die "read $bytes bytes at $curpos from $fn: $!"; while ( (ord(substr $buf, 0, 1) & 0b11000000)==0b10000000 ) { $buf = substr $buf, 1; $curpos++ } utf8::decode($buf); return $buf; } }

    It would be pretty easy to wrap the iterator which the above code returns into another iterator that counts characters and lines, and returns chunks of that size. Of course, this is specific to UTF-8. For encodings with a fixed width, like UTF-16 or UTF-32, it would be somewhat easier.

      sure this is the basic approach for UTF8.

      I was hoping for a more elegant solution and generic solution using Encode

      As you can see in my demo in the other answer is Encode using "\x{FFFD}" to decode the broken character.

      When it's reliable in doing so, this could lead to better code.

      Not sure what other multi-byte encodings are out there...

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      ) It is: from Encode If CHECK is 0, encoding and decoding replace any malformed character with a substitution character. When you encode, SUBCHAR is used. When you decode, the Unicode REPLACEMENT CHARACTER, code point U+FFFD, is used. If the data is supposed to be UTF-8, an optional lexical warning of warning category "utf8" is given.

      update

      The Flag FB_QUIET seems to be the answer

        As you can see in my demo in the other answer is Encode using "\x{FFFD}" to decode the broken character. When it's reliable in doing so, this could lead to better code.

        Well, to be purist about it (emphasis mine):

        If CHECK is 0, encoding and decoding replace any malformed character with a substitution character.

        So it doesn't allow you to differentiate between a character that was broken by the read, and an actually malformed input file.

        Update:

        Not sure what other multi-byte encodings are out there...

        Me neither, but I think UTF-8 and UTF-16 would already cover a lot of what's out there today.

        As you can see in my demo in the other answer

        I don't use the debugger often, so reading its output doesn't come naturally to me ;-)

Re: Processing an encoded file backwards
by LanX (Archbishop) on Jan 18, 2020 at 20:30 UTC
    OK for better illustration a demo in the debugger

    File "encode" in utf8

    . .

    Demo, with some modules preloaded.

    DB<62> open $fr,"<:raw","encode" DB<63> p -s $fr # 20 bytes 20 DB<64> @a=<$fr> # slurp DB<65> dd \@a # Data::Dump::dd shows bytes corre +ctly [ "\xC3\xA4\xC3\xB6\xC3\xBC.\r\n", # "" = UTF8:\xC3\xA4 = codepoint +U+00E4 etc "\xC3\x84\xC3\x96\xC3\x9C.\r\n", "\r\n", ] DB<66> seek $fr,10,0 # put readpointer to middle DB<67> p tell $fr # ok pos = 10 10 DB<68> read $fr,$rr,10 # read last 10 bytes into $rr DB<69> dd $rr # ouch, first byte is missing utf- +8 boundary "\x84\xC3\x96\xC3\x9C.\r\n\r\n" DB<70> $ru=Encode::decode('utf8',$rr) # lets decode to internal str +ing DB<71> Dump $ru # Devel::Peek : utf8-flag is set, +first byte translated to \x{fffd} SV = PVMG(0x36d3a28) at 0x36d56b8 REFCNT = 1 FLAGS = (SMG,POK,IsCOW,pPOK,UTF8) IV = 0 NV = 0 PV = 0x36195e8 "\357\277\275\303\226\303\234.\r\n\r\n"\0 [UTF8 "\x{f +ffd}\x{d6}\x{dc}.\r\n\r\n"] CUR = 12 LEN = 16 COW_REFCNT = 0 MAGIC = 0x3630f58 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = -1 DB<72> dd $ru # Data::Dump agrees "\x{FFFD}\xD6\xDC.\r\n\r\n" DB<73> p length $ru # 8 chars = "*.\r\n\r\n" with * f +or fail 8 DB<74> p $ru # can't be printed without warning Wide character in print at (eval 84)[C:/Perl_524/lib/perl5db.pl:737] l +ine 2, <$fr> line 8. ... yadda traceback &#9488;&#9500;&#9500;. # OK cmd.exe can't handle unicode DB<75> @au = split//,$ru DB<76> p $au[0] # yeah first character causing trou +ble Wide character in print at (eval 86)[C:/Perl_524/lib/perl5db.pl:737] l +ine 2, <$fr> line 8. ... yadda traceback &#9488; DB<77> p $au[1] DB<78> dd $au[1] # yep D6 is the codepoint for "" i +n unicode "\xD6"

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11111579]
Approved by hippo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2020-02-23 02:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?










    Results (102 votes). Check out past polls.

    Notices?