Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Is there a module for object-oriented substring handling/substitution?

by smls (Friar)
on Jan 24, 2013 at 22:46 UTC ( #1015241=perlquestion: print w/ replies, xml ) Need Help??
smls has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks!

I'm thinking of writing a Perl class for personal use, but would like to check first whether a similar module already exists in which case I could just use that. I failed to find anything using CPAN search or Google (I can't really come up with any good search terms that express the concept in few words), so I'm seeking your help.

The idea is to have substrings that are linked to the original text they are a substring of, and keep track of their exact start and end position within it. When the substring is modified, the change automatically propagates to the original text. The substrings could themselves have linked substrings extracted from them, and so on.

Here's a simple example of how it could be used (in this case, to replace {{...}} template markers in the third subsection of a text document with appropriate replacement text):

my $document = new My::String( read_file('doc.txt') ); my @sections = $document->split( qr/^------\n/ ); my @markers = $sections[2]->match( qr/{{\w+}}/ ); foreach $marker (@markers) { # $marker will be a My::String object my $old = $marker->text(); my $new = ''; # ... # Complex code to generate replacement text, depending on the # exact contents of $old, goes here # ... $marker->setText( $new ); # change propagates to $document } write_file( 'doc.txt', $document->text() );

There may be a better way to design the API, but I hope you get the general idea.

Now, I fully realize that the above example could be implemented using s/.../.../e regex substitution (i.e. with Perl code as the replacement pattern). In fact that's how I've done this kind of thing in the past. But it would be messy, even with this simple example.
For more complex examples of dynamic text substitution, with more layers of substrings-within-substrings that need special-case handling, using s/.../.../e regexes really doesn't scale too well in terms of code tidiness and maintainability, and an object-oriented framework of linked substrings like I laid out above might prove to be a better solution.

So, I'm looking for your input...
Does such a module already exist? What might it be called? What do you think of the general idea?
Thanks!

Comment on Is there a module for object-oriented substring handling/substitution?
Select or Download Code
Re: Is there a module for object-oriented substring handling/substitution?
by choroba (Abbot) on Jan 24, 2013 at 23:21 UTC
    Do you know that a reference to substr works as you described?
    my $string = 'abcdefghijklmnopqrstuvwxyz'; my $substring = \substr $string, 10, 10; $$substring = uc $$substring; print $string, "\n";
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Do you know that a reference to substr works as you described?

      Only if you use a single substring. I have a partially-written module that does this type of thing and supports multiple simultaneous substrings of the same string such that they cooperate (which leads to some tricky bits which were fun to hash out).

      But tye having an unpublished module isn't really a reason to avoid writing one's own version of something similar.

      - tye        

        They long since lifted the 'only one lvalue ref' limitation. This is 5.10.1:

        $s = 'abcdefghijklmnopqrstuvwxyz';; @r = map \substr( $s, $_*4, 2), 0..6;; $$_ = uc $$_ for @r;; say $s;; ABcdEFghIJklMNopQRstUVwxYZ say $];; 5.010001

        It does have its limitations though. (Unsurprisingly) The lvalue refs do not adjust to accommodate replacements that alter the length of the string:

        $$_ = $_ for @r;; say $s;; REF(REF(REF(REF(REF(REF(REF(0x3e82050)3e821e8)3e82260)3e820e0)11c458)3 +e820c8)335dc0)cdEFghIJklMNopQRstUVwxYZ

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Is there a module for object-oriented substring handling/substitution?
by roboticus (Canon) on Jan 25, 2013 at 02:36 UTC

    smis:

    Sorry, I don't know of a module that does that. Just for my own curiosity....what would you use something like that for? I can't think of a reason I would want something like that, so I can't suggest any possible searches to help.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

        Ah, thanks for that link! It lead to a nice hour or so of reading & thinking.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

        Anonymous Monk:

        I don't care much for speed, I care about convenience and elegance.

        Regarding the use-case of creating editors, they might prefer to use their own special-purpose class for performance reasons, with integrated support for efficient feedback on state changes of tracked ranges. For example, the Kate editor (coded in C++ with Qt and KDE libs) uses a light-weight class called MovingRange for keeping track of persistent ranges withing an opened document. They created this solution from scratch in 2010, dropping their previously used, more generic framework called "SmartRanges", in part due to performance reasons (see blog post).
        Now, if this is a performance-critical code path that benefits from special-case optimization even in an editor written in C++, it probably will be even more so in an editor written in Perl...

        For my purposes, notification about state changes is not needed, nor is performance a critical consideration so having a full substring class that stores a copy of its text (rather than just a thin "range" class pointing to a location withing the parent string) should not be a problem.

      roboticus:

      Well, there have been several times in the past I would have found a module like this useful. My current use-case, which led to me writing this thread, is updating table values in a wiki page by programmatically editing the page's source code (which is available in MediaWiki format).

      More precisely, the problem at hand is like this:

      Within a wiki page, there is a special section (identified by it's section header). This section in turn can have an arbitrary number of subsection (each with a unique subsection header). Each of these subsections contains, among other things, a special table.
      The Perl script is supposed to update the values in a specific column in each of these tables (identified by the word in the column's header cell). Which value goes into a particular table cell in that column, depends on the corresponding value in the first column (i.e. the ID column), as well as the title of the subsection that the table belongs to.

      Now, the Perl script should play nice with human editing of the same wiki page. Humans will fill in the remaining columns of the aforementioned tables, as well as the rest of the wiki page, and may freely add formatting, move things (like table rows and columns) around, etc.
      The Perl script must not touch *anything* on that wiki page except for the specific values it substitutes for new values. This also means no whitespace or formatting changes, so using a generic wiki text parser and dumper is out of the question.

      Last but not least, the solution should be elegant and easy to maintain and expand. For example if the wiki page is radically re-factored so that the script breaks, I want to be able to fix the script easily (even if I haven't looked at its Perl source code for months), i.e. without having to write complex five-line regexes from scratch. And in the future I might want to add support for automatically adding new table rows if expected values in the ID column were not found in one of the tables, and things like that - so the design should be flexible enough to account for that.

      In the absence of a module like I described in the OP, I would be using s/.../CODE/e blocks for this, but as I hinted in the OP, this might not provide the desired maintainability and elegance.

        smis:

        Ok, now I understand what you're asking for. I had a slightly different model in mind.

        So you're looking for the ability to do something like:

        # X is regex stuff to detect start of "interesting region", Y detects +end if ($clob =~ /(.*)(X.*Y)(.*)/) { my ($stuff_before, $stuff_to_edit, $stuff_after) = ($1, $2, $3); $stuff_to_edit =~ s/foo/bar/g; $clob = $stuff_before . $stuff_to_edit . $stuff_after; }

        But without all the gymnastics of dismantling and rebuilding the string. I can see where that would be pretty nice since a large $clob would force you to double the storage space and the associated string manipulations.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

        OK now with the knowledge of your use case, I'd rather recommend to work with a document tree representing your wiki page as a hash of hashes.

        Much like a DOM-tree, you could traverse it for whatever markup-element ("table") you want.

        Parse the wiki-page into a tree, manipulate the tree and rebuild the page again.

        Otherwise:

        If you insist to stick persistent meta-informations to ranges of characters, then you should better work with arrays of characters. You could tie or bless the scalar elements with whatever info you want. If your user inserts or deletes anything from the array your metainfos will move accordingly.

        And if you wanna go the full "emacs way" you need to realize linked lists. The easiest way is having 2 element arrays  [$value,$successor_ref]

        EDIT:

        After some meditation, IMHO if you need full interactivity, better stay with the AoH with the document tree, and a "cursor" pointing to the current element. Whenever the user does insert characters update the tree at the point the cursor points to.

        You'll also need to store informations like "parent", "child", "nextSibling" ...

        Have a look at DOM or XML modules at CPAN for inspiration.

        Cheers Rolf

Re: Is there a module for object-oriented substring handling/substitution?
by thundergnat (Deacon) on Jan 25, 2013 at 17:14 UTC

    I am not aware of a module to do anything like that specifically, and doubt that one would be more speedy/useful in a general case than just doing regex substitutions directly. Strings in Perl are not just byte arrays like they are in many other languages, so you will likely not gain much by trying to treat them like one. That being said, I can think of reasons when you might want to do something like that anyway.

    Many years ago I wrote a special purpose text editor in perl/Tk to help produce texts for Project Gutenberg. The perl/Tk text widget provides some basic search and replace functionality but has many limitations so I wrote something similar to what you are asking as a work around. It was not OO, and was pretty heavily tied to perl/Tk::Text, so it probably isn't useful as drop in code, but you might be able to glean some useful bits if you end up writing something yourself.

    Be aware that I started writing this in the heady days when perl 5.6 reigned supreme and I was a perl newby, so many of the design decisions were questionable by modern standards. Also, I have not been associated with it since 2005 or so, so the codebase may have moved on. If you are still curious after all those caveats, the code is still available on sourceforge.

Re: Is there a module for object-oriented substring handling/substitution?
by BrowserUk (Pope) on Jan 26, 2013 at 23:32 UTC
    supported by linked substrings.

    Seems to me you are looking for a complex solution when a simple one will do.

    In order to mark your substrings, you have to know where they start and end|length.

    If you simply unpack the string into an array on those same boundaries, you can edit the individual elements to your hearts content, and when you pack/join them back together, you will have exactly the same effect as your linked substrings without the overhead of all the behind-the-scenes jiggery pockery required to make the latter work.

    It's simple. And I like simple :)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Is there a module for object-oriented substring handling/substitution?
by Anonymous Monk on Jan 28, 2013 at 09:31 UTC
    https://www.mediawiki.org/wiki/VisualEditor/WikiDom_Specification
    WikiDom is a serialization of Wikitext based on JSON and optimized for transport and adaptive processing.

    The structure is based on two basic types of nodes, branches and leafs.

    Branch nodes have child nodes and leaf nodes have content. A node can not be a branch and a leaf.

    Content objects in leaf nodes use offset annotations for formatting.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1015241]
Approved by johngg
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2014-12-22 07:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (112 votes), past polls