Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Using a git filter to Cache HTML Resources Locally

by haukex (Canon)
on Oct 07, 2018 at 18:50 UTC ( #1223642=CUFP: print w/replies, xml ) Need Help??

So I've been doing quite a bit of web development recently, and several of my HTML files use resources from CDNs, like jQuery or normalize.css. While I'm developing, I refresh pages quite often, and also usually use my browser's development tools to disable caching. This means that I hit the CDNs quite often, and aside from the speed and bandwidth usage, one of the CDNs actually started rate limiting me... oops. In other projects, I'd usually just pull those resources onto my local server, keep them there, and be done with it. But the stuff I'm currently working on is for distribution, so I'd like to keep the CDN URLs in the HTML, and not have to rewrite them by hand.

Enter git filters: Documented in the Git Book Chapter 8.2 and in gitattributes, they provide a way to pipe source code files through an external program on checkout, as well as on diff, checkin, and so on. This allows you to have one version of a file checked into the repository, but to use a filter to make changes to the files that actually end up in the working copy on your machine. These changes are also reversed by the filter and not checked back into the repository on checkin, and don't show up in any commands like git diff, git status, etc.

So in this case, the files I want to have in the repository will have lines that look something like this:

<link rel="stylesheet" href="https://example.com/example.css" /> <script src="https://example.com/example.js"></script>

but when I check these files out into my local working copy, they should get rewritten into something like this:

<link rel="stylesheet" href="_cache/example.css" /> <script src="_cache/example.js"></script>

Of course, Perl to the rescue!

There are two git filters: "smudge", which is applied when the files are checked out, and "clean", which is applied when the files are staged / checked in, or the working copy is compared against the repository, etc. Each filter script takes its input on STDIN and provides the filtered output on STDOUT. The filters are set up by setting git config filter.filtername.smudge and git config filter.filtername.clean to the scripts to be executed. Since the filter script tends to be local to your system and there might be arguments specific to the repository, it's probably best to use git config --local so these are stored on a per-repository basis in .git/config. Then, you set up a .gitattributes file which specifies a line "pattern  filter=filtername", e.g. "*.html  filter=myfilter". (There are also more advanced ways to implement and configure filters, which is described in the docs I linked to above.)

So now, I can implement my filter with a regular Perl while (<>) { print; } loop, applying whatever transformations I like. My "smudge" filter is the one that does the heavy lifting, looking for HTML tags like the above that are prefixed with <!--cacheable-->, fetches those URLs into the local cache directory if they haven't been fetched before, and rewrites the tags to point to the local resources. It also records the original URL in a comment so that all the "clean" filter has to do is put that original URL back into the tag. And there we go, problem solved!

You can find my code on Bitbucket as "htmlrescache". I've implemented the "clean" and "smudge" filters in one script, and also implemented an "init" command that sets up the git configuration I described above.

For one real-world example, see this HTML file, which contains the line:

<!--cacheable--><link rel="stylesheet" href="https://cdnjs.cloudflare. +com/ajax/libs/normalize/8.0.0/normalize.min.css" integrity="sha256-oS +rCnRYXvHG31SBifqP2PM1uje7SJUyX0nTwO2RJV54=" crossorigin="anonymous" / +>

However, checked out on my local machine, that line shows up as:

<!-- CACHED FROM "https://cdnjs.cloudflare.com/ajax/libs/normalize/8.0 +.0/normalize.min.css" --><link rel="stylesheet" href="_cache/normaliz +e.min.css" integrity="sha256-oSrCnRYXvHG31SBifqP2PM1uje7SJUyX0nTwO2RJ +V54=" crossorigin="anonymous" />

And yet:

$ git status On branch master Your branch is up to date with 'origin/master'. nothing to commit, working tree clean

(I also have an older git filter lying around somewhere that implements a couple of SVN keywords like $Id$ or $Date$ for git, but that's not really ready for publication at the moment. If someone is interested I could maybe find some time to prepare it.)

Replies are listed 'Best First'.
Re: Using a git filter to Cache HTML Resources Locally
by Anonymous Monk on Oct 11, 2018 at 07:25 UTC

    Hi,

    How do you do this without filters?

      Basically you do the same steps, but without involving git:

      1. Look at all HTML files
      2. Identify external resources
      3. Download the external resources
      4. Rewrite HTML files to use the local resources

      You can just invoke the linked htmlrescache program as

      htmlrescache clean my/file.html

      Just make sure you only operate on a copy, not on the original.

        You can just invoke the linked htmlrescache program as
        htmlrescache clean my/file.html

        Sorry, that's not quite right: git filters are provided the files on STDIN and must output on STDOUT, the filename on the command line is only informative for the script - because filters can be used on files that are being added/deleted, git doesn't make any guarantees that the file even exists. In fact, git filters aren't normally even given the filename, I had to specify %f in the git filter setting. My script just uses the filename to calculate the path of the cache directory relative to the file.

        However, your comment served as the inspiration to update the script so that it now supports new options: -i for inplace editing (use -IEXT to specify an extension for the backup file; uses $^I under the hood), and -G to disable the use of git, so paths are resolved relative to the current working directory instead of the git working directory. So thank you for that! :-)

        So now you can do:

        $ htmlrescache -GI.bak smudge my/file.html # cache HTML resources $ mv my/file.html.bak my/file.html # restore original file

        or

        $ htmlrescache -Gi smudge my/file.html # cache HTML resources $ htmlrescache -Gi clean my/file.html # restore original URLs
      How do you do this without filters?

      I'm not quite sure what you're asking specifically... you can edit the files manually, you can used a hacked solution like this, or you can use a different script like this (the now removed predecessor of htmlrescache). This is how you would use htmlrescache standalone.

        What I'm trying to get at, is that good practice of local caching without rewriting urls, ex  <script src="//cdn..../../../"

        Or using something like url_for('resource.js') in javascript or perl, to load resources based on configuration/environment

        something thats one and done not physical rewrite on each change

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://1223642]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2018-12-16 00:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How many stories does it take before you've heard them all?







    Results (70 votes). Check out past polls.

    Notices?