Custom FileExtension Path-Based Importers in Python

Python 3.1 introduced importlib, which can be used to modify the behavior of Python’s import system. In fact, standard Python imports have also been ported to use this new library.

As stated in the documentation, one of the reasons for this libraries existence is to allow programmer’s to write their own Importer modules. As one can imagine, this functionality is not widely used, however, because most people don’t have a desire to alter the functionality of standard Python importing.

However, there are definitely use cases. One blog post describes using the system to block certain modules from being imported. Further, Python has actually used the module to write an Importer that allows one to import modules from zip files. There is also a great Usenix article that describes a lot of the functionality covered in this post.

In this post, I’d like to describe how one can use pre-existing machinery, namely importlib.machinery.FileFinder to quickly write a path-based Importer module to handle custom imports.

First, some background. Importing in Python is actually pretty straight-forward (and pretty elegant). During each import statement, a list of known-importers is consulted. Each importer returns whether it can handle the module name provided. The first importer that can handle the module name is then used to load the module.

Naturally, then, each Importer has two components, a finder and a loader:

find_loader(fullname) indicates whether a module can be loaded based on its name. If the module can be loaded, the loader is returned. If not, None is returned.

load_module(fullname) loads the module, returning the actual module and also doing some other work, such as placing it in sys.modules. All work is described here.

The Importers are loaded from two lists,  sys.meta_path and  sys.path_hooks .  The former imports modules simply based on their names, while the latter imports modules based on their names within a certain path.

Using this knowledge, our goal is to then allow something like this to happen:

In the directory of our project, there is a JSON file,  records.json , which contains customer records indexed by their full name. We want to seamlessly import this file and use it as if it were a dictionary. If the file doesn’t exist, naturally, we’d like to throw an error.

import records
print(records['Jane Doe'])

This seems pretty simple knowing what we know about how the Python import system works:

  1. Since we are operating on the FileSystem, we need to know information about paths. Therefore, we’d like to create an Importer module that can be appended to  sys.path_hooks .
  2. Our find_loader implementation should take the module name (records, in this case), append the “.json” extension to it, and then check to see if it exists on the filesystem. If it does, it should return the loader described in (3)
  3. Our load_module implementation should take the module name, append “.json” to it, read the contents from the filesystem, and load the JSON using Python’s  json module.

As you might notice, steps 1 and 2 are not necessarily JSON specific. In fact, they’re filesystem specific. Luckily, steps 1 and 2 have already been written and provided by Python in the form of importlib.machinery.FileFinder. We can then utilize this to write our JSON Importer.

FileFinder also has a nice helper function, FileFinder.path_hook, which allows us to specify a series of (loader, extensions) pairs. The function then returns a callable that is suitable to be inserted into  sys.path_hooks . We then only need to write the loader. The loader, by definition, is a callable that returns a Loader, which has a load_module(fullname) method. In our implementation, we are going to utilize a class’ constructor as this callable (as suggested in PEP302). We write our loader:

import json
import sys

class JsonLoader(object):
    def __init__(self, name, path):
        self.path = path

    def load_module(self, fullname):
        if fullname in sys.modules:
            return sys.modules[fullname]

        with open(self.path, 'r') as f:
            module = json.load(f.read())

        sys.modules[fullname] = module
        return module

Now we can use the already existing machinery to add this loader into our import system:

from importlib.machinery import FileFinder

json_hook = FileFinder.path_hook( (JsonImporter, ['.json']) )
sys.path_hooks.insert(0, json_hook)

# Need to invalidate the path hook's cache and force reload
sys.path_importer_cache.clear()

import records
print(records['Jane Doe'])

And voila! We have added our new JSON importing functionality. The most important part of the above codeblock is  sys.path_importer_cache.clear() . When your code begins running, all paths checked for imports in  sys.path have already had their hook’s cached. Therefore, in order to ensure that the newly added JSON hook is processed, we need to ensure that the cached list of Importers contains the JSON hook, so we simply invalidate the cache.

The great thing about this code is that FileFinder’s path_hook handles all of the Filesystem operations for you. It automatically traverses directories if directories are part of the import statement and automatically verifies extensions. All you have to do is worry about the loading logic!

Of course, no specific-solution is a good solution. It’s also possible to generalize what we’ve done.

from importlib.machinery import FileFinder
import json
import sys

class ExtensionImporter(object):
    def __init__(self, extension_list):
        self.extensions = extension_list

    def find_loader(self, name, path):
        self.path = path
        return self

    def load_module(self, fullname):
        if fullname in sys.modules:
            return sys.modules[fullname]
        
        return None

class JsonImporter(ExtensionImporter):
    def __init__(self):
        super(JsonImporter, self).__init__(['.json'])

    def load_module(self, fullname):
        premodule = super(JsonImporter, self).load_module(fullname)
        if premodule is None:
             with open(self.path, 'r') as f:
                module = json.load(f)
                sys.modules[fullname] = module
                return module
            raise ImportError('Couldn't open path')

extension_importers = [JsonImporter()]
hook_list = []
for importer in extension_importers:
    hook_list.append( (importer.find_loader, importer.extensions) )

sys.path_hooks.insert(0, FileFinder.path_hook(*hook_list))
sys.path_importer_cache.clear()

import records
print(records['Jane Doe'])

Now there’s no need to worry about any filesystem details. If we want a new importer based on a file extension, we simply extend the ExtensionImporter class, create a load_module method, and pop it into the extension_importers list.

And thus, a complete solution to creating custom file-extension based path-importers has been created. Two lessons I’ve learned while writing this post:

  1. Don’t forget to call  sys.path_importer_cache.clear()
  2. For some reason, appending the finder function to the end of  sys.path_hooks doesn’t seem to work. Inserting it at the beginning, however, does.

Simple RPC With Thrift

A key aspect of building a server-based (cloud-based in today’s lingo) service is communication. The (often times remote) client needs to communicate with the server. Further, sometimes other processes on the server also need to communicate with each other. There are several ways to accomplish this, one of which is with RPC.

Many blooming programmers and hackathoners today will jump straight to “Why not just create a simple REST API using JSON serialization?”. On the surface, there are many good things about this. I’m sure you’ve heard them all:

  1. REST is simple and stateless. The semantics of a REST API are widely used and therefore easy to use once you’ve learned them. With a REST API on your server, it’s super easy to manipulate data and debug operations using tools like Postman.
  2. REST encourages readability. I’m all about readability everywhere in computing. Code should be readable, text should be readable, interfaces should be readable. It only makes sense, then, that an API should be readable as well. REST encourages this. It’s very easy to tell what GET api/users will do.
  3. REST encourages readable serializations. Since the API endpoints are readable, it’s only natural to make the response readable as well. Today, most APIs accomplish this by serialiazing data in the easy-to-read JSON format. This way, data is easy to read and easy to manipulate.

This is all well and good. Hackathoners and newcomers should not feel discouraged from using REST/JSON to create an interface to their cloud application. There’s one problem that I’m sure you’ve noticed, however.

JSON is heavy. When you’ve implemented a distributed, load-balanced, fully cached, and 100% optimized service, the largest bottleneck is transmission time from the server to the client, especially if the response object is large. In fact, on all of the teams I’ve worked on at various companies (except one), complaints about transmission time for huge serialized objects were extremely common.

Also, REST is cumbersome. I need more than two hands to count how many times I’ve coded up, from scratch, an interface layer that handles REST-style requests and serves the response in JSON. At one point I thought it a good idea to create a C# tool which actually generates a PHP REST interface when given the dataschema. Why did I have to do this? Surely, someone else has already done it!

Enter Facebook’s Thrift, which is open source and Apache licensed. For me, Thrift’s biggest features are how it solves the problems I’ve mentioned above. In order to use Thrift, you design a schema to represent both your objects and your service. The Thrift compiler then uses your schema to generate a client and server for you, meaning that you no longer have to handle the communication or serialization problems.

As a contrived example, say that I wanted to make a simple service to get my server uptime. I first design a Thrift schema:

service UptimeService {
    i32 getUptimeInDays();
}

Then, I compile the schema using Thrift

thrift --gen py uptime.thrift

I am going to use Python for my client and server, so I use the --gen py flag. Thrift has many supported languages, however.

I can then use the generated Python libraries to write my server implementation:

import UptimeService
import subprocess

from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
from thrift.server import TServer

class UptimeHandler:
    def getUptimeInDays():
        return int(subprocess.check_output('uptime').split(' ')[3])

if __name__ == '__main__':
    handler = UptimeHandler()
    processor = UptimeService.Processor(handler)
    transport = TSocket.TServerSocket(port=9090)
    tfactory = TTransport.TBufferedTransportFactory()
    pfactory = TBinaryProtocol.TBinaryProtocolFactory()

    server = TServer.TThreadedServer(processor, transport, tfactory, pfactory)
    server.serve()

And I also use the generated Python libraries to write my client implementation:

import UptimeService

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

def main():
    transport = TSocket.TSocket('localhost', 9090)
    transport = TTransport.TBufferedTransport(transport)
    protocol = TBinaryProtocol.TBinaryProtocol(transport)
    client = UptimeService.Client(protocol)

    transport.open()
    print("Server uptime: {} days".format(client.getUptimeInDays()))
    
    transport.close()

if __name__ == '__main__':
    main()

Of course, this client implementation only works when run from the server itself. If I wanted to get the uptime remotely, I would just change ‘localhost’ to my domain name.

And that’s it! All of the networking details and serialization is handled by Thrift. Of course, a big portion of this post went on about how JSON is too big and too heavy. If one compares performance of various Thrift-like libraries, they will notice that Thrift is definitely not the fastest. Instead, libraries like Avro, Protobuf, and CapnProto are much faster and also more compact.

However, a major pain point for me across the years has been implementing the actual interface layer. Writing client code to summon an HTTP Request and read the response from the server gets old after a while. This is something that Thrift handles for you. As you can see from the example above, Thrift handles all serialization, deserialization, and message-passing for you. All you have to worry about is defining a schema — Thrift gives the rest for free!

Fez

Hexahedron

Hi there, how are you? I will be your hexahedron today.

So this is just a routine procedure, but I do need someone here just in case something goes wrong…

If something does go wrong, you are going to have to clean up the mess.

Hey wait a minute — can you even understand what I am saying? And what is wrong with your head?

Oh well — you are here now — might as well do this thing. Prepare to have your mind blown.

~~~

All right, welcome to the club. Enjoy your free hat!

I kind of thought maybe this would not work because of your weird head. But everything looks A-Ok from over here.

Thanks for the hand. You can go home now! It was very nice to meet you.

Do Not Give Users Reason to Uninstall

Before I begin, I want to point out that this post is not about some evil method of user-retainment nor is it about uninstalling software on an operating systems level. This post is fitted within the scope of any single piece of security software and its owner.

During the development of my first piece of security-related software, No-CSRF, I began to notice that there were a lot of thoughts that crossed my mind which were in the same vein as “But what if the user forgets to re-enable after they disable?”. I was spending a lot of time making a Chrome extension that would make the user more safe as they browsed the World Wide Web; however, the extension was only useful if the user had it enabled.

I never did anything about this issue, however. The only attempt I made to solve the problem was to ensure that the user could re-enable the extension just as easily as they disabled it. When they disabled it, they would know that they could also re-enable it. [1]

When the first version of the extension was released, however, it was a little too strict. Sites that posed no security threat had their functionalities broken and some sites were blocked completely. In fact, one of my colleagues, who was excited to try the extension and make his browser a small bit safer, found many of these cases. After complaining that he was unable to pay an online bill due to the extension, he uninstalled it from Chrome completely and urged me to fix it.

This isn’t an issue in itself; however, it exemplifies a problematic attitude amongst users – something along the lines of “If it breaks what I need, then I don’t need it.” Even if No-CSRF was protecting users from many dangerous Cross-Site Request Forgeries, users chose to uninstall it, essentially choosing convenience over security.

Much like users uninstalled the extension, other users will take far more drastic steps in order to make things more convenient. I had another friend who, upon having port issues with a dedicated server, decided to disable his firewall. These choices, which jeopardize a user’s security, are made without much thought.

Thus, when it comes to security software, a careful balance should be sought after. Although the security-software developers would probably like their software to protect users from many different attacks, they should instead choose to balance protection and usability. A security protocol that will be used by many users is also one that allows the user to become more secure without a change in workflow.

If the developer chooses to make their security software too secure, users may uninstall, making the benefits of the security null. If the users didn’t uninstall a software with less protection, but protection nonetheless, then that is the software that should be produced.

As soon as users have a reason to uninstall, they will. Do not give them that reason.

Finding the balance between security and usability is difficult. Although it may be tempting to solve the problem by making it difficult to disable the security software as a whole, users should never be stripped of this freedom. Thus, I propose that the following tenants should be followed when making security-related software:

  1. Do not alter the difficulty of disabling/uninstalling. User freedom is just as important as security.
  2. Do not give the user an unavoidable reason to uninstall your software.
  3. Make disabling on a per-case basis easier than uninstalling.

If these three points are achieved, the user should choose to disable the security in any case where a broken workflow is not worth increased security, but should not disable the security software as a whole.

References
1 This is referring to a custom in-extension disable rather than the browser-level disable