Closing files

Contrarian view on closing files.

It has become conventional wisdom to always explicitly close file-like objects, via context managers. The google style guide is representative:

Explicitly close files and sockets when done with them. Leaving files, sockets or other file-like objects open unnecessarily has many downsides, including:

They may consume limited system resources, such as file descriptors.

  • Code that deals with many such objects may exhaust those resources unnecessarily if they're not returned to the system promptly after use.
  • Holding files open may prevent other actions being performed on them, such as moves or deletion.
  • Files and sockets that are shared throughout a program may inadvertantly be read from or written to after logically being closed. If they are actually closed, attempts to read or write from them will throw exceptions, making the problem known sooner.

Furthermore, while files and sockets are automatically closed when the file object is destructed, tying the life-time of the file object to the state of the file is poor practice, for several reasons:

  • There are no guarantees as to when the runtime will actually run the file's destructor. Different Python implementations use different memory management techniques, such as delayed Garbage Collection, which may increase the object's lifetime arbitrarily and indefinitely.
  • Unexpected references to the file may keep it around longer than intended (e.g. in tracebacks of exceptions, inside globals, etc).

The preferred way to manage files is using the "with" statement:

with open("hello.txt") as hello_file:
    for line in hello_file:
        print line

In theory

Good points, and why limit this advice to file descriptors? Any resource may be limited or require exclusivity; that's why they're called resources. Similarly one should always explicitly call dict.clear when finished with a dict. After all, "there are no guarantees as to when the runtime will actually run the <object's> destructor. And "code that deals with many such objects may exhaust those resources unnecessarily", such as memory, or whatever else is in the dict.

But in all seriousness, this advice is applying a notably higher standard of premature optimization to file descriptors than to any other kind of resource. There are plenty of Python projects that are guaranteed to run on CPython for a variety of reasons, where destructors are immediately called. And there are plenty of Python projects where file descriptor usage is just a non-issue. It's now depressingly commonplace to see this in setup.py files:

In [ ]:
with open("README.md") as readme:
    long_description = readme.read()

Let's consider a practical example: a load function which is supposed to read and parse data given a file path.

In [ ]:
import csv
import json

def load(filepath):
    """the supposedly bad way"""
    return json.load(open(filepath))

def load(filepath):
    """the supposedly good way"""
    with open(filepath) as file:
        return json.load(file)

def load(filepath):
    """with a different file format"""
    with open(filepath) as file:
        return csv.reader(file)

Which versions work correctly? Are you sure? If it's not immediately obvious why one of these is broken, that's the point. In fact, it's worth trying out before reading on.

...

The csv version returns an iterator over a closed file. It's a violation of procedural abstraction to know whether the result of load is lazily evaluated or not; it's just supposed to implement an interface. Moreover, according to this best practice, it's impossible to write the csv version correctly. As absurd as it sounds, it's just an abstraction that can't exist.

Defiantly clever readers are probably already trying to fix it. Maybe like this:

In [ ]:
def load(filepath):
    with open(filepath) as file:
        yield from csv.reader(file)

No, it will not be fixed. This version only appears to work by not closing the file until the generator is exhausted or collected.

This trivial example has deeper implications. If one accepts this practice, then one must also accept that storing a file handle anywhere, such as on an instance, is also disallowed. Unless of course that object then virally implements it owns context manager, ad infinitum.

Furthermore it demonstrates that often the context is not being managed locally. If a file object is passed another function, then it's being used outside of the context. Let's revisit the json version, which works because the file is fully read. Doesn't a json parser have some expensive parsing to do after it's read the file? It might even throw an error. And isn't it desirable, trivial, and likely that the implementation releases interest in the file as soon as possible?

So in reality there are scenarios where the supposedly good way could keep the file open longer than the supposedly bad way. The original inline version does exactly what it's supposed to do: close the file when all interested parties are done with it. Python uses garbage collection to manage shared resources. Any attempt to pretend otherwise will result in code that is broken, inefficient, or reinventing reference counting.

A true believer now has to accept that json.load is a useless and dangerous wrapper, and that the only correct implementation is:

In [ ]:
def load(filepath):
    with open(filepath) as file:
        contents = file.read()
    return json.loads(contents)

This line of reasoning reduces to the absurd: a file should never be passed or stored anywhere. Next an example where the practice has caused real-world damage.

In practice

Requests is one of the most popular python packages, and officially recommended. It includes a Session object which supports closing via a context manager. The vast majority of real-world code uses the the top-level functions or single-use sessions.

In [ ]:
response = requests.get(...)

with requests.Session() as session:
    response = session.get(...)

Sessions manage the connection pool, so this pattern of usage is establishing a new connection every time. There are popular standard API clients which seriously do this, for every single request to the same endpoint.

Requests' documentation prominently states that "Keep-alive and HTTP connection pooling are 100% automatic". So part of the blame may lay with that phrasing, since it's only "automatic" if sessions are reused. But surely a large part of the blame is the dogma of closing sockets, and therefore sessions, explicitly. The whole point of a connection pool is that it may leave connections open, so users who genuinely need this granularity are working at the wrong abstraction layer. http.client is already builtin for that level of control.

Tellingly, requests' own top-level functions didn't always close sessions. There's a long history to that code, including a version that only closed sessions on success. An older version was causing warnings, when run to check for such warnings, and was being blamed for the appearance of leaking memory. Those threads are essentially debating whether a resource pool is "leaking" resources.

Truce

Prior to with being introduced in Python 2.5, it was not recommended that inlined reading of a file required a try... finally block. Far from it, in the past idioms like open(...).read() and for line in open(...) were lauded for being succinct and expressive. But if all this orphaned file descriptor paranoia was well-founded, it would have been a problem back then too.

Finally, let's address readability. It could be argued (though it rarely is) that showing the reader when the file is closed has inherent value. Conveniently, that tends to align with having opened the file for writing anyway, thereby needing an reference to it. In which case, the readability is approximately equal, and potential pitfalls are more realistic. But readability is genuinely lost when the file would have been opened in a inline expression.

The best practice is unjustifiably special-casing file descriptors, and not seeing its own reasoning through to its logical conclusion. This author proposes advocating for anonymous read-only open expressions. Your setup script is not going to run out of file descriptors because you wrote open("README.md").read().