Split an Iterable

Split an iterable into equal sized chunks.

A common task and interview question, with many variants. It's frequently asked and answered in a way that's suboptimal and only handles one specific case. The goal here is to present definitive, general, and efficient solutions.

The first variant is whether or not the chunks will overlap. Although this could be generalized into a step parameter, it's nearly always the case that step in (1, size).

The second variant is whether the tail of the data should be returned, if it's not of equal size. Clearly when step == 1 that's unlikely to be desirable. However, when step == size, that would seem to be the natural choice. And again when 1 < step < size, the desired behavior isn't clear at all.

The third variant is whether to slice sequences, or support any iterable. Obviously working for any iterable would be ideal, but it's also likely a user would expect slices given a sequence, particularly in the case of strings.

So this author feels it's best to split the problem in 2 distinct cases: a sliding window for overlapping sequences, and chunks for discrete sequences. In each case, supporting iterables and using advanced iterator algebra for a minimal and efficient solution.

Window

In [1]:
import itertools

def window(iterable, size=2):
    """Generate a sliding window of values."""
    its = itertools.tee(iterable, size)
    return zip(*(itertools.islice(it, index, None) for index, it in enumerate(its)))

list(window('abcde'))
Out[1]:
[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e')]

That's simple, and close to optimal. There is slight overhead in iterating an islice object, so a minor variant would be to force the step-wise iteration in advance.

In [2]:
import collections

def window(iterable, size=2):
    """Generate a sliding window of values."""
    its = itertools.tee(iterable, size)
    for index, it in enumerate(its):
        collections.deque(itertools.islice(it, index), 0)  # exhaust iterator
    return zip(*its)

list(window('abcde'))
Out[2]:
[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e')]

Chunks

A lesser-known and under-utilized feature of iter is that in can take a callable (of no arguments) and a sentinel to create an iterator. A perfect use case of the "loop and a half" idiom.

In [3]:
def chunks(iterable, size):
    """Generate adjacent chunks of data"""
    it = iter(iterable)
    return iter(lambda: tuple(itertools.islice(it, size)), ())

list(chunks('abcde', 3))
Out[3]:
[('a', 'b', 'c'), ('d', 'e')]

This should also be optimal, for reasonable sizes.

Sequences with dispatch

Rather than explicitly check isinstance, this is a perfect use case for functools.singledispatch.

In [4]:
import functools
from collections.abc import Sequence

window = functools.singledispatch(window)

@window.register
def _(seq: Sequence, size=2):
    for index in range(len(seq) - size + 1):
        yield seq[index:index + size]

list(window('abcde'))
Out[4]:
['ab', 'bc', 'cd', 'de']
In [5]:
list(window(iter('abcde')))
Out[5]:
[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e')]
In [6]:
chunks = functools.singledispatch(chunks)

@chunks.register
def _(seq: Sequence, size):
    for index in range(0, len(seq), size):
        yield seq[index:index + size]

list(chunks('abcde', 3))
Out[6]:
['abc', 'de']
In [7]:
list(chunks(iter('abcde'), 3))
Out[7]:
[('a', 'b', 'c'), ('d', 'e')]