Split an Iterable
Split an iterable into equal sized chunks.¶
A common task and interview question, with many variants. It's frequently asked and answered in a way that's suboptimal and only handles one specific case. The goal here is to present definitive, general, and efficient solutions.
The first variant is whether or not the chunks will overlap. Although this could be generalized into a step
parameter, it's nearly always the case that step in (1, size)
.
The second variant is whether the tail of the data should be returned, if it's not of equal size. Clearly when step == 1
that's unlikely to be desirable. However, when step == size
, that would seem to be the natural choice. And again when 1 < step < size
, the desired behavior isn't clear at all.
The third variant is whether to slice sequences, or support any iterable. Obviously working for any iterable would be ideal, but it's also likely a user would expect slices given a sequence, particularly in the case of strings.
So this author feels it's best to split the problem in 2 distinct cases: a sliding window
for overlapping sequences, and chunks
for discrete sequences. In each case, supporting iterables and using advanced iterator algebra for a minimal and efficient solution.
Window¶
import itertools
def window(iterable, size=2):
"""Generate a sliding window of values."""
its = itertools.tee(iterable, size)
return zip(*(itertools.islice(it, index, None) for index, it in enumerate(its)))
list(window('abcde'))
That's simple, and close to optimal. There is slight overhead in iterating an islice
object, so a minor variant would be to force the step-wise iteration in advance.
import collections
def window(iterable, size=2):
"""Generate a sliding window of values."""
its = itertools.tee(iterable, size)
for index, it in enumerate(its):
collections.deque(itertools.islice(it, index), 0) # exhaust iterator
return zip(*its)
list(window('abcde'))
Chunks¶
A lesser-known and under-utilized feature of iter
is that in can take a callable (of no arguments) and a sentinel to create an iterator. A perfect use case of the "loop and a half" idiom.
def chunks(iterable, size):
"""Generate adjacent chunks of data"""
it = iter(iterable)
return iter(lambda: tuple(itertools.islice(it, size)), ())
list(chunks('abcde', 3))
This should also be optimal, for reasonable sizes.
Sequences with dispatch¶
Rather than explicitly check isinstance
, this is a perfect use case for functools.singledispatch
.
import functools
from collections.abc import Sequence
window = functools.singledispatch(window)
@window.register
def _(seq: Sequence, size=2):
for index in range(len(seq) - size + 1):
yield seq[index:index + size]
list(window('abcde'))
list(window(iter('abcde')))
chunks = functools.singledispatch(chunks)
@chunks.register
def _(seq: Sequence, size):
for index in range(0, len(seq), size):
yield seq[index:index + size]
list(chunks('abcde', 3))
list(chunks(iter('abcde'), 3))