datalad_next.itertools.itemize

datalad_next.itertools.itemize(iterable: Iterable[T], sep: T | None, *, keep_ends: bool = False) Generator[T, None, None][source]

Yields complete items (only), assembled from an iterable

This function consumes chunks from an iterable and yields items defined by a separator. An item might span multiple input chunks. Input (chunks) can be bytes, bytearray, or str objects. The result type is determined by the type of the first input chunk. During its runtime, the type of the elements in iterable must not change.

Items are defined by a separator given via sep. If sep is None, the line-separators built into str.splitlines() are used, and each yielded item will be a line. If sep is not None, its type must be compatible to the type of the elements in iterable.

A separator could, for example, be b'\n', in which case the items would be terminated by Unix line-endings, i.e. each yielded item is a single line. The separator could also be, b'\x00' (or '\x00'), to split zero-byte delimited content, like the output of git ls-files -z.

Separators can be longer than one byte or character, e.g. b'\r\n', or b'\n-------------------\n'.

Content after the last separator, possibly merged across input chunks, is always yielded as the last item, even if it is not terminated by the separator.

Performance notes:

  • Using None as a separator (splitlines-mode) is slower than providing a specific separator.

  • If another separator than None is used, the runtime with keep_end=False is faster than with keep_end=True.

Parameters:
  • iterable (Iterable[str | bytes | bytearray]) -- The iterable that yields the input data

  • sep (str | bytes | bytearray | None) -- The separator that defines items. If None, the items are determined by the line-separators that are built into str.splitlines().

  • keep_ends (bool) -- If True, the item-separator will remain at the end of a yielded item. If False, items will not contain the separator. Preserving separators implies a runtime cost, unless the separator is None.

Yields:

str | bytes | bytearray -- The items determined from the input iterable. The type of the yielded items depends on the type of the first element in iterable.

Examples

>>> from datalad_next.itertools import itemize
>>> with open('/etc/passwd', 'rt') as f:                            
...     print(tuple(itemize(iter(f.read, ''), sep=None))[0:2])      
('root:x:0:0:root:/root:/bin/bash',
 'systemd-timesync:x:497:497:systemd Time Synchronization:/:/usr/sbin/nologin')
>>> with open('/etc/passwd', 'rt') as f:                            
...     print(tuple(itemize(iter(f.read, ''), sep=':'))[0:10])      
('root', 'x', '0', '0', 'root', '/root',
 '/bin/bash\nsystemd-timesync', 'x', '497', '497')
>>> with open('/etc/passwd', 'rt') as f:                                        
...     print(tuple(itemize(iter(f.read, ''), sep=':', keep_ends=True))[0:10])  
('root:', 'x:', '0:', '0:', 'root:', '/root:',
 '/bin/bash\nsystemd-timesync:', 'x:', '497:', '497:')