datalad_next.itertools.decode_bytes
- datalad_next.itertools.decode_bytes(iterable: Iterable[bytes], encoding: str = 'utf-8', backslash_replace: bool = True) Generator[str, None, None] [source]
Decode bytes in an
iterable
into stringsThis function decodes
bytes
orbytearray
intostr
objects, using the specified encoding. Importantly, the decoding input can be spread across multiple chunks of heterogeneous sizes, for example output read from a process or pieces of a download.Multi-byte encodings that are spread over multiple byte chunks are supported, and chunks are joined as necessary. For example, the utf-8 encoding for ö is
b'\xc3\xb6'
. If the encoding is split in the middle because a chunk ends withb'\xc3'
and the next chunk starts withb'\xb6'
, a naive decoding approach like the following would fail:>>> [chunk.decode() for chunk in [b'\xc3', b'\xb6']] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <listcomp> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data
Compared to:
>>> from datalad_next.itertools import decode_bytes >>> tuple(decode_bytes([b'\xc3', b'\xb6'])) ('ö',)
Input chunks are only joined, if it is necessary to properly decode bytes:
>>> from datalad_next.itertools import decode_bytes >>> tuple(decode_bytes([b'\xc3', b'\xb6', b'a'])) ('ö', 'a')
If
backslash_replace
isTrue
, undecodable bytes will be replaced with a backslash-substitution. Otherwise, undecodable bytes will raise aUnicodeDecodeError
:>>> tuple(decode_bytes([b'\xc3'])) ('\\xc3',) >>> tuple(decode_bytes([b'\xc3'], backslash_replace=False)) Traceback (most recent call last): ... UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1: invalid continuation byte
Backslash-replacement of undecodable bytes is an ambiguous mapping, because, for example,
b'\xc3'
can already be present in the input.- Parameters:
iterable (Iterable[bytes]) -- Iterable that yields bytes that should be decoded
encoding (str (default:
'utf-8'
)) -- Encoding to be used for decoding.backslash_replace (bool (default:
True
)) -- IfTrue
, backslash-escapes are used for undecodable bytes. IfFalse
, aUnicodeDecodeError
is raised if a byte sequence cannot be decoded.
- Yields:
str -- Decoded strings that are generated by decoding the data yielded by
iterable
with the specifiedencoding
- Raises:
UnicodeDecodeError -- If
backslash_replace
isFalse
and the data yielded byiterable
cannot be decoded with the specifiedencoding