Bug report
Bug description:
The problem
- for an in-band, writable
PickleBuffer, the pickler does buf = m.tobytes(). That value is always bytes
- if the buffer is empty, buf is always the
b'' singleton, so id(buf) is the same for every such call
- the pickler then does
in_memo = id(buf) in self.memo
- so once
b'' has been memoized anywhere earlier in the same dump, every later empty in-band tobytes() in that dump sees in_memo is true
- so any later part of the same pickle that is supposed to store
b'' as bytes may instead reuse that memo. The unpickler then returns the bytearray, not bytes.
Code that requires bytes (strict isinstance etc.) can fail during unpickling when it receives a bytearray.
Excerpt from pickler.py, but the same applies to the C implementation:
if in_memo:
pb_branch = "_save_bytearray_no_memo"
self._save_bytearray_no_memo(buf)
else:
pb_branch = "save_bytearray"
self.save_bytearray(buf)
One minimal repro example is:
import dill
from pickle import PickleBuffer
# a bit artificial example to trigger wrong flow
def repro_minimal() -> None:
pb = PickleBuffer(memoryview(bytearray()))
def f():
pass
blob = dill.dumps((pb, f), protocol=5)
# it fails when trying to pass a bytearray instead of bytes to dill's _create_code
dill.loads(blob)
print("dill.loads ok")
Less synthetic if you have Pandas and PyArrow
import pandas as pd
def repro_arrow_empty_dataframe() -> None:
col = "EMPTY_STRING_COLUMN"
df = pd.DataFrame({col: pd.Series([""], dtype="string[pyarrow]")})
def g():
pass
blob = dill.dumps((df, g), protocol=5)
dill.loads(blob)
print("dill.loads ok")
Both result in TypeError: code() argument 16 must be bytes, not bytearray
However, that code() case is only one example; the underlying issue is bytes vs bytearray memo reuse, and similar failures can appear anywhere bytes are required.
CPython versions tested on:
3.13
Operating systems tested on:
macOS
Bug report
Bug description:
The problem
PickleBuffer, the pickler doesbuf = m.tobytes(). That value is always bytesb''singleton, soid(buf)is the same for every such callin_memo = id(buf) in self.memob''has been memoized anywhere earlier in the same dump, every later empty in-bandtobytes()in that dump sees in_memo is trueb''as bytes may instead reuse that memo. The unpickler then returns the bytearray, not bytes.Code that requires bytes (strict isinstance etc.) can fail during unpickling when it receives a bytearray.
Excerpt from
pickler.py, but the same applies to the C implementation:One minimal repro example is:
Less synthetic if you have Pandas and PyArrow
Both result in
TypeError: code() argument 16 must be bytes, not bytearrayHowever, that
code()case is only one example; the underlying issue is bytes vs bytearray memo reuse, and similar failures can appear anywhere bytes are required.CPython versions tested on:
3.13
Operating systems tested on:
macOS