tree f86c5a008b5fcb08111bb56d80863ff869c3ab43
parent ca3240a8c920225d344e64647453173ffb0a77f2
author Gary Pennington <31890086+garyanaplan@users.noreply.github.com> 1623861422 +0100
committer GitHub <noreply@github.com> 1623861422 -0400
gpgsig -----BEGIN PGP SIGNATURE-----
 
 wsBcBAABCAAQBQJgyiiuCRBK7hj4Ov3rIwAAyB0IAFbgU7V0CFHVyFv+wqtpv5Ix
 OhEANjnkf16j0K5qwqakTQPGMGfl2d7tZ33gb7STC1LsFNc+EJJbR8Ddc/4L5cJm
 EI3YYCrdMhQPTQZw/Jtcu3YFCyOkpVIiio05VSSkyycKH8nJ1pUJHDNBQ5vl+p5/
 WhqiduLL+iof2+Vrzh16KvqQ3+/dDwiJ5HVWMcSB4ODZ4oRxgYZL0o693YyGjoJH
 CnYDcb7mlnELoEkvGE+qzJ/3QJpERjwVcxn/vOjTQtH8doWw84Nw0yae++eutbQa
 Rt8o5VAyc5pka1u+Sv+xY043xD8oyPrhrUjMKIUy4kEbn/w/Q12ODSwqNoGt9X8=
 =K0Pu
 -----END PGP SIGNATURE-----
 

parquet: improve BOOLEAN writing logic and report error on encoding fail (#443)

* improve BOOLEAN writing logic and report error on encoding fail

When writing BOOLEAN data, writing more than 2048 rows of data will
overflow the hard-coded 256 buffer set for the bit-writer in the
PlainEncoder. Once this occurs, further attempts to write to the encoder
fail, becuase capacity is exceeded, but the errors are silently ignored.

This fix improves the error detection and reporting at the point of
encoding and modifies the logic for bit_writing (BOOLEANS). The
bit_writer is initially allocated 256 bytes (as at present), then each
time the capacity is exceeded the capacity is incremented by another
256 bytes.

This certainly resolves the current problem, but it's not exactly a
great fix because the capacity of the bit_writer could now grow
substantially.

Other data types seem to have a more sophisticated mechanism for writing
data which doesn't involve growing or having a fixed size buffer. It
would be desirable to make the BOOLEAN type use this same mechanism if
possible, but that level of change is more intrusive and probably
requires greater knowledge of the implementation than I possess.

resolves: #349

* only manipulate the bit_writer for BOOLEAN data

Tacky, but I can't think of better way to do this without
specialization.

* better isolation of changes

Remove the byte tracking from the PlainEncoder and use the existing
bytes_written() method in BitWriter.

This is neater.

* add test for boolean writer

The test ensures that we can write > 2048 rows to a parquet file and
that when we read the data back, it finishes without hanging (defined as
taking < 5 seconds).

If we don't want that extra complexity, we could remove the
thread/channel stuff and just try to read the file and let the test
runner terminate hanging tests.

* fix capacity calculation error in bool encoding

The values.len() reports the number of values to be encoded and so must
be divided by 8 (bits in a bytes) to determine the effect on the byte
capacity of the bit_writer.