PARQUET-1272: Return correct row count for nested columns in ScanFileContents
Stumbled over this while adding lists to the `alltypes_sample` in `test_parquet.py` in Arrow.
Author: Korn, Uwe <Uwe.Korn@blue-yonder.com>
Closes #457 from xhochy/PARQUET-1272 and squashes the following commits:
45efe1c [Korn, Uwe] PARQUET-1272: Return correct row count for nested columns in ScanFileContents
diff --git a/src/parquet/file_reader.cc b/src/parquet/file_reader.cc
index 983d2d0..0632872 100644
--- a/src/parquet/file_reader.cc
+++ b/src/parquet/file_reader.cc
@@ -347,9 +347,18 @@
int64_t values_read = 0;
while (col_reader->HasNext()) {
- total_rows[col] +=
+ int64_t levels_read =
ScanAllValues(column_batch_size, def_levels.data(), rep_levels.data(),
values.data(), &values_read, col_reader.get());
+ if (col_reader->descr()->max_repetition_level() > 0) {
+ for (int64_t i = 0; i < levels_read; i++) {
+ if (rep_levels[i] == 0) {
+ total_rows[col]++;
+ }
+ }
+ } else {
+ total_rows[col] += levels_read;
+ }
}
col++;
}