commit | 22bebf8278cbbed08f82f1ec664ed4e4577cd175 | [log] [tgz] |
---|---|---|
author | Yibo Cai <yibo.cai@arm.com> | Thu Apr 15 17:23:46 2021 +0200 |
committer | Antoine Pitrou <antoine@python.org> | Thu Apr 15 17:23:46 2021 +0200 |
tree | d20dc3639e92a0e5a392a9cf5ab02337d1cad20f | |
parent | 2da0a3724a396616d3f1b52f9f28e04a12fb3bfd [diff] |
ARROW-11568: [C++][Compute] Rewrite mode kernel Arrow mode kernel performance is bad compared with scipy.stats.mode (based on numpy.unique). Arrow mode kernel stores value:count pair in a map, while numpy.unique sorts the input array then count the adjacent same values. Per my test, the map approach only wins when there are many duplicated values (length / value_range > 100), looks not very useful in practice. This patch rewrites mode kernel to use the sort and count approach for floating points and integers with wide value range. 2x performance improvement is observed. Closes #10009 from cyb70289/11568-mode-optimize Lead-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.
Major components of the project include:
Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.
The reference Arrow libraries contain many distinct software components:
The official Arrow libraries in this repository are in different stages of implementing the Arrow format and related features. See our current feature matrix on git master.
Please read our latest project contribution guide.
Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved: