A Python utility for benchmarking and profiling individual MXNet operator execution.
With this utility, for each MXNet operator you can get the following details:
Timing
Memory
NOTE: This is the pool memory
. It does not reflect the exact memory requested by the operator.
Benchmarks are usually done end-to-end for a given Network Architecture. For example: ResNet-50 benchmarks on ImageNet data. This is good measurement of overall performance and health of a deep learning framework. However, it is important to note the following important factors:
Hence, in this utility, we will build the functionality to allow users and developers of deep learning frameworks to easily run benchmarks for individual operators.
Provided you have MXNet installed (any version >= 1.5.1), all you need to use opperf utility is to add path to your cloned MXNet repository to the PYTHONPATH.
Note: To install MXNet, refer Installing MXNet page
export PYTHONPATH=$PYTHONPATH:/path/to/incubator-mxnet/
Below command runs all the MXNet operators (NDArray) benchmarks with default inputs and saves the final result as JSON in the given file.
python incubator-mxnet/benchmark/opperf/opperf.py --output-format json --output-file mxnet_operator_benchmark_results.json
Other Supported Options:
output-format : json
or md
for markdown file output.
ctx : cpu
or gpu
. By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
dtype : By default, float32
. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.
For example, you want to run benchmarks for all NDArray Broadcast Binary Operators, Ex: broadcast_add, broadcast_mod, broadcast_pow etc., You just run the following python script.
#!/usr/bin/python from benchmark.opperf.nd_operations.binary_operators import run_mx_binary_broadcast_operators_benchmarks # Run all Binary Broadcast operations benchmarks with default input values print(run_mx_binary_broadcast_operators_benchmarks())
Output for the above benchmark run, on a CPU machine, would look something like below:
{'broadcast_mod': [{'avg_time_forward_broadcast_mod': 28.7063, 'avg_time_mem_alloc_cpu/0': 4194.3042, 'avg_time_backward_broadcast_mod': 12.0954, 'inputs': {'lhs': (1024, 1024), 'rhs': (1024, 1024)}}, {'avg_time_forward_broadcast_mod': 2.7332, 'avg_time_mem_alloc_cpu/0': 400.0, 'avg_time_backward_broadcast_mod': 1.1288, 'inputs': {'lhs': (10000, 10), 'rhs': (10000, 10)}}, {'avg_time_forward_broadcast_mod': 30.5322, 'avg_time_mem_alloc_cpu/0': 4000.0, 'avg_time_backward_broadcast_mod': 225.0255, 'inputs': {'lhs': (10000, 1), 'rhs': (10000, 100)}}], 'broadcast_power': [{'avg_time_backward_broadcast_power': 49.5871, 'avg_time_forward_broadcast_power': 18.0954, 'avg_time_mem_alloc_cpu/0': 4194.3042, 'inputs': {'lhs': (1024, 1024), 'rhs': (1024, 1024)}}, {'avg_time_backward_broadcast_power': 4.6623, 'avg_time_forward_broadcast_power': 1.8283, 'avg_time_mem_alloc_cpu/0': 400.0, 'inputs': {'lhs': (10000, 10), 'rhs': (10000, 10)}}, {'avg_time_backward_broadcast_power': 279.922, 'avg_time_forward_broadcast_power': 24.4621, 'avg_time_mem_alloc_cpu/0': 4000.0, 'inputs': {'lhs': (10000, 1), 'rhs': (10000, 100)}}], ..... .....
For example, you want to run benchmarks for nd.add
operator in MXNet, you just run the following python script.
#!/usr/bin/python import mxnet as mx from mxnet import nd from benchmark.opperf.utils.benchmark_utils import run_performance_test add_res = run_performance_test(nd.add, run_backward=True, dtype='float32', ctx=mx.cpu(), inputs=[{"lhs": (1024, 1024), "rhs": (1024, 1024)}], warmup=10, runs=25)
Output for the above benchmark run, on a CPU machine, would look something like below:
{'add': [{'avg_time_mem_alloc_cpu/0': 102760.4453, 'avg_time_forward_broadcast_add': 4.0372, 'avg_time_backward_broadcast_add': 5.3841, 'inputs': {'lhs': (1024, 1024), 'rhs': (1024, 1024)}}]}
For example, you want to run benchmarks for nd.add
, nd.sub
operator in MXNet, with the same set of inputs. You just run the following python script.
#!/usr/bin/python import mxnet as mx from mxnet import nd from benchmark.opperf.utils.benchmark_utils import run_performance_test add_res = run_performance_test([nd.add, nd.subtract], run_backward=True, dtype='float32', ctx=mx.cpu(), inputs=[{"lhs": (1024, 1024), "rhs": (1024, 1024)}], warmup=10, runs=25)
Output for the above benchmark run, on a CPU machine, would look something like below:
{'add': [{'avg_time_mem_alloc_cpu/0': 102760.4453, 'avg_time_forward_broadcast_add': 4.0372, 'avg_time_backward_broadcast_add': 5.3841, 'inputs': {'lhs': (1024, 1024), 'rhs': (1024, 1024)}}], 'subtract': [{'avg_time_forward_broadcast_sub': 5.5137, 'avg_time_mem_alloc_cpu/0': 207618.0469, 'avg_time_backward_broadcast_sub': 7.2976, 'inputs': {'lhs': (1024, 1024), 'rhs': (1024, 1024)}} ]}
Under the hood, executes NDArray operator using randomly generated data. Use MXNet profiler to get summary of the operator execution:
See the design proposal document for more details - https://cwiki.apache.org/confluence/display/MXNET/MXNet+Operator+Benchmarks
NOTE:
This utility queries MXNet operator registry to fetch all operators registered with MXNet, generate inputs and run benchmarks. However, fully automated tests are enabled only for simpler operators such as - broadcast operators, element_wise operators etc... For the purpose of readability and giving more control to the users, complex operators such as convolution (2D, 3D), Pooling, Recurrent are not fully automated but expressed as default rules. See utils/op_registry_utils.py
for more details.
Optionally, you could use the python time package as the profiler engine to caliberate runtime in each operator. To use python timer for all operators, use the argument --profiler ‘python’:
python incubator-mxnet/benchmark/opperf/opperf.py --profiler='python'
To use python timer for a specific operator, pass the argument profiler to the run_performance_test method:
add_res = run_performance_test([nd.add, nd.subtract], run_backward=True, dtype='float32', ctx=mx.cpu(), inputs=[{"lhs": (1024, 1024), "rhs": (1024, 1024)}], warmup=10, runs=25, profiler='python')
By default, MXNet profiler is used as the profiler engine.
All contributions are welcome. Below is the list of desired features: