| <!DOCTYPE html> |
| <html lang=" en"><head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge"> |
| <meta name="viewport" content="width=device-width, initial-scale=1"> |
| <link href="https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/image/mxnet-icon.png" rel="icon" type="image/png"><!-- Begin Jekyll SEO tag v2.6.1 --> |
| <title>Deep Learning Programming Paradigm | Apache MXNet</title> |
| <meta name="generator" content="Jekyll v4.0.0" /> |
| <meta property="og:title" content="Deep Learning Programming Paradigm" /> |
| <meta property="og:locale" content="en_US" /> |
| <meta name="description" content="A flexible and efficient library for deep learning." /> |
| <meta property="og:description" content="A flexible and efficient library for deep learning." /> |
| <link rel="canonical" href="https://mxnet.apache.org/versions/master/api/architecture/program_model" /> |
| <meta property="og:url" content="https://mxnet.apache.org/versions/master/api/architecture/program_model" /> |
| <meta property="og:site_name" content="Apache MXNet" /> |
| <script type="application/ld+json"> |
| {"url":"https://mxnet.apache.org/versions/master/api/architecture/program_model","headline":"Deep Learning Programming Paradigm","description":"A flexible and efficient library for deep learning.","@type":"WebPage","@context":"https://schema.org"}</script> |
| <!-- End Jekyll SEO tag --> |
| <script src="https://medium-widget.pixelpoint.io/widget.js"></script> |
| <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/docsearch.js@2/dist/cdn/docsearch.min.css" /> |
| <link rel="stylesheet" href="/versions/master/assets/main.css"><link type="application/atom+xml" rel="alternate" href="https://mxnet.apache.org/versions/master/feed.xml" title="Apache MXNet" /><script> |
| if(!(window.doNotTrack === "1" || navigator.doNotTrack === "1" || navigator.doNotTrack === "yes" || navigator.msDoNotTrack === "1")) { |
| (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ |
| (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), |
| m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) |
| })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); |
| |
| ga('create', 'UA-96378503-1', 'auto'); |
| ga('send', 'pageview'); |
| } |
| </script> |
| |
| <script src="/versions/master/assets/js/jquery-3.3.1.min.js"></script><script src="https://cdn.jsdelivr.net/npm/docsearch.js@2/dist/cdn/docsearch.min.js" defer></script> |
| <script src="/versions/master/assets/js/globalSearch.js" defer></script> |
| <script src="/versions/master/assets/js/clipboard.js" defer></script> |
| <script src="/versions/master/assets/js/copycode.js" defer></script></head> |
| <body><header class="site-header" role="banner"> |
| |
| <script> |
| $(document).ready(function () { |
| |
| // HEADER OPACITY LOGIC |
| |
| function opacity_header() { |
| var value = "rgba(4,140,204," + ($(window).scrollTop() / 300 + 0.4) + ")" |
| $('.site-header').css("background-color", value) |
| } |
| |
| $(window).scroll(function () { |
| opacity_header() |
| }) |
| opacity_header(); |
| |
| // MENU SELECTOR LOGIC |
| $('.page-link').each( function () { |
| if (window.location.href.includes(this.href)) { |
| $(this).addClass("page-current"); |
| } |
| }); |
| }) |
| </script> |
| <div class="wrapper"> |
| <a class="site-title" rel="author" href="/versions/master/"><img |
| src="/versions/master/assets/img/mxnet_logo.png" class="site-header-logo"></a> |
| <nav class="site-nav"> |
| <input type="checkbox" id="nav-trigger" class="nav-trigger"/> |
| <label for="nav-trigger"> |
| <span class="menu-icon"> |
| <svg viewBox="0 0 18 15" width="18px" height="15px"> |
| <path d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0 h15.032C17.335,0,18,0.665,18,1.484L18,1.484z M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0 c0-0.82,0.665-1.484,1.484-1.484h15.032C17.335,6.031,18,6.696,18,7.516L18,7.516z M18,13.516C18,14.335,17.335,15,16.516,15H1.484 C0.665,15,0,14.335,0,13.516l0,0c0-0.82,0.665-1.483,1.484-1.483h15.032C17.335,12.031,18,12.695,18,13.516L18,13.516z"/> |
| </svg> |
| </span> |
| </label> |
| <div class="gs-search-border"> |
| <div id="gs-search-icon"></div> |
| <form id="global-search-form"> |
| <input id="global-search" type="text" title="Search" placeholder="Search" /> |
| <div id="global-search-dropdown-container"> |
| <button class="gs-current-version btn" type="button" data-toggle="dropdown"> |
| <span id="gs-current-version-label">master</span> |
| <svg class="gs-dropdown-caret" viewBox="0 0 32 32" class="icon icon-caret-bottom" aria-hidden="true"> |
| <path class="dropdown-caret-path" d="M24 11.305l-7.997 11.39L8 11.305z"></path> |
| </svg> |
| </button> |
| <ul class="gs-opt-group gs-version-dropdown"> |
| |
| |
| <li class="gs-opt gs-versions active">master</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.7.0</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.6.0</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.5.0</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.4.1</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.3.1</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.2.1</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.1.0</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.0.0</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">0.12.1</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">0.11.0</li> |
| |
| |
| </ul> |
| </div> |
| <span id="global-search-close">x</span> |
| </form> |
| </div> |
| <div class="trigger"> |
| <div id="global-search-mobile-border"> |
| <div id="gs-search-icon-mobile"></div> |
| <input id="global-search-mobile" placeholder="Search..." type="text"/> |
| <div id="global-search-dropdown-container-mobile"> |
| <button class="gs-current-version-mobile btn" type="button" data-toggle="dropdown"> |
| <svg class="gs-dropdown-caret" viewBox="0 0 32 32" class="icon icon-caret-bottom" aria-hidden="true"> |
| <path class="dropdown-caret-path" d="M24 11.305l-7.997 11.39L8 11.305z"></path> |
| </svg> |
| </button> |
| <ul class="gs-opt-group gs-version-dropdown-mobile"> |
| |
| |
| <li class="gs-opt gs-versions active">master</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.7.0</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.6.0</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.5.0</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.4.1</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.3.1</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.2.1</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.1.0</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">1.0.0</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">0.12.1</li> |
| |
| |
| |
| <li class="gs-opt gs-versions">0.11.0</li> |
| |
| |
| </ul> |
| </div> |
| </div> |
| <a class="page-link" href="/versions/master/get_started">Get Started</a> |
| <a class="page-link" href="/versions/master/blog">Blog</a> |
| <a class="page-link" href="/versions/master/features">Features</a> |
| <a class="page-link" href="/versions/master/ecosystem">Ecosystem</a> |
| <a class="page-link" href="/versions/master/api">Docs & Tutorials</a> |
| <a class="page-link" href="https://github.com/apache/incubator-mxnet">GitHub</a> |
| <div class="dropdown"> |
| <span class="dropdown-header">master |
| <svg class="dropdown-caret" viewBox="0 0 32 32" class="icon icon-caret-bottom" aria-hidden="true"><path class="dropdown-caret-path" d="M24 11.305l-7.997 11.39L8 11.305z"></path></svg> |
| </span> |
| <div class="dropdown-content"> |
| |
| |
| <a class="dropdown-option-active" href="/">master</a> |
| |
| |
| |
| <a href="/versions/1.7.0/">1.7.0</a> |
| |
| |
| |
| <a href="/versions/1.6.0/">1.6.0</a> |
| |
| |
| |
| <a href="/versions/1.5.0/">1.5.0</a> |
| |
| |
| |
| <a href="/versions/1.4.1/">1.4.1</a> |
| |
| |
| |
| <a href="/versions/1.3.1/">1.3.1</a> |
| |
| |
| |
| <a href="/versions/1.2.1/">1.2.1</a> |
| |
| |
| |
| <a href="/versions/1.1.0/">1.1.0</a> |
| |
| |
| |
| <a href="/versions/1.0.0/">1.0.0</a> |
| |
| |
| |
| <a href="/versions/0.12.1/">0.12.1</a> |
| |
| |
| |
| <a href="/versions/0.11.0/">0.11.0</a> |
| |
| |
| </div> |
| </div> |
| </div> |
| </nav> |
| </div> |
| </header> |
| <main class="page-content" aria-label="Content"> |
| <script> |
| |
| </script> |
| <article class="post"> |
| |
| <header class="post-header wrapper"> |
| <h1 class="post-title">Deep Learning Programming Paradigm</h1> |
| <h3></h3></header> |
| |
| <div class="post-content"> |
| <div class="wrapper"> |
| <div class="row"> |
| <div class="col-3 docs-side-bar"> |
| <h3 style="text-transform: capitalize; padding-left:10px">architecture</h3> |
| <ul> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| |
| <li><a href="/versions/master/api/architecture/exception_handling">Exception Handling in MXNet</a></li> |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| |
| <li><a href="/versions/master/api/architecture/note_data_loading">Efficient Data Loaders</a></li> |
| <!-- page-category --> |
| |
| |
| <li><a href="/versions/master/api/architecture/note_engine">Dependency Engine</a></li> |
| <!-- page-category --> |
| |
| |
| <li><a href="/versions/master/api/architecture/note_memory">Memory Consumption</a></li> |
| <!-- page-category --> |
| |
| |
| <li><a href="/versions/master/api/architecture/overview">MXNet System Architecture</a></li> |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| |
| <li><a href="/versions/master/api/architecture/program_model">Deep Learning Programming Paradigm</a></li> |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| |
| <!-- page-category --> |
| <!-- resource-p --> |
| </ul> |
| </div> |
| <div class="col-9"> |
| <!--- Licensed to the Apache Software Foundation (ASF) under one --> |
| <!--- or more contributor license agreements. See the NOTICE file --> |
| <!--- distributed with this work for additional information --> |
| <!--- regarding copyright ownership. The ASF licenses this file --> |
| <!--- to you under the Apache License, Version 2.0 (the --> |
| <!--- "License"); you may not use this file except in compliance --> |
| <!--- with the License. You may obtain a copy of the License at --> |
| |
| <!--- http://www.apache.org/licenses/LICENSE-2.0 --> |
| |
| <!--- Unless required by applicable law or agreed to in writing, --> |
| <!--- software distributed under the License is distributed on an --> |
| <!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY --> |
| <!--- KIND, either express or implied. See the License for the --> |
| <!--- specific language governing permissions and limitations --> |
| <!--- under the License. --> |
| |
| <h1 id="deep-learning-programming-paradigm">Deep Learning Programming Paradigm</h1> |
| |
| <p>However much we might ultimately care about performance, |
| we first need working code before we can start worrying about optimization. |
| Writing clear, intuitive deep learning code can be challenging, |
| and the first thing any practitioner must deal with is the language syntax itself. |
| Complicating matters, of the many deep learning libraries out there, |
| each has its own approach to programming style.</p> |
| |
| <p>In this document, we focus on two of the most important high-level design decisions:</p> |
| <ol> |
| <li>Whether to embrace the <em>symbolic</em> or <em>imperative</em> paradigm for mathematical computation.</li> |
| <li>Whether to build networks with bigger (more abstract) or more atomic operations.</li> |
| </ol> |
| |
| <p>Throughout, we’ll focus on the programming models themselves. |
| When programming style decisions may impact performance, we point this out, |
| but we don’t dwell on specific implementation details.</p> |
| |
| <h2 id="symbolic-vs-imperative-programs">Symbolic vs. Imperative Programs</h2> |
| |
| <p>If you are a Python or C++ programmer, then you’re already familiar with imperative programs. |
| Imperative-style programs perform computation as you run them. |
| Most code you write in Python is imperative, as is the following NumPy snippet.</p> |
| |
| <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> |
| <span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> |
| <span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span> |
| <span class="n">c</span> <span class="o">=</span> <span class="n">b</span> <span class="o">*</span> <span class="n">a</span> |
| <span class="n">d</span> <span class="o">=</span> <span class="n">c</span> <span class="o">+</span> <span class="mi">1</span> |
| </code></pre></div></div> |
| <p>When the program executes <code class="highlighter-rouge">c = b * a</code>, it runs the actual numerical computation.</p> |
| |
| <p>Symbolic programs are a bit different. With symbolic-style programs, |
| we first define a (potentially complex) function abstractly. |
| When defining the function, no actual numerical computation takes place. |
| We define the abstract function in terms of placeholder values. |
| Then we can compile the function, and evaluate it given real inputs. |
| In the following example, we rewrite the imperative program from above |
| as a symbolic-style program:</p> |
| |
| <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">A</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="s">'A'</span><span class="p">)</span> |
| <span class="n">B</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="s">'B'</span><span class="p">)</span> |
| <span class="n">C</span> <span class="o">=</span> <span class="n">B</span> <span class="o">*</span> <span class="n">A</span> |
| <span class="n">D</span> <span class="o">=</span> <span class="n">C</span> <span class="o">+</span> <span class="n">Constant</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> |
| <span class="c1"># compiles the function |
| </span> <span class="n">f</span> <span class="o">=</span> <span class="nb">compile</span><span class="p">(</span><span class="n">D</span><span class="p">)</span> |
| <span class="n">d</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">A</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">10</span><span class="p">),</span> <span class="n">B</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span><span class="o">*</span><span class="mi">2</span><span class="p">)</span> |
| </code></pre></div></div> |
| <p>As you can see, in the symbolic version, when <code class="highlighter-rouge">C = B * A</code> is executed, no computation occurs. |
| Instead, this operation generates a <em>computation graph</em> (also called a <em>symbolic graph</em>) |
| that represents the computation. |
| The following figure shows a computation graph to compute <code class="highlighter-rouge">D</code>.</p> |
| |
| <p><img src="https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/prog_model/comp_graph.png" alt="Comp Graph" /></p> |
| |
| <p>Most symbolic-style programs contain, either explicitly or implicitly, a <em>compile</em> step. |
| This converts the computation graph into a function that we can later call. |
| In the above example, numerical computation only occurs in the last line of code. |
| The defining characteristic of symbolic programs is their clear separation |
| between building the computation graph and executing it. |
| For neural networks, we typically define the entire model as a single compute graph.</p> |
| |
| <p>Among other popular deep learning libraries, Torch, Chainer, and Minerva embrace the imperative style. |
| Examples of symbolic-style deep learning libraries include Theano, CGT, and TensorFlow. |
| We might also view libraries like CXXNet and Caffe, which rely on configuration files, as symbolic-style libraries. |
| In this interpretation, we’d consider the content of the configuration file as defining the computation graph.</p> |
| |
| <p>Now that you understand the difference between these two programming models, let’s compare the advantages of each.</p> |
| |
| <h3 id="imperative-programs-tend-to-be-more-flexible">Imperative Programs Tend to be More Flexible</h3> |
| |
| <p>When you’re using an imperative-style library from Python, you are writing in Python. |
| Nearly anything that would be intuitive to write in Python, you could accelerate by calling down in the appropriate places to the imperative deep learning library. |
| On the other hand, when you write a symbolic program, you may not have access to all the familiar Python constructs, like iteration. |
| Consider the following imperative program, and think about how you can translate this into a symbolic program.</p> |
| |
| <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">a</span> <span class="o">=</span> <span class="mi">2</span> |
| <span class="n">b</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="mi">1</span> |
| <span class="n">d</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> |
| <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">d</span><span class="p">):</span> |
| <span class="n">d</span> <span class="o">+=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> |
| </code></pre></div></div> |
| <p>This wouldn’t be so easy if the Python for-loop weren’t supported by the symbolic API. |
| When you write a symbolic program in Python, you’re <em>not</em> writing in Python. |
| Instead, you’re writing in a domain-specific language (DSL) defined by the symbolic API. |
| The symbolic APIs found in deep learning libraries |
| are powerful DSLs that generate callable computation graphs for neural networks. |
| <!-- In that sense, config-file input libraries are all symbolic. --></p> |
| |
| <p>Intuitively, you might say that imperative programs |
| are more <em>native</em> than symbolic programs. |
| It’s easier to use native language features. |
| For example, it’s straightforward to print out the values |
| in the middle of computation or to use native control flow and loops |
| at any point in the flow of computation.</p> |
| |
| <h3 id="symbolic-programs-tend-to-be-more-efficient">Symbolic Programs Tend to be More Efficient</h3> |
| |
| <p>As we’ve seen, imperative programs tend to be flexible |
| and fit nicely into the programming flow of a host language. |
| So you might wonder, why do so many deep learning libraries |
| embrace the symbolic paradigm? |
| The main reason is efficiency, both in terms of memory and speed. |
| Let’s revisit our toy example from before.</p> |
| |
| <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> |
| <span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> |
| <span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span> |
| <span class="n">c</span> <span class="o">=</span> <span class="n">b</span> <span class="o">*</span> <span class="n">a</span> |
| <span class="n">d</span> <span class="o">=</span> <span class="n">c</span> <span class="o">+</span> <span class="mi">1</span> |
| <span class="o">...</span> |
| </code></pre></div></div> |
| |
| <p><img src="https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/prog_model/comp_graph.png" alt="Comp Graph" /></p> |
| |
| <p>Assume that each cell in the array occupies 8 bytes of memory. |
| How much memory do you need to execute this program in the Python console?</p> |
| |
| <p>As an imperative program we need to allocate memory at each line. |
| That leaves us allocating 4 arrays of size 10. |
| So we’ll need <code class="highlighter-rouge">4 * 10 * 8 = 320</code> bytes. |
| On the other hand, if we built a computation graph, |
| and knew in advance that we only needed <code class="highlighter-rouge">d</code>, |
| we could reuse the memory originally allocated for intermediate values. |
| For example, by performing computations in-place, |
| we might recycle the bits allocated for <code class="highlighter-rouge">b</code> to store <code class="highlighter-rouge">c</code>. |
| And we might recycle the bits allocated for <code class="highlighter-rouge">c</code> to store <code class="highlighter-rouge">d</code>. |
| In the end we could cut our memory requirement in half, |
| requiring just <code class="highlighter-rouge">2 * 10 * 8 = 160</code> bytes.</p> |
| |
| <p>Symbolic programs are more <em>restricted</em>. |
| When we call <code class="highlighter-rouge">compile</code> on D, we tell the system |
| that only the value of <code class="highlighter-rouge">d</code> is needed. |
| The intermediate values of the computation, |
| in this case <code class="highlighter-rouge">c</code>, is then invisible to us.</p> |
| |
| <p>We benefit because the symbolic programs |
| can then safely reuse the memory for in-place computation. |
| But on the other hand, if we later decide that we need to access <code class="highlighter-rouge">c</code>, we’re out of luck. |
| So imperative programs are better prepared to encounter all possible demands. |
| If we ran the imperative version of the code in a Python console, |
| we could inspect any of the intermediate variables in the future.</p> |
| |
| <!-- Of course, this is somewhat misleading, because garbage collection can occur in imperative programs and memory could then be reused. |
| However, imperative programs do need to be "prepared to encounter all possible demands," and this limits the optimization you can perform. This is true for non-trivial cases, such |
| as gradient calculation, which we discuss in next section. --> |
| |
| <p>Symbolic programs can also perform another kind of optimization, called operation folding. |
| Returning to our toy example, the multiplication and addition operations |
| can be folded into one operation, as shown in the following graph. |
| If the computation runs on a GPU processor, |
| one GPU kernel will be executed, instead of two. |
| In fact, this is one way we hand-craft operations |
| in optimized libraries, such as CXXNet and Caffe. |
| Operation folding improves computation efficiency.</p> |
| |
| <p><img src="https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/prog_model/comp_graph_fold.png" alt="Comp Graph Folded" /></p> |
| |
| <p>Note, you can’t perform operation folding in imperative programs, |
| because the intermediate values might be referenced in the future. |
| Operation folding is possible in symbolic programs |
| because you get the entire computation graph, |
| and a clear specification of which values will be needed and which are not.</p> |
| |
| <h3 id="case-study-backprop-and-autodiff">Case Study: Backprop and AutoDiff</h3> |
| |
| <p>In this section, we compare the two programming models |
| on the problem of auto differentiation, or backpropagation. |
| Differentiation is of vital importance in deep learning |
| because it’s the mechanism by which we train our models. |
| In any deep learning model, we define a <em>loss function</em>. |
| A <em>loss function</em> measures how far the model is from the desired output. |
| We then typically pass over training examples (pairs of inputs and ground-truth outputs). |
| At each step we update the model’s <em>parameters</em> to minimize the loss. |
| To determine the direction in which to update the parameters, |
| we need to take the derivative of the loss function with respect to the parameters.</p> |
| |
| <p>In the past, whenever someone defined a new model, |
| they had to work out the derivative calculations by hand. |
| While the math is reasonably straightforward, |
| for complex models, it can be time-consuming and tedious work. |
| All modern deep learning libraries make the practitioner/researcher’s job |
| much easier, by automatically solving the problem of gradient calculation.</p> |
| |
| <p>Both imperative and symbolic programs can perform gradient calculation. |
| So let’s take a look at how you might perform automatic differentiation with each.</p> |
| |
| <p>Let’s start with imperative programs. |
| The following example Python code performs automatic differentiation using our toy example:</p> |
| |
| <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">class</span> <span class="nc">array</span><span class="p">(</span><span class="nb">object</span><span class="p">)</span> <span class="p">:</span> |
| <span class="s">"""Simple Array object that support autodiff."""</span> |
| <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span> |
| <span class="bp">self</span><span class="o">.</span><span class="n">value</span> <span class="o">=</span> <span class="n">value</span> |
| <span class="k">if</span> <span class="n">name</span><span class="p">:</span> |
| <span class="bp">self</span><span class="o">.</span><span class="n">grad</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">g</span> <span class="p">:</span> <span class="p">{</span><span class="n">name</span> <span class="p">:</span> <span class="n">g</span><span class="p">}</span> |
| |
| <span class="k">def</span> <span class="nf">__add__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span> |
| <span class="k">assert</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">other</span><span class="p">,</span> <span class="nb">int</span><span class="p">)</span> |
| <span class="n">ret</span> <span class="o">=</span> <span class="n">array</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">value</span> <span class="o">+</span> <span class="n">other</span><span class="p">)</span> |
| <span class="n">ret</span><span class="o">.</span><span class="n">grad</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">g</span> <span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">grad</span><span class="p">(</span><span class="n">g</span><span class="p">)</span> |
| <span class="k">return</span> <span class="n">ret</span> |
| |
| <span class="k">def</span> <span class="nf">__mul__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span> |
| <span class="k">assert</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">other</span><span class="p">,</span> <span class="n">array</span><span class="p">)</span> |
| <span class="n">ret</span> <span class="o">=</span> <span class="n">array</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">value</span> <span class="o">*</span> <span class="n">other</span><span class="o">.</span><span class="n">value</span><span class="p">)</span> |
| <span class="k">def</span> <span class="nf">grad</span><span class="p">(</span><span class="n">g</span><span class="p">):</span> |
| <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">grad</span><span class="p">(</span><span class="n">g</span> <span class="o">*</span> <span class="n">other</span><span class="o">.</span><span class="n">value</span><span class="p">)</span> |
| <span class="n">x</span><span class="o">.</span><span class="n">update</span><span class="p">(</span><span class="n">other</span><span class="o">.</span><span class="n">grad</span><span class="p">(</span><span class="n">g</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">value</span><span class="p">))</span> |
| <span class="k">return</span> <span class="n">x</span> |
| <span class="n">ret</span><span class="o">.</span><span class="n">grad</span> <span class="o">=</span> <span class="n">grad</span> |
| <span class="k">return</span> <span class="n">ret</span> |
| |
| <span class="c1"># some examples |
| </span> <span class="n">a</span> <span class="o">=</span> <span class="n">array</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">'a'</span><span class="p">)</span> |
| <span class="n">b</span> <span class="o">=</span> <span class="n">array</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s">'b'</span><span class="p">)</span> |
| <span class="n">c</span> <span class="o">=</span> <span class="n">b</span> <span class="o">*</span> <span class="n">a</span> |
| <span class="n">d</span> <span class="o">=</span> <span class="n">c</span> <span class="o">+</span> <span class="mi">1</span> |
| <span class="k">print</span> <span class="n">d</span><span class="o">.</span><span class="n">value</span> |
| <span class="k">print</span> <span class="n">d</span><span class="o">.</span><span class="n">grad</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> |
| <span class="c1"># Results |
| </span> <span class="c1"># 3 |
| </span> <span class="c1"># {'a': 2, 'b': 1} |
| </span></code></pre></div></div> |
| |
| <p>In this code, each array object contains a grad function (it is actually a closure). |
| When you run <code class="highlighter-rouge">d.grad</code>, it recursively invokes the grad function of its inputs, |
| backprops the gradient value back, and |
| returns the gradient value of each input.</p> |
| |
| <p>This might look a bit complicated, so let’s consider |
| the gradient calculation for symbolic programs. |
| The following program performs symbolic gradient calculation for the same task.</p> |
| |
| <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">A</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="s">'A'</span><span class="p">)</span> |
| <span class="n">B</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="s">'B'</span><span class="p">)</span> |
| <span class="n">C</span> <span class="o">=</span> <span class="n">B</span> <span class="o">*</span> <span class="n">A</span> |
| <span class="n">D</span> <span class="o">=</span> <span class="n">C</span> <span class="o">+</span> <span class="n">Constant</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> |
| <span class="c1"># get gradient node. |
| </span> <span class="n">gA</span><span class="p">,</span> <span class="n">gB</span> <span class="o">=</span> <span class="n">D</span><span class="o">.</span><span class="n">grad</span><span class="p">(</span><span class="n">wrt</span><span class="o">=</span><span class="p">[</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">])</span> |
| <span class="c1"># compiles the gradient function. |
| </span> <span class="n">f</span> <span class="o">=</span> <span class="nb">compile</span><span class="p">([</span><span class="n">gA</span><span class="p">,</span> <span class="n">gB</span><span class="p">])</span> |
| <span class="n">grad_a</span><span class="p">,</span> <span class="n">grad_b</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">A</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">10</span><span class="p">),</span> <span class="n">B</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span><span class="o">*</span><span class="mi">2</span><span class="p">)</span> |
| </code></pre></div></div> |
| |
| <p>The grad function of <code class="highlighter-rouge">D</code> generates a backward computation graph, |
| and returns a gradient node, <code class="highlighter-rouge">gA, gB</code>, |
| which correspond to the red nodes in the following figure.</p> |
| |
| <p><img src="https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/prog_model/comp_graph_backward.png" alt="Comp Graph Folded" /></p> |
| |
| <p>The imperative program actually does the same thing as the symbolic program. |
| It implicitly saves a backward computation graph in the grad closure. |
| When you invoked <code class="highlighter-rouge">d.grad</code>, you start from <code class="highlighter-rouge">d(D)</code>, |
| backtrack through the graph to compute the gradient, and collect the results.</p> |
| |
| <p>The gradient calculations in both symbolic |
| and imperative programming follow the same pattern. |
| What’s the difference then? |
| Recall the <em>be prepared to encounter all possible demands</em> requirement of imperative programs. |
| If you are creating an array library that supports automatic differentiation, |
| you have to keep the grad closure along with the computation. |
| This means that none of the history variables can be |
| garbage-collected because they are referenced by variable <code class="highlighter-rouge">d</code> by way of function closure.</p> |
| |
| <p>What if you want to compute only the value of <code class="highlighter-rouge">d</code>, |
| and don’t want the gradient value? |
| In symbolic programming, you declare this with <code class="highlighter-rouge">f=compiled([D])</code>. |
| This also declares the boundary of computation, |
| telling the system that you want to compute only the forward pass. |
| As a result, the system can free the memory of previous results, |
| and share the memory between inputs and outputs.</p> |
| |
| <p>Imagine running a deep neural network with <code class="highlighter-rouge">n</code> layers. |
| If you are running only the forward pass, |
| not the backward(gradient) pass, |
| you need to allocate only two copies of |
| temporal space to store the values of the intermediate layers, |
| instead of <code class="highlighter-rouge">n</code> copies of them. |
| However, because imperative programs need to be prepared |
| to encounter all possible demands of getting the gradient, |
| they have to store the intermediate values, |
| which requires <code class="highlighter-rouge">n</code> copies of temporal space.</p> |
| |
| <p>As you can see, the level of optimization depends |
| on the restrictions on what you can do. |
| Symbolic programs ask you to clearly specify |
| these restrictions when you compile the graph. |
| One the other hand, imperative programs |
| must be prepared for a wider range of demands. |
| Symbolic programs have a natural advantage |
| because they know more about what you do and don’t want.</p> |
| |
| <p>There are ways in which we can modify imperative programs |
| to incorporate similar restrictions. |
| For example, one solution to the preceding |
| problem is to introduce a context variable. |
| You can introduce a no-gradient context variable |
| to turn gradient calculation off.</p> |
| |
| <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">with</span> <span class="n">context</span><span class="o">.</span><span class="n">NoGradient</span><span class="p">():</span> |
| <span class="n">a</span> <span class="o">=</span> <span class="n">array</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">'a'</span><span class="p">)</span> |
| <span class="n">b</span> <span class="o">=</span> <span class="n">array</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s">'b'</span><span class="p">)</span> |
| <span class="n">c</span> <span class="o">=</span> <span class="n">b</span> <span class="o">*</span> <span class="n">a</span> |
| <span class="n">d</span> <span class="o">=</span> <span class="n">c</span> <span class="o">+</span> <span class="mi">1</span> |
| </code></pre></div></div> |
| |
| <!-- This provides an imperative program with the ability to impose some restrictions, but reduces efficiency. --> |
| |
| <p>However, this example still must be prepared to encounter all possible demands, |
| which means that you can’t perform the in-place calculation |
| to reuse memory in the forward pass (a trick commonly used to reduce GPU memory usage). |
| The techniques we’ve discussed generate an explicit backward pass. |
| Some of the libraries such as Caffe and CXXNet perform backprop implicitly on the same graph. |
| The approach we’ve discussed in this section also applies to them.</p> |
| |
| <p>Most configuration-file-based libraries, |
| such as CXXNet and Caffe are designed |
| to meet one or two generic requirements: |
| get the activation of each layer, |
| or get the gradient of all of the weights. |
| These libraries have the same problem: |
| the more generic operations the library has to support, |
| the less optimization (memory sharing) you can do, |
| based on the same data structure.</p> |
| |
| <p>As you can see, the trade-off between restriction |
| and flexibility is the same for most cases.</p> |
| |
| <h3 id="model-checkpoint">Model Checkpoint</h3> |
| |
| <p>It’s important to able to save a model and load it back later. |
| There are different ways to <em>save</em> your work. |
| Normally, to save a neural network, |
| you need to save two things: a net configuration |
| for the structure of the neural network and the weights of the neural network.</p> |
| |
| <p>The ability to check the configuration is a plus for symbolic programs. |
| Because the symbolic construction phase does not perform computation, |
| you can directly serialize the computation graph, and load it back later. |
| This solves the problem of saving the configuration |
| without introducing an additional layer.</p> |
| |
| <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">A</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="s">'A'</span><span class="p">)</span> |
| <span class="n">B</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="s">'B'</span><span class="p">)</span> |
| <span class="n">C</span> <span class="o">=</span> <span class="n">B</span> <span class="o">*</span> <span class="n">A</span> |
| <span class="n">D</span> <span class="o">=</span> <span class="n">C</span> <span class="o">+</span> <span class="n">Constant</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> |
| <span class="n">D</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">'mygraph'</span><span class="p">)</span> |
| <span class="o">...</span> |
| <span class="n">D2</span> <span class="o">=</span> <span class="n">load</span><span class="p">(</span><span class="s">'mygraph'</span><span class="p">)</span> |
| <span class="n">f</span> <span class="o">=</span> <span class="nb">compile</span><span class="p">([</span><span class="n">D2</span><span class="p">])</span> |
| <span class="c1"># more operations |
| </span> <span class="o">...</span> |
| </code></pre></div></div> |
| |
| <p>Because an imperative program executes as it describes the computation, |
| you have to save the code itself as the <code class="highlighter-rouge">configuration</code>, |
| or build another configuration layer on top of the imperative language.</p> |
| |
| <h3 id="parameter-updates">Parameter Updates</h3> |
| |
| <p>Most symbolic programs are data flow (computation) graphs. |
| Data flow graphs describe computation. |
| But it’s not obvious how to use graphs to describe parameter updates. |
| That’s because parameter updates introduce mutation, |
| which is not a data flow concept. |
| Most symbolic programs introduce a special update statement |
| to update persistent state in the programs.</p> |
| |
| <p>It’s usually easier to write parameter updates in an imperative style, |
| especially when you need multiple updates that relate to each other. |
| For symbolic programs, the update statement is also executed as you call it. |
| So in that sense, most symbolic deep learning libraries |
| fall back on the imperative approach to perform updates, |
| while using the symbolic approach to perform gradient calculation.</p> |
| |
| <h3 id="there-is-no-strict-boundary">There Is No Strict Boundary</h3> |
| |
| <p>In comparing the two programming styles, |
| some of our arguments might not be strictly true, |
| i.e., it’s possible to make an imperative program |
| more like a traditional symbolic program or vice versa. |
| However, the two archetypes are useful abstractions, |
| especially for understanding the differences between deep learning libraries. |
| We might reasonably conclude that there is no clear boundary between programming styles. |
| For example, you can create a just-in-time (JIT) compiler in Python |
| to compile imperative Python programs, |
| which provides some of the advantages of global |
| information held in symbolic programs.</p> |
| |
| <h2 id="big-vs-small-operations">Big vs. Small Operations</h2> |
| |
| <p>When designing a deep learning library, another important programming model decision |
| is precisely what operations to support. |
| In general, there are two families of operations supported by most deep learning libraries:</p> |
| |
| <ul> |
| <li>Big operations - typically for computing neural network layers (e.g. FullyConnected and BatchNormalize).</li> |
| <li>Small operations - mathematical functions like matrix multiplication and element-wise addition.</li> |
| </ul> |
| |
| <p>Libraries like CXXNet and Caffe support layer-level operations. |
| Libraries like Theano and Minerva support fine-grained operations.</p> |
| |
| <h3 id="smaller-operations-can-be-more-flexible">Smaller Operations Can Be More Flexible</h3> |
| <p>It’s quite natural to use smaller operations to compose bigger operations. |
| For example, the sigmoid unit can simply be composed of division, addition and an exponentiation:</p> |
| |
| <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">sigmoid</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="p">(</span><span class="mf">1.0</span> <span class="o">+</span> <span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">))</span> |
| </code></pre></div></div> |
| <p>Using smaller operations as building blocks, you can express nearly anything you want. |
| If you’re more familiar with CXXNet- or Caffe-style layers, |
| note that these operations don’t differ from a layer, except that they are smaller.</p> |
| |
| <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">SigmoidLayer</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">=</span> <span class="n">EWiseDivisionLayer</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">AddScalarLayer</span><span class="p">(</span><span class="n">ExpLayer</span><span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">),</span> <span class="mf">1.0</span><span class="p">))</span> |
| </code></pre></div></div> |
| <p>This expression composes three layers, |
| with each defining its forward and backward (gradient) function. |
| Using smaller operations gives you the advantage of building new layers quickly, |
| because you only need to compose the components.</p> |
| |
| <h3 id="big-operations-are-more-efficient">Big Operations Are More Efficient</h3> |
| <p>Directly composing sigmoid layers requires three layers of operation, instead of one.</p> |
| |
| <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">SigmoidLayer</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">=</span> <span class="n">EWiseDivisionLayer</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">AddScalarLayer</span><span class="p">(</span><span class="n">ExpLayer</span><span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">),</span> <span class="mf">1.0</span><span class="p">))</span> |
| </code></pre></div></div> |
| <p>This code creates overhead for computation and memory (which could be optimized, with cost).</p> |
| |
| <p>Libraries like CXXNet and Caffe take a different approach. |
| To support coarse-grained operations, |
| such as BatchNormalization and the SigmoidLayer directly, |
| in each layer, the calculation kernel is hand crafted |
| with one or only some CUDA kernel launches. |
| This makes these implementations more efficient.</p> |
| |
| <h3 id="compilation-and-optimization">Compilation and Optimization</h3> |
| |
| <p>Can small operations be optimized? Of course, they can. |
| Let’s look at the system optimization part of the compilation engine. |
| Two types of optimization can be performed on the computation graph:</p> |
| |
| <ul> |
| <li>Memory allocation optimization, to reuse the memory of the intermediate computations.</li> |
| <li>Operator fusion, to detect sub-graph patterns, such as the sigmoid, and fuse them into a bigger operation kernel.</li> |
| </ul> |
| |
| <p>Memory allocation optimization isn’t restricted to small operations graphs. |
| You can use it with bigger operations graph, too. |
| However, optimization might not be essential |
| for bigger operation libraries like CXXNet and Caffe, |
| because you can’t find the compilation step in them. |
| However, there’s a (dumb) <code class="highlighter-rouge">compilation step</code> in these libraries, |
| that basically translates the layers into a fixed forward, |
| backprop execution plan, by running each operation one by one.</p> |
| |
| <p>For computation graphs with smaller operations, |
| these optimizations are crucial to performance. |
| Because the operations are small, |
| there are many sub-graph patterns that can be matched. |
| Also, because the final, generated operations |
| might not be enumerable, |
| an explicit recompilation of the kernels is required, |
| as opposed to the fixed amount of precompiled kernels |
| in the big operation libraries. |
| This creates compilation overhead for the symbolic libraries |
| that support small operations. |
| Requiring compilation optimization also creates engineering overhead |
| for the libraries that solely support smaller operations.</p> |
| |
| <p>As in the case of symbolic vs. imperative, |
| the bigger operation libraries “cheat” |
| by asking you to provide restrictions (to the common layer), |
| so that you actually perform the sub-graph matching. |
| This moves the compilation overhead to the real brain, which is usually not too bad.</p> |
| |
| <h3 id="expression-template-and-statically-typed-language">Expression Template and Statically Typed Language</h3> |
| <p>You always have a need to write small operations and compose them. |
| Libraries like Caffe use hand-crafted kernels to build these bigger blocks. |
| Otherwise, you would have to compose smaller operations using Python.</p> |
| |
| <p>There’s a third choice that works pretty well. |
| This is called the expression template. |
| Basically, you use template programming to |
| generate generic kernels from an expression tree at compile time. |
| For details, see <a href="https://github.com/dmlc/mshadow/blob/master/guide/exp-template/README.md">Expression Template Tutorial</a>. |
| CXXNet makes extensive use of an expression template, |
| which enables creating much shorter and more readable code that matches |
| the performance of hand-crafted kernels.</p> |
| |
| <p>The difference between using an expression template and Python kernel generation |
| is that expression evaluation is done at compile time for C++ with an existing type, |
| so there is no additional runtime overhead. |
| In principle, this is also possible with other statically typed languages that support templates, |
| but we’ve seen this trick used only in C++.</p> |
| |
| <p>Expression template libraries create a middle ground between Python operations |
| and hand-crafted big kernels by allowing C++ users to craft efficient big |
| operations by composing smaller operations. It’s an option worth considering.</p> |
| |
| <h2 id="mix-the-approaches">Mix the Approaches</h2> |
| |
| <p>Now that we’ve compared the programming models, which one should you choose? |
| Before delving into that, we should emphasize that depending on the problems you’re trying to solve, |
| our comparison might not necessarily have a big impact.</p> |
| |
| <p>Remember <a href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl’s law</a>: |
| If you are optimizing a non-performance-critical part of your problem, |
| you won’t get much of a performance gain.</p> |
| |
| <p>As you’ve seen, there usually is a trade-off between efficiency, |
| flexibility, and engineering complexity. |
| The more suitable programming style depends on the problem you are trying to solve. |
| For example, imperative programs are better for parameter updates, |
| and symbolic programs for gradient calculation.</p> |
| |
| <p>We advocate <em>mixing</em> the approaches. |
| Sometimes the part that we want to be flexible |
| isn’t crucial to performance. |
| In these cases, it’s okay to leave some efficiency on the table |
| to support more flexible interfaces. |
| In machine learning, combining methods usually works better than using just one.</p> |
| |
| <p>If you can combine the programming models correctly, |
| you can get better results than when using a single programming model. |
| In this section, we discuss how to do so.</p> |
| |
| <h3 id="symbolic-and-imperative-programs">Symbolic and Imperative Programs</h3> |
| <p>There are two ways to mix symbolic and imperative programs:</p> |
| |
| <ul> |
| <li>Use imperative programs within symbolic programs as callbacks</li> |
| <li>Use symbolic programs as part of imperative programs</li> |
| </ul> |
| |
| <p>We’ve observed that it’s usually helpful to write parameter updates imperatively, |
| and perform gradient calculations in symbolic programs.</p> |
| |
| <p>Symbolic libraries already mix programs because Python itself is imperative. |
| For example, the following program mixes the symbolic approach with NumPy, which is imperative.</p> |
| |
| <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">A</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="s">'A'</span><span class="p">)</span> |
| <span class="n">B</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="s">'B'</span><span class="p">)</span> |
| <span class="n">C</span> <span class="o">=</span> <span class="n">B</span> <span class="o">*</span> <span class="n">A</span> |
| <span class="n">D</span> <span class="o">=</span> <span class="n">C</span> <span class="o">+</span> <span class="n">Constant</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> |
| <span class="c1"># compiles the function |
| </span> <span class="n">f</span> <span class="o">=</span> <span class="nb">compile</span><span class="p">(</span><span class="n">D</span><span class="p">)</span> |
| <span class="n">d</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">A</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">10</span><span class="p">),</span> <span class="n">B</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span><span class="o">*</span><span class="mi">2</span><span class="p">)</span> |
| <span class="n">d</span> <span class="o">=</span> <span class="n">d</span> <span class="o">+</span> <span class="mf">1.0</span> |
| </code></pre></div></div> |
| <p>The symbolic graphs are compiled into a function that can be executed imperatively. |
| The internals are a black box to the user. |
| This is exactly like writing C++ programs and exposing them to Python, which we commonly do.</p> |
| |
| <p>Because parameter memory resides on the GPU, |
| you might not want to use NumPy as an imperative component. |
| Supporting a GPU-compatible imperative library |
| that interacts with symbolic compiled functions |
| or provides a limited amount of updating syntax |
| in the update statement in symbolic program execution |
| might be a better choice.</p> |
| |
| <h3 id="small-and-big-operations">Small and Big Operations</h3> |
| |
| <p>There might be a good reason to combine small and big operations. |
| Consider applications that perform tasks such as changing |
| a loss function or adding a few customized layers to an existing structure. |
| Usually, you can use big operations to compose existing |
| components, and use smaller operations to build the new parts.</p> |
| |
| <p>Recall Amdahl’s law. Often, the new components |
| are not the cause of the computation bottleneck. |
| Because the performance-critical part is already optimized by |
| the bigger operations, it’s okay to forego optimizing the additional small operations, |
| or to do a limited amount of memory optimization instead |
| of operation fusion and directly running them.</p> |
| |
| <h3 id="choose-your-own-approach">Choose Your Own Approach</h3> |
| |
| <p>In this document, we compared multiple approaches |
| to developing programming environments for deep learning. |
| We compared both the usability and efficiency implications of each, |
| finding that many of these trade-offs (like imperative vs symbolic aren’t necessarily black and white). |
| You can choose your approach, or combine the approaches |
| to create more interesting and intelligent deep learning libraries.</p> |
| |
| <h2 id="contribute-to-mxnet">Contribute to MXNet</h2> |
| |
| <p>This document is part of our effort to provide <a href="overview">open-source system design notes</a> |
| for deep learning libraries. If you’re interested in contributing to <em>MXNet</em> or its |
| documentation, <a href="http://github.com/apache/incubator-mxnet">fork us on GitHub</a>.</p> |
| |
| <h2 id="next-steps">Next Steps</h2> |
| |
| <ul> |
| <li><a href="note_engine">Dependency Engine for Deep Learning</a></li> |
| <li><a href="note_memory">Squeeze the Memory Consumption of Deep Learning</a></li> |
| <li><a href="note_data_loading">Efficient Data Loading Module for Deep Learning</a></li> |
| </ul> |
| |
| </div> |
| </div> |
| |
| </div> |
| </div> |
| |
| </article> |
| |
| </main><footer class="site-footer h-card"> |
| <div class="wrapper"> |
| <div class="row"> |
| <div class="col-4"> |
| <h4 class="footer-category-title">Resources</h4> |
| <ul class="contact-list"> |
| <li><a href="/versions/master/community#stay-connected">Mailing lists</a></li> |
| <li><a href="https://discuss.mxnet.io">MXNet Discuss forum</a></li> |
| <li><a href="/versions/master/community#github-issues">Github Issues</a></li> |
| <li><a href="https://github.com/apache/incubator-mxnet/projects">Projects</a></li> |
| <li><a href="https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+Home">Developer Wiki</a></li> |
| <li><a href="/versions/master/community">Contribute To MXNet</a></li> |
| |
| </ul> |
| </div> |
| |
| <div class="col-4"><ul class="social-media-list"><li><a href="https://github.com/apache/incubator-mxnet"><svg class="svg-icon"><use xlink:href="/versions/master/assets/minima-social-icons.svg#github"></use></svg> <span class="username">apache/incubator-mxnet</span></a></li><li><a href="https://www.twitter.com/apachemxnet"><svg class="svg-icon"><use xlink:href="/versions/master/assets/minima-social-icons.svg#twitter"></use></svg> <span class="username">apachemxnet</span></a></li><li><a href="https://youtube.com/apachemxnet"><svg class="svg-icon"><use xlink:href="/versions/master/assets/minima-social-icons.svg#youtube"></use></svg> <span class="username">apachemxnet</span></a></li></ul> |
| </div> |
| |
| <div class="col-4 footer-text"> |
| <p>A flexible and efficient library for deep learning.</p> |
| </div> |
| </div> |
| </div> |
| </footer> |
| <footer class="site-footer2"> |
| <div class="wrapper"> |
| <div class="row"> |
| <div class="col-3"> |
| <img src="/versions/master/assets/img/apache_incubator_logo.png" class="footer-logo col-2"> |
| </div> |
| <div class="footer-bottom-warning col-9"> |
| <p>Apache MXNet is an effort undergoing incubation at The Apache Software Foundation (ASF), <span |
| style="font-weight:bold">sponsored by the <i>Apache Incubator</i></span>. Incubation is required |
| of all newly accepted projects until a further review indicates that the infrastructure, |
| communications, and decision making process have stabilized in a manner consistent with other |
| successful ASF projects. While incubation status is not necessarily a reflection of the completeness |
| or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. |
| </p><p>"Copyright © 2017-2018, The Apache Software Foundation Apache MXNet, MXNet, Apache, the Apache |
| feather, and the Apache MXNet project logo are either registered trademarks or trademarks of the |
| Apache Software Foundation."</p> |
| </div> |
| </div> |
| </div> |
| </footer> |
| |
| |
| |
| |
| </body> |
| |
| </html> |