| |
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" |
| "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> |
| |
| <html xmlns="http://www.w3.org/1999/xhtml"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> |
| <title>pyspark.streaming module — PySpark 2.2.1 documentation</title> |
| <link rel="stylesheet" href="_static/nature.css" type="text/css" /> |
| <link rel="stylesheet" href="_static/pygments.css" type="text/css" /> |
| <link rel="stylesheet" href="_static/pyspark.css" type="text/css" /> |
| <script type="text/javascript"> |
| var DOCUMENTATION_OPTIONS = { |
| URL_ROOT: './', |
| VERSION: '2.2.1', |
| COLLAPSE_INDEX: false, |
| FILE_SUFFIX: '.html', |
| HAS_SOURCE: true, |
| SOURCELINK_SUFFIX: '.txt' |
| }; |
| </script> |
| <script type="text/javascript" src="_static/jquery.js"></script> |
| <script type="text/javascript" src="_static/underscore.js"></script> |
| <script type="text/javascript" src="_static/doctools.js"></script> |
| <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> |
| <script type="text/javascript" src="_static/pyspark.js"></script> |
| <link rel="search" title="Search" href="search.html" /> |
| <link rel="next" title="pyspark.ml package" href="pyspark.ml.html" /> |
| <link rel="prev" title="pyspark.sql module" href="pyspark.sql.html" /> |
| </head> |
| <body> |
| <div class="related" role="navigation" aria-label="related navigation"> |
| <h3>Navigation</h3> |
| <ul> |
| <li class="right" style="margin-right: 10px"> |
| <a href="pyspark.ml.html" title="pyspark.ml package" |
| accesskey="N">next</a></li> |
| <li class="right" > |
| <a href="pyspark.sql.html" title="pyspark.sql module" |
| accesskey="P">previous</a> |</li> |
| |
| <li class="nav-item nav-item-0"><a href="index.html">PySpark 2.2.1 documentation</a> »</li> |
| |
| <li class="nav-item nav-item-1"><a href="pyspark.html" accesskey="U">pyspark package</a> »</li> |
| </ul> |
| </div> |
| |
| <div class="document"> |
| <div class="documentwrapper"> |
| <div class="bodywrapper"> |
| <div class="body" role="main"> |
| |
| <div class="section" id="pyspark-streaming-module"> |
| <h1>pyspark.streaming module<a class="headerlink" href="#pyspark-streaming-module" title="Permalink to this headline">¶</a></h1> |
| <div class="section" id="module-pyspark.streaming"> |
| <span id="module-contents"></span><h2>Module contents<a class="headerlink" href="#module-pyspark.streaming" title="Permalink to this headline">¶</a></h2> |
| <dl class="class"> |
| <dt id="pyspark.streaming.StreamingContext"> |
| <em class="property">class </em><code class="descclassname">pyspark.streaming.</code><code class="descname">StreamingContext</code><span class="sig-paren">(</span><em>sparkContext</em>, <em>batchDuration=None</em>, <em>jssc=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Bases: <code class="xref py py-class docutils literal"><span class="pre">object</span></code></p> |
| <p>Main entry point for Spark Streaming functionality. A StreamingContext |
| represents the connection to a Spark cluster, and can be used to create |
| <a class="reference internal" href="#pyspark.streaming.DStream" title="pyspark.streaming.DStream"><code class="xref py py-class docutils literal"><span class="pre">DStream</span></code></a> various input sources. It can be from an existing <code class="xref py py-class docutils literal"><span class="pre">SparkContext</span></code>. |
| After creating and transforming DStreams, the streaming computation can |
| be started and stopped using <cite>context.start()</cite> and <cite>context.stop()</cite>, |
| respectively. <cite>context.awaitTermination()</cite> allows the current thread |
| to wait for the termination of the context by <cite>stop()</cite> or by an exception.</p> |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingContext.addStreamingListener"> |
| <code class="descname">addStreamingListener</code><span class="sig-paren">(</span><em>streamingListener</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.addStreamingListener"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.addStreamingListener" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Add a [[org.apache.spark.streaming.scheduler.StreamingListener]] object for |
| receiving system events related to streaming.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingContext.awaitTermination"> |
| <code class="descname">awaitTermination</code><span class="sig-paren">(</span><em>timeout=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.awaitTermination"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.awaitTermination" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Wait for the execution to stop.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>timeout</strong> – time to wait in seconds</td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingContext.awaitTerminationOrTimeout"> |
| <code class="descname">awaitTerminationOrTimeout</code><span class="sig-paren">(</span><em>timeout</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.awaitTerminationOrTimeout"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.awaitTerminationOrTimeout" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Wait for the execution to stop. Return <cite>true</cite> if it’s stopped; or |
| throw the reported error during the execution; or <cite>false</cite> if the |
| waiting time elapsed before returning from the method.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>timeout</strong> – time to wait in seconds</td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingContext.binaryRecordsStream"> |
| <code class="descname">binaryRecordsStream</code><span class="sig-paren">(</span><em>directory</em>, <em>recordLength</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.binaryRecordsStream"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.binaryRecordsStream" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Create an input stream that monitors a Hadoop-compatible file system |
| for new files and reads them as flat binary files with records of |
| fixed length. Files must be written to the monitored directory by “moving” |
| them from another location within the same file system. |
| File names starting with . are ignored.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple"> |
| <li><strong>directory</strong> – Directory to load data from</li> |
| <li><strong>recordLength</strong> – Length of each record in bytes</li> |
| </ul> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingContext.checkpoint"> |
| <code class="descname">checkpoint</code><span class="sig-paren">(</span><em>directory</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.checkpoint"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.checkpoint" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Sets the context to periodically checkpoint the DStream operations for master |
| fault-tolerance. The graph will be checkpointed every batch interval.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>directory</strong> – HDFS-compatible directory where the checkpoint data |
| will be reliably stored</td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="classmethod"> |
| <dt id="pyspark.streaming.StreamingContext.getActive"> |
| <em class="property">classmethod </em><code class="descname">getActive</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.getActive"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.getActive" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return either the currently active StreamingContext (i.e., if there is a context started |
| but not stopped) or None.</p> |
| </dd></dl> |
| |
| <dl class="classmethod"> |
| <dt id="pyspark.streaming.StreamingContext.getActiveOrCreate"> |
| <em class="property">classmethod </em><code class="descname">getActiveOrCreate</code><span class="sig-paren">(</span><em>checkpointPath</em>, <em>setupFunc</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.getActiveOrCreate"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.getActiveOrCreate" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Either return the active StreamingContext (i.e. currently started but not stopped), |
| or recreate a StreamingContext from checkpoint data or create a new StreamingContext |
| using the provided setupFunc function. If the checkpointPath is None or does not contain |
| valid checkpoint data, then setupFunc will be called to create a new context and setup |
| DStreams.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple"> |
| <li><strong>checkpointPath</strong> – Checkpoint directory used in an earlier streaming program. Can be |
| None if the intention is to always create a new context when there |
| is no active context.</li> |
| <li><strong>setupFunc</strong> – Function to create a new JavaStreamingContext and setup DStreams</li> |
| </ul> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="classmethod"> |
| <dt id="pyspark.streaming.StreamingContext.getOrCreate"> |
| <em class="property">classmethod </em><code class="descname">getOrCreate</code><span class="sig-paren">(</span><em>checkpointPath</em>, <em>setupFunc</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.getOrCreate"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.getOrCreate" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Either recreate a StreamingContext from checkpoint data or create a new StreamingContext. |
| If checkpoint data exists in the provided <cite>checkpointPath</cite>, then StreamingContext will be |
| recreated from the checkpoint data. If the data does not exist, then the provided setupFunc |
| will be used to create a new context.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple"> |
| <li><strong>checkpointPath</strong> – Checkpoint directory used in an earlier streaming program</li> |
| <li><strong>setupFunc</strong> – Function to create a new context and setup DStreams</li> |
| </ul> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingContext.queueStream"> |
| <code class="descname">queueStream</code><span class="sig-paren">(</span><em>rdds</em>, <em>oneAtATime=True</em>, <em>default=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.queueStream"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.queueStream" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Create an input stream from an queue of RDDs or list. In each batch, |
| it will process either one or all of the RDDs returned by the queue.</p> |
| <div class="admonition note"> |
| <p class="first admonition-title">Note</p> |
| <p class="last">Changes to the queue after the stream is created will not be recognized.</p> |
| </div> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple"> |
| <li><strong>rdds</strong> – Queue of RDDs</li> |
| <li><strong>oneAtATime</strong> – pick one rdd each time or pick all of them once.</li> |
| <li><strong>default</strong> – The default rdd if no more in rdds</li> |
| </ul> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingContext.remember"> |
| <code class="descname">remember</code><span class="sig-paren">(</span><em>duration</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.remember"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.remember" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Set each DStreams in this context to remember RDDs it generated |
| in the last given duration. DStreams remember RDDs only for a |
| limited duration of time and releases them for garbage collection. |
| This method allows the developer to specify how to long to remember |
| the RDDs (if the developer wishes to query old data outside the |
| DStream computation).</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>duration</strong> – Minimum duration (in seconds) that each DStream |
| should remember its RDDs</td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingContext.socketTextStream"> |
| <code class="descname">socketTextStream</code><span class="sig-paren">(</span><em>hostname</em>, <em>port</em>, <em>storageLevel=StorageLevel(True</em>, <em>True</em>, <em>False</em>, <em>False</em>, <em>2)</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.socketTextStream"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.socketTextStream" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Create an input from TCP source hostname:port. Data is received using |
| a TCP socket and receive byte is interpreted as UTF8 encoded <code class="docutils literal"><span class="pre">\n</span></code> delimited |
| lines.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple"> |
| <li><strong>hostname</strong> – Hostname to connect to for receiving data</li> |
| <li><strong>port</strong> – Port to connect to for receiving data</li> |
| <li><strong>storageLevel</strong> – Storage level to use for storing the received objects</li> |
| </ul> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="attribute"> |
| <dt id="pyspark.streaming.StreamingContext.sparkContext"> |
| <code class="descname">sparkContext</code><a class="headerlink" href="#pyspark.streaming.StreamingContext.sparkContext" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return SparkContext which is associated with this StreamingContext.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingContext.start"> |
| <code class="descname">start</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.start"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.start" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Start the execution of the streams.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingContext.stop"> |
| <code class="descname">stop</code><span class="sig-paren">(</span><em>stopSparkContext=True</em>, <em>stopGraceFully=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.stop"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.stop" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Stop the execution of the streams, with option of ensuring all |
| received data has been processed.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple"> |
| <li><strong>stopSparkContext</strong> – Stop the associated SparkContext or not</li> |
| <li><strong>stopGracefully</strong> – Stop gracefully by waiting for the processing |
| of all received data to be completed</li> |
| </ul> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingContext.textFileStream"> |
| <code class="descname">textFileStream</code><span class="sig-paren">(</span><em>directory</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.textFileStream"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.textFileStream" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Create an input stream that monitors a Hadoop-compatible file system |
| for new files and reads them as text files. Files must be wrriten to the |
| monitored directory by “moving” them from another location within the same |
| file system. File names starting with . are ignored.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingContext.transform"> |
| <code class="descname">transform</code><span class="sig-paren">(</span><em>dstreams</em>, <em>transformFunc</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.transform"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.transform" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Create a new DStream in which each RDD is generated by applying |
| a function on RDDs of the DStreams. The order of the JavaRDDs in |
| the transform function parameter will be the same as the order |
| of corresponding DStreams in the list.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingContext.union"> |
| <code class="descname">union</code><span class="sig-paren">(</span><em>*dstreams</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/context.html#StreamingContext.union"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingContext.union" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Create a unified DStream from multiple DStreams of the same |
| type and same slide duration.</p> |
| </dd></dl> |
| |
| </dd></dl> |
| |
| <dl class="class"> |
| <dt id="pyspark.streaming.DStream"> |
| <em class="property">class </em><code class="descclassname">pyspark.streaming.</code><code class="descname">DStream</code><span class="sig-paren">(</span><em>jdstream</em>, <em>ssc</em>, <em>jrdd_deserializer</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Bases: <code class="xref py py-class docutils literal"><span class="pre">object</span></code></p> |
| <p>A Discretized Stream (DStream), the basic abstraction in Spark Streaming, |
| is a continuous sequence of RDDs (of the same type) representing a |
| continuous stream of data (see <code class="xref py py-class docutils literal"><span class="pre">RDD</span></code> in the Spark core documentation |
| for more details on RDDs).</p> |
| <p>DStreams can either be created from live data (such as, data from TCP |
| sockets, Kafka, Flume, etc.) using a <a class="reference internal" href="#pyspark.streaming.StreamingContext" title="pyspark.streaming.StreamingContext"><code class="xref py py-class docutils literal"><span class="pre">StreamingContext</span></code></a> or it can be |
| generated by transforming existing DStreams using operations such as |
| <cite>map</cite>, <cite>window</cite> and <cite>reduceByKeyAndWindow</cite>. While a Spark Streaming |
| program is running, each DStream periodically generates a RDD, either |
| from live data or by transforming the RDD generated by a parent DStream.</p> |
| <dl class="docutils"> |
| <dt>DStreams internally is characterized by a few basic properties:</dt> |
| <dd><ul class="first last simple"> |
| <li>A list of other DStreams that the DStream depends on</li> |
| <li>A time interval at which the DStream generates an RDD</li> |
| <li>A function that is used to generate an RDD after each time interval</li> |
| </ul> |
| </dd> |
| </dl> |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.cache"> |
| <code class="descname">cache</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.cache"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.cache" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Persist the RDDs of this DStream with the default storage level |
| (<code class="xref py py-class docutils literal"><span class="pre">MEMORY_ONLY</span></code>).</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.checkpoint"> |
| <code class="descname">checkpoint</code><span class="sig-paren">(</span><em>interval</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.checkpoint"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.checkpoint" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Enable periodic checkpointing of RDDs of this DStream</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>interval</strong> – time in seconds, after each period of that, generated |
| RDD will be checkpointed</td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.cogroup"> |
| <code class="descname">cogroup</code><span class="sig-paren">(</span><em>other</em>, <em>numPartitions=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.cogroup"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.cogroup" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying ‘cogroup’ between RDDs of this |
| DStream and <cite>other</cite> DStream.</p> |
| <p>Hash partitioning is used to generate the RDDs with <cite>numPartitions</cite> partitions.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.combineByKey"> |
| <code class="descname">combineByKey</code><span class="sig-paren">(</span><em>createCombiner</em>, <em>mergeValue</em>, <em>mergeCombiners</em>, <em>numPartitions=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.combineByKey"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.combineByKey" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying combineByKey to each RDD.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.context"> |
| <code class="descname">context</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.context"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.context" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return the StreamingContext associated with this DStream</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.count"> |
| <code class="descname">count</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.count"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.count" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream in which each RDD has a single element |
| generated by counting each RDD of this DStream.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.countByValue"> |
| <code class="descname">countByValue</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.countByValue"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.countByValue" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream in which each RDD contains the counts of each |
| distinct value in each RDD of this DStream.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.countByValueAndWindow"> |
| <code class="descname">countByValueAndWindow</code><span class="sig-paren">(</span><em>windowDuration</em>, <em>slideDuration</em>, <em>numPartitions=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.countByValueAndWindow"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.countByValueAndWindow" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream in which each RDD contains the count of distinct elements in |
| RDDs in a sliding window over this DStream.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple"> |
| <li><strong>windowDuration</strong> – width of the window; must be a multiple of this DStream’s |
| batching interval</li> |
| <li><strong>slideDuration</strong> – sliding interval of the window (i.e., the interval after which |
| the new DStream will generate RDDs); must be a multiple of this |
| DStream’s batching interval</li> |
| <li><strong>numPartitions</strong> – number of partitions of each RDD in the new DStream.</li> |
| </ul> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.countByWindow"> |
| <code class="descname">countByWindow</code><span class="sig-paren">(</span><em>windowDuration</em>, <em>slideDuration</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.countByWindow"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.countByWindow" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream in which each RDD has a single element generated |
| by counting the number of elements in a window over this DStream. |
| windowDuration and slideDuration are as defined in the window() operation.</p> |
| <p>This is equivalent to window(windowDuration, slideDuration).count(), |
| but will be more efficient if window is large.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.filter"> |
| <code class="descname">filter</code><span class="sig-paren">(</span><em>f</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.filter"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.filter" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream containing only the elements that satisfy predicate.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.flatMap"> |
| <code class="descname">flatMap</code><span class="sig-paren">(</span><em>f</em>, <em>preservesPartitioning=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.flatMap"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.flatMap" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying a function to all elements of |
| this DStream, and then flattening the results</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.flatMapValues"> |
| <code class="descname">flatMapValues</code><span class="sig-paren">(</span><em>f</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.flatMapValues"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.flatMapValues" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying a flatmap function to the value |
| of each key-value pairs in this DStream without changing the key.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.foreachRDD"> |
| <code class="descname">foreachRDD</code><span class="sig-paren">(</span><em>func</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.foreachRDD"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.foreachRDD" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Apply a function to each RDD in this DStream.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.fullOuterJoin"> |
| <code class="descname">fullOuterJoin</code><span class="sig-paren">(</span><em>other</em>, <em>numPartitions=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.fullOuterJoin"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.fullOuterJoin" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying ‘full outer join’ between RDDs of this DStream and |
| <cite>other</cite> DStream.</p> |
| <p>Hash partitioning is used to generate the RDDs with <cite>numPartitions</cite> |
| partitions.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.glom"> |
| <code class="descname">glom</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.glom"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.glom" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream in which RDD is generated by applying glom() |
| to RDD of this DStream.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.groupByKey"> |
| <code class="descname">groupByKey</code><span class="sig-paren">(</span><em>numPartitions=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.groupByKey"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.groupByKey" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying groupByKey on each RDD.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.groupByKeyAndWindow"> |
| <code class="descname">groupByKeyAndWindow</code><span class="sig-paren">(</span><em>windowDuration</em>, <em>slideDuration</em>, <em>numPartitions=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.groupByKeyAndWindow"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.groupByKeyAndWindow" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying <cite>groupByKey</cite> over a sliding window. |
| Similar to <cite>DStream.groupByKey()</cite>, but applies it over a sliding window.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple"> |
| <li><strong>windowDuration</strong> – width of the window; must be a multiple of this DStream’s |
| batching interval</li> |
| <li><strong>slideDuration</strong> – sliding interval of the window (i.e., the interval after which |
| the new DStream will generate RDDs); must be a multiple of this |
| DStream’s batching interval</li> |
| <li><strong>numPartitions</strong> – Number of partitions of each RDD in the new DStream.</li> |
| </ul> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.join"> |
| <code class="descname">join</code><span class="sig-paren">(</span><em>other</em>, <em>numPartitions=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.join"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.join" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying ‘join’ between RDDs of this DStream and |
| <cite>other</cite> DStream.</p> |
| <p>Hash partitioning is used to generate the RDDs with <cite>numPartitions</cite> |
| partitions.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.leftOuterJoin"> |
| <code class="descname">leftOuterJoin</code><span class="sig-paren">(</span><em>other</em>, <em>numPartitions=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.leftOuterJoin"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.leftOuterJoin" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying ‘left outer join’ between RDDs of this DStream and |
| <cite>other</cite> DStream.</p> |
| <p>Hash partitioning is used to generate the RDDs with <cite>numPartitions</cite> |
| partitions.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.map"> |
| <code class="descname">map</code><span class="sig-paren">(</span><em>f</em>, <em>preservesPartitioning=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.map"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.map" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying a function to each element of DStream.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.mapPartitions"> |
| <code class="descname">mapPartitions</code><span class="sig-paren">(</span><em>f</em>, <em>preservesPartitioning=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.mapPartitions"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.mapPartitions" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream in which each RDD is generated by applying |
| mapPartitions() to each RDDs of this DStream.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.mapPartitionsWithIndex"> |
| <code class="descname">mapPartitionsWithIndex</code><span class="sig-paren">(</span><em>f</em>, <em>preservesPartitioning=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.mapPartitionsWithIndex"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.mapPartitionsWithIndex" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream in which each RDD is generated by applying |
| mapPartitionsWithIndex() to each RDDs of this DStream.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.mapValues"> |
| <code class="descname">mapValues</code><span class="sig-paren">(</span><em>f</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.mapValues"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.mapValues" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying a map function to the value of |
| each key-value pairs in this DStream without changing the key.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.partitionBy"> |
| <code class="descname">partitionBy</code><span class="sig-paren">(</span><em>numPartitions</em>, <em>partitionFunc=<function portable_hash></em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.partitionBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.partitionBy" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a copy of the DStream in which each RDD are partitioned |
| using the specified partitioner.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.persist"> |
| <code class="descname">persist</code><span class="sig-paren">(</span><em>storageLevel</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.persist"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.persist" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Persist the RDDs of this DStream with the given storage level</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.pprint"> |
| <code class="descname">pprint</code><span class="sig-paren">(</span><em>num=10</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.pprint"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.pprint" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Print the first num elements of each RDD generated in this DStream.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>num</strong> – the number of elements from the first will be printed.</td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.reduce"> |
| <code class="descname">reduce</code><span class="sig-paren">(</span><em>func</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.reduce"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.reduce" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream in which each RDD has a single element |
| generated by reducing each RDD of this DStream.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.reduceByKey"> |
| <code class="descname">reduceByKey</code><span class="sig-paren">(</span><em>func</em>, <em>numPartitions=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.reduceByKey"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.reduceByKey" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying reduceByKey to each RDD.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.reduceByKeyAndWindow"> |
| <code class="descname">reduceByKeyAndWindow</code><span class="sig-paren">(</span><em>func</em>, <em>invFunc</em>, <em>windowDuration</em>, <em>slideDuration=None</em>, <em>numPartitions=None</em>, <em>filterFunc=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.reduceByKeyAndWindow"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.reduceByKeyAndWindow" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying incremental <cite>reduceByKey</cite> over a sliding window.</p> |
| <dl class="docutils"> |
| <dt>The reduced value of over a new window is calculated using the old window’s reduce value :</dt> |
| <dd><ol class="first last arabic simple"> |
| <li>reduce the new values that entered the window (e.g., adding new counts)</li> |
| <li>“inverse reduce” the old values that left the window (e.g., subtracting old counts)</li> |
| </ol> |
| </dd> |
| </dl> |
| <p><cite>invFunc</cite> can be None, then it will reduce all the RDDs in window, could be slower |
| than having <cite>invFunc</cite>.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple"> |
| <li><strong>func</strong> – associative and commutative reduce function</li> |
| <li><strong>invFunc</strong> – inverse function of <cite>reduceFunc</cite></li> |
| <li><strong>windowDuration</strong> – width of the window; must be a multiple of this DStream’s |
| batching interval</li> |
| <li><strong>slideDuration</strong> – sliding interval of the window (i.e., the interval after which |
| the new DStream will generate RDDs); must be a multiple of this |
| DStream’s batching interval</li> |
| <li><strong>numPartitions</strong> – number of partitions of each RDD in the new DStream.</li> |
| <li><strong>filterFunc</strong> – function to filter expired key-value pairs; |
| only pairs that satisfy the function are retained |
| set this to null if you do not want to filter</li> |
| </ul> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.reduceByWindow"> |
| <code class="descname">reduceByWindow</code><span class="sig-paren">(</span><em>reduceFunc</em>, <em>invReduceFunc</em>, <em>windowDuration</em>, <em>slideDuration</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.reduceByWindow"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.reduceByWindow" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream in which each RDD has a single element generated by reducing all |
| elements in a sliding window over this DStream.</p> |
| <p>if <cite>invReduceFunc</cite> is not None, the reduction is done incrementally |
| using the old window’s reduced value :</p> |
| <ol class="arabic simple"> |
| <li>reduce the new values that entered the window (e.g., adding new counts)</li> |
| </ol> |
| <p>2. “inverse reduce” the old values that left the window (e.g., subtracting old counts) |
| This is more efficient than <cite>invReduceFunc</cite> is None.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple"> |
| <li><strong>reduceFunc</strong> – associative and commutative reduce function</li> |
| <li><strong>invReduceFunc</strong> – inverse reduce function of <cite>reduceFunc</cite>; such that for all y, |
| and invertible x: |
| <cite>invReduceFunc(reduceFunc(x, y), x) = y</cite></li> |
| <li><strong>windowDuration</strong> – width of the window; must be a multiple of this DStream’s |
| batching interval</li> |
| <li><strong>slideDuration</strong> – sliding interval of the window (i.e., the interval after which |
| the new DStream will generate RDDs); must be a multiple of this |
| DStream’s batching interval</li> |
| </ul> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.repartition"> |
| <code class="descname">repartition</code><span class="sig-paren">(</span><em>numPartitions</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.repartition"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.repartition" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream with an increased or decreased level of parallelism.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.rightOuterJoin"> |
| <code class="descname">rightOuterJoin</code><span class="sig-paren">(</span><em>other</em>, <em>numPartitions=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.rightOuterJoin"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.rightOuterJoin" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by applying ‘right outer join’ between RDDs of this DStream and |
| <cite>other</cite> DStream.</p> |
| <p>Hash partitioning is used to generate the RDDs with <cite>numPartitions</cite> |
| partitions.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.saveAsTextFiles"> |
| <code class="descname">saveAsTextFiles</code><span class="sig-paren">(</span><em>prefix</em>, <em>suffix=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.saveAsTextFiles"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.saveAsTextFiles" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Save each RDD in this DStream as at text file, using string |
| representation of elements.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.slice"> |
| <code class="descname">slice</code><span class="sig-paren">(</span><em>begin</em>, <em>end</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.slice"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.slice" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return all the RDDs between ‘begin’ to ‘end’ (both included)</p> |
| <p><cite>begin</cite>, <cite>end</cite> could be datetime.datetime() or unix_timestamp</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.transform"> |
| <code class="descname">transform</code><span class="sig-paren">(</span><em>func</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.transform"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.transform" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream in which each RDD is generated by applying a function |
| on each RDD of this DStream.</p> |
| <p><cite>func</cite> can have one argument of <cite>rdd</cite>, or have two arguments of |
| (<cite>time</cite>, <cite>rdd</cite>)</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.transformWith"> |
| <code class="descname">transformWith</code><span class="sig-paren">(</span><em>func</em>, <em>other</em>, <em>keepSerializer=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.transformWith"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.transformWith" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream in which each RDD is generated by applying a function |
| on each RDD of this DStream and ‘other’ DStream.</p> |
| <p><cite>func</cite> can have two arguments of (<cite>rdd_a</cite>, <cite>rdd_b</cite>) or have three |
| arguments of (<cite>time</cite>, <cite>rdd_a</cite>, <cite>rdd_b</cite>)</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.union"> |
| <code class="descname">union</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.union"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.union" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream by unifying data of another DStream with this DStream.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>other</strong> – Another DStream having the same interval (i.e., slideDuration) |
| as this DStream.</td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.updateStateByKey"> |
| <code class="descname">updateStateByKey</code><span class="sig-paren">(</span><em>updateFunc</em>, <em>numPartitions=None</em>, <em>initialRDD=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.updateStateByKey"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.updateStateByKey" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new “state” DStream where the state for each key is updated by applying |
| the given function on the previous state of the key and the new values of the key.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>updateFunc</strong> – State update function. If this function returns None, then |
| corresponding state key-value pair will be eliminated.</td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.DStream.window"> |
| <code class="descname">window</code><span class="sig-paren">(</span><em>windowDuration</em>, <em>slideDuration=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/dstream.html#DStream.window"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.DStream.window" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Return a new DStream in which each RDD contains all the elements in seen in a |
| sliding window of time over this DStream.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple"> |
| <li><strong>windowDuration</strong> – width of the window; must be a multiple of this DStream’s |
| batching interval</li> |
| <li><strong>slideDuration</strong> – sliding interval of the window (i.e., the interval after which |
| the new DStream will generate RDDs); must be a multiple of this |
| DStream’s batching interval</li> |
| </ul> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| </dd></dl> |
| |
| <dl class="class"> |
| <dt id="pyspark.streaming.StreamingListener"> |
| <em class="property">class </em><code class="descclassname">pyspark.streaming.</code><code class="descname">StreamingListener</code><a class="reference internal" href="_modules/pyspark/streaming/listener.html#StreamingListener"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingListener" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Bases: <code class="xref py py-class docutils literal"><span class="pre">object</span></code></p> |
| <dl class="class"> |
| <dt id="pyspark.streaming.StreamingListener.Java"> |
| <em class="property">class </em><code class="descname">Java</code><a class="reference internal" href="_modules/pyspark/streaming/listener.html#StreamingListener.Java"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingListener.Java" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Bases: <code class="xref py py-class docutils literal"><span class="pre">object</span></code></p> |
| <dl class="attribute"> |
| <dt id="pyspark.streaming.StreamingListener.Java.implements"> |
| <code class="descname">implements</code><em class="property"> = ['org.apache.spark.streaming.api.java.PythonStreamingListener']</em><a class="headerlink" href="#pyspark.streaming.StreamingListener.Java.implements" title="Permalink to this definition">¶</a></dt> |
| <dd></dd></dl> |
| |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingListener.onBatchCompleted"> |
| <code class="descname">onBatchCompleted</code><span class="sig-paren">(</span><em>batchCompleted</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/listener.html#StreamingListener.onBatchCompleted"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingListener.onBatchCompleted" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Called when processing of a batch of jobs has completed.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingListener.onBatchStarted"> |
| <code class="descname">onBatchStarted</code><span class="sig-paren">(</span><em>batchStarted</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/listener.html#StreamingListener.onBatchStarted"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingListener.onBatchStarted" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Called when processing of a batch of jobs has started.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingListener.onBatchSubmitted"> |
| <code class="descname">onBatchSubmitted</code><span class="sig-paren">(</span><em>batchSubmitted</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/listener.html#StreamingListener.onBatchSubmitted"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingListener.onBatchSubmitted" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Called when a batch of jobs has been submitted for processing.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingListener.onOutputOperationCompleted"> |
| <code class="descname">onOutputOperationCompleted</code><span class="sig-paren">(</span><em>outputOperationCompleted</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/listener.html#StreamingListener.onOutputOperationCompleted"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingListener.onOutputOperationCompleted" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Called when processing of a job of a batch has completed</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingListener.onOutputOperationStarted"> |
| <code class="descname">onOutputOperationStarted</code><span class="sig-paren">(</span><em>outputOperationStarted</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/listener.html#StreamingListener.onOutputOperationStarted"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingListener.onOutputOperationStarted" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Called when processing of a job of a batch has started.</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingListener.onReceiverError"> |
| <code class="descname">onReceiverError</code><span class="sig-paren">(</span><em>receiverError</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/listener.html#StreamingListener.onReceiverError"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingListener.onReceiverError" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Called when a receiver has reported an error</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingListener.onReceiverStarted"> |
| <code class="descname">onReceiverStarted</code><span class="sig-paren">(</span><em>receiverStarted</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/listener.html#StreamingListener.onReceiverStarted"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingListener.onReceiverStarted" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Called when a receiver has been started</p> |
| </dd></dl> |
| |
| <dl class="method"> |
| <dt id="pyspark.streaming.StreamingListener.onReceiverStopped"> |
| <code class="descname">onReceiverStopped</code><span class="sig-paren">(</span><em>receiverStopped</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/listener.html#StreamingListener.onReceiverStopped"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.StreamingListener.onReceiverStopped" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Called when a receiver has been stopped</p> |
| </dd></dl> |
| |
| </dd></dl> |
| |
| </div> |
| <div class="section" id="module-pyspark.streaming.kafka"> |
| <span id="pyspark-streaming-kafka-module"></span><h2>pyspark.streaming.kafka module<a class="headerlink" href="#module-pyspark.streaming.kafka" title="Permalink to this headline">¶</a></h2> |
| <dl class="class"> |
| <dt id="pyspark.streaming.kafka.Broker"> |
| <em class="property">class </em><code class="descclassname">pyspark.streaming.kafka.</code><code class="descname">Broker</code><span class="sig-paren">(</span><em>host</em>, <em>port</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/kafka.html#Broker"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.kafka.Broker" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Bases: <code class="xref py py-class docutils literal"><span class="pre">object</span></code></p> |
| <p>Represent the host and port info for a Kafka broker.</p> |
| </dd></dl> |
| |
| <dl class="class"> |
| <dt id="pyspark.streaming.kafka.KafkaMessageAndMetadata"> |
| <em class="property">class </em><code class="descclassname">pyspark.streaming.kafka.</code><code class="descname">KafkaMessageAndMetadata</code><span class="sig-paren">(</span><em>topic</em>, <em>partition</em>, <em>offset</em>, <em>key</em>, <em>message</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/kafka.html#KafkaMessageAndMetadata"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.kafka.KafkaMessageAndMetadata" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Bases: <code class="xref py py-class docutils literal"><span class="pre">object</span></code></p> |
| <p>Kafka message and metadata information. Including topic, partition, offset and message</p> |
| <dl class="attribute"> |
| <dt id="pyspark.streaming.kafka.KafkaMessageAndMetadata.key"> |
| <code class="descname">key</code><a class="headerlink" href="#pyspark.streaming.kafka.KafkaMessageAndMetadata.key" title="Permalink to this definition">¶</a></dt> |
| <dd></dd></dl> |
| |
| <dl class="attribute"> |
| <dt id="pyspark.streaming.kafka.KafkaMessageAndMetadata.message"> |
| <code class="descname">message</code><a class="headerlink" href="#pyspark.streaming.kafka.KafkaMessageAndMetadata.message" title="Permalink to this definition">¶</a></dt> |
| <dd></dd></dl> |
| |
| </dd></dl> |
| |
| <dl class="class"> |
| <dt id="pyspark.streaming.kafka.KafkaUtils"> |
| <em class="property">class </em><code class="descclassname">pyspark.streaming.kafka.</code><code class="descname">KafkaUtils</code><a class="reference internal" href="_modules/pyspark/streaming/kafka.html#KafkaUtils"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.kafka.KafkaUtils" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Bases: <code class="xref py py-class docutils literal"><span class="pre">object</span></code></p> |
| <dl class="staticmethod"> |
| <dt id="pyspark.streaming.kafka.KafkaUtils.createDirectStream"> |
| <em class="property">static </em><code class="descname">createDirectStream</code><span class="sig-paren">(</span><em>ssc</em>, <em>topics</em>, <em>kafkaParams</em>, <em>fromOffsets=None</em>, <em>keyDecoder=<function utf8_decoder></em>, <em>valueDecoder=<function utf8_decoder></em>, <em>messageHandler=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/kafka.html#KafkaUtils.createDirectStream"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.kafka.KafkaUtils.createDirectStream" title="Permalink to this definition">¶</a></dt> |
| <dd><div class="admonition note"> |
| <p class="first admonition-title">Note</p> |
| <p class="last">Experimental</p> |
| </div> |
| <p>Create an input stream that directly pulls messages from a Kafka Broker and specific offset.</p> |
| <p>This is not a receiver based Kafka input stream, it directly pulls the message from Kafka |
| in each batch duration and processed without storing.</p> |
| <p>This does not use Zookeeper to store offsets. The consumed offsets are tracked |
| by the stream itself. For interoperability with Kafka monitoring tools that depend on |
| Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application. |
| You can access the offsets used in each batch from the generated RDDs (see</p> |
| <p>To recover from driver failures, you have to enable checkpointing in the StreamingContext. |
| The information on consumed offset can be recovered from the checkpoint. |
| See the programming guide for details (constraints, etc.).</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple"> |
| <li><strong>ssc</strong> – StreamingContext object.</li> |
| <li><strong>topics</strong> – list of topic_name to consume.</li> |
| <li><strong>kafkaParams</strong> – Additional params for Kafka.</li> |
| <li><strong>fromOffsets</strong> – Per-topic/partition Kafka offsets defining the (inclusive) starting |
| point of the stream.</li> |
| <li><strong>keyDecoder</strong> – A function used to decode key (default is utf8_decoder).</li> |
| <li><strong>valueDecoder</strong> – A function used to decode value (default is utf8_decoder).</li> |
| <li><strong>messageHandler</strong> – A function used to convert KafkaMessageAndMetadata. You can assess |
| meta using messageHandler (default is None).</li> |
| </ul> |
| </td> |
| </tr> |
| <tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">A DStream object</p> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="staticmethod"> |
| <dt id="pyspark.streaming.kafka.KafkaUtils.createRDD"> |
| <em class="property">static </em><code class="descname">createRDD</code><span class="sig-paren">(</span><em>sc</em>, <em>kafkaParams</em>, <em>offsetRanges</em>, <em>leaders=None</em>, <em>keyDecoder=<function utf8_decoder></em>, <em>valueDecoder=<function utf8_decoder></em>, <em>messageHandler=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/kafka.html#KafkaUtils.createRDD"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.kafka.KafkaUtils.createRDD" title="Permalink to this definition">¶</a></dt> |
| <dd><div class="admonition note"> |
| <p class="first admonition-title">Note</p> |
| <p class="last">Experimental</p> |
| </div> |
| <p>Create an RDD from Kafka using offset ranges for each topic and partition.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple"> |
| <li><strong>sc</strong> – SparkContext object</li> |
| <li><strong>kafkaParams</strong> – Additional params for Kafka</li> |
| <li><strong>offsetRanges</strong> – list of offsetRange to specify topic:partition:[start, end) to consume</li> |
| <li><strong>leaders</strong> – Kafka brokers for each TopicAndPartition in offsetRanges. May be an empty |
| map, in which case leaders will be looked up on the driver.</li> |
| <li><strong>keyDecoder</strong> – A function used to decode key (default is utf8_decoder)</li> |
| <li><strong>valueDecoder</strong> – A function used to decode value (default is utf8_decoder)</li> |
| <li><strong>messageHandler</strong> – A function used to convert KafkaMessageAndMetadata. You can assess |
| meta using messageHandler (default is None).</li> |
| </ul> |
| </td> |
| </tr> |
| <tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">An RDD object</p> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="staticmethod"> |
| <dt id="pyspark.streaming.kafka.KafkaUtils.createStream"> |
| <em class="property">static </em><code class="descname">createStream</code><span class="sig-paren">(</span><em>ssc</em>, <em>zkQuorum</em>, <em>groupId</em>, <em>topics</em>, <em>kafkaParams=None</em>, <em>storageLevel=StorageLevel(True</em>, <em>True</em>, <em>False</em>, <em>False</em>, <em>2)</em>, <em>keyDecoder=<function utf8_decoder></em>, <em>valueDecoder=<function utf8_decoder></em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/kafka.html#KafkaUtils.createStream"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.kafka.KafkaUtils.createStream" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Create an input stream that pulls messages from a Kafka Broker.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple"> |
| <li><strong>ssc</strong> – StreamingContext object</li> |
| <li><strong>zkQuorum</strong> – Zookeeper quorum (hostname:port,hostname:port,..).</li> |
| <li><strong>groupId</strong> – The group id for this consumer.</li> |
| <li><strong>topics</strong> – Dict of (topic_name -> numPartitions) to consume. |
| Each partition is consumed in its own thread.</li> |
| <li><strong>kafkaParams</strong> – Additional params for Kafka</li> |
| <li><strong>storageLevel</strong> – RDD storage level.</li> |
| <li><strong>keyDecoder</strong> – A function used to decode key (default is utf8_decoder)</li> |
| <li><strong>valueDecoder</strong> – A function used to decode value (default is utf8_decoder)</li> |
| </ul> |
| </td> |
| </tr> |
| <tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">A DStream object</p> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| </dd></dl> |
| |
| <dl class="class"> |
| <dt id="pyspark.streaming.kafka.OffsetRange"> |
| <em class="property">class </em><code class="descclassname">pyspark.streaming.kafka.</code><code class="descname">OffsetRange</code><span class="sig-paren">(</span><em>topic</em>, <em>partition</em>, <em>fromOffset</em>, <em>untilOffset</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/kafka.html#OffsetRange"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.kafka.OffsetRange" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Bases: <code class="xref py py-class docutils literal"><span class="pre">object</span></code></p> |
| <p>Represents a range of offsets from a single Kafka TopicAndPartition.</p> |
| </dd></dl> |
| |
| <dl class="class"> |
| <dt id="pyspark.streaming.kafka.TopicAndPartition"> |
| <em class="property">class </em><code class="descclassname">pyspark.streaming.kafka.</code><code class="descname">TopicAndPartition</code><span class="sig-paren">(</span><em>topic</em>, <em>partition</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/kafka.html#TopicAndPartition"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.kafka.TopicAndPartition" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Bases: <code class="xref py py-class docutils literal"><span class="pre">object</span></code></p> |
| <p>Represents a specific topic and partition for Kafka.</p> |
| </dd></dl> |
| |
| <dl class="function"> |
| <dt id="pyspark.streaming.kafka.utf8_decoder"> |
| <code class="descclassname">pyspark.streaming.kafka.</code><code class="descname">utf8_decoder</code><span class="sig-paren">(</span><em>s</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/kafka.html#utf8_decoder"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.kafka.utf8_decoder" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Decode the unicode as UTF-8</p> |
| </dd></dl> |
| |
| </div> |
| <div class="section" id="module-pyspark.streaming.kinesis"> |
| <span id="pyspark-streaming-kinesis-module"></span><h2>pyspark.streaming.kinesis module<a class="headerlink" href="#module-pyspark.streaming.kinesis" title="Permalink to this headline">¶</a></h2> |
| <dl class="class"> |
| <dt id="pyspark.streaming.kinesis.KinesisUtils"> |
| <em class="property">class </em><code class="descclassname">pyspark.streaming.kinesis.</code><code class="descname">KinesisUtils</code><a class="reference internal" href="_modules/pyspark/streaming/kinesis.html#KinesisUtils"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.kinesis.KinesisUtils" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Bases: <code class="xref py py-class docutils literal"><span class="pre">object</span></code></p> |
| <dl class="staticmethod"> |
| <dt id="pyspark.streaming.kinesis.KinesisUtils.createStream"> |
| <em class="property">static </em><code class="descname">createStream</code><span class="sig-paren">(</span><em>ssc</em>, <em>kinesisAppName</em>, <em>streamName</em>, <em>endpointUrl</em>, <em>regionName</em>, <em>initialPositionInStream</em>, <em>checkpointInterval</em>, <em>storageLevel=StorageLevel(True</em>, <em>True</em>, <em>False</em>, <em>False</em>, <em>2)</em>, <em>awsAccessKeyId=None</em>, <em>awsSecretKey=None</em>, <em>decoder=<function utf8_decoder></em>, <em>stsAssumeRoleArn=None</em>, <em>stsSessionName=None</em>, <em>stsExternalId=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/kinesis.html#KinesisUtils.createStream"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.kinesis.KinesisUtils.createStream" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Create an input stream that pulls messages from a Kinesis stream. This uses the |
| Kinesis Client Library (KCL) to pull messages from Kinesis.</p> |
| <div class="admonition note"> |
| <p class="first admonition-title">Note</p> |
| <p class="last">The given AWS credentials will get saved in DStream checkpoints if checkpointing |
| is enabled. Make sure that your checkpoint directory is secure.</p> |
| </div> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple"> |
| <li><strong>ssc</strong> – StreamingContext object</li> |
| <li><strong>kinesisAppName</strong> – Kinesis application name used by the Kinesis Client Library (KCL) to |
| update DynamoDB</li> |
| <li><strong>streamName</strong> – Kinesis stream name</li> |
| <li><strong>endpointUrl</strong> – Url of Kinesis service (e.g., <a class="reference external" href="https://kinesis.us-east-1.amazonaws.com">https://kinesis.us-east-1.amazonaws.com</a>)</li> |
| <li><strong>regionName</strong> – Name of region used by the Kinesis Client Library (KCL) to update |
| DynamoDB (lease coordination and checkpointing) and CloudWatch (metrics)</li> |
| <li><strong>initialPositionInStream</strong> – In the absence of Kinesis checkpoint info, this is the |
| worker’s initial starting position in the stream. The |
| values are either the beginning of the stream per Kinesis’ |
| limit of 24 hours (InitialPositionInStream.TRIM_HORIZON) or |
| the tip of the stream (InitialPositionInStream.LATEST).</li> |
| <li><strong>checkpointInterval</strong> – Checkpoint interval for Kinesis checkpointing. See the Kinesis |
| Spark Streaming documentation for more details on the different |
| types of checkpoints.</li> |
| <li><strong>storageLevel</strong> – Storage level to use for storing the received objects (default is |
| StorageLevel.MEMORY_AND_DISK_2)</li> |
| <li><strong>awsAccessKeyId</strong> – AWS AccessKeyId (default is None. If None, will use |
| DefaultAWSCredentialsProviderChain)</li> |
| <li><strong>awsSecretKey</strong> – AWS SecretKey (default is None. If None, will use |
| DefaultAWSCredentialsProviderChain)</li> |
| <li><strong>decoder</strong> – A function used to decode value (default is utf8_decoder)</li> |
| <li><strong>stsAssumeRoleArn</strong> – ARN of IAM role to assume when using STS sessions to read from |
| the Kinesis stream (default is None).</li> |
| <li><strong>stsSessionName</strong> – Name to uniquely identify STS sessions used to read from Kinesis |
| stream, if STS is being used (default is None).</li> |
| <li><strong>stsExternalId</strong> – External ID that can be used to validate against the assumed IAM |
| role’s trust policy, if STS is being used (default is None).</li> |
| </ul> |
| </td> |
| </tr> |
| <tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">A DStream object</p> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| </dd></dl> |
| |
| <dl class="class"> |
| <dt id="pyspark.streaming.kinesis.InitialPositionInStream"> |
| <em class="property">class </em><code class="descclassname">pyspark.streaming.kinesis.</code><code class="descname">InitialPositionInStream</code><a class="reference internal" href="_modules/pyspark/streaming/kinesis.html#InitialPositionInStream"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.kinesis.InitialPositionInStream" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Bases: <code class="xref py py-class docutils literal"><span class="pre">object</span></code></p> |
| <dl class="attribute"> |
| <dt id="pyspark.streaming.kinesis.InitialPositionInStream.LATEST"> |
| <code class="descname">LATEST</code><em class="property"> = 0</em><a class="headerlink" href="#pyspark.streaming.kinesis.InitialPositionInStream.LATEST" title="Permalink to this definition">¶</a></dt> |
| <dd></dd></dl> |
| |
| <dl class="attribute"> |
| <dt id="pyspark.streaming.kinesis.InitialPositionInStream.TRIM_HORIZON"> |
| <code class="descname">TRIM_HORIZON</code><em class="property"> = 1</em><a class="headerlink" href="#pyspark.streaming.kinesis.InitialPositionInStream.TRIM_HORIZON" title="Permalink to this definition">¶</a></dt> |
| <dd></dd></dl> |
| |
| </dd></dl> |
| |
| <dl class="function"> |
| <dt id="pyspark.streaming.kinesis.utf8_decoder"> |
| <code class="descclassname">pyspark.streaming.kinesis.</code><code class="descname">utf8_decoder</code><span class="sig-paren">(</span><em>s</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/kinesis.html#utf8_decoder"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.kinesis.utf8_decoder" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Decode the unicode as UTF-8</p> |
| </dd></dl> |
| |
| </div> |
| <div class="section" id="module-pyspark.streaming.flume"> |
| <span id="pyspark-streaming-flume-module"></span><h2>pyspark.streaming.flume.module<a class="headerlink" href="#module-pyspark.streaming.flume" title="Permalink to this headline">¶</a></h2> |
| <dl class="class"> |
| <dt id="pyspark.streaming.flume.FlumeUtils"> |
| <em class="property">class </em><code class="descclassname">pyspark.streaming.flume.</code><code class="descname">FlumeUtils</code><a class="reference internal" href="_modules/pyspark/streaming/flume.html#FlumeUtils"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.flume.FlumeUtils" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Bases: <code class="xref py py-class docutils literal"><span class="pre">object</span></code></p> |
| <dl class="staticmethod"> |
| <dt id="pyspark.streaming.flume.FlumeUtils.createPollingStream"> |
| <em class="property">static </em><code class="descname">createPollingStream</code><span class="sig-paren">(</span><em>ssc</em>, <em>addresses</em>, <em>storageLevel=StorageLevel(True</em>, <em>True</em>, <em>False</em>, <em>False</em>, <em>2)</em>, <em>maxBatchSize=1000</em>, <em>parallelism=5</em>, <em>bodyDecoder=<function utf8_decoder></em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/flume.html#FlumeUtils.createPollingStream"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.flume.FlumeUtils.createPollingStream" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Creates an input stream that is to be used with the Spark Sink deployed on a Flume agent. |
| This stream will poll the sink for data and will pull events as they are available.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple"> |
| <li><strong>ssc</strong> – StreamingContext object</li> |
| <li><strong>addresses</strong> – List of (host, port)s on which the Spark Sink is running.</li> |
| <li><strong>storageLevel</strong> – Storage level to use for storing the received objects</li> |
| <li><strong>maxBatchSize</strong> – The maximum number of events to be pulled from the Spark sink |
| in a single RPC call</li> |
| <li><strong>parallelism</strong> – Number of concurrent requests this stream should send to the sink. |
| Note that having a higher number of requests concurrently being pulled |
| will result in this stream using more threads</li> |
| <li><strong>bodyDecoder</strong> – A function used to decode body (default is utf8_decoder)</li> |
| </ul> |
| </td> |
| </tr> |
| <tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">A DStream object</p> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| <dl class="staticmethod"> |
| <dt id="pyspark.streaming.flume.FlumeUtils.createStream"> |
| <em class="property">static </em><code class="descname">createStream</code><span class="sig-paren">(</span><em>ssc</em>, <em>hostname</em>, <em>port</em>, <em>storageLevel=StorageLevel(True</em>, <em>True</em>, <em>False</em>, <em>False</em>, <em>2)</em>, <em>enableDecompression=False</em>, <em>bodyDecoder=<function utf8_decoder></em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/flume.html#FlumeUtils.createStream"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.flume.FlumeUtils.createStream" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Create an input stream that pulls events from Flume.</p> |
| <table class="docutils field-list" frame="void" rules="none"> |
| <col class="field-name" /> |
| <col class="field-body" /> |
| <tbody valign="top"> |
| <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple"> |
| <li><strong>ssc</strong> – StreamingContext object</li> |
| <li><strong>hostname</strong> – Hostname of the slave machine to which the flume data will be sent</li> |
| <li><strong>port</strong> – Port of the slave machine to which the flume data will be sent</li> |
| <li><strong>storageLevel</strong> – Storage level to use for storing the received objects</li> |
| <li><strong>enableDecompression</strong> – Should netty server decompress input stream</li> |
| <li><strong>bodyDecoder</strong> – A function used to decode body (default is utf8_decoder)</li> |
| </ul> |
| </td> |
| </tr> |
| <tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">A DStream object</p> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </dd></dl> |
| |
| </dd></dl> |
| |
| <dl class="function"> |
| <dt id="pyspark.streaming.flume.utf8_decoder"> |
| <code class="descclassname">pyspark.streaming.flume.</code><code class="descname">utf8_decoder</code><span class="sig-paren">(</span><em>s</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/streaming/flume.html#utf8_decoder"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.streaming.flume.utf8_decoder" title="Permalink to this definition">¶</a></dt> |
| <dd><p>Decode the unicode as UTF-8</p> |
| </dd></dl> |
| |
| </div> |
| </div> |
| |
| |
| </div> |
| </div> |
| </div> |
| <div class="sphinxsidebar" role="navigation" aria-label="main navigation"> |
| <div class="sphinxsidebarwrapper"> |
| <p class="logo"><a href="index.html"> |
| <img class="logo" src="_static/spark-logo-hd.png" alt="Logo"/> |
| </a></p> |
| <h3><a href="index.html">Table Of Contents</a></h3> |
| <ul> |
| <li><a class="reference internal" href="#">pyspark.streaming module</a><ul> |
| <li><a class="reference internal" href="#module-pyspark.streaming">Module contents</a></li> |
| <li><a class="reference internal" href="#module-pyspark.streaming.kafka">pyspark.streaming.kafka module</a></li> |
| <li><a class="reference internal" href="#module-pyspark.streaming.kinesis">pyspark.streaming.kinesis module</a></li> |
| <li><a class="reference internal" href="#module-pyspark.streaming.flume">pyspark.streaming.flume.module</a></li> |
| </ul> |
| </li> |
| </ul> |
| |
| <h4>Previous topic</h4> |
| <p class="topless"><a href="pyspark.sql.html" |
| title="previous chapter">pyspark.sql module</a></p> |
| <h4>Next topic</h4> |
| <p class="topless"><a href="pyspark.ml.html" |
| title="next chapter">pyspark.ml package</a></p> |
| <div role="note" aria-label="source link"> |
| <h3>This Page</h3> |
| <ul class="this-page-menu"> |
| <li><a href="_sources/pyspark.streaming.rst.txt" |
| rel="nofollow">Show Source</a></li> |
| </ul> |
| </div> |
| <div id="searchbox" style="display: none" role="search"> |
| <h3>Quick search</h3> |
| <form class="search" action="search.html" method="get"> |
| <div><input type="text" name="q" /></div> |
| <div><input type="submit" value="Go" /></div> |
| <input type="hidden" name="check_keywords" value="yes" /> |
| <input type="hidden" name="area" value="default" /> |
| </form> |
| </div> |
| <script type="text/javascript">$('#searchbox').show(0);</script> |
| </div> |
| </div> |
| <div class="clearer"></div> |
| </div> |
| <div class="related" role="navigation" aria-label="related navigation"> |
| <h3>Navigation</h3> |
| <ul> |
| <li class="right" style="margin-right: 10px"> |
| <a href="pyspark.ml.html" title="pyspark.ml package" |
| >next</a></li> |
| <li class="right" > |
| <a href="pyspark.sql.html" title="pyspark.sql module" |
| >previous</a> |</li> |
| |
| <li class="nav-item nav-item-0"><a href="index.html">PySpark 2.2.1 documentation</a> »</li> |
| |
| <li class="nav-item nav-item-1"><a href="pyspark.html" >pyspark package</a> »</li> |
| </ul> |
| </div> |
| <div class="footer" role="contentinfo"> |
| © Copyright . |
| Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.6.5. |
| </div> |
| </body> |
| </html> |