r/man/map_batches.Rd - arrow - Git at Google

 % Generated by roxygen2: do not edit by hand
 % Please edit documentation in R/dataset-scan.R
 \name{map_batches}
 \alias{map_batches}
 \title{Apply a function to a stream of RecordBatches}
 \usage{
 map_batches(X, FUN, ..., .schema = NULL, .lazy = TRUE, .data.frame = NULL)
 }
 \arguments{
 \item{X}{A \code{Dataset} or \code{arrow_dplyr_query} object, as returned by the
 \code{dplyr} methods on \code{Dataset}.}

 \item{FUN}{A function or \code{purrr}-style lambda expression to apply to each
 batch. It must return a RecordBatch or something coercible to one via
 `as_record_batch()'.}

 \item{...}{Additional arguments passed to \code{FUN}}

 \item{.schema}{An optional \code{\link[=schema]{schema()}}. If NULL, the schema will be inferred
 from the first batch.}

 \item{.lazy}{Use \code{TRUE} to evaluate \code{FUN} lazily as batches are read from
 the result; use \code{FALSE} to evaluate \code{FUN} on all batches before returning
 the reader.}

 \item{.data.frame}{Deprecated argument, ignored}
 }
 \value{
 An \code{arrow_dplyr_query}.
 }
 \description{
 As an alternative to calling \code{collect()} on a \code{Dataset} query, you can
 use this function to access the stream of \code{RecordBatch}es in the \code{Dataset}.
 This lets you do more complex operations in R that operate on chunks of data
 without having to hold the entire Dataset in memory at once. You can include
 \code{map_batches()} in a dplyr pipeline and do additional dplyr methods on the
 stream of data in Arrow after it.
 }
 \details{
 This is experimental and not recommended for production use. It is also
 single-threaded and runs in R not C++, so it won't be as fast as core
 Arrow methods.
 }
	% Generated by roxygen2: do not edit by hand
	% Please edit documentation in R/dataset-scan.R
	\name{map_batches}
	\alias{map_batches}
	\title{Apply a function to a stream of RecordBatches}
	\usage{
	map_batches(X, FUN, ..., .schema = NULL, .lazy = TRUE, .data.frame = NULL)
	}
	\arguments{
	\item{X}{A \code{Dataset} or \code{arrow_dplyr_query} object, as returned by the
	\code{dplyr} methods on \code{Dataset}.}

	\item{FUN}{A function or \code{purrr}-style lambda expression to apply to each
	batch. It must return a RecordBatch or something coercible to one via
	`as_record_batch()'.}

	\item{...}{Additional arguments passed to \code{FUN}}

	\item{.schema}{An optional \code{\link[=schema]{schema()}}. If NULL, the schema will be inferred
	from the first batch.}

	\item{.lazy}{Use \code{TRUE} to evaluate \code{FUN} lazily as batches are read from
	the result; use \code{FALSE} to evaluate \code{FUN} on all batches before returning
	the reader.}

	\item{.data.frame}{Deprecated argument, ignored}
	}
	\value{
	An \code{arrow_dplyr_query}.
	}
	\description{
	As an alternative to calling \code{collect()} on a \code{Dataset} query, you can
	use this function to access the stream of \code{RecordBatch}es in the \code{Dataset}.
	This lets you do more complex operations in R that operate on chunks of data
	without having to hold the entire Dataset in memory at once. You can include
	\code{map_batches()} in a dplyr pipeline and do additional dplyr methods on the
	stream of data in Arrow after it.
	}
	\details{
	This is experimental and not recommended for production use. It is also
	single-threaded and runs in R not C++, so it won't be as fast as core
	Arrow methods.
	}