docs/learn/documentation/versioned/hdfs/producer.md - samza - Git at Google

 ---
 layout: page
 title: Isolation
 ---
 <!--
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
 -->

 ### Writing to HDFS from Samza

 The `samza-hdfs` module implements a Samza Producer to write to HDFS. The current implementation includes a ready-to-use `HdfsSystemProducer`, and three `HdfsWriter`s: One that writes messages of raw bytes to a `SequenceFile` of `BytesWritable` keys and values. Another writes UTF-8 `String`s to a `SequenceFile` with `LongWritable` keys and `Text` values.
 The last one writes out Avro data files including the schema automatically reflected from the POJO objects fed to it.

 ### Configuring an HdfsSystemProducer

 You can configure an HdfsSystemProducer like any other Samza system: using configuration keys and values set in a `job.properties` file.
 You might configure the system producer for use by your `StreamTasks` like this:

 ```
 # set the SystemFactory implementation to instantiate HdfsSystemProducer aliased to 'hdfs-clickstream'
 systems.hdfs-clickstream.samza.factory=org.apache.samza.system.hdfs.HdfsSystemFactory

 # define a serializer/deserializer for the hdfs-clickstream system
 # DO NOT define (i.e. comment out) a SerDe when using the AvroDataFileHdfsWriter so it can reflect the schema
 systems.hdfs-clickstream.samza.msg.serde=some-serde-impl

 # consumer configs not needed for HDFS system, reader is not implemented yet

 # Assign a Metrics implementation via a label we defined earlier in the props file
 systems.hdfs-clickstream.streams.metrics.samza.msg.serde=some-metrics-impl

 # Assign the implementation class for this system's HdfsWriter
 systems.hdfs-clickstream.producer.hdfs.writer.class=org.apache.samza.system.hdfs.writer.TextSequenceFileHdfsWriter
 #systems.hdfs-clickstream.producer.hdfs.writer.class=org.apache.samza.system.hdfs.writer.AvroDataFileHdfsWriter

 # Set compression type supported by chosen Writer. Only BLOCK compression is supported currently
 # AvroDataFileHdfsWriter supports snappy, bzip2, deflate or none (null, anything other than the first three)
 systems.hdfs-clickstream.producer.hdfs.compression.type=snappy

 # The base dir for HDFS output. The default Bucketer for SequenceFile HdfsWriters
 # is currently /BASE/JOB_NAME/DATE_PATH/FILES, where BASE is set below
 systems.hdfs-clickstream.producer.hdfs.base.output.dir=/user/me/analytics/clickstream_data

 # Assign the implementation class for the HdfsWriter's Bucketer
 systems.hdfs-clickstream.producer.hdfs.bucketer.class=org.apache.samza.system.hdfs.writer.JobNameDateTimeBucketer

 # Configure the DATE_PATH the Bucketer will set to bucket output files by day for this job run.
 systems.hdfs-clickstream.producer.hdfs.bucketer.date.path.format=yyyy_MM_dd

 # Optionally set the max output bytes (records for AvroDataFileHdfsWriter) per file.
 # A new file will be cut and output continued on the next write call each time this many bytes
 # (records for AvroDataFileHdfsWriter) are written.
 systems.hdfs-clickstream.producer.hdfs.write.batch.size.bytes=134217728
 #systems.hdfs-clickstream.producer.hdfs.write.batch.size.records=10000
 ```

 The above configuration assumes a Metrics and Serde implemnetation has been properly configured against the `some-serde-impl` and `some-metrics-impl` labels somewhere else in the same `job.properties` file. Each of these properties has a reasonable default, so you can leave out the ones you don't need to customize for your job run.
	---
	layout: page
	title: Isolation
	---
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->

	### Writing to HDFS from Samza

	The `samza-hdfs` module implements a Samza Producer to write to HDFS. The current implementation includes a ready-to-use `HdfsSystemProducer`, and three `HdfsWriter`s: One that writes messages of raw bytes to a `SequenceFile` of `BytesWritable` keys and values. Another writes UTF-8 `String`s to a `SequenceFile` with `LongWritable` keys and `Text` values.
	The last one writes out Avro data files including the schema automatically reflected from the POJO objects fed to it.

	### Configuring an HdfsSystemProducer

	You can configure an HdfsSystemProducer like any other Samza system: using configuration keys and values set in a `job.properties` file.
	You might configure the system producer for use by your `StreamTasks` like this:

	```
	# set the SystemFactory implementation to instantiate HdfsSystemProducer aliased to 'hdfs-clickstream'
	systems.hdfs-clickstream.samza.factory=org.apache.samza.system.hdfs.HdfsSystemFactory

	# define a serializer/deserializer for the hdfs-clickstream system
	# DO NOT define (i.e. comment out) a SerDe when using the AvroDataFileHdfsWriter so it can reflect the schema
	systems.hdfs-clickstream.samza.msg.serde=some-serde-impl

	# consumer configs not needed for HDFS system, reader is not implemented yet

	# Assign a Metrics implementation via a label we defined earlier in the props file
	systems.hdfs-clickstream.streams.metrics.samza.msg.serde=some-metrics-impl

	# Assign the implementation class for this system's HdfsWriter
	systems.hdfs-clickstream.producer.hdfs.writer.class=org.apache.samza.system.hdfs.writer.TextSequenceFileHdfsWriter
	#systems.hdfs-clickstream.producer.hdfs.writer.class=org.apache.samza.system.hdfs.writer.AvroDataFileHdfsWriter

	# Set compression type supported by chosen Writer. Only BLOCK compression is supported currently
	# AvroDataFileHdfsWriter supports snappy, bzip2, deflate or none (null, anything other than the first three)
	systems.hdfs-clickstream.producer.hdfs.compression.type=snappy

	# The base dir for HDFS output. The default Bucketer for SequenceFile HdfsWriters
	# is currently /BASE/JOB_NAME/DATE_PATH/FILES, where BASE is set below
	systems.hdfs-clickstream.producer.hdfs.base.output.dir=/user/me/analytics/clickstream_data

	# Assign the implementation class for the HdfsWriter's Bucketer
	systems.hdfs-clickstream.producer.hdfs.bucketer.class=org.apache.samza.system.hdfs.writer.JobNameDateTimeBucketer

	# Configure the DATE_PATH the Bucketer will set to bucket output files by day for this job run.
	systems.hdfs-clickstream.producer.hdfs.bucketer.date.path.format=yyyy_MM_dd

	# Optionally set the max output bytes (records for AvroDataFileHdfsWriter) per file.
	# A new file will be cut and output continued on the next write call each time this many bytes
	# (records for AvroDataFileHdfsWriter) are written.
	systems.hdfs-clickstream.producer.hdfs.write.batch.size.bytes=134217728
	#systems.hdfs-clickstream.producer.hdfs.write.batch.size.records=10000
	```

	The above configuration assumes a Metrics and Serde implemnetation has been properly configured against the `some-serde-impl` and `some-metrics-impl` labels somewhere else in the same `job.properties` file. Each of these properties has a reasonable default, so you can leave out the ones you don't need to customize for your job run.