blob: 1443991065f7a2ad60255de63fc997d4a5e9be22 [file] [log] [blame]
~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements. See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License. You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
~~
Pipeline Configuration Guide
Basic Options
Apache Chukwa pipeline are responsible for accepting incoming data from Agents,
and extract, transform and load data to destination storage. Most commonly,
pipeline simply write all received to HBase or HDFS.
* HBase
For enabling streaming data to HBase, chukwa pipeline can
be configured in <chukwa-agent-conf.xml>.
---
<property>
<name>chukwa.pipeline</name>
<value>org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter</value>
</property>
---
In this mode, HBase configuration is configured in <chukwa-env.sh>.
HBASE_CONF_DIR should reference to HBae configuration directory to enable
Apache Chukwa agent to load <hbase-site.xml> from class path.
* HDFS
For enabling streaming data to HDFS, chukwa pipeline can be configured in
<chukwa-agent-conf.xml>.
---
<property>
<name>chukwa.pipeline</name>
<value>org.apache.hadoop.chukwa.datacollection.writer.parquet.ChukwaParquetWriter</value>
</property>
---
In this mode, data will write to HDFS which has been defined by HADOOP_CONF_DIR environment
variable.
This is the only option that you really need to specify to get a working
pipeline.
Advanced Options
There are some advanced options, not necessarily documented in the
agent conf file, that are helpful in using Apache Chukwa in nonstandard ways.
While normally Apache Chukwa writes sequence files to HDFS, it's possible to
specify an alternate pipe class. The option <chukwa.pipeline> specifies
a Java class to instantiate and use as a writer. See the <ChukwaWriter>
javadoc for details.
One particularly useful pipeline class is <PipelineStageWriter>, which
lets you string together a series of <PipelineableWriters>
for pre-processing or post-processing incoming data.
As an example, the SocketTeeWriter class allows other programs to get
incoming chunks fed to them over a socket by Apache Chukwa agent.
Stages in the pipeline should be listed, comma-separated, in option
<chukwa.pipeline>
---
<property>
<name>chukwa.pipeline</name>
<value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.parquet.ChukwaParquetWriter</value>
</property>
---
HBaseWriter
The default writer to store data on HBase. HBaseWriter runs Demux parsers
inside for convert unstructured data to semi-structured data, then load the
key value pairs to HBase table. HBaseWriter has the following configuration:
* <<hbase.demux.package>> Demux parser class package, HBaseWriter uses this
package name to validate HBase for annotated demux parser classes.
---
<property>
<name>hbase.demux.package</name>
<value>org.apache.hadoop.chukwa.extraction.demux.processor</value>
</property>
---
* <<hbase.writer.verify.schema>> Verify HBase Table schema with demux parser
schema, log warning if there are mismatch between hbase schema and
demux parsers.
---
<property>
<name>hbase.writer.verify.schema</name>
<value>false</value>
</property>
---
* <<hbase.writer.halt.on.schema.mismatch>> If this option is set to true,
and HBase table schema is mismatched with demux parser, agent will
shut down itself.
---
<property>
<name>hbase.writer.halt.on.schema.mismatch</name>
<value>false</value>
</property>
---
SolrWriter
<SolrWriter> writes chunks of data to SolrCloud server. This writer is
designed to write log entries to Solr for full text indexing.
SolrWriter can be enabled by <chukwa.pipline> property in chukwa-agent-conf.xml.
Solr specific settings are pointer to zookeeper location to find SolrCloud
leader and solr collection to store indexed data.
---
<property>
<name>solr.cloud.address</name>
<value>localhost:2181</value>
</property>
<property>
<name>solr.collection</name>
<value>chukwa</value>
</property>
---
LocalWriter
<LocalWriter> writes chunks of data to local disk then upload file to HDFS
as a whole file. This writer is designed for high throughput environment.
* <<chukwaCollector.localOutputDir>> Location to buffer data before moving
data to HDFS.
---
<property>
<name>chukwaCollector.localOutputDir</name>
<value>/tmp/chukwa/logs</value>
</property>
---
ChukwaParquetWriter
The <ChukwaParquetWriter> streams chunks of data to HDFS. When the file is completed writing,
the filename is renamed with <.done> suffix. ChukwaParquetWriter has the following
configuration in <chukwa-agent-conf.xml>.
* <<chukwaCollector.outputDir>> Location of collect data sink directory
---
<property>
<name>chukwaCollector.outputDir</name>
<value>/chukwa/logs/</value>
<description>Chukwa data sink directory</description>
</property>
---
* <<chukwaCollector.rotateInterval>> File Rotation Interval
---
<property>
<name>chukwaCollector.rotateInterval</name>
<value>300000</value>
<description>Chukwa rotate interval (ms)</description>
</property>
---
SocketTeeWriter
The <SocketTeeWriter> allows external processes to watch
the stream of chunks passing through the agent. This allows certain kinds
of real-time monitoring to be done on-top of Apache Chukwa.
SocketTeeWriter listens on a port (specified by conf option
<chukwaCollector.tee.port>, defaulting to 9094.) Applications
that want Chunks should connect to that port, and issue a command of the form
<RAW|WRITABLE <filter>\n>. Filters use the same syntax
as the {{{./programming.html#Reading+data+from+the+sink+or+the+archive}Dump command}}.
If the filter is accepted, the Writer will respond
<OK\n>.
Subsequently, Chunks matching the filter will be serialized and sent back
over the socket. Specifying "WRITABLE" will cause the chunks to be written
using Hadoop's Writable serialization framework. "RAW" will send the internal
data of the Chunk, without any metadata, prefixed by its length encoded as
a 32-bit int, big-endian. "HEADER" is similar to "RAW", but with a one-line
header in front of the content. Header format is:
---
<hostname> <datatype> <stream name> <offset>
---
separated by spaces.
The filter will be de-activated when the socket is closed.
---
Socket s2 = new Socket("host", SocketTeeWriter.DEFAULT_PORT);
s2.getOutputStream().write("RAW datatype=XTrace\n".getBytes());
dis = new DataInputStream(s2.getInputStream());
dis.readFully(new byte[3]); //read "OK\n"
while(true) {
int len = dis.readInt();
byte[] data = new byte[len];
dis.readFully(data);
DoSomethingUsing(data);
}
---