| ~~ Licensed to the Apache Software Foundation (ASF) under one or more |
| ~~ contributor license agreements. See the NOTICE file distributed with |
| ~~ this work for additional information regarding copyright ownership. |
| ~~ The ASF licenses this file to You under the Apache License, Version 2.0 |
| ~~ (the "License"); you may not use this file except in compliance with |
| ~~ the License. You may obtain a copy of the License at |
| ~~ |
| ~~ http://www.apache.org/licenses/LICENSE-2.0 |
| ~~ |
| ~~ Unless required by applicable law or agreed to in writing, software |
| ~~ distributed under the License is distributed on an "AS IS" BASIS, |
| ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| ~~ See the License for the specific language governing permissions and |
| ~~ limitations under the License. |
| ~~ |
| |
| Pipeline Configuration Guide |
| |
| Basic Options |
| |
| Apache Chukwa pipeline are responsible for accepting incoming data from Agents, |
| and extract, transform and load data to destination storage. Most commonly, |
| pipeline simply write all received to HBase or HDFS. |
| |
| * HBase |
| |
| For enabling streaming data to HBase, chukwa pipeline can |
| be configured in <chukwa-agent-conf.xml>. |
| |
| --- |
| <property> |
| <name>chukwa.pipeline</name> |
| <value>org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter</value> |
| </property> |
| --- |
| |
| In this mode, HBase configuration is configured in <chukwa-env.sh>. |
| HBASE_CONF_DIR should reference to HBae configuration directory to enable |
| Apache Chukwa agent to load <hbase-site.xml> from class path. |
| |
| * HDFS |
| |
| For enabling streaming data to HDFS, chukwa pipeline can be configured in |
| <chukwa-agent-conf.xml>. |
| |
| --- |
| <property> |
| <name>chukwa.pipeline</name> |
| <value>org.apache.hadoop.chukwa.datacollection.writer.parquet.ChukwaParquetWriter</value> |
| </property> |
| --- |
| |
| In this mode, data will write to HDFS which has been defined by HADOOP_CONF_DIR environment |
| variable. |
| |
| This is the only option that you really need to specify to get a working |
| pipeline. |
| |
| Advanced Options |
| |
| There are some advanced options, not necessarily documented in the |
| agent conf file, that are helpful in using Apache Chukwa in nonstandard ways. |
| While normally Apache Chukwa writes sequence files to HDFS, it's possible to |
| specify an alternate pipe class. The option <chukwa.pipeline> specifies |
| a Java class to instantiate and use as a writer. See the <ChukwaWriter> |
| javadoc for details. |
| |
| One particularly useful pipeline class is <PipelineStageWriter>, which |
| lets you string together a series of <PipelineableWriters> |
| for pre-processing or post-processing incoming data. |
| As an example, the SocketTeeWriter class allows other programs to get |
| incoming chunks fed to them over a socket by Apache Chukwa agent. |
| |
| Stages in the pipeline should be listed, comma-separated, in option |
| <chukwa.pipeline> |
| |
| --- |
| <property> |
| <name>chukwa.pipeline</name> |
| <value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.parquet.ChukwaParquetWriter</value> |
| </property> |
| --- |
| |
| HBaseWriter |
| |
| The default writer to store data on HBase. HBaseWriter runs Demux parsers |
| inside for convert unstructured data to semi-structured data, then load the |
| key value pairs to HBase table. HBaseWriter has the following configuration: |
| |
| * <<hbase.demux.package>> Demux parser class package, HBaseWriter uses this |
| package name to validate HBase for annotated demux parser classes. |
| |
| --- |
| <property> |
| <name>hbase.demux.package</name> |
| <value>org.apache.hadoop.chukwa.extraction.demux.processor</value> |
| </property> |
| --- |
| |
| * <<hbase.writer.verify.schema>> Verify HBase Table schema with demux parser |
| schema, log warning if there are mismatch between hbase schema and |
| demux parsers. |
| |
| --- |
| <property> |
| <name>hbase.writer.verify.schema</name> |
| <value>false</value> |
| </property> |
| --- |
| |
| * <<hbase.writer.halt.on.schema.mismatch>> If this option is set to true, |
| and HBase table schema is mismatched with demux parser, agent will |
| shut down itself. |
| |
| --- |
| <property> |
| <name>hbase.writer.halt.on.schema.mismatch</name> |
| <value>false</value> |
| </property> |
| --- |
| |
| SolrWriter |
| |
| <SolrWriter> writes chunks of data to SolrCloud server. This writer is |
| designed to write log entries to Solr for full text indexing. |
| SolrWriter can be enabled by <chukwa.pipline> property in chukwa-agent-conf.xml. |
| Solr specific settings are pointer to zookeeper location to find SolrCloud |
| leader and solr collection to store indexed data. |
| |
| --- |
| <property> |
| <name>solr.cloud.address</name> |
| <value>localhost:2181</value> |
| </property> |
| |
| <property> |
| <name>solr.collection</name> |
| <value>chukwa</value> |
| </property> |
| --- |
| |
| LocalWriter |
| |
| <LocalWriter> writes chunks of data to local disk then upload file to HDFS |
| as a whole file. This writer is designed for high throughput environment. |
| |
| * <<chukwaCollector.localOutputDir>> Location to buffer data before moving |
| data to HDFS. |
| |
| --- |
| <property> |
| <name>chukwaCollector.localOutputDir</name> |
| <value>/tmp/chukwa/logs</value> |
| </property> |
| --- |
| |
| ChukwaParquetWriter |
| |
| The <ChukwaParquetWriter> streams chunks of data to HDFS. When the file is completed writing, |
| the filename is renamed with <.done> suffix. ChukwaParquetWriter has the following |
| configuration in <chukwa-agent-conf.xml>. |
| |
| * <<chukwaCollector.outputDir>> Location of collect data sink directory |
| |
| --- |
| <property> |
| <name>chukwaCollector.outputDir</name> |
| <value>/chukwa/logs/</value> |
| <description>Chukwa data sink directory</description> |
| </property> |
| --- |
| |
| * <<chukwaCollector.rotateInterval>> File Rotation Interval |
| |
| --- |
| <property> |
| <name>chukwaCollector.rotateInterval</name> |
| <value>300000</value> |
| <description>Chukwa rotate interval (ms)</description> |
| </property> |
| --- |
| |
| SocketTeeWriter |
| |
| The <SocketTeeWriter> allows external processes to watch |
| the stream of chunks passing through the agent. This allows certain kinds |
| of real-time monitoring to be done on-top of Apache Chukwa. |
| |
| SocketTeeWriter listens on a port (specified by conf option |
| <chukwaCollector.tee.port>, defaulting to 9094.) Applications |
| that want Chunks should connect to that port, and issue a command of the form |
| <RAW|WRITABLE <filter>\n>. Filters use the same syntax |
| as the {{{./programming.html#Reading+data+from+the+sink+or+the+archive}Dump command}}. |
| If the filter is accepted, the Writer will respond |
| <OK\n>. |
| |
| Subsequently, Chunks matching the filter will be serialized and sent back |
| over the socket. Specifying "WRITABLE" will cause the chunks to be written |
| using Hadoop's Writable serialization framework. "RAW" will send the internal |
| data of the Chunk, without any metadata, prefixed by its length encoded as |
| a 32-bit int, big-endian. "HEADER" is similar to "RAW", but with a one-line |
| header in front of the content. Header format is: |
| |
| --- |
| <hostname> <datatype> <stream name> <offset> |
| --- |
| separated by spaces. |
| |
| The filter will be de-activated when the socket is closed. |
| |
| --- |
| Socket s2 = new Socket("host", SocketTeeWriter.DEFAULT_PORT); |
| s2.getOutputStream().write("RAW datatype=XTrace\n".getBytes()); |
| dis = new DataInputStream(s2.getInputStream()); |
| dis.readFully(new byte[3]); //read "OK\n" |
| while(true) { |
| int len = dis.readInt(); |
| byte[] data = new byte[len]; |
| dis.readFully(data); |
| DoSomethingUsing(data); |
| } |
| --- |
| |