blob: c0dd9288c6a69fc65145fa1fd57fb7a35c791523 [file] [log] [blame]
~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements. See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License. You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
~~
Data Model
Chukwa Adaptors emit data in <Chunks>. A Chunk is a sequence of bytes,
with some metadata. Several of these are set automatically by the Agent or
Adaptors. Two of them require user intervention: <cluster name> and
<datatype>. Cluster name is specified in <conf/chukwa-agent-conf.xml>,
and is global to each Agent process. Datatype describes the expected format
of the data collected by an Adaptor instance, and it is specified when that
instance is started.
The following table lists the Chunk metadata fields:
*-------------*-------------------------------------+----------------------------------------------------------:
| Field | Meaning | Source |
*-------------*-------------------------------------+----------------------------------------------------------:
| Source | Hostname where Chunk was generated | Automatic |
*-------------*-------------------------------------+----------------------------------------------------------:
| Cluster | Cluster host is associated with | Specified by user in agent config |
*-------------*-------------------------------------+----------------------------------------------------------:
| Datatype | Format of output | Specified by user when Adaptor started |
*-------------*-------------------------------------+----------------------------------------------------------:
| Sequence ID | Offset of Chunk in stream | Automatic, initial offset specified when Adaptor started |
*-------------*-------------------------------------+----------------------------------------------------------:
| Name | Name of data source | Automatic, chosen by Adaptor |
*-------------*-------------------------------------+----------------------------------------------------------:
Conceptually, each Adaptor emits a semi-infinite stream of bytes, numbered
starting from zero. The sequence ID specifies how many bytes each Adaptor has
sent, including the current chunk. So if an adaptor emits a chunk containing
the first 100 bytes from a file, the sequenceID of that Chunk will be 100.
And the second hundred bytes will have sequence ID 200. This may seem a
little peculiar, but it's actually the same way that TCP sequence numbers work.
Adaptors need to take sequence ID as a parameter so that they can resume
correctly after a crash, and not send redundant data. When starting adaptors,
it's usually save to specify 0 as an ID, but it's sometimes useful to specify
something else. For instance, it lets you do things like only tail the second
half of a file.
HBase Schema
* Metrics
<chukwa> table stores time series data.
** Row Key
*------*------*------------*------------*
| | Day | Metric MD5 | Source MD5 |
*------*------*------------*------------*
| Size | 2 | 6 | 6 |
*------*------*------------*------------*
Row key is composed of 14 bytes data. First 2 bytes are day of the year.
The next 6 bytes are md5 signature of metrics name. The last 6 bytes are
md5 signature of data source. This arrangement helps Chukwa to partition
data evenly across regions base on time.
This arrangement provides a good condensed store for data of the same day
for the same source.
** Column Family
The column family format for Chukwa table are:
*---------------*-----------------------------------------------------------------:
| Column Family | Description |
*---------------*-----------------------------------------------------------------:
| t | Time series data. Column name is timestamp. Value is a string |
*---------------*-----------------------------------------------------------------:
| a | Annotation, string tags associated with time series data. |
*---------------*-----------------------------------------------------------------:
* Metadata
<chukwa_metadata> table is designed to store point lookup data. For example, small
amount of data to describe the metric name mapping for chukwa table. It is also used to
store JSON blob of dashboard data.
** Row Key
*----------------*------------------------------------------------------------------:
| Row Key | Description |
*----------------*------------------------------------------------------------------:
| [Metrics Group]| Metrics Group Name, this allows to fetch all metrics name from |
| | the group can be fetched from loading the row key. |
*----------------*------------------------------------------------------------------:
| chart_meta | All charts are stored in this row. |
*----------------*------------------------------------------------------------------:
| dashboard_meta | All dashboard are stored in this row. |
*----------------*------------------------------------------------------------------:
| widget_meta | All widgets are stored in this row. |
*----------------*------------------------------------------------------------------:
** Special Row
*----------------*------------------------------------------------------------------:
| chart_meta | Cell contains the rendering option and metric series name in |
| | a JSON blob |
*----------------*------------------------------------------------------------------:
| dashboard_meta | Cell describes one dashboard view |
*----------------*------------------------------------------------------------------:
| widget_meta | Cell describes title and URL of a dashboard widget |
*----------------*------------------------------------------------------------------:
** Column Family
*---------------*-------------------------------------------------------------------:
| Column Family | Description |
*---------------*-------------------------------------------------------------------:
| k | Key, associated with a fixed structure for describing key types |
| | and md5 signature of the key used in chukwa table. |
*---------------*-------------------------------------------------------------------:
| c | column for storing JSON blob for special rows. This column is |
| | used to store dashboard, chart, and widget metadata. |
*---------------*-------------------------------------------------------------------:
Key Types for k column Family, the current supported key types are:
*----------*----------------------------------------------------:
| Type | Description |
*----------*----------------------------------------------------:
| metric | This key is a metric name. |
*----------*----------------------------------------------------:
| source | This key is a source name. |
*----------*----------------------------------------------------: