blob: 8eabea2d2788f3eaa74ac862692a875b0bd2464c [file] [log] [blame]
~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements. See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License. You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
~~
HDFS Storage Layout
Overview
This document describes how Apache Chukwa data is stored in HDFS and the processes that act on it.
HDFS File System Structure
The general layout of Apache Chukwa filesystem is as follows.
---
/chukwa/
archivesProcessing/
dataSinkArchives/
demuxProcessing/
finalArchives/
logs/
postProcess/
repos/
rolling/
temp/
---
Raw Log Collection and Aggregation Workflow
What data is stored where is best described by stepping through Apache Chukwa workflow.
[[1]] Agents write chunks to <logs/*.chukwa> files until a 64MB chunk size is reached or a given time interval has passed.
* <logs/*.chukwa>
[[2]] Agents close chunks and rename them to <*.done>
* from <logs/*.chukwa>
* to <logs/*.done>
[[3]] DemuxManager checks for <*.done> files every 20 seconds.
[[1]] If <*.done> files exist, moves files in place for demux processing:
* from: <logs/*.done>
* to: <demuxProcessing/mrInput>
[[2]] The Demux MapReduce job is run on the data in <demuxProcessing/mrInput>.
[[3]] If demux is successful within 3 attempts, archives the completed files:
* from: <demuxProcessing/mrOutput>
* to: <dataSinkArchives/[yyyyMMdd]/*/*.done>
[[4]] Otherwise moves the completed files to an error folder:
* from: <demuxProcessing/mrOutput>
* to: <dataSinkArchives/InError/[yyyyMMdd]/*/*.done>
[[4]] PostProcessManager wakes up every few minutes and aggregates, orders and de-dups record files.
* from: <postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt>
* to: <repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH]_[N].[N].evt>
[[5]] HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5 minute logs to hourly.
* from: <repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm].[N].evt>
* to: <temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd]>
* to: <repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMdd]_[HH].[N].evt>
* leaves: <repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/>
[[6]] DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs to daily.
* from: <repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N].evt>
* to: <temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd]>
* to: <repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N].evt>
* leaves: <repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/>
[[7]] ChukwaArchiveManager every half hour or so aggregates and removes dataSinkArchives data using M/R.
* from: <dataSinkArchives/[yyyyMMdd]/*/*.done>
* to: <archivesProcessing/mrInput>
* to: <archivesProcessing/mrOutput>
* to: <finalArchives/[yyyyMMdd]/*/chukwaArchive-part-*>
Log Directories Requiring Cleanup
The following directories will grow over time and will need to be periodically pruned:
* <finalArchives/[yyyyMMdd]/*>
* <repos/[clusterName]/[dataType]/[yyyyMMdd]/*.evt>