blob: 469c256fbad78e89d2941b384a0846ec7a021fcf [file] [log] [blame] [view]
# KIF - Kill It (with) Fire
## A simple find-and-fix program with a yaml configuration
Kif is a simple monitoring program that detects programs running amok
and tries to correct them. It can currently scan for:
- Memory usage (MB, GB or % of total mem available)
- No. of open file descriptors
- No. of TCP connections open
- No. of LAN TCP connections open
- Age of process
- State of process (running, waiting, zombie etc)
and act accordingly, either running a custom command (such as restarting
a service) or killing it with any preferred signal. It can also notify
you of issues found and actions taken, either via email or hipchat.
See [kif.yaml](kif.yaml) for example configuration and features.
### Requirements
- python 3.6 or higher
- python-yaml
- python-psutil
- asfpy
### Installation and use
- Download Kif
- Make a kif.yaml configuration (see the [example](kif.yaml))
- Install the dependencies with: `pip3 install -r requirements.txt` (or use pipenv)
- Run as root (required to both read usage and restart services).
- Enjoy!
### Installing via pipservice
To install on an infra node, add the following yaml snippet to it:
~~~yaml
pipservice:
kif:
tag: master
~~~
### Rule syntax:
```yaml
rules:
apache:
description: 'sample apache process rule'
# We can specify the exact cmdline and args to scan for:
procid:
- '/usr/sbin/apache2'
- '-k'
- 'start'
# We'll use combine: true to combine the resource of multiple processes into one check.
combine: true
triggers:
# Demand no more than 500 LAN connections
maxlocalconns: 500
# No more than 25,000 open connections in total
maxconns: 25000
# Require < 1GB memory used (could also be 10%, 512mb etc)
maxmemory: 1gb
# And finally, no more than 65,000 open file descriptors
maxfds: 65000
# If triggered, run this:
runlist:
- 'service apache2 restart'
zombies:
description: 'Any process caught in zombie mode'
# use empty procid to catch all
procid: ''
triggers:
# This can be any process state (zombie, sleeping, running, etc)
state: 'zombie'
# No runlist here, just kill it with signal 9
kill: true
killwith: 9
puppet:
description: 'kill -9 puppet agents that are hanging'
procid: 'puppet agent'
# Find all processes created more than 1 day ago.
triggers:
maxage: 1d
# Ignore main process
ignorepidfile: '/var/run/puppet/agent.pid'
# Kill it with signal 9
kill: true
killwith: 9
```
### Restricting rules to certain machines
To have a specific rule run on certain nodes, please add the rule to kif.yaml, and make use of `host_must_match` or `host_must_not_match` definitions to narrow down where to run the rule-set, like so:
~~~yaml
zombies_on_gitbox:
description: 'Any gitweb process caught in zombie mode'
host_must_match: gitbox.apache.org
procid: '/usr/bin/git'
triggers:
# This can be any process state (zombie, sleeping, running, etc)
# Or a git process > 30 minutes old.
state: 'zombie'
maxage: 30m
kill: true
killwith: 9
httpd_but_not_tlpserver:
description: 'httpd too many backend connections (pool filling up?)'
host_must_not_match: 'tlp-.+'
procid: '/usr/sbin/apache2'
# Use combine: true to combine the resource of multiple processes into one check.
combine: true
triggers:
maxlocalconns: 1000
runlist:
- 'service apache2 restart'
~~~
Both `host_must_match` and `host_must_not_match` are regular expressions and must match the full hostname.
Be sure to use double escaping for keywords, for instance `\\d` instead of `\d`, or the yaml will break. The must/must-not can also be used in combination to include some nodes and rule out others.
### Command line arguments
- `--debug`: Run in debug mode - detect but don't try to fix issues.
- `--config $filename`: path to config file.