KIF - Kill It (with) Fire. Janitorial service for ASF Infra

Clone this repo:

Branches

  1. 8a2c12d jenkins ccos uses slave.jar instead of remoting.jar by clambertus · 5 weeks ago master
  2. 2094ed6 Switch to safe_load which is guaranteed to work by Daniel Gruno · 3 months ago
  3. 759caa9 Rename kif.service to pipservice-kif.service by Daniel Gruno · 3 months ago
  4. afdf1f7 Update README.md by Daniel Gruno · 3 months ago
  5. 95903e7 Update README.md by Daniel Gruno · 3 months ago

KIF - Kill It (with) Fire

A simple find-and-fix program with a yaml configuration

Kif is a simple monitoring program that detects programs running amok and tries to correct them. It can currently scan for:

  • Memory usage (MB, GB or % of total mem available)
  • No. of open file descriptors
  • No. of TCP connections open
  • No. of LAN TCP connections open
  • Age of process
  • State of process (running, waiting, zombie etc)

and act accordingly, either running a custom command (such as restarting a service) or killing it with any preferred signal. It can also notify you of issues found and actions taken, either via email or hipchat.

See kif.yaml for example configuration and features.

Requirements

  • python 3.6 or higher
  • python-yaml
  • python-psutil
  • asfpy

Installation and use

  • Download Kif
  • Make a kif.yaml configuration (see the example)
  • Install the dependencies with: pip3 install -r requirements.txt (or use pipenv)
  • Run as root (required to both read usage and restart services).
  • Enjoy!

Installing via pipservice

To install on an infra node, add the following yaml snippet to it:

pipservice:
  kif:
    tag: master

Rule syntax:

rules:
    apache:
        description:     'sample apache process rule'
        # We can specify the exact cmdline and args to scan for:
        procid: 
            - '/usr/sbin/apache2'
            - '-k'
            - 'start'
        # We'll use combine: true to combine the resource of multiple processes into one check.
        combine:            true
        triggers:
            # Demand no more than 500 LAN connections
            maxlocalconns:  500
            # No more than 25,000 open connections in total
            maxconns:       25000
            # Require < 1GB memory used (could also be 10%, 512mb etc)
            maxmemory:      1gb
            # And finally, no more than 65,000 open file descriptors
            maxfds:         65000
        # If triggered, run this:
        runlist:
            - 'service apache2 restart'
            
    zombies:
        description:    'Any process caught in zombie mode'
        # use empty procid to catch all
        procid:         ''
        triggers:
            # This can be any process state (zombie, sleeping, running, etc)
            state:      'zombie'
        # No runlist here, just kill it with signal 9
        kill:           true
        killwith:       9
        
    puppet:
        description:    'kill -9 puppet agents that are hanging'
        procid: 'puppet agent'
        # Find all processes created more than 1 day ago.
        triggers:
            maxage: 1d
        # Ignore main process
        ignorepidfile:  '/var/run/puppet/agent.pid'
        # Kill it with signal 9
        kill:           true
        killwith:       9

Restricting rules to certain machines

To have a specific rule run on certain nodes, please add the rule to kif.yaml, and make use of host_must_match or host_must_not_match definitions to narrow down where to run the rule-set, like so:

  zombies_on_gitbox:
    description:    'Any gitweb process caught in zombie mode'
    host_must_match: gitbox.apache.org
    procid:         '/usr/bin/git'
    triggers:
        # This can be any process state (zombie, sleeping, running, etc)
        # Or a git process > 30 minutes old.
        state:      'zombie'
        maxage:      30m
    kill:           true
    killwith: 9
  
  httpd_but_not_tlpserver:
    description:         'httpd too many backend connections (pool filling up?)'
    host_must_not_match: 'tlp-.+'
    procid:              '/usr/sbin/apache2'
    # Use combine: true to combine the resource of multiple processes into one check.
    combine:             true
    triggers:
        maxlocalconns:   1000
    runlist:
        - 'service apache2 restart'

Both host_must_match and host_must_not_match are regular expressions and must match the full hostname. Be sure to use double escaping for keywords, for instance \\d instead of \d, or the yaml will break. The must/must-not can also be used in combination to include some nodes and rule out others.

Command line arguments

  • --debug: Run in debug mode - detect but don't try to fix issues.
  • --config $filename: path to config file.