website/0.8.2/src/site/markdown/tutorial

Helix Tutorial: Helix Agent (for non-JVM systems)

Not every distributed system is written on the JVM, but many systems would benefit from the cluster management features that Helix provides. To make a non-JVM system work with Helix, you can use the Helix Agent module.

What is Helix Agent?

Helix is built on the following assumption: if your distributed resource is modeled by a finite state machine, then Helix can tell participants when they should transition between states. In the Java API, this means implementing transition callbacks. In the Helix agent API, this means providing commands than can run for each transition.

These commands could do anything behind the scenes; Helix only requires that they exit once the state transition is complete.

Configuring Transition Commands

Here's how to tell Helix which commands to run on state transitions:

Java

Using the Java API, first get a configuration scope (the Helix agent supports both cluster and resource scopes, picking resource first if it is available):

// Cluster scope
HelixConfigScope scope =
    new HelixConfigScopeBuilder(ConfigScopeProperty.CLUSTER).forCluster(clusterName).build();

// Resource scope
HelixConfigScope scope =
    new HelixConfigScopeBuilder(ConfigScopeProperty.RESOURCE).forCluster(clusterName).forResource(resourceName).build();

Then, specify the command to run for each state transition:

// Get the configuration accessor
ConfigAccessor configAccessor = new ConfigAccessor(_gZkClient);

// Specify the script for OFFLINE --> ONLINE
CommandConfig.Builder builder = new CommandConfig.Builder();
CommandConfig cmdConfig =
    builder.setTransition("OFFLINE", "ONLINE").setCommand("simpleHttpClient.py OFFLINE-ONLINE")
        .setCommandWorkingDir(workingDir)
        .setCommandTimeout("5000L") // optional: ms to wait before failing
        .setPidFile(pidFile) // optional: for daemon-like systems that will write the process id to a file
        .build();
configAccessor.set(scope, cmdConfig.toKeyValueMap());

// Specify the script for ONLINE --> OFFLINE
builder = new CommandConfig.Builder();
cmdConfig =
    builder.setTransition("ONLINE", "OFFLINE").setCommand("simpleHttpClient.py ONLINE-OFFLINE")
        .setCommandWorkingDir(workingDir)
        .build();
configAccessor.set(scope, cmdConfig.toKeyValueMap());

// Specify NOP for OFFLINE --> DROPPED
builder = new CommandConfig.Builder();
cmdConfig =
    builder.setTransition("OFFLINE", "DROPPED")
        .setCommand(CommandAttribute.NOP.getName())
        .build();
configAccessor.set(scope, cmdConfig.toKeyValueMap());

In this example, we have a program called simpleHttpClient.py that we call for all transitions, only changing the arguments that are passed in. However, there is no requirement that each transition invoke the same program; this API allows running arbitrary commands in arbitrary directories with arbitrary arguments.

Notice that that for the OFFLINE --> DROPPED transition, we do not run any command (specifically, we specify the NOP command). This just tells Helix that the system doesn't care about when things are dropped, and it can consider the transition already done.

Command Line

It is also possible to configure everything directly from the command line. Here's how that would look for cluster-wide configuration:

# Specify the script for OFFLINE --> ONLINE
/helix-admin.sh --zkSvr localhost:2181 --setConfig CLUSTER clusterName OFFLINE-ONLINE.command="simpleHttpClient.py OFFLINE-ONLINE",OFFLINE-ONLINE.workingDir="/path/to/script", OFFLINE-ONLINE.command.pidfile="/path/to/pidfile"

# Specify the script for ONLINE --> OFFLINE
/helix-admin.sh --zkSvr localhost:2181 --setConfig CLUSTER clusterName ONLINE-OFFLINE.command="simpleHttpClient.py ONLINE-OFFLINE",ONLINE-OFFLINE.workingDir="/path/to/script", OFFLINE-ONLINE.command.pidfile="/path/to/pidfile"

# Specify NOP for OFFLINE --> DROPPED
/helix-admin.sh --zkSvr localhost:2181 --setConfig CLUSTER clusterName ONLINE-OFFLINE.command="nop"

Like in the Java configuration, it is also possible to specify a resource scope instead of a cluster scope:

# Specify the script for OFFLINE --> ONLINE
/helix-admin.sh --zkSvr localhost:2181 --setConfig RESOURCE clusterName,resourceName OFFLINE-ONLINE.command="simpleHttpClient.py OFFLINE-ONLINE",OFFLINE-ONLINE.workingDir="/path/to/script", OFFLINE-ONLINE.command.pidfile="/path/to/pidfile"

Starting the Agent

There should be an agent running for every participant you have running. Ideally, its lifecycle should match that of the participant. Here, we have a simple long-running participant called simpleHttpServer.py. Its only purpose is to record state transitions.

Here are some ways that you can start the Helix agent:

Java

// Start your application process
ExternalCommand serverCmd = ExternalCommand.start(workingDir + "/simpleHttpServer.py");

// Start the agent
Thread agentThread = new Thread() {
  @Override
  public void run() {
    while(!isInterrupted()) {
      try {
        HelixAgentMain.main(new String[] {
            "--zkSvr", zkAddr, "--cluster", clusterName, "--instanceName", instanceName,
            "--stateModel", "OnlineOffline"
        });
      } catch (InterruptedException e) {
        LOG.info("Agent thread interrupted", e);
        interrupt();
      } catch (Exception e) {
        LOG.error("Exception start helix-agent", e);
      }
    }
  }
};
agentThread.start();

// Wait for the process to terminate (either intentionally or unintentionally)
serverCmd.waitFor();

// Kill the agent
agentThread.interrupt();

Command Line

# Build Helix and start the agent
mvn clean install -DskipTests
chmod +x helix-agent/target/helix-agent-pkg/bin/*
helix-agent/target/helix-agent-pkg/bin/start-helix-agent.sh --zkSvr zkAddr1,zkAddr2 --cluster clusterName --instanceName instanceName --stateModel OnlineOffline

# Here, you can define your own logic to terminate this agent when your process terminates
...

Example

Here is a basic system that uses the Helix agent package.

Notes

As you may have noticed from the examples, the participant program and the state transition program are two different programs. The former is a long-running process that is directly tied to the Helix agent. The latter is a process that only exists while a state transition is underway. Despite this, these two processes should be intertwined. The transition command will need to communicate to the participant to actually complete the state transition and the participant will need to communicate whether or not this was successful. The implementation of this protocol is the responsibility of the system.