blob: 08d8abe02c3ecda133f29345251fbde77949c12a [file] [log] [blame] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
Health Monitoring REST API
===
Knox provides REST-ful API for monitoring the core service. It primarily exposes the health of the Knox service that includes service status (up/down) as well as other health metrics. This is a work-in-progress feature, which started with an extensible framework to support basic functionalities. In particular, it currently supports the API to A) *ping* the service and B) time-based statistics related to all API calls.
#### Health Monitoring Setup
The basic setup includes two major steps A) add configurations to enable the metrics collection and reporting B) write a topology file and upload it into *topologies* directory.
##### Service Configurations
At first, we need to make sure the gateway configurations to gather and report to JMX are turned on in *gateway-site.xml*. The following two configurations into *gateway-site.xml* will serve the purpose.
```
<property>
<name>gateway.metrics.enabled</name>
<value>true</value>
<description>Boolean flag indicates whether to enable the metrics collection</description>
</property>
<property>
<name>gateway.jmx.metrics.reporting.enabled</name>
<value>true</value>
<description>Boolean flag indicates whether to enable the metrics reporting using JMX</description>
</property>
```
##### health.xml Topology
In order to enable health monitoring REST service, you need to add a new topology file (i.e. *health.xml*). The following is an example that is configured to test the basic functionalities of Knox service. It is highly recommended using more restricted authentication mechanism.
```
<topology>
<gateway>
<provider>
<role>authentication</role>
<name>ShiroProvider</name>
<enabled>true</enabled>
<param>
<!--
session timeout in minutes, this is really idle timeout,
defaults to 30 mins, if the property value is not defined,,
current client authentication would expire if client idles continuously for more than this value
-->
<name>sessionTimeout</name>
<value>30</value>
</param>
<param>
<name>main.ldapRealm</name>
<value>org.apache.knox.gateway.shirorealm.KnoxLdapRealm</value>
</param>
<param>
<name>main.ldapContextFactory</name>
<value>org.apache.knox.gateway.shirorealm.KnoxLdapContextFactory</value>
</param>
<param>
<name>main.ldapRealm.contextFactory</name>
<value>$ldapContextFactory</value>
</param>
<param>
<name>main.ldapRealm.userDnTemplate</name>
<value>uid={0},ou=people,dc=hadoop,dc=apache,dc=org</value>
</param>
<param>
<name>main.ldapRealm.contextFactory.url</name>
<value>ldap://localhost:33389</value>
</param>
<param>
<name>main.ldapRealm.contextFactory.authenticationMechanism</name>
<value>simple</value>
</param>
<param>
<name>urls./**</name>
<value>authcBasic</value>
</param>
</provider>
<provider>
<role>authorization</role>
<name>AclsAuthz</name>
<enabled>false</enabled>
<param>
<name>knox.acl</name>
<value>admin;*;*</value>
</param>
</provider>
<provider>
<role>identity-assertion</role>
<name>Default</name>
<enabled>false</enabled>
</provider>
<provider>
<role>hostmap</role>
<name>static</name>
<enabled>true</enabled>
<param><name>localhost</name><value>sandbox,sandbox.hortonworks.com</value></param>
</provider>
</gateway>
<service>
<role>HEALTH</role>
</service>
</topology>
```
Just as with any Knox service, the gateway providers protect the health monitoring REST service defined above it. In this case, the ShiroProvider is taking care of HTTP Basic Auth using LDAP. Once the user authenticates with LDAP, the request processing continues to the *Health* service that will perform the necessary actions.
The authenticate/federation provider can be swapped out to fit your deployment environment.
After creating the file health.xml with above contents, you need to copy the file to *KNOX_HOME/conf/topologies* directory. If Knox/gateway service is not running, you can start it using "*bin/gateway.sh start*". Otherwise the service would automatically pick this new '*health*' service. When gateway service registers the new service, it displays the following log messages in *log/gateway.log*.
```
2017-08-22 03:44:25,045 INFO knox.gateway (GatewayServer.java:handleCreateDeployment(677)) - Deploying topology health to /home/joe/knox/knox-0.12.0/bin/../data/deployments/health.topo.15e080a91c0
2017-08-22 03:44:25,045 INFO knox.gateway (GatewayServer.java:internalDeactivateTopology(596)) - Deactivating topology health
2017-08-22 03:44:25,119 INFO knox.gateway (DefaultGatewayServices.java:initializeContribution(197)) - Creating credential store for the cluster: health
2017-08-22 03:44:25,142 INFO knox.gateway (GatewayServer.java:internalActivateTopology(566)) - Activating topology health
2017-08-22 03:44:25,142 INFO knox.gateway (GatewayServer.java:internalActivateArchive(576)) - Activating topology health archive %2F
```
##### Verify
Once the health service is active, you can verify it by using the following *curl* command. The '*ping*' end point displays if the service is up. This end point can be utilized for monitoring the basic health of a Knox service.
```
$ curl -i -k -u guest:guest-password -X GET 'https://localhost:8445/gateway/health/v1/ping'
HTTP/1.1 200 OK
Date: Tue, 22 Aug 2017 07:09:37 GMT
Set-Cookie: JSESSIONID=1o82bcvoqbhbb1apt7zs8ubybb;Path=/gateway/health;Secure;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Set-Cookie: rememberMe=deleteMe; Path=/gateway/health; Max-Age=0; Expires=Mon, 21-Aug-2017 07:09:37 GMT
Cache-Control: must-revalidate,no-cache,no-store
Content-Type: text/plain; charset=ISO-8859-1
Content-Length: 3
Server: Jetty(9.2.15.v20160210)
OK
```
To retrieve the meaningful metrics details of various service calls, you may need to run multiple REST calls such as the followings. After that, execute the metrics REST call as shown below with a sample output. As shown, metrics output is returned in JSON format.
```
curl -i -k -u guest:guest-password -X GET 'https://localhost:8445/gateway/sandbox/webhdfs/v1/?op=LISTSTATUS'
```
```
$ curl -i -k -u guest:guest-password -X GET 'https://localhost:8445/gateway/health/v1/metrics?pretty=true'
HTTP/1.1 200 OK
Date: Tue, 22 Aug 2017 07:10:44 GMT
Set-Cookie: JSESSIONID=kqntcdaje9uai3pup7ffvfw4;Path=/gateway/health;Secure;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Set-Cookie: rememberMe=deleteMe; Path=/gateway/health; Max-Age=0; Expires=Mon, 21-Aug-2017 07:10:44 GMT
Content-Type: application/json
Cache-Control: must-revalidate,no-cache,no-store
Transfer-Encoding: chunked
Server: Jetty(9.2.15.v20160210)
{
"version" : "3.0.0",
"gauges" : { },
"counters" : { },
"histograms" : { },
"meters" : { },
"timers" : {
"client./gateway/health/v1/metrics.GET-requests" : {
"count" : 5,
"max" : 0.624587973,
"mean" : 0.027655743001736188,
"min" : 0.006145587,
"p50" : 0.010020548,
"p75" : 0.010020548,
"p95" : 0.074454725,
"p98" : 0.624587973,
"p99" : 0.624587973,
"p999" : 0.624587973,
"stddev" : 0.0929226225229978,
"m15_rate" : 2.657500857422334E-7,
"m1_rate" : 5.770087852901534E-89,
"m5_rate" : 4.769163772973399E-19,
"mean_rate" : 4.0952378345310894E-4,
"duration_units" : "seconds",
"rate_units" : "calls/second"
},
"client./gateway/health/v1/ping.GET-requests" : {
"count" : 1,
"max" : 0.017257638000000002,
"mean" : 0.017257638000000002,
"min" : 0.017257638000000002,
"p50" : 0.017257638000000002,
"p75" : 0.017257638000000002,
"p95" : 0.017257638000000002,
"p98" : 0.017257638000000002,
"p99" : 0.017257638000000002,
"p999" : 0.017257638000000002,
"stddev" : 0.0,
"m15_rate" : 0.18710139700632353,
"m1_rate" : 0.0735758882342885,
"m5_rate" : 0.1637461506155964,
"mean_rate" : 0.014990517517814805,
"duration_units" : "seconds",
"rate_units" : "calls/second"
},
"client./gateway/sandbox/health/v1/.GET-requests" : {
"count" : 1,
"max" : 4.01873E-4,
"mean" : 4.01873E-4,
"min" : 4.01873E-4,
"p50" : 4.01873E-4,
"p75" : 4.01873E-4,
"p95" : 4.01873E-4,
"p98" : 4.01873E-4,
"p99" : 4.01873E-4,
"p999" : 4.01873E-4,
"stddev" : 0.0,
"m15_rate" : 2.536740427767808E-7,
"m1_rate" : 7.074903404511115E-90,
"m5_rate" : 4.081014139447941E-19,
"mean_rate" : 8.179827684854002E-5,
"duration_units" : "seconds",
"rate_units" : "calls/second"
},
"client./gateway/sandbox/v1/health/.GET-requests" : {
"count" : 1,
"max" : 5.470700000000001E-4,
"mean" : 5.470700000000001E-4,
"min" : 5.470700000000001E-4,
"p50" : 5.470700000000001E-4,
"p75" : 5.470700000000001E-4,
"p95" : 5.470700000000001E-4,
"p98" : 5.470700000000001E-4,
"p99" : 5.470700000000001E-4,
"p999" : 5.470700000000001E-4,
"stddev" : 0.0,
"m15_rate" : 2.413022137213267E-7,
"m1_rate" : 3.341947732164585E-90,
"m5_rate" : 3.512561421726287E-19,
"mean_rate" : 8.149518570285245E-5,
"duration_units" : "seconds",
"rate_units" : "calls/second"
},
"client./gateway/sandbox/webhdfs/v1/.GET-requests" : {
"count" : 4,
"max" : 0.463745401,
"mean" : 0.024924118143299912,
"min" : 0.016542244,
"p50" : 0.024799078000000002,
"p75" : 0.033933548,
"p95" : 0.033933548,
"p98" : 0.033933548,
"p99" : 0.033933548,
"p999" : 0.033933548,
"stddev" : 0.007284773511002474,
"m15_rate" : 2.120680068580741E-8,
"m1_rate" : 4.7541228609699333E-91,
"m5_rate" : 1.5806080232092864E-20,
"mean_rate" : 2.7314359915623396E-4,
"duration_units" : "seconds",
"rate_units" : "calls/second"
},
"service./gateway/sandbox/webhdfs/v1/.get-requests" : {
"count" : 3,
"max" : 0.014635496000000001,
"mean" : 0.00342438191233768,
"min" : 0.0020088890000000002,
"p50" : 0.0020088890000000002,
"p75" : 0.005144646,
"p95" : 0.005144646,
"p98" : 0.005144646,
"p99" : 0.005144646,
"p999" : 0.005144646,
"stddev" : 0.0015604555820128599,
"m15_rate" : 1.9913776931949195E-8,
"m1_rate" : 3.1334281325640874E-91,
"m5_rate" : 1.055281734633953E-20,
"mean_rate" : 2.0486339070804923E-4,
"duration_units" : "seconds",
"rate_units" : "calls/second"
}
}
}
```
#### REST End Points
As mentioned above, currently Knox provides a few monitoring APIs to start with. The list will gradually grow to support new use-cases.
##### /ping
This end-point can be used to determine if a Knox gateway service is alive or not. It is useful for basic health monitoring of the core service. Although most of the results of REST calls are in JSON format, this one (*/ping*) is in plain text.
Sample response
```
OK
```
##### /metrics
This end-point returns all Knox metrics grouped by individual call type. For example, timer metrics for all *webhdfs* calls are aggregated into one set of metrics and then returned in a separate JSON element. This end-point also supports an option (*/metrics?pretty=true*) to pretty print the metrics output.
A sample response with *pretty=true* is shown below:
```
{
"version" : "3.0.0",
"gauges" : { },
"counters" : { },
"histograms" : { },
"meters" : { },
"timers" : {
"client./gateway/health/v1/ping.GET-requests" : {
"count" : 1,
"max" : 0.017257638000000002,
"mean" : 0.017257638000000002,
"min" : 0.017257638000000002,
"p50" : 0.017257638000000002,
"p75" : 0.017257638000000002,
"p95" : 0.017257638000000002,
"p98" : 0.017257638000000002,
"p99" : 0.017257638000000002,
"p999" : 0.017257638000000002,
"stddev" : 0.0,
"m15_rate" : 0.18710139700632353,
"m1_rate" : 0.0735758882342885,
"m5_rate" : 0.1637461506155964,
"mean_rate" : 0.014990517517814805,
"duration_units" : "seconds",
"rate_units" : "calls/second"
},
"client./gateway/sandbox/v1/health/.GET-requests" : {
"count" : 1,
"max" : 5.470700000000001E-4,
"mean" : 5.470700000000001E-4,
"min" : 5.470700000000001E-4,
"p50" : 5.470700000000001E-4,
"p75" : 5.470700000000001E-4,
"p95" : 5.470700000000001E-4,
"p98" : 5.470700000000001E-4,
"p99" : 5.470700000000001E-4,
"p999" : 5.470700000000001E-4,
"stddev" : 0.0,
"m15_rate" : 2.413022137213267E-7,
"m1_rate" : 3.341947732164585E-90,
"m5_rate" : 3.512561421726287E-19,
"mean_rate" : 8.149518570285245E-5,
"duration_units" : "seconds",
"rate_units" : "calls/second"
},
"client./gateway/sandbox/webhdfs/v1/.GET-requests" : {
"count" : 4,
"max" : 0.463745401,
"mean" : 0.024924118143299912,
"min" : 0.016542244,
"p50" : 0.024799078000000002,
"p75" : 0.033933548,
"p95" : 0.033933548,
"p98" : 0.033933548,
"p99" : 0.033933548,
"p999" : 0.033933548,
"stddev" : 0.007284773511002474,
"m15_rate" : 2.120680068580741E-8,
"m1_rate" : 4.7541228609699333E-91,
"m5_rate" : 1.5806080232092864E-20,
"mean_rate" : 2.7314359915623396E-4,
"duration_units" : "seconds",
"rate_units" : "calls/second"
}
}
}
```