[docs] Update documentation to cover some recent improvements

commit: dda70681236b2b5bd12088261e969eb641047858 [log] [tgz]
author: Gyula Fora <g_fora@apple.com> Mon May 15 17:07:01 2023 +0200
committer: Gyula Fora <gyfora@apache.org> Tue May 16 09:46:46 2023 +0200
tree: f5c75a6f4deaceea015b0a4ae66cbb72bb307946
parent: 5cc1aa1b63a5c9ed8b8418f0d2a5dbc2ba6dc023 [diff]
diff --git a/README.md b/README.md
index 8f1fbcf..353aeb3 100644
--- a/README.md
+++ b/README.md

@@ -17,6 +17,7 @@
  - Upgrade, suspend and delete deployments
  - Full logging and metrics integration
  - Flexible deployments and native integration with Kubernetes tooling
+ - Flink Job Autoscaler
 
 For the complete feature-set please refer to our [documentation](https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/).
 

diff --git a/docs/content/docs/concepts/overview.md b/docs/content/docs/concepts/overview.md
index a67e091..d37d8d3 100644
--- a/docs/content/docs/concepts/overview.md
+++ b/docs/content/docs/concepts/overview.md

@@ -36,7 +36,7 @@
   - Stateful and stateless application upgrades
   - Triggering and managing savepoints
   - Handling errors, rolling-back broken upgrades
-- Multiple Flink version support: v1.13, v1.14, v1.15, v1.16
+- Multiple Flink version support: v1.13, v1.14, v1.15, v1.16, v1.17
 - [Deployment Modes]({{< ref "docs/custom-resource/overview#application-deployments" >}}):
   - Application cluster
   - Session cluster
@@ -52,6 +52,10 @@
 - POD augmentation via [Pod Templates]({{< ref "docs/custom-resource/pod-template" >}})
   - Native Kubernetes POD definitions
   - Layering (Base/JobManager/TaskManager overrides)
+- [Job Autoscaler]({{< ref "docs/custom-resource/autoscaler" >}})
+  - Collect lag and utilization metrics
+  - Scale job vertices to the ideal parallelism
+  - Scale up and down as the load changes
 ### Operations
 - Operator [Metrics]({{< ref "docs/operations/metrics-logging#metrics" >}})
   - Utilizes the well-established [Flink Metric System](https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics)
@@ -101,5 +105,5 @@
 ```
 
 ### AuditUtils can log sensitive information present in the custom resources
-As reported in [FLINK-30306](https://issues.apache.org/jira/browse/FLINK-30306) when Flink custom resources change the operator logs the change, which could include sensitive information. We suggest ingesting secrets to Flink containers during runtime to mitigate this. 
-Also note that anyone who has access to the custom resources already had access to the potentially sensitive information in question, but folks who only have access to the logs could also see them now. We are planning to introduce redaction rules to AuditUtils to improve this in a later release.
\ No newline at end of file
+As reported in [FLINK-30306](https://issues.apache.org/jira/browse/FLINK-30306) when Flink custom resources change the operator logs the change, which could include sensitive information. We suggest ingesting secrets to Flink containers during runtime to mitigate this.
+Also note that anyone who has access to the custom resources already had access to the potentially sensitive information in question, but folks who only have access to the logs could also see them now. We are planning to introduce redaction rules to AuditUtils to improve this in a later release.

diff --git a/docs/content/docs/custom-resource/job-management.md b/docs/content/docs/custom-resource/job-management.md
index 2081096..45d0840 100644
--- a/docs/content/docs/custom-resource/job-management.md
+++ b/docs/content/docs/custom-resource/job-management.md

@@ -98,7 +98,7 @@
 The three upgrade modes are intended to support different scenarios:
 
  1. **stateless**: Stateless application upgrades from empty state
- 2. **last-state**: Quick upgrades in any application state (even for failing jobs), does not require a healthy job as it always uses the latest checkpoint information. Manual recovery may be necessary if HA metadata is lost.
+ 2. **last-state**: Quick upgrades in any application state (even for failing jobs), does not require a healthy job as it always uses the latest checkpoint information. Manual recovery may be necessary if HA metadata is lost. To limit the time the job may fall back when picking up the latest checkpoint you can configure `kubernetes.operator.job.upgrade.last-state.max.allowed.checkpoint.age`. If the checkpoint is older than the configured value a savepoint will be taken instead for healthy jobs.
  3. **savepoint**: Use savepoint for upgrade, providing maximal safety and possibility to serve as backup/fork point. The savepoint will be created during the upgrade process. Note that the Flink job needs to be running to allow the savepoint to get created. If the job is in an unhealthy state, the last checkpoint will be used (unless `kubernetes.operator.job.upgrade.last-state-fallback.enabled` is set to `false`). If the last checkpoint is not available, the job upgrade will fail.
 
 During stateful upgrades there are always cases which might require user intervention to preserve the consistency of the application. Please see the [manual Recovery section](#manual-recovery) for details.
@@ -214,6 +214,9 @@
 It is therefore very likely that savepoints live beyond the max age configuration.  
 {{< /hint >}}
 
+To disable savepoint cleanup by the operator you can set `kubernetes.operator.savepoint.cleanup.enabled: false`.
+When savepoint cleanup is disabled the operator will still collect and populate the savepoint history but not perform any dispose operations.
+
 ## Recovery of missing job deployments
 
 When HA is enabled, the operator can recover the Flink cluster deployments in cases when it was accidentally deleted

diff --git a/docs/content/docs/custom-resource/overview.md b/docs/content/docs/custom-resource/overview.md
index eb17b85..5c15c6a 100644
--- a/docs/content/docs/custom-resource/overview.md
+++ b/docs/content/docs/custom-resource/overview.md

@@ -87,7 +87,7 @@
  - `image` : Docker used to run Flink job and task manager processes
  - `flinkVersion` : Flink version used in the image (`v1_13`, `v1_14`, `v1_15`, `v1_16` ...)
  - `serviceAccount` : Kubernetes service account used by the Flink pods
- - `taskManager, jobManager` : Job and Task manager pod resource specs (cpu, memory, etc.)
+ - `taskManager, jobManager` : Job and Task manager pod resource specs (cpu, memory, ephemeralStorage)
  - `flinkConfiguration` : Map of Flink configuration overrides such as HA and checkpointing configs
  - `job` : Job Spec for Application deployments
 
@@ -158,7 +158,7 @@
 
 Standalone cluster deployment simply uses Kubernetes as an orchestration platform that the Flink cluster is running on. Flink is unaware that it is running on Kubernetes and therefore all Kubernetes resources need to be managed externally, by the Kubernetes Operator.
 
-In Standalone mode the Flink cluster doesn't have access to the Kubernetes cluster so this can increase security. If unknown or external code is being ran on the Flink cluster then Standalone mode adds another layer of security. 
+In Standalone mode the Flink cluster doesn't have access to the Kubernetes cluster so this can increase security. If unknown or external code is being ran on the Flink cluster then Standalone mode adds another layer of security.
 
 The deployment mode can be set using the `mode` field in the deployment spec.
 
@@ -169,7 +169,7 @@
 spec:
   ...
   mode: standalone
-    
+
 
 ```
 
@@ -212,7 +212,7 @@
 
 ### Limitations
 
-- The LastState UpgradeMode have not been supported.
+- Last-state upgradeMode is currently not supported for FlinkSessionJobs
 
 ## Further information
 
@@ -220,4 +220,3 @@
  - [Deployment customization and pod templates]({{< ref "docs/custom-resource/pod-template" >}})
  - [Full Reference]({{< ref "docs/custom-resource/reference" >}})
  - [Examples](https://github.com/apache/flink-kubernetes-operator/tree/main/examples)
-

diff --git a/docs/content/docs/custom-resource/pod-template.md b/docs/content/docs/custom-resource/pod-template.md
index 942dd21..5a428f1 100644
--- a/docs/content/docs/custom-resource/pod-template.md
+++ b/docs/content/docs/custom-resource/pod-template.md

@@ -104,3 +104,33 @@
 When using the operator with Flink native Kubernetes integration, please refer to [pod template field precedence](
 https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#fields-overwritten-by-flink).
 {{< /hint >}}
+
+## Array Merging Behaviour
+
+When layering pod templates (defining both a top level and jobmanager specific podtemplate for example) the corresponding yamls are merged together.
+
+The default behaviour of the pod template mechanism is to merge array arrays by merging the objects in the respective array positions.
+This requires that containers in the podTemplates are defined in the same order otherwise the results may be undefined.
+
+Default behaviour (merge by position):
+
+```
+arr1: [{name: a, p1: v1}, {name: b, p1: v1}]
+arr1: [{name: a, p2: v2}, {name: c, p2: v2}]
+
+merged: [{name: a, p1: v1, p2: v2}, {name: c, p1: v1, p2: v2}]
+```
+
+The operator supports an alternative array merging mechanism that can be enabled by the `kubernetes.operator.pod-template.merge-arrays-by-name` flag.
+When true, instead of the default positional merging, object array elements that have a `name` property defined will be merged by their name and the resulting array will be a union of the two input arrays.
+
+Merge by name:
+
+```
+arr1: [{name: a, p1: v1}, {name: b, p1: v1}]
+arr1: [{name: a, p2: v2}, {name: c, p2: v2}]
+
+merged: [{name: a, p1: v1, p2: v2}, {name: b, p1: v1}, {name: c, p2: v2}]
+```
+
+Merging by name can we be very convenient when merging container specs or when the base and override templates are not defined together.

diff --git a/docs/content/docs/development/roadmap.md b/docs/content/docs/development/roadmap.md
index 9754538..94a8aa4 100644
--- a/docs/content/docs/development/roadmap.md
+++ b/docs/content/docs/development/roadmap.md

@@ -31,6 +31,6 @@
 
 ## What’s Next?
 
-- Standalone deployment mode support [FLIP-225](https://cwiki.apache.org/confluence/display/FLINK/FLIP-225%3A+Implement+standalone+mode+support+in+the+kubernetes+operator)
-- Improved scaling and autoscaling support
 - Improved rollback mechanism and stability conditions
+- Autoscaler hardening and improvements
+- Support for in-place job rescaling with Flink 1.18

diff --git a/docs/content/docs/operations/health.md b/docs/content/docs/operations/health.md
new file mode 100644
index 0000000..50ff070
--- /dev/null
+++ b/docs/content/docs/operations/health.md

@@ -0,0 +1,69 @@
+---
+title: "Operator Health Monitoring"
+weight: 3
+type: docs
+aliases:
+- /operations/health.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Operator Health Monitoring
+
+## Health Probe
+
+The Flink Kubernetes Operator provides a built in health endpoint that serves as the information source for Kubernetes liveness and startup probes.
+
+The liveness and startup probes are enabled by default in the Helm chart:
+
+```
+operatorHealth:
+  port: 8085
+  livenessProbe:
+    periodSeconds: 10
+    initialDelaySeconds: 30
+  startupProbe:
+    failureThreshold: 30
+    periodSeconds: 10
+```
+
+The health endpoint catches startup and informer errors that are exposed by the JOSDK framework. By default if one of the watched namespaces becomes inaccessible the health endpoint will report an error and the operator will restart.
+
+In some cases it is desirable to keep the operator running even if some namespaces are inaccessible. To allow the operator to start even if some namespaces cannot be watched, you can disable the `kubernetes.operator.startup.stop-on-informer-error` flag.
+
+## Canary Resources
+
+The canary resource feature allows users to deploy special dummy resources (canaries) into selected namespaces. The operator health probe will then monitor that these resources are reconciled in a timely manner. This allows the operator health probe to catch any slowdowns, and other general reconciliation issues not covered otherwise.
+
+Canary deployments are identified by a special label: `"flink.apache.org/canary": "true"`. These resources do not need to define a spec and they will not start any pods or consume other cluster resources and are purely there to assert the operator reconciliation functionality.
+
+Canary FlinkDeployment:
+
+```
+apiVersion: flink.apache.org/v1beta1
+kind: FlinkDeployment
+metadata:
+  name: canary
+  labels:
+    "flink.apache.org/canary": "true"
+```
+
+The default timeout for reconciling the canary resources is 1 minute and it is controlled by `kubernetes.operator.health.canary.resource.timeout`. If the operator cannot reconcile the canaries within this time limit the operator is marked unhealthy and will be automatically restarted.
+
+Canaries can be deployed into multiple namespaces.
commit	dda70681236b2b5bd12088261e969eb641047858	[log] [tgz]
author	Gyula Fora <g_fora@apple.com>	Mon May 15 17:07:01 2023 +0200
committer	Gyula Fora <gyfora@apache.org>	Tue May 16 09:46:46 2023 +0200
tree	f5c75a6f4deaceea015b0a4ae66cbb72bb307946
parent	5cc1aa1b63a5c9ed8b8418f0d2a5dbc2ba6dc023 [diff]