| <!-- |
| * Licensed to the Apache Software Foundation (ASF) under one |
| * or more contributor license agreements. See the NOTICE file |
| * distributed with this work for additional information |
| * regarding copyright ownership. The ASF licenses this file |
| * to you under the Apache License, Version 2.0 (the |
| * "License"); you may not use this file except in compliance |
| * with the License. You may obtain a copy of the License at |
| * |
| * http://www.apache.org/licenses/LICENSE-2.0 |
| * |
| * Unless required by applicable law or agreed to in writing, software |
| * distributed under the License is distributed on an "AS IS" BASIS, |
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| * See the License for the specific language governing permissions and |
| * limitations under the License. |
| --> |
| |
| # Scheduler Interface Spec |
| |
| Authors: The Yunikorn Scheduler Authors |
| |
| ## Objective |
| |
| To define a standard interface that can be used by different types of resource management systems such as YARN/K8s. |
| |
| ### Goals for minimum viable product (MVP) |
| |
| - Interface and implementation should be resource manager (RM) agnostic. |
| - Interface can handle multiple types of resource managers from multiple zones, and different policies can be configured in different zones. |
| |
| Possible use cases: |
| - A large cluster needs multiple schedulers to achieve horizontally scalability. |
| - Multiple resource managers need to run on the same cluster. The managers grow and shrink according to runtime resource usage and policies. |
| |
| ### Non-Goals for minimum viable product (MVP) |
| |
| - Handle process-specific information: Scheduler Interface only handles decisions for scheduling instead of how containers will be launched. |
| |
| ## Design considerations |
| |
| Highlights: |
| - The scheduler should be as stateless as possible. It should try to eliminate any local persistent storage for scheduling decisions. |
| - When a RM starts, restarts or recovers the RM needs to sync its state with scheduler. |
| |
| ### Architecture |
| |
| ## Generic definitions |
| |
| Interface and messages generic definition. |
| |
| The syntax used for the declarations is `proto3`. The definition currently only provides go related info. |
| ```protobuf |
| syntax = "proto3"; |
| package si.v1; |
| |
| import "google/protobuf/descriptor.proto"; |
| |
| option go_package = "lib/go/si"; |
| |
| extend google.protobuf.FieldOptions { |
| // Indicates that a field MAY contain information that is sensitive |
| // and MUST be treated as such (e.g. not logged). |
| bool si_secret = 1059; |
| } |
| ``` |
| |
| ## Scheduler Interfaces |
| |
| There are two kinds of interfaces, the first one is RPC based communication, the second one is API based. |
| |
| RPC based, the [gRPC](https://grpc.io/) framework is used, will be useful when scheduler has to be deployed as a remote process. |
| For example when we need to deploy scheduler support multiple remote clusters. |
| A second example is when there is a cross language integration, like between Java and Go. |
| |
| Unless specifically required we strongly recommend the use of the API based interface to avoid the overhead of the RPC serialization and de-serialization. |
| |
| ### RPC Interface |
| |
| There are three sets of RPCs: |
| |
| * **Scheduler Service**: RM can communicate with the Scheduler Service and do resource allocation/request update, etc. |
| * **Admin Service**: Admin can communicate with Scheduler Interface and get configuration updated. |
| * **Metrics Service**: Used to retrieve state of scheduler by users / RMs. |
| |
| Currently only the design and implementation for the Scheduler Service is provided. |
| |
| ```protobuf |
| service Scheduler { |
| // Register a RM, if it is a reconnect from previous RM the call will |
| // trigger a cleanup of all in-memory data and resync with RM. |
| rpc RegisterResourceManager (RegisterResourceManagerRequest) |
| returns (RegisterResourceManagerResponse) { } |
| |
| // Update Scheduler status (this includes node status update, allocation request |
| // updates, etc. And receive updates from scheduler for allocation changes, |
| // any required status changes, etc. |
| // Update allocation request |
| rpc UpdateAllocation(stream AllocationRequest) |
| returns (stream AllocationResponse) { } |
| |
| // Update application request |
| rpc UpdateApplication(stream ApplicationRequest) |
| returns (stream ApplicationResponse) { } |
| |
| // Update node info |
| rpc UpdateNode(stream NodeRequest) |
| returns (stream NodeResponse) { } |
| } |
| ``` |
| #### Why bi-directional gRPC |
| |
| The reason of using bi-directional streaming gRPC is, according to performance benchmark: https://grpc.io/docs/guides/benchmarking.html latency is close to 0.5 ms. |
| The same performance benchmark shows streaming QPS can be 4x of non-streaming RPC. |
| Considering scheduler needs both throughput and better latency, we go with streaming API for scheduler related decisions. |
| |
| ### API Interface |
| |
| The API interface only relies on the message definition and not on other generated code as the RPC Interface does. |
| Below is an example of the Scheduler Service as defined in the RPC. The SchedulerAPI is bi-directional and can be a-synchronous. |
| For the asynchronous cases the API requires a callback interface to be implemented in the resource manager. |
| The callback must be provided to the scheduler as part of the registration. |
| |
| |
| ```golang |
| package api |
| |
| import "github.com/apache/yunikorn-scheduler-interface/lib/go/si" |
| |
| type SchedulerAPI interface { |
| // Register a new RM, if it is a reconnect from previous RM, cleanup |
| // all in-memory data and resync with RM. |
| RegisterResourceManager(request *si.RegisterResourceManagerRequest, callback ResourceManagerCallback) (*si.RegisterResourceManagerResponse, error) |
| |
| // Update allocation request |
| UpdateAllocation(request *si.AllocationRequest) error |
| |
| // Update application request |
| UpdateApplication(request *si.ApplicationRequest) error |
| |
| // Update node info |
| UpdateNode(request *si.NodeRequest) error |
| |
| // Notify scheduler to reload configuration and hot-refresh in-memory state based on configuration changes |
| UpdateConfiguration(request *si.UpdateConfigurationRequest) error |
| } |
| |
| // RM side needs to implement this API |
| type ResourceManagerCallback interface { |
| |
| //Receive Allocation Update Response |
| UpdateAllocation(response *si.AllocationResponse) error |
| |
| //Receive Application Update Response |
| UpdateApplication(response *si.ApplicationResponse) error |
| |
| //Receive Node Update Response |
| UpdateNode(response *si.NodeResponse) error |
| |
| // Run a certain set of predicate functions to determine if a proposed allocation |
| // can be allocated onto a node. |
| Predicates(args *si.PredicatesArgs) error |
| |
| // Run predicate functions to determine if a proposed allocation can be allocated |
| // onto a node after preemption. The request contains a list of allocations to |
| // speculatively remove. |
| PreemptionPredicates(args *si.PreemptionPredicatesArgs) *si.PreemptionPredicatesResponse |
| |
| // This plugin is responsible for transmitting events to the shim side. |
| // Events can be further exposed from the shim. |
| SendEvent(events []*si.EventRecord) |
| |
| // Scheduler core can update container scheduling state to the RM, |
| // the shim side can determine what to do incorporate with the scheduling state |
| |
| // update container scheduling state to the shim side |
| // this might be called even the container scheduling state is unchanged |
| // the shim side cannot assume to only receive updates on state changes |
| // the shim side implementation must be thread safe |
| UpdateContainerSchedulingState(request *si.UpdateContainerSchedulingStateRequest) |
| } |
| |
| // RM can additionally implement this API to provide information during state dumps |
| type StateDumpPlugin interface { |
| |
| // This plugin is responsible for returning a JSON representation of the state of the shim |
| GetStateDump() (string, error) |
| } |
| ``` |
| |
| ### Communications between RM and Scheduler |
| |
| Lifecycle of RM-Scheduler communication |
| |
| ``` |
| Status of RM in scheduler: |
| |
| Connection timeout |
| +-------+ +-------+ loss +-------+ +---------+ |
| |init |+---->|Running|+---->|Paused |+---->| Stopped | |
| +-------+ +----+--+ +-------+ +---------+ |
| RM register | ^ |
| with scheduler | | |
| +-----------------------------+ |
| RM voluntarilly |
| Shutdown |
| ``` |
| |
| #### RM register with scheduler |
| |
| When a new RM starts, fails, it will register with scheduler. In some cases, scheduler can ask RM to re-register because of connection issues or other internal issues. |
| |
| ```protobuf |
| message RegisterResourceManagerRequest { |
| // An ID which can uniquely identify a RM **cluster**. (For example, if a RM cluster has multiple manager instances for HA purpose, they should use the same information when do registration). |
| // If RM register with the same ID, all previous scheduling state in memory will be cleaned up, and expect RM report full scheduling state after registration. |
| string rmID = 1; |
| |
| // Version of RM scheduler interface client. |
| string version = 2; |
| |
| // Policy group name: |
| // This defines which policy to use. Policy should be statically configured. (Think about network security group concept of ec2). |
| // Different RMs can refer to the same policyGroup if their static configuration is identical. |
| string policyGroup = 3; |
| |
| // Pass the build information of k8shim to core. |
| map<string, string> buildInfo = 4; |
| |
| // Pass the serialized configuration for this policyGroup to core. |
| string config = 5; |
| |
| // Additional configuration key/value pairs for configuration not related to the policyGroup. |
| map<string, string> extraConfig = 6; |
| } |
| |
| // Upon success, scheduler returns RegisterResourceManagerResponse to RM, otherwise RM receives exception. |
| message RegisterResourceManagerResponse { |
| // Intentionally empty. |
| } |
| ``` |
| |
| #### RM and scheduler updates. |
| |
| Below is overview of how scheduler/RM keep connection and updates. |
| |
| ```protobuf |
| message AllocationRequest { |
| // New allocation requests or replace existing allocation request (if allocationID is same) |
| repeated AllocationAsk asks = 1; |
| |
| // Allocations can be released. |
| AllocationReleasesRequest releases = 2; |
| |
| // ID of RM, this will be used to identify which RM of the request comes from. |
| string rmID = 3; |
| } |
| |
| message ApplicationRequest { |
| // RM should explicitly add application when allocation request also explictly belongs to application. |
| // This is optional if allocation request doesn't belong to a application. (Independent allocation) |
| repeated AddApplicationRequest new = 1; |
| |
| // RM can also remove applications, all allocation/allocation requests associated with the application will be removed |
| repeated RemoveApplicationRequest remove = 2; |
| |
| // ID of RM, this will be used to identify which RM of the request comes from. |
| string rmID = 3; |
| } |
| |
| message NodeRequest { |
| // New node can be scheduled. If a node is notified to be "unscheduable", it needs to be part of this field as well. |
| repeated NodeInfo nodes = 1; |
| |
| // ID of RM, this will be used to identify which RM of the request comes from. |
| string rmID = 2; |
| } |
| |
| message AllocationResponse { |
| // New allocations |
| repeated Allocation new = 1; |
| |
| // Released allocations, this could be either ack from scheduler when RM asks to terminate some allocations. |
| // Or it could be decision made by scheduler (such as preemption or timeout). |
| repeated AllocationRelease released = 2; |
| |
| // Released allocation asks(placeholder), when the placeholder allocation times out |
| repeated AllocationAskRelease releasedAsks = 3; |
| |
| // Rejected allocation requests |
| repeated RejectedAllocationAsk rejected = 4; |
| } |
| |
| message ApplicationResponse { |
| // Rejected Applications |
| repeated RejectedApplication rejected = 1; |
| |
| // Accepted Applications |
| repeated AcceptedApplication accepted = 2; |
| |
| // Updated Applications |
| repeated UpdatedApplication updated = 3; |
| } |
| |
| message NodeResponse { |
| // Rejected Node Registrations |
| repeated RejectedNode rejected = 1; |
| |
| // Accepted Node Registrations |
| repeated AcceptedNode accepted = 2; |
| } |
| |
| message UpdatedApplication { |
| // The application ID that was updated |
| string applicationID = 1; |
| // State of the application |
| string state = 2; |
| // Timestamp of the state transition |
| int64 stateTransitionTimestamp = 3; |
| // Detailed message |
| string message = 4; |
| } |
| |
| message RejectedApplication { |
| // The application ID that was rejected |
| string applicationID = 1; |
| // A human-readable reason message |
| string reason = 2; |
| } |
| |
| message AcceptedApplication { |
| // The application ID that was accepted |
| string applicationID = 1; |
| } |
| |
| message RejectedNode { |
| // The node ID that was rejected |
| string nodeID = 1; |
| // A human-readable reason message |
| string reason = 2; |
| } |
| |
| message AcceptedNode { |
| // The node ID that was accepted |
| string nodeID = 1; |
| } |
| ``` |
| |
| #### Ask for more resources |
| |
| Lifecycle of AllocationAsk: |
| |
| ``` |
| Rejected by Scheduler |
| +-------------------------------------------+ |
| | | |
| | v |
| +-------+---+ Asked +-----------+Scheduler or,+-----------+ |
| |Initial +------->|Pending |+----+----+->|Rejected | |
| +-----------+By RM +-+---------+ Asked by RM +-----------+ |
| + |
| | |
| v |
| +-----------+ |
| |Allocated | |
| +-----------+ |
| ``` |
| |
| Lifecycle of Allocations: |
| |
| ``` |
| +--Allocated by |
| v Scheduler |
| +-----------+ +------------+ |
| |Allocated |+------ |Completed | |
| +---+-------+ Stoppe +------------+ |
| | by RM |
| | +------------+ |
| +--------------->|Preempted | |
| + Preempted by +------------+ |
| | Scheduler |
| | |
| | |
| | +------------+ |
| +--------------->|Expired | |
| Timeout +------------+ |
| (Part of Allocation |
| ask) |
| ``` |
| |
| Common fields for allocation: |
| |
| ```protobuf |
| |
| // A sparse map of resource to Quantity. |
| message Resource { |
| map<string, Quantity> resources = 1; |
| } |
| |
| // Quantity includes a single int64 value |
| message Quantity { |
| int64 value = 1; |
| } |
| ``` |
| |
| Allocation ask: |
| |
| ```protobuf |
| message AllocationAsk { |
| // Allocation key is used by both of scheduler and RM to track allocations. |
| // It doesn't have to be same as RM's internal allocation id (such as Pod name of K8s or ContainerID of YARN). |
| // Allocations from the same AllocationAsk which are returned to the RM at the same time will have the same allocationKey. |
| // The request is considered an update of the existing AllocationAsk if an ALlocationAsk with the same allocationKey |
| // already exists. |
| string allocationKey = 1; |
| // The application ID this allocation ask belongs to |
| string applicationID = 2; |
| // The partition the application belongs to |
| string partitionName = 3; |
| // The amount of resources per ask |
| Resource resourceAsk = 4; |
| // Maximum number of allocations |
| int32 maxAllocations = 5; |
| // Priority of ask |
| int32 priority = 6; |
| // Execution timeout: How long this allocation will be terminated (by scheduler) |
| // once allocated by scheduler, 0 or negative value means never expire. |
| int64 executionTimeoutMilliSeconds = 7; |
| // A set of tags for this spscific AllocationAsk. Allocation level tags are used in placing this specific |
| // ask on nodes in the cluster. These tags are used in the PlacementConstraints. |
| // These tags are optional. |
| map<string, string> tags = 8; |
| // The name of the TaskGroup this ask belongs to |
| string taskGroupName = 9; |
| // Is this a placeholder ask (true) or a real ask (false), defaults to false |
| // ignored if the taskGroupName is not set |
| bool placeholder = 10; |
| // Is this ask the originator of the application? |
| bool Originator = 11; |
| // The preemption policy for this ask |
| PreemptionPolicy preemptionPolicy = 12; |
| } |
| ``` |
| |
| Preemption policy: |
| |
| ```protobuf |
| message PreemptionPolicy { |
| // Opt-out from preemption |
| bool allowPreemptSelf = 1; |
| // Allow preemption of other tasks with same or lower priority |
| bool allowPreemptOther = 2; |
| } |
| ``` |
| |
| Application requests: |
| |
| ```protobuf |
| message AddApplicationRequest { |
| // The ID of the application, must be unique |
| string applicationID = 1; |
| // The queue this application is requesting. The scheduler will place the application into a |
| // queue according to policy, taking into account the requested queue as per the policy. |
| string queueName = 2; |
| // The partition the application belongs to |
| string partitionName = 3; |
| // The user group information of the application owner |
| UserGroupInformation ugi = 4; |
| // A set of tags for the application. These tags provide application level generic inforamtion. |
| // The tags are optional and are used in placing an appliction or scheduling. |
| // Application tags are not considered when processing AllocationAsks. |
| map<string, string> tags = 5; |
| // Execution timeout: How long this application can be in a running state |
| // 0 or negative value means never expire. |
| int64 executionTimeoutMilliSeconds = 6; |
| // The total amount of resources gang placeholders will request |
| Resource placeholderAsk = 7; |
| // Gang scheduling style can be hard (the application will fail after placeholder timeout) |
| // or soft (after the timeout the application will be scheduled as a normal application) |
| string gangSchedulingStyle = 8; |
| } |
| |
| message RemoveApplicationRequest { |
| // The ID of the application to remove |
| string applicationID = 1; |
| // The partition the application belongs to |
| string partitionName = 2; |
| } |
| ``` |
| |
| User information: |
| The user that owns the application. Group information can be empty. If the group information is empty the groups will be resolved by the scheduler when needed. |
| ```protobuf |
| message UserGroupInformation { |
| // the user name |
| string user = 1; |
| // the list of groups of the user, can be empty |
| repeated string groups = 2; |
| } |
| ``` |
| |
| ### Allocation of resources |
| |
| The Allocation message is used in two cases: |
| 1. A recovered allocation send from the RM to the scheduler |
| 2. A newly created allocation from the scheduler. |
| |
| ```protobuf |
| message Allocation { |
| // AllocationKey from AllocationAsk |
| string allocationKey = 1; |
| // Allocation tags from AllocationAsk |
| map<string, string> allocationTags = 2; |
| // UUID of the allocation |
| string UUID = 3; |
| |
| // Resource for each allocation |
| Resource resourcePerAlloc = 5; |
| // Priority of ask |
| int32 priority = 6; |
| // Node which the allocation belongs to |
| string nodeID = 8; |
| |
| // The ID of the application |
| string applicationID = 9; |
| // Partition of the allocation |
| string partitionName = 10; |
| // The name of the TaskGroup this allocation belongs to |
| string taskGroupName = 11; |
| // Is this a placeholder allocation (true) or a real allocation (false), defaults to false |
| // ignored if the taskGroupName is not set |
| bool placeholder = 12; |
| |
| reserved 7; |
| reserved "queueName"; |
| } |
| ``` |
| |
| #### Release previously allocated resources |
| |
| ```protobuf |
| message AllocationReleasesRequest { |
| // The allocations to release |
| repeated AllocationRelease allocationsToRelease = 1; |
| // The asks to release |
| repeated AllocationAskRelease allocationAsksToRelease = 2; |
| } |
| |
| enum TerminationType { |
| UNKNOWN_TERMINATION_TYPE = 0;//TerminationType not set |
| STOPPED_BY_RM = 1; // Stopped or killed by ResourceManager (created by RM) |
| TIMEOUT = 2; // Timed out based on the executionTimeoutMilliSeconds (created by core) |
| PREEMPTED_BY_SCHEDULER = 3; // Preempted allocation by scheduler (created by core) |
| PLACEHOLDER_REPLACED = 4; // Placeholder allocation replaced by real allocation (created by core) |
| } |
| |
| // Release allocation: this is a bidirectional message. The Terminationtype defines the origin, or creator, |
| // as per the comment. The confirmation or response from the receiver is the same message with the same |
| // termination type set. |
| message AllocationRelease { |
| |
| // The name of the partition the allocation belongs to |
| string partitionName = 1; |
| // The application the allocation belongs to |
| string applicationID = 2; |
| // The UUID of the allocation to release, if not set all allocations are released for |
| // the applicationID |
| string UUID = 3; |
| // Termination type of the released allocation |
| TerminationType terminationType = 4; |
| // human-readable message |
| string message = 5; |
| // AllocationKey from AllocationAsk |
| string allocationKey = 6; |
| } |
| |
| // Release ask |
| message AllocationAskRelease { |
| // Which partition to release the ask from, required. |
| string partitionName = 1; |
| // optional, when this is set, filter allocation key by application id. |
| // when application id is set and allocationKey is not set, release all allocations key under the application id. |
| string applicationID = 2; |
| // optional, when this is set, only release allocation ask by specified |
| string allocationKey = 3; |
| // Termination type of the released allocation ask |
| TerminationType terminationType = 4; |
| // For human-readable message |
| string message = 5; |
| } |
| ``` |
| |
| #### Schedulable nodes registration and updates |
| |
| State transition of node: |
| |
| ``` |
| +-----------+ +--------+ +-------+ |
| |SCHEDULABLE|+-------->|DRAINING|+---------->|REMOVED| |
| +-----------+ +--------+ +-------+ |
| ^ Asked by + Aasked by |
| | RM to DRAIN | RM to REMOVE |
| | | |
| +---------------------+ |
| Asked by RM to |
| SCHEDULE again |
| ``` |
| |
| See protocol below: |
| |
| During new node registration with the scheduler, request will be rejected if the node exist already. |
| While updating registered node with the scheduler, request will fail if the node doesn't exist. |
| ```protobuf |
| message NodeInfo { |
| // Action from RM |
| enum ActionFromRM { |
| |
| //ActionFromRM not set |
| UNKNOWN_ACTION_FROM_RM = 0; |
| |
| // Create Node |
| CREATE = 1; |
| |
| // Update node resources, attributes. |
| UPDATE = 2; |
| |
| // Do not allocate new allocations on the node. |
| DRAIN_NODE = 3; |
| |
| // Decomission node, it will immediately stop allocations on the node and |
| // remove the node from schedulable lists. |
| DECOMISSION = 4; |
| |
| // From Draining state to SCHEDULABLE state. |
| // If node is not in draining state, error will be thrown |
| DRAIN_TO_SCHEDULABLE = 5; |
| } |
| |
| // ID of node, the node must exist to be updated |
| string nodeID = 1; |
| |
| // Action to perform by the scheduler |
| ActionFromRM action = 2; |
| |
| // New attributes of node, which will replace previously reported attribute. |
| map<string, string> attributes = 3; |
| |
| // new schedulable resource, scheduler may preempt allocations on the |
| // node or schedule more allocations accordingly. |
| Resource schedulableResource = 4; |
| |
| // when the scheduler is co-exist with some other schedulers, some node |
| // resources might be occupied (allocated) by other schedulers. |
| Resource occupiedResource = 5; |
| |
| // Allocated resources, this will be added when node registered to RM (recovery) |
| repeated Allocation existingAllocations = 6; |
| } |
| ``` |
| |
| #### Feedback from Scheduler |
| |
| Following is feedback from scheduler to RM: |
| |
| When allocation ask rejected by scheduler, information will be shared by scheduler. |
| |
| ```protobuf |
| message RejectedAllocationAsk { |
| string allocationKey = 1; |
| // The ID of the application |
| string applicationID = 2; |
| // A human-readable reason message |
| string reason = 3; |
| } |
| ``` |
| |
| ### Following are constant of spec |
| |
| Scheduler Interface attributes start with the si prefix. Such constants are for example known attribute names for nodes and applications. |
| |
| ```constants |
| // Constants for node attributes |
| const ( |
| ARCH = "si/arch" |
| HostName = "si/hostname" |
| RackName = "si/rackname" |
| OS = "si/os" |
| InstanceType = "si/instance-type" |
| FailureDomainZone = "si/zone" |
| FailureDomainRegion = "si/region" |
| LocalImages = "si/local-images" |
| NodePartition = "si/node-partition" |
| ) |
| |
| // Constants for allocation attributes |
| const ( |
| ApplicationID = "si/application-id" |
| ContainerImage = "si/container-image" |
| ContainerPorts = "si/container-ports" |
| ) |
| ``` |
| |
| Allocation tags are key-value pairs, where the key should contain a domain, and optionally a group part. |
| These parts should precede the name of the key (and should be in that order) and separated by a "/" character. |
| Example allocation key: `kubernetes.io/meta/namespace`. |
| |
| ```constants |
| // Constants for allocation tags |
| const ( |
| // Domains |
| DomainK8s = "kubernetes.io/" |
| DomainYuniKorn = "yunikorn.apache.org/" |
| |
| // Groups |
| GroupMeta = "meta/" |
| GroupLabel = "label/" |
| GroupAnnotation = "annotation/" |
| |
| // Keys |
| KeyPodName = "podName" |
| KeyNamespace = "namespace" |
| KeyRequiredNode = "requiredNode" |
| KeyAllowPreemption = "allowPreemption" |
| |
| // Pods |
| CreationTime = "creationTime" |
| ) |
| ``` |
| |
| Miscellaneous constants for resources and other values. |
| |
| ```constants |
| // Constants for Core and Shim |
| const ( |
| Memory = "memory" |
| CPU = "vcore" |
| AppTagNamespaceResourceQuota = "namespace.resourcequota" |
| AppTagNamespaceResourceGuaranteed = "namespace.resourceguaranteed" |
| AppTagStateAwareDisable = "application.stateaware.disable" |
| NodeReadyAttribute = "ready" |
| ) |
| ``` |
| |
| ### Scheduler plugin |
| |
| SchedulerPlugin is a way to extend scheduler capabilities. Scheduler shim can implement such plugin and register itself to |
| yunikorn-core, so plugged function can be invoked in the scheduler core. |
| |
| ```protobuf |
| message PredicatesArgs { |
| // allocation key identifies a container, the predicates function is going to check |
| // if this container is eligible to be placed ont to a node. |
| string allocationKey = 1; |
| // the node ID the container is assigned to. |
| string nodeID = 2; |
| // run the predicates for alloactions (true) or reservations (false) |
| bool allocate = 3; |
| } |
| |
| message PreemptionPredicatesArgs { |
| // the allocation key of the container to check |
| string allocationKey = 1; |
| // the node ID the container should be attempted to be scheduled on |
| string nodeID = 2; |
| // a list of existing allocations that should be tentatively removed before checking |
| repeated string preemptAllocationKeys = 3; |
| // index of last allocation in starting attempt (first attempt should be 0..startIndex) |
| int32 startIndex = 4; |
| } |
| |
| message PreemptionPredicatesResponse { |
| // whether or not container will schedule on the node |
| bool success = 1; |
| // index of last allocation which was removed before success (ignored during failure) |
| int32 index = 2; |
| } |
| |
| message UpdateContainerSchedulingStateRequest { |
| // container scheduling states |
| enum SchedulingState { |
| //SchedulingState not set |
| UNKNOWN_SCHEDULING_STATE = 0; |
| // the container is being skipped by the scheduler |
| SKIPPED = 1; |
| // the container is scheduled and it has been assigned to a node |
| SCHEDULED = 2; |
| // the container is reserved on some node, but not yet assigned |
| RESERVED = 3; |
| // scheduler has visited all candidate nodes for this container |
| // but non of them could satisfy this container's requirement |
| FAILED = 4; |
| } |
| |
| // application ID |
| string applicartionID = 1; |
| |
| // allocation key used to identify a container. |
| string allocationKey = 2; |
| |
| // container scheduling state |
| SchedulingState state = 3; |
| |
| // an optional plain message to explain why it is in such state |
| string reason = 4; |
| } |
| |
| message UpdateConfigurationRequest { |
| // RM ID to update |
| string rmID = 2; |
| |
| // PolicyGroup to update |
| string policyGroup = 3; |
| |
| // New configuration to update |
| string config = 4; |
| |
| // Additional configuration key/value pairs for configuration not related to the policyGroup. |
| map<string, string> extraConfig = 5; |
| |
| reserved 1; |
| reserved "configs"; |
| } |
| ``` |
| |
| #### Event Plugin |
| |
| The Event Cache is a SchedulerPlugin that exposes events about scheduler objects aiming to help the end user to |
| see these events from the shim side. An event is sent to the shim through the callback as an `EventRecord`. |
| An `EventRecord` consists of the following fields: |
| |
| ```protobuf |
| message EventRecord { |
| enum Type { |
| //EventRecord Type not set |
| UNKNOWN_EVENTRECORD_TYPE = 0; |
| REQUEST = 1; |
| APP = 2; |
| NODE = 3; |
| QUEUE = 4; |
| } |
| |
| enum ChangeType { |
| NONE = 0; |
| SET = 1; |
| ADD = 2; |
| REMOVE = 3; |
| } |
| |
| enum ChangeDetail { |
| DETAILS_NONE = 0; |
| |
| REQUEST_CANCEL = 100; // Request cancelled by the RM |
| REQUEST_ALLOC = 101; // Request allocated |
| REQUEST_TIMEOUT = 102; // Request cancelled due to timeout |
| |
| APP_ALLOC = 200; // Allocation changed |
| APP_REQUEST = 201; // Request changed |
| APP_REJECT = 202; // Application rejected on create |
| APP_NEW = 203; // Application added with state new |
| APP_ACCEPTED = 204; // State change to accepted |
| APP_STARTING = 205; // State change to starting |
| APP_RUNNING = 206; // State change to running |
| APP_COMPLETING = 207; // State change to completing |
| APP_COMPLETED = 208; // State change to completed |
| APP_FAILING = 209; // State change to failing |
| APP_FAILED = 210; // State change to failed |
| |
| NODE_DECOMISSION = 300; // Node removal |
| NODE_READY = 301; // Node ready state change |
| NODE_SCHEDULABLE = 302; // Node schedulable state change (cordon) |
| NODE_ALLOC = 303; // Allocation changed |
| NODE_CAPACITY = 304; // Capacity changed |
| NODE_OCCUPIED = 305; // Occupied resource changed |
| |
| QUEUE_CONFIG = 400; // Managed queue update or removal |
| QUEUE_DYNAMIC = 401; // Dynamic queue update or removal |
| QUEUE_TYPE = 402; // Queue type change |
| QUEUE_MAX = 403; // Max resource changed |
| QUEUE_GUARANTEED = 404; // Guaranteed resource changed |
| QUEUE_APP = 405; // Application changed |
| QUEUE_ALLOC = 406; // Allocation changed |
| |
| ALLOC_CANCEL = 500; // Allocation cancelled by the RM |
| ALLOC_PREEMPT = 501; // Allocation preempted by the core |
| ALLOC_TIMEOUT = 502; // Allocation cancelled due to timeout |
| ALLOC_REPLACED = 503; // Allocation replacement (placeholder) |
| ALLOC_NODEREMOVED = 504; // Allocation cancelled, node removal |
| } |
| |
| // the type of the object associated with the event |
| Type type = 1; |
| // ID of the object associated with the event |
| string objectID = 2; |
| // the detailed message as string |
| string message = 5; |
| // timestamp of the event |
| int64 timestampNano = 6; |
| // the type of the change |
| ChangeType eventChangeType = 7; |
| // details about the change |
| ChangeDetail eventChangeDetail = 8; |
| // the secondary object in the event (eg. allocation UUID, request ID) |
| string referenceID = 9; |
| // the resource value if the change involves setting/modifying a resource |
| Resource resource = 10; |
| |
| reserved 3; |
| reserved "groupID"; |
| reserved 4; |
| reserved "reason"; |
| } |
| ``` |