| % |
| % Licensed to the Apache Software Foundation (ASF) under one |
| % or more contributor license agreements. See the NOTICE file |
| % distributed with this work for additional information |
| % regarding copyright ownership. The ASF licenses this file |
| % to you under the Apache License, Version 2.0 (the |
| % "License"); you may not use this file except in compliance |
| % with the License. You may obtain a copy of the License at |
| % |
| % http://www.apache.org/licenses/LICENSE-2.0 |
| % |
| % Unless required by applicable law or agreed to in writing, |
| % software distributed under the License is distributed on an |
| % "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| % KIND, either express or implied. See the License for the |
| % specific language governing permissions and limitations |
| % under the License. |
| % |
| % Create well-known link to this spot for HTML version |
| \ifpdf |
| \else |
| \HCode{<a name='DUCC_OVERVIEW'></a>} |
| \fi |
| \chapter{DUCC Overview} |
| |
| \section{What is DUCC?} |
| |
| DUCC stands for Distributed UIMA Cluster Computing. DUCC is a cluster management system |
| providing tooling, management, and scheduling facilities to automate the scale-out of |
| applications written to the UIMA framework. |
| |
| Core UIMA provides a generalized framework for applications that process unstructured |
| information such as human language, but does not provide a scale-out mechanism. UIMA-AS provides |
| a scale-out mechanism to distribute UIMA pipelines over a cluster of computing resources, but |
| does not provide job or cluster management of the resources. DUCC defines a formal job model |
| that closely maps to a standard UIMA pipeline. Around this job model DUCC provides cluster |
| management services to automate the scale-out of UIMA pipelines over computing clusters. |
| |
| \section{DUCC Job Model} |
| |
| The Job Model defines the steps necessary to scale-up a UIMA pipeline using DUCC. The goal of |
| DUCC is to allow the application logic to be unchanged. |
| |
| The DUCC Job model consists of standard UIMA components: a Collection Reader (CR), a CAS |
| Multiplier (CM), application logic as implemented one or more Analysis Engines (AE), and a CAS |
| Consumer (CC). In theory, any CR, or CM will work with DUCC, but DUCC is all about scale-out. In |
| order to achieve good scale-out these components must be constructed in a specific way. |
| |
| The Collection Reader builds input CASs and forwards them to the UIMA pipelines. In the DUCC |
| model, the CR is run in a process separate from the rest of the pipeline. In fact, in all but the |
| smallest clusters it is run on a different physical machine than the rest of the pipeline. To |
| achieve scalability, the CR must create very small CASs that do not contain application data, |
| but which contain references to data; for instance, file names. Ideally, the CR should be |
| runnable in a process not much larger than the smallest Java virtual machine. Later sections |
| demonstrate methods for achieving this. |
| |
| Each pipeline must contain at least one CAS Multiplier which receives the CASs from the CR. The |
| CMs encapsulate the knowledge of how to receive the data references in the small CASs received |
| from the CRs and deliver the referenced data to the application pipeline. DUCC packages the CM, |
| AE(s), and CC into a single process, multiple instances of which are then deployed over the |
| cluster. |
| |
| DUCC does not provide any mechanism for receiving output CASs. Each application must |
| supply its own CAS Consumer which serializes the output of the pipelines for |
| consumption by other entities (as serialized CASs, for example). |
| |
| A DUCC job therefore consists of a small specification containing the following items: |
| |
| \begin{itemize} |
| \item The name of a resource containing the CR descriptor. |
| \item The name of a resource containing the CM descriptor. |
| \item The name of a resource containing the AE descriptor. |
| \item The name of a resource containing the CC descriptor. |
| \item Other information required to parametrize the above and identify the job |
| such as log directory, working directory, desired scale-out, etc. These are |
| described in detail in subsequent sections. |
| \end{itemize} |
| |
| On job submission, DUCC examines the job specification and automatically creates a scaled-out |
| UIMA-AS service with a single process executing the CR as a UIMA-AS client and and as many |
| processes as possible executing the combined CM, AE, and CC pipeline as UIMA-AS service |
| instances. |
| |
| DUCC provides other facilities in support of scale-out: |
| \begin{itemize} |
| \item The ability to reserve all or part of a node in the cluster. |
| \item Automated management of services required in support of jobs. |
| \item The ability to schedule and execute arbitrary processes on nodes in the |
| cluster. |
| \item Debugging tools and support. |
| \item A web server to display and manage work and cluster status. |
| \item A CLI and a Java API to support the above. |
| \end{itemize} |
| |
| \section{DUCC From UIMA to Full Scale-out} |
| |
| In this section we demonstrate the progression of a simple UIMA pipeline to a fully |
| scaled-out job running under DUCC. |
| |
| \paragraph{UIMA Pipelines} |
| A normal UIMA pipeline |
| contains a Collection Reader, one or more Analysis Engines connected in a pipeline, and a CAS |
| Consumer as shown in \hyperref[fig:UIMA-pipeline]{Figure ~\ref{fig:UIMA-pipeline}}. |
| |
| \begin{figure}[H] |
| \centering |
| % \includegraphics[bb=0 0 575 310, width=5.5in]{images/uima-pipeline.jpg} |
| \includegraphics[width=5.5in]{images/uima-pipeline.jpg} |
| \caption{Standard UIMA Pipeline} |
| \label{fig:UIMA-pipeline} |
| \end{figure} |
| |
| \paragraph{UIMA-AS Scaled Pipeline} |
| With UIMA-AS the CR is separated into a discrete process and a CAS Multiplier is introduced |
| into the pipeline as an interface between the CR and the pipeline, as shown in |
| \hyperref[fig:UIMA-AS-pipeline]{Figure ~\ref{fig:UIMA-AS-pipeline}} below. |
| Multiple pipelines are serviced by the |
| CR and are scaled-out over a computing cluster. The difficulty with this model is that each |
| user is individually responsible for finding and scheduling computing nodes, installing |
| communication software such as ActiveMQ, and generally managing the distributed job and |
| associated hardware. |
| |
| \begin{figure}[H] |
| \centering |
| % \includegraphics[bb=0 0 584 341, width=5.5in]{images/uima-as-pipeline.jpg} |
| \includegraphics[width=5.5in]{images/uima-as-pipeline.jpg} |
| \caption{UIMA Pipeline As Scaled by UIMA-AS} |
| \label{fig:UIMA-AS-pipeline} |
| \end{figure} |
| |
| \paragraph{UIMA-AS Pipeline Scaled By DUCC} |
| DUCC is a UIMA and UIMA-AS-aware cluster manager. To scale out work under DUCC the developer |
| tells DUCC what the parts of the application are, and DUCC does the work to build the |
| scale-out via UIMA/AS, to find and schedule resources, to deploy the parts of the application |
| over the cluster, and to manage the jobs while it executes. |
| |
| On job submission, the DUCC Command Line Interface (CLI) inspects the XML defining the analytic |
| and generates a UIMA-AS Deployment Descriptor (DD). The DD establishes some number of pipeline |
| threads per process (as indicated in the DUCC job parameters), and generates job-unique queues. |
| |
| Under DUCC, the Collection Reader is executed in a process called the Job Driver (or JD). The |
| pipelines are executed in one or more processes called Job Processes (or JPs). The JD |
| process provides a thin wrapper over the CR to enable communication with DUCC. The JD uses the |
| CR to implement a UIMA-AS client delivering CASs to the multiple (scaled-out) pipelines, |
| shown in \hyperref[fig:UIMA-AS-pipeline-DUCC]{Figure ~\ref{fig:UIMA-AS-pipeline-DUCC}} below. |
| |
| \begin{figure}[H] |
| \centering |
| % \includegraphics[bb=0 0 571 311, width=5.5in]{images/ducc-sequential.jpg} |
| \includegraphics[width=5.5in]{images/ducc-sequential.jpg} |
| \caption{UIMA Pipeline As Automatically Scaled Out By DUCC} |
| \label{fig:UIMA-AS-pipeline-DUCC} |
| \end{figure} |
| |
| \paragraph{UIMA-AS Pipeline with User-Supplied DD Scaled By DUCC} |
| |
| Application programmers may supply their own Deployment Descriptors to control intra-process |
| threading and scale-out. If a DD is supplied in the job parameters, DUCC will use this instead |
| of generating one as depicted in \hyperref[fig:UIMA-AS-pipeline-DUCC-DD]{Figure ~\ref{fig:UIMA-AS-pipeline-DUCC-DD}} below. |
| |
| \begin{figure}[H] |
| \centering |
| % \includegraphics[bb=0 0 571 316,width=5.5in]{images/ducc-parallel.jpg} |
| \includegraphics[width=5.5in]{images/ducc-parallel.jpg} |
| \caption{UIMA Pipeline With User-Supplied DD as Automatically Scaled Out By DUCC} |
| \label{fig:UIMA-AS-pipeline-DUCC-DD} |
| \end{figure} |
| |
| |
| \section{Error Management } |
| DUCC provides a number of facilities to assist error management: |
| |
| \begin{itemize} |
| \item DUCC uses the UIMA-AS error-handling facilities to reflect errors from the Job Processes |
| to the Job Drivers. The JD wrappers implement logic to enforce error thresholds, to identify |
| and log errors, and to reflect job problems in the DUCC Web Server. All error thresholds are |
| configurable both globally and on a per-job basis. |
| |
| \item Error and timeout thresholds are implemented for both the initialization phase of a pipeline |
| and the execution phase. |
| |
| \item Retry-after-error is supported: if a process has a failure on some CAS after |
| initialization is successful, the process is terminated and all affected CASs are retried, up to some |
| configurable threshold. |
| |
| \item DUCC ensures that processes can successfully initialize before fully scaling out a job, |
| to ensure a cluster is not overwhelmed with errant processes. |
| |
| \item Various error conditions encountered while a job is running will prevent the errant job |
| from continuing scale out, and can result in termination of the job. |
| \end{itemize} |
| |
| \section{Cluster and Job Management} |
| DUCC supports management of multiple jobs and multiple users in a distributed cluster: |
| |
| \begin{description} |
| \item[Multiple User Support] DUCC runs all work under the identity of the submitting user. Logs |
| are written with the user's credentials into the user's file space designated at job |
| submission. |
| |
| \item[Fair-Share Scheduling] DUCC provides a Fair-Share scheduler to equitably share |
| resources among multiple users. The scheduler also supports semi-permanent reservation of |
| full or partial machines. |
| |
| \item[Service Management] DUCC provides a Service Manager capable of automatically starting, stopping, and |
| otherwise managing and querying both UIMA-AS and non-UIMA-AS services in support of jobs. |
| |
| \item[Job Lifetime Management and Orchestration] DUCC includes an Orchestrator to manage the |
| lifetimes of all entities in the system. |
| |
| \item[Node Sharing] DUCC allocates processes from one or more users on a node, each with a specified |
| amount of memory. DUCC's preferred mechanism for constraining memory use is Linux |
| Control Groups, or CGroups. For nodes that do not suport CGroups, DUCC agents monitor |
| RAM use and kill processes that exceed their share size by a settable fudge factor. |
| |
| \item[DUCC Agents] DUCC Agents manage each node's local resources and all |
| processes started by DUCC. Each node in a cluster has exactly one Agent. The Agent |
| \begin{itemize} |
| \item Monitors and reports node capabilities (memory, etc) and performance data (CPU busy, |
| swap, etc). |
| \item Starts, stops, and monitors all processes on behalf of users. |
| \item Patrols the node for ``foreign'' (non-DUCC) processes, reporting them to the |
| Web Server, and optionally reaping them. |
| \item Ensures job processes do not exceed their declared memory requirements |
| through the use of Linux Cgroups. |
| \end{itemize} |
| |
| \item[DUCC Web server] DUCC provides a web server displaying all aspects of the system: |
| \begin{itemize} |
| \item All jobs in the system, their current state, resource usage, etc. |
| |
| \item All reserved resources and associated information (owner, etc.), |
| including the ability to request and cancel reservations. |
| |
| \item All services, including the ability to start, stop, and modify |
| service definitions. |
| |
| \item All nodes in the system and their status, usage, etc. |
| |
| \item The status of all DUCC management processes. |
| |
| \item Access to documentation. |
| \end{itemize} |
| |
| |
| \item[Cluster Management Support] DUCC provides system management support to: |
| \begin{itemize} |
| \item Start, stop, and query full DUCC systems. |
| |
| \item Start, stop, and quiesce individual DUCC components. |
| |
| \item Add and delete nodes from the DUCC system. |
| |
| \item Discover DUCC processes (e.g. after partial failures). |
| |
| \item Find and kill errant job processes belonging to individual users. |
| |
| \item Monitor and display inter-DUCC messages. |
| \end{itemize} |
| \end{description} |
| |
| |
| \section{Security Measures} |
| The following DUCC security measures are provided: |
| |
| \begin{description} |
| \item[command line interface] The CLI employs HTTP to send requests |
| to the DUCC controller. The CLI creates and employs public and private |
| security keys in the user's home directory for authentication of HTTP |
| requests. The controller validates requests via these same security keys. |
| \item[webserver] The webserver facilitates operational control and |
| therefore authentication is desirable. |
| \begin{itemize} |
| \item[\textit{user}] Each user has the ability to control certain aspects of |
| only his/her active submissions. |
| \item[\textit{admin}] Each administrator has the ability to control certain |
| aspects of any user's active submissions, as well as modification of some |
| DUCC operational characteristics. |
| \end{itemize} |
| A simple interface is provided so |
| that an installation can plug-in a site specific authentication mechanism |
| comprising userid and password. |
| \item[ActiveMQ] TBD. |
| \end{description} |
| |
| \section{Security Issues} |
| The following DUCC security issues should be considered: |
| |
| \begin{description} |
| \item[submit transmission 'sniffed'] In the event that the DUCC submit |
| command is 'sniffed' then the user authentication mechanism is compromised |
| and user masquerading is possible. That is, the userid encryption mechanism |
| can be exploited such that user A can submit a job pretending to be user B. |
| \item[user \textit{ducc} password compromised] In the event that the \textit{ducc} |
| user password is compromised then the root privileged command |
| \textbf{ducc\_ling} can be used to become any other user except root. |
| \item[user \textit{root} password compromised] In the event that the |
| \textit{root} user password is compromised DUCC provides no protection. |
| That is, compromising the root user is equivalent to compromising the DUCC |
| user password. |
| \end{description} |
| |