In the Distributed Runtime Environment, we learn some basic concepts about task slots and resources. And in TaskManager Resource we could get more information about TaskManager Resource configuration and know how to specify resources for each operator. After all of this, we come to a conclusion that quantitative resource management could help a lot in avoiding OutOfMemory
. However, at the same time it raises the bar to debug the job scheduling problems. So we enhance some web pages of Flink dashboard to make debugging easier.
As shown in the figure above, the resource section of Flink cluster overview contains total/available slots and total/available resources(cpu cores, heap memory, direct memory, native memory, managed memory and network memory).
Each SlotRequest
has a corresponding ResourceProfile
which describes resource requirements of tasks. As shown in the figure above, all the requested but not fulfilled slots will show in a list of pending slots. The request time and resources of each pending slot could also be found.
Each TaskManager represents a subset of resources of the Flink cluster. A slot can only be assigned to one TaskManager and allocation across multiple TaskManagers is not possible. When we submit a job to an existing session, even though the Flink cluster have enough resources to fulfill the pending SlotRequest
, it may not be allocated due to resource fragmentation. So we need the TaskManagers resources list(including total/available amounts of all kinds of resources) to help diagnose this situation.
Running Jobs Page
of Flink dashboard to check whether all tasks are running. If not, go to the next step.Job Details Page
and switch to the Exceptions
tab. Check whether the Root Exceptions
or Exception History
is empty. If it is empty, just go to the next step. Otherwise, please try to fix all these exceptions besides TimeoutException
before continuing.Pending Slots
tab and check whether the pending list is empty. If not, the not running tasks may be due to insufficient resources.Overview Page
and check if all kinds of available resources are enough to fulfill the pending slots. As shown in the figure 2, each pending slot request needs the resources of <0.1 Core, 32MB Direct Memory, 256MB Heap Memory, 32MB Native Memory>. The Available UserNative memory is zero in figure 1. So the two SlotRequest
s could not be fulfilled and remain pending.Task Managers Page
and we should find that no TaskManager has enough available resources to fulfill the pending slot requests.{% top %}