section> Computing
  • Auto Scaling
  • Bare Metal Server
  • Dedicated Host
  • Elastic Cloud Server
  • FunctionGraph
  • Image Management Service
Network
  • Direct Connect
  • Domain Name Service
  • Elastic IP
  • Elastic Load Balancing
  • Enterprise Router
  • NAT Gateway
  • Private Link Access Service
  • Secure Mail Gateway
  • Virtual Private Cloud
  • Virtual Private Network
  • VPC Endpoint
Storage
  • Cloud Backup and Recovery
  • Cloud Server Backup Service
  • Elastic Volume Service
  • Object Storage Service
  • Scalable File Service
  • Storage Disaster Recovery Service
  • Volume Backup Service
Application
  • API Gateway (APIG)
  • Application Operations Management
  • Application Performance Management
  • Distributed Message Service (for Kafka)
  • Simple Message Notification
Data Analysis
  • Cloud Search Service
  • Data Lake Insight
  • Data Warehouse Service
  • DataArts Studio
  • MapReduce Service
  • ModelArts
  • Optical Character Recognition
Container
  • Application Service Mesh
  • Cloud Container Engine
  • Cloud Container Instance
  • Software Repository for Containers
Databases
  • Data Replication Service
  • Distributed Cache Service
  • Distributed Database Middleware
  • Document Database Service
  • GeminiDB
  • Relational Database Service
  • TaurusDB
Management & Deployment
  • Cloud Create
  • Cloud Eye
  • Cloud Trace Service
  • Config
  • Log Tank Service
  • Resource Formation Service
  • Tag Management Service
Security Services
  • Anti-DDoS
  • Cloud Firewall
  • Database Security Service
  • Dedicated Web Application Firewall
  • Host Security Service
  • Identity and Access Management
  • Key Management Service
  • Web Application Firewall
Other
  • Enterprise Dashboard
  • Marketplace
  • Price Calculator
  • Status Dashboard
APIs
  • REST API
  • API Usage Guidelines
  • Endpoints
Development and Automation
  • SDKs
  • Drivers and Tools
  • Terraform
  • Ansible
  • Cloud Create
Architecture Center
  • Best Practices
  • Blueprints
IaaSComputingAuto ScalingBare Metal ServerDedicated HostElastic Cloud ServerFunctionGraphImage Management ServiceNetworkDirect ConnectDomain Name ServiceElastic IPElastic Load BalancingEnterprise RouterNAT GatewayPrivate Link Access ServiceSecure Mail GatewayVirtual Private CloudVirtual Private NetworkVPC EndpointStorageCloud Backup and RecoveryCloud Server Backup ServiceElastic Volume ServiceObject Storage ServiceScalable File ServiceStorage Disaster Recovery ServiceVolume Backup ServicePaaSApplicationAPI Gateway (APIG)Application Operations ManagementApplication Performance ManagementDistributed Message Service (for Kafka)Simple Message NotificationData AnalysisCloud Search ServiceData Lake InsightData Warehouse ServiceDataArts StudioMapReduce ServiceModelArtsOptical Character RecognitionContainerApplication Service MeshCloud Container EngineCloud Container InstanceSoftware Repository for ContainersDatabasesData Replication ServiceDistributed Cache ServiceDistributed Database MiddlewareDocument Database ServiceGeminiDBRelational Database ServiceTaurusDBManagementManagement & DeploymentCloud CreateCloud EyeCloud Trace ServiceConfigLog Tank ServiceResource Formation ServiceTag Management ServiceSecuritySecurity ServicesAnti-DDoSCloud FirewallDatabase Security ServiceDedicated Web Application FirewallHost Security ServiceIdentity and Access ManagementKey Management ServiceWeb Application FirewallOtherOtherEnterprise DashboardMarketplacePrice CalculatorStatus Dashboard

MapReduce Service

  • Overview
    • What Is MRS?
    • Application Scenarios
    • Components
      • List of MRS Component Versions
      • Alluxio
      • CarbonData
      • ClickHouse
      • CDL
      • DBService
      • Apache Doris
      • Flink
        • Flink Basic Principles
        • Flink HA Solution
        • Relationship with Other Components
        • Flink Enhanced Open Source Features
      • Flume
      • Guardian
      • HBase
      • HDFS
      • HetuEngine
      • Hive
      • Hudi
      • Hue
      • IoTDB
      • JobGateway
      • Kafka
      • KafkaManager
      • KrbServer and LdapServer
      • Loader
      • Manager
      • MapReduce
      • Oozie
      • OpenTSDB
      • Presto
      • Ranger
      • Spark
      • Spark2x
      • Storm
      • Tez
      • Yarn
      • ZooKeeper
    • Functions
    • Constraints
    • Related Services
  • Preparing a User
  • MRS Quick Start
  • Configuring a Cluster
  • Managing Clusters
  • Using an MRS Client
  • Configuring a Cluster with Storage and Compute Decoupled
  • Accessing Web Pages of Open Source Components Managed in MRS Clusters
  • Accessing Manager
  • MRS Manager Operation Guide (Applicable to 3.x)
  • MRS Manager Operation Guide (Applicable to 2.x and Earlier Versions)
  • Security Description
  • High-Risk Operations
  • Backup and Restoration
  • Data Backup and Restoration
  • Appendix
  • FAQ
  • Change History
  • User Guide
  • Overview
  • Components
  • Flink
  • Flink HA Solution

Flink HA Solution¶

Flink HA Solution¶

A Flink cluster has only one JobManager. This has the risks of single point of failures (SPOFs). There are three modes of Flink: Flink On Yarn, Flink Standalone, and Flink Local. Flink On Yarn and Flink Standalone modes are based on clusters and Flink Local mode is based on a single node. Flink On Yarn and Flink Standalone provide an HA mechanism. With such a mechanism, you can recover the JobManager from failures and thereby eliminate SPOF risks. This section describes the HA mechanism of the Flink On Yarn.

Flink supports the HA mode and job exception recovery that highly depend on ZooKeeper. If you want to enable the two functions, configure ZooKeeper in the flink-conf.yaml file in advance as follows:

high-availability: zookeeper
high-availability.zookeeper.quorum:  ZooKeeper IP address:2181
high-availability.storageDir: hdfs:///flink/recovery

Flink On Yarn

Flink JobManager and Yarn ApplicationMaster are in the same process. Yarn ResourceManager monitors ApplicationMaster. If ApplicationMaster is abnormal, Yarn restarts it and restores all JobManager metadata from HDFS. During the recovery, existing tasks cannot run and new tasks cannot be submitted. ZooKeeper stores JobManager metadata, such as information about jobs, to be used by the new JobManager. A TaskManager failure is listened and processed by the DeathWatch mechanism of Akka on JobManager. When a TaskManager fails, a container is requested again from Yarn and a TaskManager is created.

For more information about the HA solution of Flink on Yarn, visit https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html.

For details about how to set yarn-site.xml, visit https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/jobmanager_high_availability.html.

Standalone

In the standalone mode, multiple JobManagers can be started and ZooKeeper elects one as the leader JobManager. In this mode, there is a leader JobManager and multiple standby JobManagers. If the leader JobManager fails, a standby JobManager takes over the leadership. Figure 1 shows the process of a leader/standby JobManager switchover.

**Figure 1** Switchover process

Figure 1 Switchover process¶

Restoring TaskManager

A TaskManager failure is listened and processed by the DeathWatch mechanism of Akka on JobManager. If the TaskManager fails, the JobManager creates a TaskManager and migrates services to the created TaskManager.

Restoring JobManager

Flink JobManager and Yarn ApplicationMaster are in the same process. Yarn ResourceManager monitors ApplicationMaster. If ApplicationMaster is abnormal, Yarn restarts it and restores all JobManager metadata from HDFS. During the recovery, existing tasks cannot run and new tasks cannot be submitted.

Restoring Jobs

If you want to restore jobs, ensure that the startup policy is configured in Flink configuration files. Supported restart policies are fixed-delay, failure-rate, and none. Jobs can be restored only when the policy is configured to fixed-delay or failure-rate. If the restart policy is configured to none and checkpoint is configured for jobs, the restart policy is automatically configured to fixed-delay and the value of restart-strategy.fixed-delay.attempts (which specifies the number of retry times) is configured to Integer.MAX_VALUE.

For details about the three strategies, visit https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/task_failure_recovery.html. The following is an example of the restart policy configuration:

restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 3
restart-strategy.fixed-delay.delay: 10 s

Jobs will be restored in the following scenarios:

  • If a JobManager fails, all its jobs are stopped, and will be recovered after another JobManager is created and running.

  • If a TaskManager fails, all tasks on the TaskManager are stopped, and will be started until there are available resources.

  • When a task of a job fails, the job is restarted.

    Note

    For details about how to configure the restart policy of a job, visit https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/jobmanager_high_availability.html.

  • Prev
  • Next
last updated: 2025-07-09 15:07 UTC - commit: cb943fa3145d5c3e150bb4fa1a987d24c3077fe9
Edit pageReport Documentation Bug
Page Contents
  • Flink HA Solution
    • Flink HA Solution
© T-Systems International GmbH
  • Contact
  • Data privacy
  • Disclaimer of Liabilities
  • Imprint