section> Computing
  • Auto Scaling
  • Bare Metal Server
  • Dedicated Host
  • Elastic Cloud Server
  • FunctionGraph
  • Image Management Service
Network
  • Direct Connect
  • Domain Name Service
  • Elastic IP
  • Elastic Load Balancing
  • Enterprise Router
  • NAT Gateway
  • Private Link Access Service
  • Secure Mail Gateway
  • Virtual Private Cloud
  • Virtual Private Network
  • VPC Endpoint
Storage
  • Cloud Backup and Recovery
  • Cloud Server Backup Service
  • Elastic Volume Service
  • Object Storage Service
  • Scalable File Service
  • Storage Disaster Recovery Service
  • Volume Backup Service
Application
  • API Gateway (APIG)
  • Application Operations Management
  • Application Performance Management
  • Distributed Message Service (for Kafka)
  • Simple Message Notification
Data Analysis
  • Cloud Search Service
  • Data Lake Insight
  • Data Warehouse Service
  • DataArts Studio
  • MapReduce Service
  • ModelArts
  • Optical Character Recognition
Container
  • Application Service Mesh
  • Cloud Container Engine
  • Cloud Container Instance
  • Software Repository for Containers
Databases
  • Data Replication Service
  • Distributed Cache Service
  • Distributed Database Middleware
  • Document Database Service
  • GeminiDB
  • Relational Database Service
  • TaurusDB
Management & Deployment
  • Cloud Create
  • Cloud Eye
  • Cloud Trace Service
  • Config
  • Log Tank Service
  • Resource Formation Service
  • Tag Management Service
Security Services
  • Anti-DDoS
  • Cloud Firewall
  • Database Security Service
  • Dedicated Web Application Firewall
  • Host Security Service
  • Identity and Access Management
  • Key Management Service
  • Web Application Firewall
Other
  • Enterprise Dashboard
  • Marketplace
  • Price Calculator
  • Status Dashboard
APIs
  • REST API
  • API Usage Guidelines
  • Endpoints
Development and Automation
  • SDKs
  • Drivers and Tools
  • Terraform
  • Ansible
  • Cloud Create
Architecture Center
  • Best Practices
  • Blueprints
IaaSComputingAuto ScalingBare Metal ServerDedicated HostElastic Cloud ServerFunctionGraphImage Management ServiceNetworkDirect ConnectDomain Name ServiceElastic IPElastic Load BalancingEnterprise RouterNAT GatewayPrivate Link Access ServiceSecure Mail GatewayVirtual Private CloudVirtual Private NetworkVPC EndpointStorageCloud Backup and RecoveryCloud Server Backup ServiceElastic Volume ServiceObject Storage ServiceScalable File ServiceStorage Disaster Recovery ServiceVolume Backup ServicePaaSApplicationAPI Gateway (APIG)Application Operations ManagementApplication Performance ManagementDistributed Message Service (for Kafka)Simple Message NotificationData AnalysisCloud Search ServiceData Lake InsightData Warehouse ServiceDataArts StudioMapReduce ServiceModelArtsOptical Character RecognitionContainerApplication Service MeshCloud Container EngineCloud Container InstanceSoftware Repository for ContainersDatabasesData Replication ServiceDistributed Cache ServiceDistributed Database MiddlewareDocument Database ServiceGeminiDBRelational Database ServiceTaurusDBManagementManagement & DeploymentCloud CreateCloud EyeCloud Trace ServiceConfigLog Tank ServiceResource Formation ServiceTag Management ServiceSecuritySecurity ServicesAnti-DDoSCloud FirewallDatabase Security ServiceDedicated Web Application FirewallHost Security ServiceIdentity and Access ManagementKey Management ServiceWeb Application FirewallOtherOtherEnterprise DashboardMarketplacePrice CalculatorStatus Dashboard

Data Lake Insight

  • SQL Jobs
  • Flink OpenSource SQL Jobs
    • Reading Data from Kafka and Writing Data to RDS
    • Reading Data from Kafka and Writing Data to GaussDB(DWS)
    • Reading Data from Kafka and Writing Data to Elasticsearch
    • Reading Data from MySQL CDC and Writing Data to GaussDB(DWS)
    • Reading Data from PostgreSQL CDC and Writing Data to GaussDB(DWS)
    • Configuring High-Reliability Flink Jobs (Automatic Restart upon Exceptions)
  • Flink Jar Jobs
  • Spark Jar Jobs
  • Change History
  • Developer Guide
  • Flink OpenSource SQL Jobs
  • Configuring High-Reliability Flink Jobs (Automatic Restart upon Exceptions)

Configuring High-Reliability Flink Jobs (Automatic Restart upon Exceptions)¶

Scenario¶

If you need to configure high reliability for a Flink application, you can set the parameters when creating your Flink jobs.

Procedure¶

  1. Create an SMN topic and add an email address or mobile number to subscribe to the topic. You will receive a subscription notification by an email or message. Click the confirmation link to complete the subscription.

  2. Log in to the DLI console, create a Flink SQL job, write SQL statements for the job, and configure running parameters. In this example, key parameters are described. Set other parameters based on your requirements. For details about how to create a Flink SQL job, see .

    Note

    The reliability configuration of a Flink Jar job is the same as that of a SQL job, which will not be described in this section.

    1. Set CUs, Job Manager CUs, and Max Concurrent Jobs based on the following formulas:

      Total number of CUs = Number of manager CUs + (Total number of concurrent operators / Number of slots of a TaskManager) x Number of TaskManager CUs

      For example, with a total of 9 CUs (1 manager CU) and a maximum of 16 concurrent jobs, the number of compute-specific CUs is 8.

      If you do not configure TaskManager specifications, a TaskManager occupies 1 CU by default and has no slot. To ensure a high reliability, set the number of slots of the TaskManager to 2, according to the preceding formula.

      Set the maximum number of concurrent jobs be twice the number of CUs.

    2. Select Save Job Log and select an OBS bucket. If you are not authorized to access the bucket, click Authorize. This allows job logs be saved to your OBS bucket. If a job fails, the logs can be used for fault locating.

    3. Select Alarm Generation upon Job Exception and select the SMN topic created in 1. This allows DLI to send notifications to your email box or phone when a job exception occurs, so you can be notified of any exceptions in time.

    4. Select Enable Checkpointing and set the checkpoint interval and mode as needed. This function ensures that a failed Flink task can be restored from the latest checkpoint.

      Note

      • Checkpoint interval indicates the interval between two triggers. Checkpointing hurts real-time computing performance. To minimize the performance loss, you need to allow for the recovery duration when configuring the interval. It is recommended that the checkpoint interval be greater than the checkpointing duration. The recommended value is 5 minutes.

      • The Exactly once mode ensures that each piece of data is consumed only once, and the At least once mode ensures that each piece of data is consumed at least once. Select a mode as you need.

    5. Select Auto Restart upon Exception and Restore Job from Checkpoint, and set the number of retry attempts as needed.

    6. Configure Dirty Data Policy. You can select Ignore, Trigger a job exception, or Save based on your service requirements.

    7. Select a queue, and then submit and run the job.

  3. Log in to the Cloud Eye console. In the navigation pane on the left, choose Cloud Service Monitoring > Data Lake Insight. Locate the target Flink job and click Create Alarm Rule.

    DLI provides various monitoring metrics for Flink jobs. You can define alarm rules as required using different monitoring metrics for fine-grained job monitoring.

  • Prev
  • Next
last updated: 2025-06-16 14:07 UTC - commit: 2d6c283406071bb470705521bc41e86fa3400203
Edit pageReport Documentation Bug
Page Contents
  • Configuring High-Reliability Flink Jobs (Automatic Restart upon Exceptions)
    • Scenario
    • Procedure
© T-Systems International GmbH
  • Contact
  • Data privacy
  • Disclaimer of Liabilities
  • Imprint