Tracing

When the calls between enterprise microservices are complex, APM Agents sample some requests, and intercept corresponding requests and subsequent call information. For example, in the scenario where service A calls service B and then calls service C, after service A receives a request, APM determines whether to trace the request based on the intelligent sampling algorithm.

Intelligent Sampling Algorithm

APM uses the intelligent sampling algorithm to determine whether to trace requests.

  • If a request needs to be traced, a trace ID is generated and details (events) about some important methods (generally the tree structure with the parent-child relationship) under service A are intercepted. At the same time, the trace ID is transparently transmitted to service B. The important methods under service B are also intercepted. The trace ID is also transparently transmitted to service C. Some methods under service C are intercepted in a similar way as those under services B and A. Each node respectively reports event information and an association relationship can be formed based on the trace ID. In this way, you can view the call details of the entire request based on the trace ID.

  • If a request does not need to be traced, no trace ID is generated. Service B does not receive the trace ID and uses the same algorithm as service A to determine whether to perform tracing.

  • After data is reported, APM stores not only all event details, but also the root event (called span) information of each service for subsequent trace search. Generally, you search for the span information and then obtain the overall trace details based on the trace ID in the span information.

  • By default, the intelligent sampling policy is used. There are three types of URLs: error URLs, slow URLs (use the default 800 ms or customize a threshold), and normal URLs. The sampling ratio of each type of URL is calculated separately. For APM, statistics are collected and reported every minute. In the first collection period, all URLs are regarded as normal for sampling. In the second collection period, URLs are classified into error, slow, and normal URLs based on the statistics collected in the previous period.

    • Sampling rate of error URLs: If the CPU usage is less than 30%, 100 records are collected per minute. If the CPU usage is greater than or equal to 30% but less than 60%, 50 records are collected per minute. If the CPU usage is greater than or equal to 60%, 10 records are collected per minute. At least two records are collected for each URL.

    • Sampling rate of slow URLs: If the CPU usage is less than 30%, 100 records are collected per minute. If the CPU usage is greater than or equal to 30% but less than 60%, 50 records are collected per minute. If the CPU usage is greater than or equal to 60%, 10 records are collected per minute. At least two records are collected for each URL.

    • Sampling rate of normal URLs: If the CPU usage is less than 30%, 20 records are collected per minute. If the CPU usage is greater than or equal to 30% but less than 60%, 10 records are collected per minute. If the CPU usage is greater than or equal to 60%, 5 records are collected per minute. At least one record is collected for each URL.

The advantage of the preceding algorithm is that once the trace information is generated, the link is complete, helping you make correct decisions. If a large number of URLs are called, abnormal requests may fail to be collected. In this case, you can collect metrics to locate system exceptions.

Viewing Trace Details

Viewing Basic Information About the Trace Filtered Based on the Search Criteria

In the displayed trace list, click image2 next to the target trace to view its basic information, as shown in the following figure.

**Figure 1** Basic information about a trace

Figure 1 Basic information about a trace

Parameter description:

  1. HTTP method of the trace.

  2. REST URL of the trace. A REST URL contains a variable name, for example, /apm/get/{id}. You can click the URL to go to the trace details page.

  3. Start time of the trace.

  4. HTTP status code returned by the trace.

  5. Response time of the trace.

  6. Trace ID.

  7. Component to which the trace belongs.

  8. Environment to which the trace belongs.

  9. Host of the instance to which the trace belongs.

  10. IP address of the instance to which the trace belongs.

  11. Actual URL of the trace.

You can also click a specific URL on the monitoring item view page, for example, the table view of the URL monitoring item. In this way, you can quickly search for required trace information based on preset search criteria.

Viewing the Complete Information About the Trace, Including Local Method Stacks and Remote Call Relationships

Click the name of a trace to view its details, as shown in the following figure.

  • The upper part is the sequence diagram of the trace, which shows complete call relationships between components. This diagram contains the information about the client and server corresponding to each call. The lower the line is, the later a call occurs.

  • The lower part lists the method stack details of the trace. Each line indicates a method call. You can view the detailed method call relationships of the trace. By default, only component methods supported by JavaAgents are displayed. To display application methods, configure the application methods to be intercepted during JavaMethod configuration.

    **Figure 2** Call relationship

    Figure 2 Call relationship

    Parameter description:

  1. Component and environment to which the called API belongs

  2. Response time (unit: ms) of the client. You can hover the mouse pointer over this digit to view more details.

  3. Response time (unit: ms) of the server.

  4. Key parameter of the method in the trace method stack. For example, for a Tomcat entry method, a real URL is displayed. For a MySQL call method, an executed SQL statement is displayed.

  5. Extended data of the trace method. Generally, parameters related to the method are displayed.