Avoiding APM Pitfalls: A Crucial Update for Elasticsearch Users
If you’re using APM and considering an upgrade between versions 8.13.0 and 8.14.2, this post is for you.
A newly documented issue on Elasticsearch’s “Known Issues” page might still catch you off guard, especially if you’re sending a significant amount of OpenTelemetry (Otel) data to your clusters. The bug may not manifest exactly as described, which can lead to confusion and wasted time.
If you are seeing a large amount “Context Deadline Exceeded” in your Otel farms sending to Elasticsearch APM server then this bug might be affecting you.
Spotting the Issue: “Context Deadline Exceeded” Errors
The bug is officially listed as “Too many small bulk requests” in the Elasticsearch output. However, in our experience, it presented differently.
In our Otel environments, we encountered this error:
7-5-2024 12:21:17.4402024-05-27T12:21:17.440Z info exporterhelper/retry_sender.go:118 Exporting failed. Will retry the request after interval. {“kind”: “exporter”, “data_type”: “traces”, “name”: “otlp/elastic”, “error”: “rpc error: code = DeadlineExceeded desc = context deadline exceeded“, “interval”: “3.126608528s”}
According to Otel documentation, this is a retryable error from the APM Server. A quick Google search led us to numerous articles suggesting tweaks to the Otel agent and server, which only wasted more time.
Adding to the frustration, when we began graphing the performance and audit metrics for our traces, metrics, and logs, we found we were losing up to 90% of data in some cases.
The Solution: A Simple Configuration Fix
After much trial and error, we stumbled upon the fix by accident. We initially thought the issue was related to configuration, so we focused on benchmarking and testing both APM and Elasticsearch (more on this in another post). During this process, we discovered a configuration setting that resolved all our error logs:
output.elasticsearch.flush_bytes: 1mib
It turns out that a performance regression in specific APM Server versions causes it to send smaller bulk requests of just 24KB, leading to the issue.
We applied this fix using the eck-stack Helm chart by passing the configuration via the config: block in values.yaml. Depending on your deployment setup, you may need to set this configuration elsewhere.
We never did see the tell tale “Too many bulk requests” errors logged at any tier of our cluster, which is what the “Known Issues” page said to look out for, so hopefully this points you in the right direction.
That said, Context Deadline Exceeded could be any number of issues…
Conclusion: Stay Safe with the Right Configuration
Although this bug is resolved in version 8.14.3, if you’re running or planning to run Elasticsearch APM Server versions 8.13.0 – 8.14.2, make sure to manually configure this setting to avoid issues.
For more details on this regression, you can visit Elasticsearch’s Known Issues page.