Scaling OpenTelemetry: Lessons from Benchmarking with Elasticsearch APM Server

As systems grow in scale, there comes a point where optimization becomes essential—not only to improve performance but also to avoid potential errors or inefficiencies that could arise from neglect. Over the past year, I’ve been working to scale OpenTelemetry (Otel) within my organization, and as we approached 30TB of data per day, it became clear that we could no longer ignore the inefficiencies or the sporadic issues we were encountering while sending data to Elasticsearch.

The time had come to benchmark Elasticsearch and the Otel farms.

Here, I’ll share the results from our internal benchmarking tests, focusing specifically on the performance of OpenTelemetry agents versus Elastic APM Server. The goal was to uncover key optimizations and bottlenecks that could help others who are dealing with similar scaling challenges.

Our Test Environment

These tests were designed to pinpoint the performance limits of the OpenTelemetry agent compared to Elastic APM Server. For this post, Elasticsearch performance isn’t a key factor—we’ll assume you already know how to scale your Elasticsearch cluster to handle large data volumes.

Otel Collector Version: 0.93.0
Elastic APM Server & Elasticsearch Version: 8.13.4
Test Setup: One agent, one APM server, and one Elasticsearch node (with no replicas).

The Testing Process

We used a simple weather forecast generator to simulate a high traffic environment. The load consisted of approximately 20,000 requests per second—equating to 1,200,000 requests per minute—for 10 minutes. The objective was to identify peak performance points and where both systems might begin to struggle.

Key Findings

Elastic APM Server

Through more than 100 tests with various configurations, we discovered that significant tweaking was necessary to get the best performance from Elastic APM Server. Here are the settings we found crucial to set and what settings we landed on:

max_event_size: 1457600
max_header_size: 485760
logging.level: warning
output.elasticsearch.bulk_max_size: 10240
output.elasticsearch.flush_bytes: 2MB
output.elasticsearch.flush_interval: 5s
output.elasticsearch.workers: 10
queue.mem.events: 102400

Issues Identified:

There is a bug in Elastic APM Server versions 8.13.0 to 8.13.4. To avoid it, be sure to set output.elasticsearch.flush_bytes.
Elastic APM Server does not scale well vertically. In our development environment, we tested running 10 APM servers with 6 cores, each reaching around 70% utilization. When we consolidated to two servers with 32 cores each, we couldn’t get utilization above 3 cores—even with increased load. After extensive configuration tweaking, we concluded that the APM server doesn’t efficiently utilize additional cores. As a result, we’ve settled on a maximum of 6-core APM pods and opted to autoscale horizontally when needed.

Take a look at the following pages for more configuration options:

OpenTelemetry Collector

For the OpenTelemetry Collector, we made several key discoveries that optimized performance based on the configurations we used with Elastic APM Server.

Best-Performing Settings:

  batch:
    send_batch_size: 2048
    timeout: 1s

  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 0

The memory_limiter processor proved vital in memory-constrained environments, as it helps prevent the collector pods from restarting due to memory shortages. However, it should be used with caution. When the soft memory usage threshold is hit, the processor rejects data until memory levels fall below the threshold—a challenging situation if the incoming data rate remains high. The best solution is to ensure your collectors have enough memory to avoid hitting these limits.

In our case, we set spike_limit_percentage: 0 to prevent unexpected restarts. With these settings, garbage collection kicked in as expected without prematurely triggering issues. A 1-second memory check interval was crucial to avoid restarts during periods of high traffic.

Batch Processor Optimization: The batch processor plays a crucial role in minimizing network traffic by batching requests. We found that the following settings worked best:

batch:
  send_batch_size: 2048
  timeout: 1s

If you are seeing high “received message after decompression larger than max…” messages, then you should try lowering the batch size above.

Conclusion and Contributions

These tests highlight the importance of fine-tuning configurations to get the most out of OpenTelemetry and Elastic APM Server at scale. In our case, we saw significant performance improvements, particularly with memory management and batch processing in the OpenTelemetry Collector, as well as optimising the Elastic APM Server’s core settings.

Special thanks to my colleague, Emile Van Reenen, for his tireless efforts in running over 100 benchmark tests.

If you’re facing similar challenges, I hope these insights help you fine-tune your own setups for better performance and resource efficiency.

NoSqlSkills

NoSqlSkills

Scaling OpenTelemetry: Lessons from Benchmarking with Elasticsearch APM Server

Our Test Environment

The Testing Process

Key Findings

Elastic APM Server

OpenTelemetry Collector

Conclusion and Contributions

jethro.pickering

Related Posts

Simplifying Elasticsearch Cluster Management with Terraform Modules.

High “Context Deadline Exceeded” errors in APM

Leave a Reply Cancel reply