Trace Sampling at server side

When we run a distributed tracing system, the trace bring us detailed info, but cost a lot at storage. Open server side trace sampling mechanism, the metrics of service, service instance, endpoint and topology are all accurate as before, but only don't save all the traces into storage.

Of course, even you open sampling, the traces will be kept as consistent as possible. Consistent means, once the trace segments have been collected and reported by agents, the backend would do their best to don't break the trace. See Recommendation to understand why we called it as consistent as possible and do their best to don't break the trace.

Set the sample rate

In agent-analyzer module, you will find sampleRate setting.

agent-analyzer:
  default:
    ...
    sampleRate: ${SW_TRACE_SAMPLE_RATE:1000} # The sample rate precision is 1/10000. 10000 means 100% sample in default.
    forceSampleErrorSegment: ${SW_FORCE_SAMPLE_ERROR_SEGMENT:true} # When sampling mechanism activated, this config would make the error status segment sampled, ignoring the sampling rate.

sampleRate is for you to set sample rate to this backend. The sample rate precision is 1/10000. 10000 means 100% sample in default.

forceSampleErrorSegment is for you to open force save some error segment when sampling mechanism active. When sampling mechanism activated, this config would make the error status segment sampled, ignoring the sampling rate.

Recommendation

You could set different backend instances with different sampleRate values, but we recommend you to set the same.

When you set the rate different, let's say

Backend-InstanceA.sampleRate = 35
Backend-InstanceB.sampleRate = 55

And we assume the agents reported all trace segments to backend, Then the 35% traces in the global will be collected and saved in storage consistent/complete, with all spans. 20% trace segments, which reported to Backend-InstanceB, will saved in storage, maybe miss some trace segments, because they are reported to Backend-InstanceA and ignored.

Note

When you open sampling, the actual sample rate could be over sampleRate. Because currently, all error segments will be saved, meanwhile, the upstream and downstream may not be sampled. This feature is going to make sure you could have the error stacks and segments, but don't guarantee you would have the whole trace.

Also, the side effect would be, if most of the accesses are fail, the sampling rate would be closing to 100%, which could crash the backend or storage clusters.