While I was at Google, circa 2015-2016, working on some Ad project, and happened to be on-call our system started doing something wonky, so I think I either called the SRE on the sub-system we were using (spanner? something else - don't remember) to check what's up (as written by our playbook).
They asked me to enable tracing for 30s (we had Chrome extension, that sends some URL common parameter that enables in your web server full tracing (100%) for some short amount of time), and then I did some operations that our internal customers were complaining.
This produced quite a hefty tracing, but only for 30secs - enough for them to trace back where the issue might be coming from. But basically end-to-end from me doing something on the browser, down to our server/backend, downto their systems, etc.
That's how I've learned how important it is - for cases like this, but you can't have 100% ON all the time - not even 1% I think...
Oh yeah, tracing can be extremely useful, precisely because it should be end to end.
As for the numbers, that's why all tracing collectors and receivers support downsampling out of the box. Recording only 1% or 10% of all traces, or 10% of all successful ones and 100% of failures is a good way of making use of tracing without overburdening storage.
They asked me to enable tracing for 30s (we had Chrome extension, that sends some URL common parameter that enables in your web server full tracing (100%) for some short amount of time), and then I did some operations that our internal customers were complaining.
This produced quite a hefty tracing, but only for 30secs - enough for them to trace back where the issue might be coming from. But basically end-to-end from me doing something on the browser, down to our server/backend, downto their systems, etc.
That's how I've learned how important it is - for cases like this, but you can't have 100% ON all the time - not even 1% I think...