Measuring Latency (2015)

(bravenewgeek.com)

34 points | by dempedempe 10 hours ago

5 comments

  • rdtsc 9 hours ago
    10 years old and still relevant. Gil created a wrk fork https://github.com/giltene/wrk2 to handle coordinated omission better. I used using his fork for many years. But I think he stopped updating it after a while.

    Good load testing tools will have modes to send in data at a fixed rate regardless of other requests to handle coordinated omission. k6 for instance defined these modes are "open" and "closed": https://grafana.com/docs/k6/latest/using-k6/scenarios/concep.... They mention the term "coordinated omission" on the page however I feel like they could have given a nod to Gil for the inventing term.

  • 10000truths 7 hours ago
    The table is a bit misleading. Most of the resources of a website are loaded concurrently and are not on the critical path of the "first contentful paint", so latency does not compound as quickly as the table implies. For web apps, much of the end-to-end latency hides lower in the networking stack. Here's the worst-case latency for a modern Chrome browser performing a cold load of an SPA website:

    DNS-over-HTTPS-over-QUIC resolution: 2 RTTs

    TCP handshake: 1 RTT

    TLS v1.2 handshake: 2 RTTs

    HTTP request/response (HTML): 1 RTT

    HTTP request/response (bundled JS that actually renders the content): 1 RTT

    That's 7 round trips. If your connection crosses a continent, that's easily a 1-2 second time-to-first-byte for the content you actually care about. And no amount of bandwidth will decrease that, since the bottlenecks are the speed of light and router hop latencies. Weak 4G/WiFi signal and/or network congestion will worsen that latency even further.

    • jiggawatts 5 hours ago
      The reason why using a CDN is so effective for improving the perceived performance of a web site is because it reduces the length (and hence speed of light delay) of these first 7 round trips by moving the static parts of the web app (HTML+JS) to the "edge", which is just a bunch of cache boxes scattered around the world.

      The user no longer has to connect to the central app server, they can connect to their nearest cache edge box, which is probably a lot closer to them (1-10ms is typical).

      Note that stateful API calls will still need to go back to the central app server, potentially an intercontinental hop.

      • 10000truths 4 hours ago
        Indeed, at some point, you can't lower tail latencies any further without moving closer to your users. But of the 7 round trips that I mentioned above, you have control over 3 of them: 2 round trips can be eliminated by supporting HTTP/3 over QUIC (and adding HTTPS DNS records to your zone file), and 1 round trip can be eliminated by server-side rendering. That's a 40-50% reduction before you even need to consider a CDN setup, and depending on your business requirements, it may very well be enough.
  • Fripplebubby 10 hours ago
    > This is partly a tooling problem. Many of the tools we use do not do a good job of capturing and representing this data. For example, the majority of latency graphs produced by Grafana, such as the one below, are basically worthless. We like to look at pretty charts, and by plotting what’s convenient we get a nice colorful graph which is quite readable. Only looking at the 95th percentile is what you do when you want to hide all the bad stuff. As Gil describes, it’s a “marketing system.” Whether it’s the CTO, potential customers, or engineers—someone’s getting duped. Furthermore, averaging percentiles is mathematically absurd. To conserve space, we often keep the summaries and throw away the data, but the “average of the 95th percentile” is a meaningless statement. You cannot average percentiles, yet note the labels in most of your Grafana charts. Unfortunately, it only gets worse from here.

    I think this is getting a bit carried away. I don't have any argument against the observation that that average of a p95 is not something that mathematically makes sense, but if you actually understand what it is, it is absolutely still meaningful. With time series data, there is always some time denominator, so it really means (say) "the p95 per minute averaged over the last hour", which is or can be meaningful (and useful at a glance).

    Also, the claim that "[o]nly looking at the 95th percentile is what you do when you want to hide all the bad stuff" is very context dependent. As long as you understand what it actually means, I don't see the harm in it. The author makes this point that, because a load of a single webpage will result in 40 requests or so, you are much more likely to hit a p99 and so you should really care about p99 and up - more power to you, if that's the contextually appropriate, then that is absolutely right, but that really only applies to a webserver serving webpage assets which is only one kind of software that you might be writing. I think it is definitely important to know, for one given "eyeball" waiting on your service to respond, what the actual flow is - whether it's just one request, or multiple concurrent requests, or some kind of dependency graph of calls to your service all needed in sequence - but I don't really think that challenges the commonsense notion of latency, does it?

    • camel_gopher 9 hours ago
      Nearly all time series databases store single value aggregations (think p95) over a time period. A select few store actual serialized distributions (Atlas from Netflix, Apica IronDB, some bespoke implementations). Latency tooling is sorely overlooked mostly because the good tooling is complex, and requires corresponding visualization tooling. Most of the vendors have some implementation of heat map or histogram visualization but either the math is wrong or the UI can’t handle a non trivial volume of samples. Unfortunately it’s been a race to the bottom for latency measurement tooling, with the users losing.

      Source: I’ve done this a lot

      • Fripplebubby 8 hours ago
        I take it as a given that what is stored and graphed is an information-destroying aggregate, but I think that aggregate is actually still useful + meaningful
        • camel_gopher 7 hours ago
          Someone smart I know coined it as “wrong but useful”
  • hakkikonu 3 hours ago
    "How NOT to Measure Latency" by Gil Tene https://www.youtube.com/watch?v=lJ8ydIuPFeU
  • tomhow 10 hours ago
    One previous discussion at time of publication:

    A summary of how not to measure latency - https://news.ycombinator.com/item?id=10732469 - Dec 2015 (3 comments)