Low Latency Microservices, a Retrospective
In this article, I share some important lessons from our experience and what we learned after five years of developing and supporting low latency microservices.
Join the DZone community and get the full member experience.
Join For FreeI wrote an article on low latency microservices almost five years ago now. In that time, my company has worked with a number of tier-one investment banks to implement and support those systems. What has changed in that time and what lessons have we learned?
Read this article and learn what we learned after five years of developing and supporting low latency microservices.
Separation of Concerns Give Better Testability
Microservices repeatedly demonstrated that testing and debugging business components were much easier with simple, stand-alone components with clear contracts between microservices.
While unit tests were still used to start with, in 2017 we moved almost entirely to behavior-driven development of microservices. Unit tests are still used for lower-level libraries and utilities. As our microservices are all based on Kappa Architecture, all our behavior-driven tests are modeled as a series of events in and out of the service.
An input test might look like this:
---
oms: OMS1 # the service to receive this message
newOrder: {
eventId: orderevent1,
eventTime: 2017-04-27T07:26:40.9836487,
triggerTime: 2017-05-05T12:56:40.7534873,
instrument: USDCHF,
settlementTime: 2017-04-26T09:46:40,
market: FXALL_MID,
orderId: orderid1,
hedgerName: Hedger1,
user: MidHedger,
orderType: MARKET,
side: SELL,
quantity: 500E3,
maxShowQuantity: 500E3,
timeInForceType: GTC,
timeInForceExpireTime: 2018-01-01T01:00:00
}
---
# more messages
While the output looks very similar as this is an Order Management Service and its job is to filter and track orders.
---
newOrder: {
eventId: orderevent1,
eventTime: 2017-04-27T07:26:40.9836487,
triggerTime: 2017-05-05T12:56:40.7534873,
instrument: USDCHF,
settlementTime: 2017-04-26T09:46:40,
market: FXALL_MID,
orderId: orderid1,
hedgerName: Hedger1,
orderType: MARKET,
quantity: 500E3,
user: MidHedger,
side: SELL,
maxShowQuantity: 500E3,
timeInForceType: GTC,
timeInForceExpireTime: 2018-01-01T01:00:00
}
---
# more results
Building variations on tests to explore all the things which could go wrong and check how they are handled is easy.
What We Needed to Add
Beyond implementing what we envisioned five years ago, there were some features we discovered we needed to add.
A deterministic clock
To ensure our services produced the same results every time, whether in tests or between production and any redundant system, we made the time input. This appeared in our test like this:
periodicUpdate: 2017-04-27T07:26:51
---
This ensured that all time-outs or events triggered by the clock could be tested, but also ensure each redundant system did the same things at the same point and produced the same output.
Nanosecond timestamps
We started with millisecond timestamps but quickly found we needed greater resolution switching to microseconds timestamps, we now use nanosecond resolution timestamps.
We found that nanosecond timestamps were more useful if we could also ensure some level of uniqueness. We added support for ensuring nanosecond timestamps are unique on a single host in a low latency way. This takes well under 100 nanoseconds. That way a timestamp can be used as a unique id for tracing events through a system (adding a pre-set host id if needed).
Ability to store complex data as primitives
For testing purposes, all messages appear as text, however for performance reasons all data is written/read in a binary form. The typical latency of a persisted message between microservices is less than a microsecond so how the objects are stored can make a big difference.
The use of String and LocalDateTime was very expensive for our use case. To reduce the impact of storing this information, we developed a number of strategies for:
- Encoding Strings and dates in
long
fields. - Object pooling Strings
- Storing text in a mutable/reusable field
Ultimately, this lead to the support of Trivially Copyable objects, a concept adopted from C++, where the majority (or entire) Java object could be copied as a memory copy without any serialization logic. This allowed us to support passing complex market data with around 50 fields between microservices in different processes at well under a microsecond most of the time.
Ability to run microservices as a single process or a single thread
One problem with microservices is that they are less productive when it comes to integration tests and development. To solve this we created a means of running multiple microservices in a single container/JVM, or even in a single thread. This gave our customers the flexibility of running parts of the system as a compound microservice in the test environment while they could still deploy individual microservices independently in production.
Simplifying restarts with idempotency
Idempotency allows an operation to be harmlessly attempted multiple times. Restarts are simplified by replaying any messages you are not sure if they were processed fully.
In low latency systems, transnationality needs to be as simple as possible. This suits most needs, however when you get a complex transaction, idempotency can significantly reduce the complexity of recovery.
Multiple replay strategies
Having a complete deterministic persisted record of every message made restarts and failover simpler, however, we found that different use cases needed different strategies as to how that is done. We have learned a lot about real replayability/restartability requirements from customers' real use cases and the trade-offs they make for service start time vs accuracy etc.
One key feature is ensuring upstream messages are replicated before downstream ones are to simplify rebuilding the current state.
What Became Less Important
Some of the features we thought would be vital five years ago, turned out to be not always required as clients used our technology in broader contexts.
Ultra-low garbage collection
Five years ago, we were entirely focused on no minor collections over a day of processing. However, many clients wanted the productivity gains we offered but didn't have stringent performance requirements and occasional garbage collections were not a concern. Creating microservices naturally supports having latency-critical services as well as less latency-critical services in the same system.
This required a shift in some of our thinking which assumed GCs never occurred normally. We significantly reduced our use of WeakReferences for example.
Low throughput data processing
The lowest latencies tend to be at around 1% of the peak capacity. At lower throughputs, you tend to get either your hardware trying to power save, increasing latency when an event does occur, and at higher throughputs, you increase the chance of a message coming in while you are still processing previous ones, adding to latency.
Again, we saw a broadening of use cases to very low message rates which showed up behavior that is only seen when the machine is spending most of its time waiting. We added performance test tools to see how our code behaves when the CPU is running cold.
We also saw the need to support higher throughputs of messages for hours at a time (rather than seconds or minutes). For example, if we had a microservice that could process a million messages per second, we would test the latency at 1% of this as this was considered the normal volume. It also wasn't possible to get high-performance, high-volume drives that could sustain this rate for hours without filling up. Today, we are testing the latency of systems sustaining long bursts of one million messages per second for many hours at a time.
If you are looking for such a drive you can test on a desktop, I suggest looking at the Corsair MP600 Pro series
Make Your Infrastructure as Fast as Your Application Needs
Over the last five years, the requirements for core systems have been more stringent, however, as we need to integrate with existing systems, we have seen the need to easily support systems that don't have the same requirements (and would rather it be easy and natural to work with).
For the more stringent systems, the latencies clients care about:
- Have moved from the 99%ile (worst 1 in 100) to the 99.9%ile or 99.99%ile
- The latencies they are looking to achieve, wire to wire (as measured on the network), is more around the 99.99%ile at 20 microseconds or 99.9%ile at 100 microseconds
- Have a clearer idea of the worst-case latencies they need to see. Many are looking for low milliseconds end to end worst case
At the same time, our client needs to integrate with existing systems where all they need is for that to be as easy as possible.
Make the Message Format a Configuration Consideration
Five years ago I imagined we would need to support all sorts of formats however a relatively small number turned out to be really useful. Have a look:
- An existing format. Make using a corporate standard format easy, no need to use ours
- Text format, YAML seems the best one, esp for readability and typed data
- The binary form of YAML. The good trade-off between ease of use and performance
- Typed JSON for use over WebSockets to Javascript based UIs
- Marshaling of fields as binary
- Direct memory copy objects e.g., Trivially Copyable, to maximize speed
- FIX protocol
We regularly use a single DTO in multiple formats depending on what is most appropriate without the need to copy data between DTO specialized for a given format.
Conclusion
For the use case of our clients, I believe most of the concerns around replacing microservices with a monolith have been solved. However, having the option to run multiple microservices as if it was a monolith handles those cases where it is easier to work with e.g., testing and debugging multiple services.
Opinions expressed by DZone contributors are their own.
Comments