Monitoring with DataDog
Join the DZone community and get the full member experience.
Join For FreeRecently I found myself sending more and more business metrics to Datadog, a Software as a Service solution that promises to collect all your data points and build business metrics, displaying them as graphs and triggering alerts whenever they get to critically low (or high) levels.
The goals
The more your automated tests raises their level of abstraction, the more they become oriented to external quality (what the customer wants and does) instead of internal quality (low coupling, high cohesion of the software design). The largest end-to-end tests that we have in place at Onebip connect several different projects on an integration server and run everything from the creation of a purchase or subscription to its renewal and termination (events that would happen months after creation).
However, even end-to-end tests cannot guarantee that our applications work against external resources, such as merchants, mobile carrier, and ISPs. The only way to catch integration problems is monitoring. These problems, like a mobile carrier experiencing an outage, may be due to our errors or to external conditions; but they should nevertheless be discovered as early as possible.
The infrastructure
Datadog is the only data-collection service that passed the stress tests of SLL, our solution architect. It ships as an UDP server that you pay basing on the number of machines you want to run it on; for example, a preproduction and a production server are a common choice to start out. The server collects data locally and periodically uploads it to Datadog in bursts, where you can access it via a web application or via APIs in case you want to call it from your build.
The UDP protocol is aligned with the goals of metric collections: a silent server that decouples the sending of metrics from the rest of the business logic:
- UDP packets are just lost if no process is there listening to them, no errors are raised if the server crashes or is not running or installed for some reason for instance in development machines).
- The monitoring code, which you write, should be decoupled and asynchronous as much as possible. The part that talks over the network is already externalized in the DataDog server, but you don't want the user to wait because you have to send some strange number.
So the internal part (sending via UDP) is performed in Listener objects that implement the Observer pattern. These object still have to be wrapped in all-encompassing try/catch constructs so that any errors in the monitoring part never influence the business logic. Againg, you don't want a payment to fail because of an exception in how monitoring DateTime objects are built.
For PHP we built a SilentListener class to wrap all of our object:
class SilentListener { private $wrapped; public function __construct($wrapped) { $this->wrapped = $wrapped; } public function __call($method, $args) { try { call_user_func_array(array($this->wrapped, $method), $args); } catch (Exception $e) { $this->log($e); } } }SLL
An example
In some countries, we receive payments through mobile-originated messages (MO), a fancy word for saying SMS sent by the end user. So a simple way to monitor if we are receiving payment or if the server is exploded is to upload a metric counting them every time we receive one (pseudo-JSON format to show you the data):
{ counter: 1 }
However, we can be more precise than this: an external outage or an integration problem may happen to a lower level than the whole application. For example, MOs can be delayed in Argentina, by a single carrier, while the rest of the world is still working fine.
So our data points look like this:
{ counter: 1, tags: { country: "IT", carrier: "Vodafone", merchant: "Tasty Cookies, Inc.", } }
and in turn graphs on DataDog or calls to its API can set up filters so that we can, if necessary, view only the data related to any combination of country, carrier and merchant.
The nice thing, SLL says, is that you just start send data from production and only after you have data points available you build a graph or an alert system basing on what appears to be the most important tags. For example, a big merchant may benefit from some dedicated monitoring, while minor countries such as Vietnam should be monitored as a whole since their traffic is by far lower than that of the others.
Opinions expressed by DZone contributors are their own.
Comments