8 Ways Mobile Observability Is the Opposite of Backend Observability
Backend systems run under highly controlled conditions, while mobile apps have incredible complexity across user experiences, devices, connectivities, and more.
Join the DZone community and get the full member experience.
Join For FreeUsing a hammer to pound a screw into the wall will work, but it's not really a great way to get the job done and probably will lead you with damage you wish hadn't occurred. Similarly, using monitoring tools initially designed for the observability of backend applications to monitor your mobile applications will leave you wishing you had reached for the screwdriver instead of the proverbial hammer.
Often the observability challenges for mobile applications are pretty much the opposite of what they are for backend monitoring. Let's take a look at 8 examples where that is the case.
To make this more concrete, we will use a typical e-commerce mobile application and the backend application that handles its requests as an example to illustrate these differences. However, the comparisons in these examples are broadly applicable to other types of mobile applications and backend systems that you are running.
Duration of Interactions
For high-traffic services in your backend application, you are looking to have requests that take milliseconds to run on average, and you want to scale to handle thousands of requests per second. You don't maintain a state between requests, and it's uncommon for prior requests to cause bugs in the current request.
Some examples of service calls here include:
- Getting a list of specific products.
- Completing a purchase.
- Fetching a list of alternative products for a given product.
The data needed to troubleshoot issues here is likely within the request or the supporting infrastructure. You can trace the individual calls and connected service calls, and then inspect them to look for failure points.
However, in your e-commerce mobile app, a single session lasts from multiple seconds to minutes, or even hours. If you want to understand why purchases are failing, the problem could stem from many application and device factors:
- Did the user background the app between adding items to the cart and attempting to complete the purchase? Certain data might be lost during such app state transitions.
- Did the app run out of memory in a product list view due to excessive loading and retention of product images not appropriately scaled for the device?
- Did the app not complete the payment processing in a timely manner, so the user force quit the app?
- Did the device lose network connectivity and fail to gracefully recover?
- Did the app crash during the purchase flow, and the user decided to purchase elsewhere?
Since mobile is such a dynamic environment, tracking down the root cause of a drop in purchases could have many root causes that fall outside of the actual service calls. The span over which errors can be introduced is far greater than in backend interactions. How you visualize and interpret data becomes very different when your expectation is that issues can evolve over minutes and not over milliseconds.
Session Complexity
When you envision what a complete session is for your backend application, it frequently boils down to responding to a request from the client. The external variables at play are mostly your infrastructure’s health and capacity.
In your e-commerce mobile app, a complete user experience can span multiple sessions across varying lengths of time. The user could launch, background, and then launch the app again over multiple days to complete a single purchase. Key functionality can also take place while the app is backgrounded, such as sending push notifications and pulling fresh products and deals so that the user is always getting the most up-to-date data whenever they launch the app.
Some challenges when troubleshooting apps with complex interactions include:
- Stitching together multiple app sessions to get the complete user experience context.
- Understanding how app performance is impacted by different launch conditions like cold starts versus reused app processes.
- Tracking problems with failed or outdated app states that were loaded far earlier than when the resulting error happened.
App sessions also cannot be easily modeled as a series of traces, so there are data and visualization challenges when dealing with longer, more complex experiences.
Uncontrolled Devices
You control the infrastructure that your backend applications run on. As such, it would be a rookie mistake for a DevOps team to, for example, not be aware of servers that are about to run out of disk space, and most people would forgive the backend monitoring agent for not working as expected if a server ran out of disk space.
That is not the case for your e-commerce mobile app that runs on devices that you have no control over. People buy devices with the least amount of storage they think they can get away with and promptly fill them up with apps and media. You have to build resilient SDKs that can gracefully handle these situations and still report as complete a picture as possible. You have to find the right balance between retaining relevant information on the device – you may not have network connectivity to send it right away – and making the lack of disk space worse by excessively adding to the data stored on the device.
Heterogeneous Devices
Not only do you have no control over devices, but also they are far from homogeneous. In a backend environment, you are likely to have a small set of different machine types. For an Android app, you will have it run on tens of thousands of device models, running a variety of OS versions, so you end up with more complicated variables when analyzing the collected data. Cardinality for certain dimensions will grow in ways that just would not be seen in backend applications.
Some examples of device-specific issues include:
- Your developers and QA team have modern devices for testing, which can handle the size of the product images in a list view. However, many customer devices have less RAM and end up with an out-of-memory crash.
- A manufacturer introduced a bug in their custom Android version, so customers encounter a crash that only affects your app on specific OS version/manufacturer combinations.
- The UI stutters on some devices because they have old CPU and GPU chipsets that cannot handle the complexity of your application.
With so many combinations of device variables, your engineering team needs deep insights into affected user segments to avoid costly issue investigations. Otherwise, they will spend time looking for root causes in code when seeing the holistic picture of impact would streamline their resolution efforts.
Network Connectivity
Your e-commerce backend application operates with the explicit assumption of constant connectivity. Failures frequently are a capacity problem, which can be alleviated by sizing your infrastructure to handle traffic spikes. Outright losing connectivity occurs mostly during cloud provider outages, which are exceedingly rare.
However, constant network connectivity in mobile is never guaranteed. Maybe your app has a great network connection when it starts, then completely drops the network connection, and then gets it back but experiences significant lag and packet loss. Your mobile app observability solution needs to provide insight into how your app deals with these challenging conditions.
Some examples of network connectivity issues include:
- The app cannot launch without connectivity because the download of critical data is required to enter the main application flow.
- The device loses connection as the user tries to make a purchase, but the user is not greeted with a prompt about the issue. To the user, it still looks like the app is attempting to complete the purchase. They get frustrated and force quit because they don’t know the source of the issue.
- The app does not effectively cache images, so customers in locales where bandwidth is a scarce resource stop using your application.
Since problems can occur during connectivity switches, you need visibility into entire user experiences to track down problems. A great example is content refreshes or data synchronizations that are scheduled as background tasks. Understanding where failures happen under specific network conditions allows your engineering team to understand the root cause of network-related problems and introduce fixes that gracefully handle the loss of network connectivity.
Data Delays
Many backend observability tools will only accept data that is delayed by minutes or at most a few hours. This works fine for backend applications since the expectation is for the servers to not lose connectivity. The opposite is true in mobile, where the expectation is for connectivity to be lost intermittently and for a significant percentage of data to be delayed.
As an example, your engineering team notices a spike in crashes, then launches an investigation and puts out a fix in a new version. You notice the crash rate go down, and everyone is happy.
However, users on the previous version that crashed, who were too frustrated to immediately relaunch your app after it crashed, have decided to give it another go a day or two later. They launch the app, which sends a crash report from the device. If your observability tool marks those crashes as having just occurred, you might think the issue is still ongoing, even though you released a fix for it.
Ecosystem Limitations
When you build a backend application, you get to choose the environment that it runs in. The limitations on what you monitor and how you monitor it are largely dictated by the overhead it introduces and the time it takes to implement it.
On mobile, you are operating in ecosystems defined by the device manufacturers or maintainers of the ecosystem, and there are restrictions that you need to find creative solutions to in order to get the data that you need. Certain metrics are forced upon you, such as the crash and Application Not Responding (ANR) rates that, in Android, impact your ranking and discoverability on the Google Play Store.
The tricky part here is that the ecosystems have the ability to collect data from a system perspective, while you only have the ability to collect data from the perspective of your application. That means you have to get pretty inventive to find ways to collect the data that helps you solve certain problems, such as ANRs on Android.
To provide a bit more color here, ANRs occur when an Android app has a prolonged app freeze that causes the OS to display a prompt that asks the user if they want to terminate your app. Effectively, the app freezes for so long that the user is forced to crash their current app session.
From a data collection perspective, the Google Play Console treats ANRs exactly like a crash, capturing a single stack trace at the end of the session. However, app freezes are not deterministic and can stem from endless combinations of initial conditions that led to the main thread being blocked, including:
- Third-party SDKs (like ad vendors) conflicting with each other
- Loading heavy resources like large images or unoptimized ads
- Data synchronizations hitting slow backend service calls
- Heavy animations or UI work
- Slow responses to system broadcasts
With so many variables at play, your best bet is to capture data as soon as the app freezes and then examine these code patterns across your users to find the most common causes. Backend observability solutions are simply not built for these types of nuanced mobile data capture.
Deploying New Code
If you discover an issue in your backend application, code can be consistently redeployed with all instances running new code. That means, if you spot an issue that’s preventing the system from completing purchases, the biggest delay is in tracking down the root cause and writing the code to fix it.
In mobile, you can’t control when people upgrade their app version. There will be a long tail of old versions out in the wild. It is not unusual for a large, established application to have over a hundred different versions used in a single day. As such, it’s vital that you minimize the number of users who download bad app versions. Slow rollouts and real-time visibility into user experiences can help you proactively address issues before they become widespread. Your mobile observability solution should surface signals that allow for early issue detection for every type of broken experience, including:
- Performance issues like slow startups or purchase flows
- Stability issues like crash, error, or ANR spikes
- User frustration issues like abandons and force quits
- Device resource issues like excessive memory, CPU, and battery consumption
- Network issues like failing first and third-party endpoints
Mobile is so complex that engineering teams frequently must add logs and release new versions to build enough context to uncover root causes. This approach is riddled with guesswork, resulting in additional releases – some of them which will introduce, rather than solve, problems – out in the wild. You want your mobile observability solution to provide complete visibility so that your engineering team can get to solutions faster and without sacrificing feature velocity.
Closing Thoughts
At first glance, the challenges of achieving observability in a mobile application may not seem all that different than doing so for a backend application – collect some data, store it in a database, visualize it in a dashboard – but, on closer inspection, the nuances of each of those steps is quite different for the two domains. Trying to use the same tool for monitoring your mobile application as you do for your backend application is better than having no visibility, but it will ultimately deprive you of the full clarity of what is happening in your mobile app. Your developers will take longer to figure out how to solve problems that exist if they can even detect that they are occurring.
If you rely on a backend observability approach for your mobile applications, there are mobile-first approaches that can eliminate toil and guesswork while integrating across your existing tech stack for full stack visibility. In addition, given the different challenges in collecting data for mobile apps versus backend systems, open-source communities, and governing groups are actively working on what mobile telemetry standards should be in order to power the future of mobile observability.
Opinions expressed by DZone contributors are their own.
Comments