Is DataOps the Future of the Modern Data Stack?
As data needs scale, teams need to start prioritizing reliability. Here’s why DataOps might be the answer—and how you can get started.
Join the DZone community and get the full member experience.
Join For FreeBefore DevOps took the software engineering world by storm, developers were left in the dark once their applications were up and running.
Instead of being the first to know when outages occurred, engineers would only find out when customers or stakeholders complained of “laggy websites” or one too many 503 pages.
Unfortunately, this led to the same mistakes occurring repeatedly as developers lacked insight into application performance and didn’t know where to start looking to debug their code if something failed.
The solution? The now widely adopted concept of DevOps, is a new approach that mandates collaboration and continuous iteration between developers (Dev) and operations (Ops) teams during the software deployment and development process.
By the mid-2010s, large data-first companies such as Netflix, Uber, and Airbnb had adopted continuous integration / continuous deployment (CI/CD) principles, even building open source tools to foster their growth for data teams, and DataOps was born.
In fact, if you’re a data engineer, you’re probably already applying DataOps processes and technologies to your stack, whether or not you realize it.
Over the past few years, DataOps has grown in popularity among data teams of all sizes as a framework that enables quick deployment of data pipelines while still delivering reliable and trustworthy data that is readily available.
DataOps can benefit any organization, which is why we put together a guide to help clear up any misconceptions you might have around the topic.
In this guide, we’ll explain how:
What Is DataOps?
DataOps is a discipline that merges data engineering and data science teams to support an organization’s data needs, in a similar way to how DevOps helped scale software engineering.
Similar to how DevOps applies CI/CD to software development and operations, DataOps entails a CI/CD-like, automation-first approach to building and scaling data products. At the same time, DataOps makes it easier for data engineering teams to provide analysts and other downstream stakeholders with reliable data to drive decision-making.
DataOps vs. DevOps
While DataOps draws many parallels from DevOps, there are important distinctions between the two.
The key difference is DevOps is a methodology that brings development and operations teams together to make software development and delivery more efficient, while DataOps focuses on breaking down silos between data producers and data consumers to make data more reliable and valuable.
For years, DevOps teams have become integral to most engineering organizations, removing silos between software developers and IT as they facilitate the seamless and reliable release of software to production. DevOps rose in popularity among organizations as they began to grow and the tech stacks that powered them began to increase in complexity.
To keep a constant pulse on the overall health of their systems, DevOps engineers leverage observability to monitor, track, and triage incidents to prevent application downtime.
Software observability consists of three pillars:
- Logs: A record of an event that occurred at a given timestamp. Logs also provide context to that specific event that occurred.
- Metrics: A numeric representation of data measured over a period of time.
- Traces: Represent events that are related to one another in a distributed environment.
Together, the three pillars of observability give DevOps teams the ability to predict future behavior and trust their applications.
Similarly, the discipline of DataOps helps teams remove silos and work more efficiently to deliver high-quality data products across the organization.
DataOps professionals also leverage observability to decrease downtime as companies begin to ingest large amounts of data from various sources.
Data observability is an organization’s ability to fully understand the health of the data in their systems. It reduces the frequency and impact of data downtime (periods of time when your data is partial, erroneous, missing, or otherwise inaccurate) by monitoring and alerting teams to incidents that may otherwise go undetected for days, weeks, or even months.
Like software observability, data observability includes its own set of pillars:
- Freshness: Is the data recent? When was it last updated?
- Distribution: Is the data within accepted ranges? Is it in the expected format?
- Volume: Has all the data arrived? Was any of the data duplicated or removed from tables?
- Schema: What’s the schema, and has it changed? Were the changes to the schema made intentionally?
- Lineage: Which upstream and downstream dependencies are connected to a given data asset? Who relies on that data for decision-making, and what tables is that data in?
By gaining insight into the state of data across these pillars, DataOps teams can understand and proactively address the quality and reliability of data at every stage of its lifecycle.
The DataOps Framework
- Planning: Partnering with the product, engineering, and business teams to set KPIs, SLAs, and SLIs for the quality and availability of data (more on this in the next section).
- Development: Building the data products and machine learning models that will power your data application.
- Integration: Integrating the code and/or data product within your existing tech and or data stack. (For example, you might integrate a DBT model with Airflow so the DBT module can automatically run.)
- Testing: Testing your data to make sure it matches business logic and meets basic operational thresholds (such as uniqueness of your data or no null values).
- Release: Releasing your data into a test environment.
- Deployment: Merging your data into production.
- Operate: Running your data into applications such as Looker or Tableau dashboards and data loaders that feed machine learning models.
- Monitor: Continuously monitoring and alerting for any anomalies in the data.
This cycle will repeat itself over and over again. However, by applying similar principles of DevOps to data pipelines, data teams can better collaborate to identify, resolve, and even prevent data quality issues from occurring in the first place.
Five Best Practices of DataOps
Similar to our friends in software development, data teams are beginning to follow suit by treating data as a product.
Data is a crucial part of an organization’s decision-making process, and applying a product management mindset to how you build, monitor, and measure data products helps ensure those decisions are based on accurate, reliable insights.
After speaking with hundreds of data teams over the last few years, we’ve boiled down five key DataOps best practices that can help you better adapt this “data like a product” approach.
1. Gain Stakeholder Alignment on KPIs Early, and Revisit Them Periodically.
Since you are treating data like a product, internal stakeholders are your customers. As a result, it’s critical to align early with key data stakeholders and agree on who uses data, how they use it, and for what purposes. It is also essential to develop Service Level Agreements (SLAs) for key datasets. Agreeing on what good data quality looks like with stakeholders helps you avoid spinning cycles on KPIs or measurements that don’t matter.
After you and your stakeholders align, you should periodically check in with them to ensure priorities are still the same. Brandon Beidel, a Senior Data Scientist at Red Ventures, meets with every business team at his company weekly to discuss his teams’ progress on SLAs.
“I would always frame the conversation in simple business terms and focus on the ‘who, what, when, where, and why,” Brandon told us. “I’d especially ask questions probing the constraints on data freshness, which I’ve found to be particularly important to business stakeholders.”
2. Automate as Many Tasks as Possible
One of the primary focuses of DataOps is data engineering automation. Data teams can automate rote tasks that typically take hours to complete, such as unit testing, hard coding ingestion pipelines, and workflow orchestration.
By using automated solutions, your team reduces the likelihood of human errors entering data pipelines and improves reliability while aiding organizations in making better and faster data-driven decisions.
3. Embrace a “Ship and Iterate” Culture
Speed is of the essence for most data-driven organizations. And, chances are, your data product doesn’t need to be 100 percent perfect to add value. My suggestion? Build a basic MVP, test it out, evaluate your learnings, and revise as necessary.
My firsthand experience has shown that successful data products can be built faster by testing and iterating in production, with live data. Teams can collaborate with relevant stakeholders to monitor, test, and analyze patterns to address any issues and improve outcomes. If you do this regularly, you’ll have fewer errors and decrease the likelihood of bugs entering your data pipelines.
4. Invest in Self-Service Tooling
A key benefit to DataOps is removing the silos that data sits in between business stakeholders and data engineers. And in order to do this, business users need to have the ability to self-serve their own data needs.
Rather than data teams fulfilling ad hoc requests from business users (which ultimately slows down decision-making), business stakeholders can access the data they need when they need it. Mammad Zadeh, the former VP of Engineering for Intuit, believes that self-service tooling plays a crucial role in enabling DataOps across an organization.
“Central data teams should make sure the right self-serve infrastructure and tooling are available to both producers and consumers of data so that they can do their jobs easily,” Mammad told us. “Equip them with the right tools, let them interact directly, and get out of the way.”
5. Prioritize Data Quality, Then Scale
Maintaining high data quality while scaling is not an easy task. So start with your most important data assets—the information your stakeholders rely on to make important decisions.
If inaccurate data in a given asset could mean lost time, resources, and revenue, pay attention to that data and the pipelines that fuel those decisions with data quality capabilities like testing, monitoring, and alerting. Then, continue to build out your capabilities to cover more of the data lifecycle. (And going back to best practice #2, keep in mind that data monitoring at scale will usually involve automation.)
Four Ways Organizations Can Benefit From DataOps
While DataOps exists to eliminate data silos and help data teams collaborate, teams can realize four other key benefits when implementing DataOps.
1. Better Data Quality
Companies can apply DataOps across their pipelines to improve data quality. This includes automating routine tasks like testing and introducing end-to-end observability with monitoring and alerting across every layer of the data stack, from ingestion to storage to transformation to BI tools.
This combination of automation and observability reduces opportunities for human error and empowers data teams to proactively respond to data downtime incidents quickly—often before stakeholders are aware anything’s gone wrong.
With these DataOps practices in place, business stakeholders gain access to better data quality, experience fewer data issues, and build up trust in data-driven decision-making across the organization.
2. Happier and More Productive Data Teams
On average, data engineers and scientists spend at least 30% of their time firefighting data quality issues, and a key part of DataOps is creating an automated and repeatable process, which in return frees up engineering time.
Automating tedious engineering tasks such as continuous code quality checks and anomaly detection can improve engineering processes while reducing the amount of technical debt inside an organization.
DataOps leads to happier team members who can focus their valuable time on improving data products, building out new features, and optimizing data pipelines to accelerate the time to value for an organization’s data.
3. Faster Access to Analytic Insights
DataOps automates engineering tasks such as testing and anomaly detection that typically take countless hours to perform. As a result, DataOps brings speed to data teams, fostering faster collaboration between data engineering and data science teams.
Shorter development cycles for data products reduce costs (in terms of engineering time) and allow data-driven organizations to reach their goals faster. This is possible since multiple teams can work side-by-side on the same project to deliver results simultaneously.
In my experience, the collaboration that DataOps fosters between different teams leads to faster insight, more accurate analysis, improved decision-making, and higher profitability. If DataOps is adequately implemented, teams can access data in real-time and adjust their decision-making instead of waiting for the data to be available or requesting ad-hoc support.
4. Reduce Operational and Legal Risk
As organizations strive to increase the value of data by democratizing access, it’s inevitable that ethical, technical, and legal challenges will also rise. Government regulations such as General Data Protection Regulation (GDPR)and California Consumer Privacy Act (CCPA) have already changed the ways companies handle data, and introduced complexity just as companies are striving to get data directly in the hands of more teams.
DataOps—specifically data observability—can help address these concerns by providing more visibility and transparency into what users are doing with data, which tables data feeds into, and who has access to data either up or downstream.
Implementing DataOps at Your Company
The good news about Data Ops? Companies adopting a modern data stack and other best practices are likely already applying DataOps principles to their pipelines.
For example, more companies are hiring DataOps engineers to drive the adoption of data for decision-making—but these job descriptions include duties likely already being handled by data engineers at your company. DataOps engineers are typically responsible for:
- Developing and maintaining a library of deployable, tested, and documented automation design scripts, processes, and procedures.
- Collaborating with other departments to integrate source systems with data lakes and data warehouses.
- Creating and implementing automation for testing data pipelines.
- Proactively identifying and fixing data quality issues before they affect downstream stakeholders.
- Driving the awareness of data throughout the organization, whether through investing in self-service tooling or running training programs for business stakeholders.
- Familiarity with data transformation, testing, and data observability platforms to increase data reliability.
Even if other team members are currently overseeing these functions, having a specialized role dedicated to architecting how the DataOps framework comes to life will increase accountability and streamline the process of adopting these best practices.
And no matter what job titles your team members hold, just as you can’t have DevOps without application observability, you can’t have DataOps without data observability.
Data observability tools use automated monitoring, alerting, and triaging to identify and evaluate data quality and discoverability issues. This leads to healthier pipelines, more productive teams, and happier customers.
Published at DZone with permission of Glen Willis. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments