Machine Learning in Software Development — Techniques and Tools
The ability to version-control ML models, automate testing, and provide better feedback.
Join the DZone community and get the full member experience.
Join For FreeTo learn about the current and future state of machine learning (ML) in software development, we gathered insights from IT professionals from 16 solution providers. We asked, "What machine learning techniques and tools are most effective for the SDLC?" Here's what we learned:
Tools
- MLFlow, Bugspots, Helium, and Appvance are some pretty powerful tools. I particularly like MLFlow for its ease of use and ability to version-control ML models.
- We adopted MLFlow for our data platform — ML data platform management system. Operational database real-time and transactional for in-database ML to track the workflow of the data scientists. If you adopt a culture of experimentation, create 50 experiments a day, each running and producing a different result, you need to keep track of each. You need the ability to tag with parameters and metrics so you can go back and see why one model performed better than another.
- We’re building those tools as part of our platform. Open source tools like SciLearn, Pytorch, TensorFlow, and build our own.
- A lot of the new modern test automation tools allow you to have self-healing tests, automated tests, and automated crawlers to find bugs. Logging systems to find anomalies for security alerts. Most of the focus is around maintenance.
- Tools simplify infrastructure and data engineering for developers. With ML an explosion of things needs to happen. Easy integration into the application. Debugging is more difficult because the ML modes are living entities and drift occurs as data and learning changes. The biggest challenge is the debuggability of code and application. Make sure you have the traceability of your model decisions. Model performance evaluation over time.
Feedback
- The most effective technique is to define the task at hand as clearly as possible and immediately come up with an automatic evaluation method. Following this step, you ought to collect and label a small dataset for your problem, overfit to that dataset with any method, and try to close the whole production loop: dataset collection - training - evaluation - deployment. A majority of the time you’ll realize that your evaluation method is actually not what you had intended for your product, causing you to have to go through these stages again.
- The answer for everything is DevOps but a better answer is thinking in terms of providing useful feedback loops. We tend to focus on ceremony and mechanics without instrumenting ops in a way that a dev finds value from the metrics. To prevent analysis paralysis, including ML on the ops level to give developers the information they need. Want anomaly rates that diverge from projections. Build anomaly detection models based on code. Ops is creating better feedback data for developers.
- Python by default is the language for scripting the frameworks. There are a lot of models that can be used, or you can build your own. Reinforcement learning (Deep adversarial, Q), semi-supervised and using Closed-loop ML techniques have proven to be beneficial in different phases of SDLC. When organizations build models, the underlying premise is that the model’s accuracy and efficiency are based on certain assumptions and is dependent on the training data set it is privy to. If there is a change in data patterns or unanticipated scenarios, the model’s accuracy and efficiency may diminish over time. For example, in a manufacturing plant, a model can be deployed to detect defects on parts being manufactured and assembled in the assembly line. Over time, the model’s ability to accurately identify the errors may diminish. This results in severe challenges if the software uses traditional analytics exclusively. However, when equipped with closed-loop functionalities, the smart agents can auto-detect and trigger a re-learning and re-training process to improve the accuracy and performance of the models automatically, leading to increased productivity, efficiency and cost-savings. The closed-loop ML technique for the SDLC can use a reinforcement or unsupervised algorithms to train, test and validate ML models to improve accuracy. Post the initial deployment, as needed, the model can self-learn, self-adjust and detect variations in its own accuracy and performance. In short, it will tune itself so that the output is optimal.
Other
- ML is becoming standardized across the SDLC — people are learning how to use it, getting vision into where things are going, and becoming more distributed.
- We're seeing more around deep learning and specific ML methods.
- It depends on the business case. Classic data science is needed to understand the right algorithm and ensure data management. You may need to choose a model that’s almost as good but computationally less expensive. Incorporate a desirability function to consider the cost of planning and deployment.
- Techniques I am seeing include learning techniques such as concept learning, decision trees, neural networks (and convolutional neural networks), if/then rules, reinforcement learning, inductive logic programming, and the like.
- Here are the main elements:
- 1) Ensuring business requirements and expectations are set from the beginning. This helps define the ROI for the project and what you’re looking to solve for (i.e., better customer engagement, reduce churn, etc.).
- 2) Converting the business problem into a technical problem. This lets you define what data is needed, the approach, where to start, etc. so you can set the scope of the solution. You take the business problem of improving customer satisfaction or gaining market share and you turn it into a data science problem: prediction for customer conversion/customer churn, user segmentation, product recommendation, etc. which is something that you can solve for using data and a model. 3) Establish what data is actually available to solve the problem. This can be one of the biggest limiting factors of applying ML in the SDLC. There needs to be sufficient and relevant data to solve the problem, and there needs to be a base level of normalization. Given the technical problem, you need to identify which entities can be relevant features to plug into the model. 4) Design the rotation process. Given your toolkit, start with the simplest approach possible and see how it performs. Based on those results, you have a sense of direction for where to go and how to add complexity. 5) Experimentation and Quality: Design experiments so you can test performance, make modifications, re-evaluate, then rinse and repeat. Make sure you pick the right metrics, so you measure what really matters.
Here’s who we heard from:
- Dipti Borkar, V.P. Products, Alluxio
- Adam Carmi, Co-founder & CTO, Applitools
- Dr. Oleg Sinyavskiy, Head of Research and Development, Brain Corp
- Eli Finkelshteyn, CEO & Co-founder, Constructor.io
- Senthil Kumar, VP of Software Engineering, FogHorn
- Ivaylo Bahtchevanov, Head of Data Science, ForgeRock
- John Seaton, Director of Data Science, Functionize
- Irina Farooq, Chief Product Officer, Kinetica
- Elif Tutuk, AVP Research, Qlik
- Shivani Govil, EVP Emerging Tech and Ecosystem, Sage
- Patrick Hubbard, Head Geek, SolarWinds
- Monte Zweben, CEO, Splice Machine
- Zach Bannor, Associate Consultant, SPR
- David Andrzejewski, Director of Engineering, Sumo Logic
- Oren Rubin, Founder & CEO, Testim.io
- Dan Rope, Director, Data Science and Michael O’Connell, Chief Analytics Officer, TIBCO
Machine learning
Software development
Data science
Opinions expressed by DZone contributors are their own.
Comments