DIY Fail: Taking on Model Deployment Alone
Many of us who work in software know the panic on the other end of that 3am wake-up call: a production system is down. If you don’t know that feeling, just ask your DevOps or IT guy and watch them sweat. When different micro-services work together to deliver a product, things break. This contrasts with how Data Scientists approach our modeling work.Data exploration and model development are often conducted in an environment that is very different from “production”. Small samples of data can often reside on a laptop, with no need for extra computing resources. Even when data is accessed remotely, exploratory analysis and hypothesis testing rarely require someone on “pager-duty” 24/7. And many of the most challenging problems do not arise until a data science team tries to deploy a machine learning pipeline into production — at which point progress slows, reliance on software and IT increases, and economic value stagnates.Businesses expect more and more of their data science teams. As businesses’ expectations grow and mature, data science teams need to provide more than ad-hoc modeling and analysis. Companies are beginning to realize that they need robust, maintainable, automated deployment pipelines with up-times that rival operational data stores. What is the pathway from proof-of-concept to a production pipeline of tasks including data ingestion, cleaning, engineering, model training, and deployment? Many companies are currently addressing this challenge head-on, and are asking themselves: should I build it or should I buy it?—If you’re a Data Scientist-Engineer hybrid like me, you might jump at the opportunity to rally the team and build an “in house” solution. But deploying machine learning solutions to a stable, scalable, and enterprise-grade production environment is hard. You will quickly discover some difficult truths:1. Local machine learning is very different than machine learning in production.A model prototyped in a local environment has a different set of dependencies when it gets to production. Whether the production environment is hosted on the cloud, or on a cluster of physical servers sitting in a chilled-closet down the hall, some new questions must be considered:Data Storage & Scale: Where is the data coming from and going to? How does the model get access to it? Is there enough compute resources to go around?Scheduling: How often does the job(s) need to run? When does the model need to train or score? How are the processes dependent on one another?Output: The whole reason a model goes into production is to drive business insights/decisions. What application, software, report, html page, or mobile app consumes the output of the model(s)? How is it going to get there?Backend Infrastructure: Now that the model isn’t an ad hoc experiment anymore, what infrastructure is in place to make sure that it’s resilient to network issues, node failures or corrupted data input?2. Maintenance costs dominate ML pipeline expenses in production.Did you accidentally leave an instance or two running while you went on vacation? That could be a costly mistake! Several cloud services and deployment platforms are not forgiving in their variable pricing plans. Lack of visibility on month over month expenses is a hard sell to business leaders of any organization.Infrastructure costs and man hours alone are huge maintenance expenses. According to a report published by McKinsey Global Institute (“Artificial Intelligence The Next Digital Frontier” June 2016), $30 billion was spent in 2016 alone on AI R&D and building deployment solutions. Keeping the data pipes clean for machine learning is a challenging job, requiring a full stack: data store, reliable queues, server side communication, websocket connections and a colony of clusters — all requiring configuration and orchestration. Contracts between these micro-services will be breached, nodes will fail and those 3am alerts will start pouring in (if they haven’t already!).3. Model performance degrades over time.As the world changes around us, assumptions integrated into our machine learning models break. Especially in systems where a feedback loop is generated, data inputs may reflect a different world than they did when the model was first built. Sometimes a model breaks because an external data API is no longer available. Sometimes they die because a source of bias crept its way into the training data. When a model is in production, it must be carefully monitored to ensure that any real-world consequences from model predictions are minimized. This would be especially costly in the industrial space, where severe errors in predicting machinery failures can cost hundreds of thousands of dollars, if not millions over time.4. A typical Data Scientist is not an expert in cloud infrastructure or backend engineering.As a Data Scientist myself, I will be the first to admit that our breed doesn’t speak the same language as Software Engineers. The development lifecycle for model building also differs from building a scalable microservice. This is understandable, because there are fundamentally different tools, languages, concepts and environments used for making things work. However, when it comes to deploying end-to-end machine learning pipelines that deliver prescriptive insights, gaps in understanding and communication can cause severe delays in realizing value.At Metis Machine, we’ve witnessed the horror stories of model monitoring and deployment with customers in many domains. Frequently, we encounter young and energetic organizations who have the optimism to take over the world. However, if they haven’t stepped into the world of ML deployment, they haven’t been seasoned by the war yet.With Skafos, friction is eliminated; our platform enables teams of data scientists to drastically speed up the time to market by providing tools and workflows that are familiar and easy. Serverless ML production deployment is as simple as “git push”. That time saved may be the difference-maker in a competitive tech industry.We have the battle scars from wrestling infrastructure, smoothing out the deployment pipeline and moving through stages of monitoring maturity internally. Those insights make it right into our product, right into the hands of our user community and, thus, making Skafos a better platform for deploying and managing machine learning pipelines. So, before you decide to build it yourself, remember this: no one wants to be woken up at 3am.DIY Fail: Taking on Model Deployment Alone was originally published in Metis Machine on Medium, where people are continuing the conversation by highlighting and responding to this story.
Building Convergent Data Sets
At Metis Machine we build machine learning systems. These systems tend to be distributed in nature and require more data than you might imagine. Occasionally, the partitions the data lives on become unavailable for different reasons; third party router drops, software upgrades, etc. Therefore, consistency must be maintained for many of our applications. However, how we keep consistency in our applications sometimes involves conflict-free replicated data types.Let’s explore building one. Here’s a simple example.We’ll build a shopping cart persisted by a multi-node replicated data store. Our data store for this case has three nodes. One of these nodes has just gone down due to a network partition within a hosting provider. Our user has decided they wish to check out, so how can we know which of the remaining nodes has the latest user shopping cart?One potential solution is to merge the carts we have together and use the resulting data structure as the user’s shopping cart. The problem with this approach is if the user has recently removed an item from their shopping cart, we might have orphaned data in the database. This appears as an error, and becomes a source of unnecessary returns or support calls.The first thing we have to consider is how we add and remove items from the shopping cart? There’s a useful conflict-free replicated data type (CRDT) called a grow-only set. Its properties allow you to add to it or merge it, with other grow-only sets, effectively forming the union of the two sets. What follows is what one might look like:The grow-only set could be used to store items we’ve added to the cart, but if we want to remove items, this won’t work. It turns out, we can keep a grow-only set for removals, too. If we also include a notation of time along with the item, then we can use the information in both sets to determine how our last cart should look.The shopping cart approach we are settling on shows one property of CRDTs. They can often be combined to form other more complex CRDTs; our cart is a 2-phase grow-only set. With our approach above we have decided on a solution for retrieving the last cart with the two sets. What if an item was added and removed at the same time? Although unlikely, we can choose to bias our algorithm towards additions or removals. In the case of a shopping cart, the best approach might be to favor removals for the reasons described earlier in this article. In the case of a tie, we’ll favor the union of the removal set as our source of truth.To tie everything together, a simple test:When we put all these pieces together into a single file and run it, we’ll see the output as:One improvement we can make to improve this implementation is to deal with time better. Time might vary from server to server, so it’s best if we take time from another source, or use a different concept of time entirely. When dealing with time, it is usually a better idea to employ some other notion of ordering of items than simple time for the above reasons. Some choices to enhance our understanding of time in this implementation might be to use a Lamport timestamp (link: https://en.wikipedia.org/wiki/Lamport_timestamps).At Metis Machine, we employ CRDTs in a variety of applications within our Skafos platform. Interested in reading the original paper on CRDTs? Click here.Building Convergent Data Sets was originally published in Metis Machine on Medium, where people are continuing the conversation by highlighting and responding to this story.