Data Science
Before I start talking about data science platforms, let me give you a short introduction into what data science is. In its essence, it is a field that uses tools taken from computer science, statistics, and machine learning to extract insights from data. In other words, if you have some data and you want to make some decision or predictions based on it, you use data science. However, extracting information from big data sets can be expensive. To implement any type of big-data project, a company must build a data infrastructure first. Think of it as different pieces of technology that can run all the tools a data scientist needs. The issue is that for many years building such an infrastructure was like building a car just from parts. Possible, but you needed people with highly specialized skills, and it took a lot of money and a time. Fortunately, this is changing. What we have seen in the past few years is the appearance of platforms that automate this process. Take for example various cloud-based platforms that make it much easier to develop and maintain big-data infrastructures, from my own team to others in the market like Amazon Web Services (AWS), Google, Microsoft, and Anaconda.
Automated big data platforms are only part of the story. Although they enable us to set up and maintain data infrastructures more easily, somebody still needs to write lines of code to clean the data and experiment with machine learning models. This process can be quite time consuming and needlessly complex. To understand what I mean, let me walk you through the data science workflow.
Usually, every data science project consists of three parts: data processing, modelling, and deployment. When it comes to how time is spent in each of these parts, we have this rule of thumb: 50-60 percent is spent on just processing the data, and the rest is spent on modelling and deployment. As you can imagine, this is a very inefficient use of a data scientist’s time. But it’s necessary as real-world data is messy and algorithms can’t deal with messy and unstructured data sets. Modelling, on the other hand, is an iterative process. For any specific problem, it’s impossible to know beforehand which exact algorithm is going to be the best. As a result, a data scientist has to try out many different algorithms until he arrives at a well-performing one.
Another source of inefficiency is that it can be difficult to take a model and package it in a way that can be launched in production. Many times the machine learning pipeline that was built during the modelling part needs to be broken apart and reconstructed to make sure its production safe. To address all these inefficiencies, in the past few years analytics vendors have started developing products that take the entirety of this workflow and integrate into one end-to-end platform. Think of these platforms as operating systems for data science. The big innovation that these platforms bring is that first that they automate a lot of the data processing part. Second, they make it very easy to keep track of all the developed models and their parameters. And they make it easier to launch algorithms and models into production.
Comments
Post a Comment