Is there a way to accelerate Machine Learning?

6 min read

For many companies, today, the demand for data and analytics is now everywhere in the enterprise. Projects have been created to improve customer engagement, reduce risk and optimize business operations. Data sources are also exploding with new data coming from both inside and outside the enterprise in many different varieties. Also although analytics are needed everywhere, the current approaches to developing them are slow and expensive.

To support this demand, the modern analytical environment has also expanded going way beyond the traditional data warehouse to become an analytical ecosystem comprising multiple data stores and platforms optimised for different kinds of analytic workloads. This ecosystem includes data warehouses, NoSQL databases, Hadoop and cloud storage and streaming analytics platforms made up or technologies such as Kafka, Apache Spark and Apache Flink for example.

In addition, companies have and are creating data science teams to focus on specific business problems and analytic workloads on different underlying platforms. Therefore data science teams are operating across the analytical ecosystem on data housed in across multiple data stores in what is turning out to be a distributed logical data lake and not a centralised single data store.

The demand for new data and new kinds of analytics on high volume multi-structured data has also created a need to scale at low cost which in turn has triggered the rapid adoption of technologies such as Hadoop and Apache Spark to scale data and analytical processing.

While the adoption of these technologies was slow to begin with, we are now seeing them become mainstream with many different types of analytics being developed in data science projects and used to analyze data in both traditional and big data environments. The types of analysis being undertaken include supervised and unsupervised machine learning, text analysis, graph analysis, deep learning and artificial intelligence.

Most probably the fastest growing of these is machine learning – the use of both supervised and unsupervised machine learning to classify (predict) and to describe patterns in data e.g. to cluster similar customer into groups for customer segmentation. With supervised learning, data is first prepared and then fed into an algorithm to train it to correctly predict an outcome. Examples would be to predict customer churn or to predict equipment failure. Data here is often split into training and test data with a further subset held back altogether to see how a model performs on totally unseen data once trained. There are a number of algorithms that can be used for prediction and data scientists will typically develop multiple models, each with a different algorithm and compare results to find the most accurate.

Unsupervised machine learning is when an algorithm is run on the data without any training to find patterns in the data. A good example here is clustering to group together similar data or association detection for market basket analysis.

However the current approaches to machine learning analytical model development are fraught with problems. For a start there is a real shortage of skilled data scientists who, even if you can hire them, are likely to be head-hunted pretty quickly. Also analytical model development is often slow especially on big data platforms where data scientists often prefer to develop everything manually by writing code in popular languages like Java, R, Scala and Python. This results in data science becoming a bottleneck with a backlog building up of analytics that still need to be built.

Also the cost of development is higher than it should or could be because data science is often fractured with different people often using inconsistent development approaches, different libraries and tools. This leads to a fracturing of skills thinly spread across too many technologies. It also limits re-use and sharing of datasets and models. Maintenance then becomes a problem, costs increase and there is no integrated programme of analytical activity. All of this limits agility and adds to complexity. In addition, with the pace of development being slow, data science can become a bottleneck, which in turn can cause existing analytics running in production to become stale because people are held up on other model development activities associated with the aforementioned backlog.

Furthermore, fractured and overly complex data science can also lead to unmanaged self-service data preparation if teams are all adopting different approaches to preparing data. This is especially the case if everything is hand-coded with no metadata lineage and no way to know how data was prepared. The problem is it encourages re-invention rather than reuse and so a governed environment is preferable. Productivity and governance also suffers and so time to value is impacted. Also the backlogs of analytics yet to be built continues to get bigger and we lose the ability to prevent, optimise and respond. Therefore, opportunities are missed and business problems become unavoidable

So how do you solve this? The answer is to accelerate data science by automating the development of predictive models. This is sometimes referred to as machine learning automation. New technologies are emerging to do this e.g. DataRobot and Tellmeplus as well as IBM Watson ML. Google are also headed down thus road. Machine learning automation allows you to rapidly build and compare predictive models and so enable lesser skilled business analysts to become ‘citizen data scientists’.

It also allows you to integrate automatically built and custom built models into a common champion/ challenger program so that you can co-ordinate, and govern all ML projects from a single place.

If you are looking to evaluate tools in that automate machine learning, some of the key requirements for this kind of technology you might want to consider include:

  • Project management and alignment with business strategy
  • Collaborative development and sharing
  • Integration with an information catalog to make it easy for data scientists to find data
  • Providing access to data on big data and small data platforms inside or outside the enterprise
  • Support for easy exploration and profiling of data
  • Built-in and integrated 3rd party self-service data preparation
  • Ability to automate or manually define datasets to train and validate models
  • The ability to automate variable selection for input to a machine learning algorithm
  • The ability to automatically create, train and validate multiple candidate models using different algorithms to find the best predictor
  • Suport for interactive training for better accuracy
  • The ability to include algorithms from 3rd party libraries to integrate 
technologies and train candidate models from a common platform
  • The ability to integrate custom models built in various languages in interactive notebooks like Zeppelin and Jupyter
  • The ability to easily compare the predictive accuracy of multiple candidate models
  • The ability to select and easily deploy models for consumption by other applications and tools
  • The ability to easily deploy models to run in different executing environments e.g. in-cloud, in-Spark, in-database, in-Hadoop or in-stream
  • The ability to set up a machine learning ‘factory’ to not only industrialise development but also to automate the maintenance and refresh of existing models 
So is there a way to accelerate machine learning to reduce time to value? The answer is well and truly yes. Machine learning automation is well worth a look.

About Mike Ferguson

Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an independent analyst and consultant he specializes in business intelligence, analytics, data management and big data. With over 35 years of IT experience, he has consulted for dozens of companies, spoken at events all over the world and written numerous articles. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates.