At Intuity we are used to working in iterative processes, where prototyping makes ideas graspable and explorable. We wanted to use the same process for machine learning.
The internet is full of explanations of machine learning and discussions about its usefulness. There is no need to repeat this here. Only to be clear, when we talk about machine learning in this article, we mean statistical machine learning, and more specifically supervised learning. In less technical terms: Suppose you produce and sell handmade chocolate. Since you only have a small market stall, you can only display three of your ten varieties of chocolate. So each week you have to decide which chocolate to produce and to take to the marketplace. Over the years you have recorded data in a spreadsheet, for example for each date and the market town the number of chocolate bars that you have sold of each variety. You would like to predict the number of bars per variety that you are likely to sell at the market next week using machine learning.
In the most narrow sense of machine learning — and the one that has dominated machine learning research for decades — you would feed your spreadsheet into some machine learning tool. The result is a function that predicts the number of chocolate bars you are likely to sell at a specified time and place of a certain type of chocolate.
Unfortunately, this will rarely work. Imagine in the chocolate example that one day a rich lover of pecans passed your stall and bought every bar of your pecan chocolate. This is a rare event that is unlikely to happen again. The row in your spreadsheet that contains the data of that day may mislead a learning algorithm to predict a much higher pecan chocolate demand than it would without this row. Therefore, it is often advisable to preprocess data before feeding them into the learning algorithm. Besides outlier removal, such preprocessing can include data scaling, balancing or treatment of unknown values. Such additional operations on data are now supported by machine learning frameworks or other statistics tools.
The data preprocessing operates on the rows of your table, but you also have to consider the columns. In our example, we may have recorded the date of the market day. But what can we expect to predict out of the data from 5 April 2015, when we are interested in our planning for 13 November 2018? More interesting would be information like the weekday or the season as those are more general and applicable to days in the future. We could even consider to add new columns to our table that include the weather of that day or any events that took place in the market town. While there is some automatic support for selecting features (i.e. the columns we want to use) from predefined ones, for the definition of the features and addition of data, expert domain knowledge is irreplaceable.
Even more generally, we may have to reconsider our original question. Do we really want to predict the number of bars of each variety to be sold? Could we reshape the problem so that it outputs directly the three varieties that we should take to market next week? Or maybe there are other considerations involved that we cannot directly put into a machine learning algorithm. The chocolate we sell must first be produced. So our decision may also depend on the availability and price of fresh ingredients. We may add another learning problem that predicts such business considerations and we include both aspects into our decision.
We have now come a long way from simply throwing data onto a learning algorithm to designing a learning problem including the question to ask, the features to record in the data set, the preprocessing of data, and ultimately the learning algorithm. All these decisions are correlated: the tasks you can solve depend on the availability of data, but you can also decide to gather specific data; different types of questions require different classes of algorithms; each learning algorithm needs specific types of preprocessing. Obviously, the question and data collection depend on human knowledge, but even the choice of preprocessing and learning algorithm need to be determined by a combination of expert knowledge and experimentation.
If we treat machine learning as a design process, we can use design methods to solve it, in particular we can use prototyping to iterate through different combinations of data preprocessing, feature selection and learning algorithm, to enable a dialogue with domain experts to solve the overall problem.
As with any other prototype, a machine learning prototype should be quick, reflect the properties of a final product, and be graspable and explorable. Our Artificial Intelligence Insight Tools visualize the data and the machine learning results to quickly give domain experts an intuition of what is possible within the current conception of the task. Our machine learning experts can quickly iterate through different combinations of preprocessing and learning algorithms and thus provide a preview of a fully engineered learning solution. Together, domain experts and machine learning experts reconsider the task, put it into the business context and develop an overall solution. Whether this solution consists of a single machine learning task, a combination of machine learning tasks or some completely different approach, our client has saved the effort to collect a huge number of data without even knowing whether these are the right data or whether they address the right question.
The results of our prototypical machine learning runs can directly be deployed into an overall agile development process. This is the path to lead machine learning from an isolated esoteric technique into the mainstream software development process.