How to select machine learning algorithm for your problem?
Selecting a suitable machine learning algorithm for your problem can be a difficult task. If you have a lot of time, you can try all of them. However, usually the time you have to solve a problem is limited. You can ask yourself several queries before you start to work on your problem. Depending on your answers and situation, you can list out some of the algorithms to try them on your data.
- Explainability: Does your model have to be explained to a non-technical audience (i.e. your customers)? There are some machine learning models that are no doubt very accurate but are hard to build and explain. Examples of such models are neural networks or ensemble models. For a non-technical audience, these models can be hard to understand. On the other hand, kNN, linear regression, or decision tree learning algorithms produce models that are not always much accurate, however, the way they make their prediction is very straightforward. So, you have to think at least once whether you need to explain your model in front of a non-technical audience or not. If yes, it will be good to choose a model with an easy algorithm and high accuracy..
- In-memory vs. out-of-memory: Can your dataset be fully loaded into the RAM of your server or personal computer? This is one of the important questions to be answered before selecting a machine learning model. If your answer is yes, then you can choose from a wide variety of algorithms. Otherwise, you would prefer incremental learning algorithms that can improve the model by adding more data gradually.
- Number of features and examples: How many training examples do you have in your dataset and what is the number of features included in that data? Some algorithms, including neural networks and gradient boosting, can handle a huge number of examples and thousands of features. Others, like SVM, can be very modest in their capacity and might be slower for a large number of data and features.
- The nonlinearity of the data: Is your data linearly separable or can it be modeled using a linear model? If yes, SVM with the linear kernel, logistic or linear regression can be good choices. Otherwise, non-linear SVM, deep neural networks, or ensemble algorithms, might work better.
- Training speed: How much time is a learning algorithm allowed to use to build a model? i.e. do you have limited time to train your model or the time is sufficiently large? Deep Neural networks are known to be slow to train. Simple algorithms like logistic and linear regression or decision trees are faster. Specialized libraries contain very efficient implementations of some algorithms. Some algorithms, such as random forests, benefit from the availability of multiple CPU cores, so their model building time can be significantly reduced on a machine with dozens of cores.
- Prediction speed: How fast does the model have to be when generating predictions? Algorithms like SVMs, linear and logistic regression, and (some types of) neural networks, are extremely fast at the prediction time. Others, like KNN, ensemble algorithms, and very deep or recurrent neural networks, are slower.
- Cross-validation(scientific one): Besides, you can also select the best machine learning model from a set of models by testing them on validation sets(visit here for more on cross-validation). You can test different models on your data and select the best one on the basic of errors and accuracy. This technique is considered best because it is scientific and it suggests a model that suits best the training data.