Challenges in Machine Learning
In the last article ‘A-Z of Machine learning,’ we have learned about machine learning like what is machine learning and how it works but there are some challenges behind ML-like it’s not a cakewalk thing that you just get a data given it to ml model and your model will give you prediction but trust me ml is more than that, you need to take care of so many things before or while making your model, most of the challenges you will face at an enterprise level and there are such challenges you will encounter while working on a model.
Data Collection
Very first challenge might you face regarding the data collection, if you are learning Machine learning and for that, you need a data so, you can easily get the data at the initial level or I would say at academic level there are such platform like Kaggle, UCI Repository from where you can get the data for your POC or Academic project.
But if you are in an organization and for that you need to build an ML solution so, for that you might face some challenges in Data Gathering.
If you are in a small organization there is a possibility that you have to do this task on your own.
There are such tools-techniques or processes with the help of that you can get data like, there are such API’s from where you have to fetch the data, there is a technique called Web-Scraping with the help of this you can extract data from websites, maybe there are some databases from where you will get the data, there are some Data Warehouses (OLAP systems) from where you can get the data for the solution.
Like there are so many ways to accumulate the data, but it’s not as easy as you are getting the data at your academic level.
At the enterprise level, you have to work on different- different things to make an accumulation of data but if you are in a big company or organization where they have a team of data scientists and so many people who work together in a team of data science, this Data Gathering task will be done by Data Engineer’s who are responsible to fetch the data from the real world. So, that data engineering team can work on that, and give the data to you.
Insufficient Data
As we discuss in the Data Collection statement you might face data gathering problems, but once you fetch the data there is a chance that data is insufficient data for your problem statement which means you need to spend more time on tuning that data according to your need or the problem statement.
Assume that you make two models one is M1 and the second is M2, while making M1 you gave 10,000 rows of data to M1 and 10,000,00 rows of data to M2, where the M1 gives the accuracy of 75% and on the other side M2 gave only 68–70% of accuracy.
Which model has the better performance? The answer is M2 because while making M2 you gave a good amount of data as compared to M1 so, in this case, we consider the M2 model for better performance irrespective of the accuracy of the models.
Because in ML you need a good amount of data to make a model as well as rely on that model. And a good amount of data doesn’t mean that you can give any data to the model. That data should have a high amount as well as good quality, then only the data is trustworthy for your model.
Note: If there are 10,00,000 rows of data that are 95% impure or have bad quality, and on the other side there are 1,00,000 rows of data that are 100% pure. So, in this case, 1,00,000 rows of data are good for the model instead of 10,00,000 rows.
Non-Representative Data
Non-representative data means the biases in data, or the data is biased in nature. Let’s understand this with an example.
Suppose you have to predict the upcoming elections like who is going to win the elections? and there are three political parties such as AAA, BBB, and CCC which are pretenders for upcoming elections, now for prediction you ran a survey on different-different regions or states, and suppose you ran a survey on 10 states or regions, and from these 10 regions.
There are 6 states where the ruling party is AAA, 2 states where the ruling party is BBB and there are 2 states where the CCC party is ruling or they have their state government there.
You have to predict who will win the central election? And from this survey, you will get biased data that says that the AAA will win the central election. And why does this happen?
Because you select 6 out of 10 states who are having the AAA as their state government, there is a high probability that the people of those 6 states say in the favor of AAA (there maybe have different- different parameters for selecting their favorite party, but for an example let consider this one parameter ), and you conclude that “AAA is going to win the upcoming central election”.
This thing is known as Sample Noise like the weightage of AAA states is higher as compared to the states of BBB and CCC in your survey.
Let’s suppose for removing the Sample Noise you have selected some chunks of people from this same scenario, but what are those chunks are the party workers of their respective parties and they only say good about their parties, these things tend to Sample Biased because those chunks of people are biased on their parties.
This Sample Noise and Sample Biased indicates that the data is a Non-representative data for your problem statement.
So, to overcome this Sample Noise and Sample Biased, what will you do? You can conduct a survey on 6 states where 2 states have the AAA, 2 states have the BBB and 2 states have the CCC as their ruling parties.
For this you also select the chunks of people from different-different mindsets/perspectives, like you select 100 people in a group of which these 100 people have the perspective on AAA, BBB, CCC means those people are not biased to any one party.
So, now you have representative data and from this, you can make a prediction on the upcoming election.
Poor Quality of Data
In this, you will encounter that your data has a lot of Missing values, Duplicate values, Outliers, and incorrect values.
In short, you have noisy data, and for making a model with the help of this data you need to clean it first, then only you can use this data for your model building purpose.
For removing the noise you have to pre-preprocess the data and pre-process there are many steps to do this task. And these kinds of steps take a huge part in the Machine Learning life cycle, or we can say these are the crucial steps in the ML life cycle because on average we have to spend our 60–70% time on preprocessing the data.
This is not like the ML life cycle only focused on Model Building before model building we have to do a lot of tasks even model building takes a very small part in the machine learning life cycle.
Most of the statistics or probability knowledge will come in handy in this process, because without Statistics and Probability you cannot pre-process the data according to need.
You have to do a lot of EDA(Exploratory Data Analysis) on your data, to understand the data better so that you can remove or impute the noise from data, and make your data training ready.
Irrelevant Features
This is different from poor quality of data so, once you have imputed or removed all the noise from the data, now you are looking for the valuable features for your ML algorithm because there are high chances that the features you have in your data are relevant ones and work for your model.
There is a famous quote about Machine Learning “Garbage In Garbage Out”, we know ML is not magic it’s pure math.
So, whatever the data we give to the model, the model behaves the same as the data we gave to it.
It’s really important that we need focus more on data before giving it to ML algorithm.
Let’s understand this with a use case.
Assume that you have to build an ML solution for predicting the student’s placement result, like which student is going to the placement and which student is not. And for this, you have a dataset that consists of features(input columns) such as Student’s IQ, Student’s CGPA, Students Grade by teachers, and Students location and there is a label which is Placement.
- The student IQ feature tells the IQ (intelligence quotient) level of each student.
- Student CGPA feature tells the Academic grade of each student.
- Student Grade feature tells the behavioral marking of each student given by teachers.
- Location tells about the residence or geographical location of each student.
- Placement is a label column that gives the values in Yes/No form like which student is going to be placed and which are not.
In this use case you can see there is one irrelevant feature present in the data, which has nothing to do with the placement of students and that feature is Student location, why am I telling this?
Because you can get an idea or predict the placement of a student by their Academic grade ‘CGPA’, their ‘IQ’ level, and their behavioral marking ‘Grade’.
These are the only features that are responsible for the prediction of students and student location is an irrelevant feature for the problem statement, so, that’s why taking care of irrelevant features is important because they can ruin the model accuracy and even results.
Overfitting
Overfitting is one of the most usual challenges in Machine Learning, in this, your ML Algorithm will memorize the entire data while training.
This thing happened on training data where the ML algorithm memorizes each data point or pattern in data.
This is a very bad thing for any model if that model is overfitted because this kind of model will never give an accurate or good prediction when we test the model on other data (testing data).
Because the model only understands the training data, instead of that model has to understand the underlying rule or pattern of data.
We know machine learning is a math and in math, we understand the concept behind the equation instead of memorizing the whole solution (except few students), and this is also applicable to Machine learning if the ML algorithm memorizes the data points instead of understanding the underlying concept of that so, that algorithm or model known as an overfitted model.
There are some parameters that prevent the overfitting of the model. You will learn those things while you are working on model building. Overfitting is one of the most crucial things we have to take care of.
I think this diagram will help you to understand the overfitting.
Underfitting
Underfitting is the opposite or vice versa of overfitting, in this, the algorithm doesn’t think too much and makes an opinion, in overfitting we understand that the model will memorize the entire data and vice versa of this thing happens in underfitting.
As per its name, we can understand that the model/algorithm doesn’t understand the required data points instead of that the model understands very less data points as compared to the required one and makes their opinions about predictions.
This kind of model doesn’t work on training data as well as on testing data, which means this kind of model is a complete disaster for both the data.
As we know that with the help of some parameters we can prevent overfitting just like that for we also prevent underfitting with the help of making some parameterization while making a model.
If I conclude the overfitting and underfitting in simple words, an overfitted model is a disaster for testing data and on the other side under fitted model is a disaster for both data (training and testing data)
We have found the mediocre solution of our model, like that model is not to be overfitted and the model is not under fitted instead of that model lying between these two concepts so that mode is an accurate one.
Software Issue
The perspective of making an ML model is to help a user as per their needs but this can’t be done by ML itself. With the help of Machine Learning we can make models easily but to make that model user-friendly we need a mediocre solution that will work like a bridge between the user and model.
Most of the time we expose the ML model as an API and that API will work for users. If we talk about the embedation of ML models in software, it’s kind of a complex task because there is no platform that will help with the embedation of ML models and software.
And to embed the model in any software we have to take care of both things individually first we have to make a model and then we have to build a system from scratch which is compatible with that ML model, which means we need a separate software engineering for this.
And this software engineering is done by software developers, and ML engineers also work as mediators between the model and software developer so they can help the software developer according to the nature of the model.
Offline Learning
In ML offline learning is a type of learning where we make a model in offline mode and then deploy that model on online servers if we want to retrain the model, first we displace the model from the server and retrain the model in our local system or offline mode and then re-deploy that model on servers, so this process known as offline learning.
And there are some consequences we can face in offline learning, first thing deployment is costly because for deployment we use some cloud platforms like AWS, GCP, AZURE, and others and we know the cloud is a costly thing.
If we use offline learning and deploy that model on a server and if that server is logging off for some reason and new data is continuously coming on that model, that model is unable to predict on new data and due to logging off, we can’t displace the model for re-training on new data.
We have to wait till the server get restarted, so this is the main concern related to offline learning and deployment.