Data Preprocessing with Python for Machine Learning


Data Preprocessing with Python for Machine Learning

Around quintillions of bytes of data are generating day by day so it has become a challenging task to manipulate this data. Data scientist aims to manage and analyze that data to make some predictive models by using the data. Machine learning algorithms can learn through the data and generates some useful predictive models. Which can be helpful to predict some future results. The main requirement of these machine learning algorithms is “data”. The accuracy and efficiency of the model depend on the algorithm as well as the quality of the dataset. So data preparation is necessary to develop predictive models. Data can be in any form like image, text, audio, tables, etc. Different types of data require different types of preprocessing techniques. In this blog, we are describing data preprocessing techniques for image dataset and numerical dataset. Python is one of the best tools in the data science field. Python supports too many libraries for machine learning as well as data preprocessing. There are some important steps in data preprocessing:
1.          Importing the Dataset
2.         Checking for Missing Data in Dataset
3.         Encoding categorical data
4.         Feature Scaling
5.         Splitting the Dataset into Training set and Test Set
In this blog, we are going to explain these data preprocessing techniques. We’ll explain how python can be used for specific purposes.
Pandas: Pandas is the most powerful, fast and excellent open-source python library in data preprocessing and data analysis. Most of the times data is stored in CSV form or excel spreadsheet. Pandas library is an important tool to read and process this data. Pandas supports too many inbuilt functions. So that accessing the data becomes easy with pandas. For detecting and removing the outliers, pandas becomes an excellent tool. All the data preprocessing steps can be done by using this library. Still, we can't use pandas for data preprocessing because it may take intense coding practices and consume a lot more time.
NumPy: NumPy is a python package mostly used for scientific calculations or computations in data science and other fields. It is basically used to create larger dimensional arrays. With the help of NumPy library, matrix operations and some operations on larger dimensional arrays is possible. While training deep learning models there is a need to store images in larger arrays. So here NumPy tool plays an important role. It can store a larger image dataset in a single array. So data preprocessing on image data becomes an easy task due to this tool. So this library is also important for preprocessing complex data such as audio signals and images.
Scikit-learn: Scikit-learn is one of the simple, easy and efficient library for building machine learning models in python. Also, scikit-learn can be used for data preprocessing purposes. There are too many inbuilt functions and tools are available in this library for data preprocessing. With the help of this tool, the one-line command is sufficient for larger and complex transformation.
Kearas Utilities: Keras is the high-level API for building TensorFlow models. But it can be also used for image data preprocessing. There is a tool called ImageDataGenerator in Keras for processing images. Using this tool, brightness range, zoom range, rotation range, shift range can be adjusted. Also image resizing, standard normalization can be done by using these tools. Also train, a test split can be done by using Keras utilities.
Now we will discuss some data preprocessing techniques and python syntax which can be useful for this purpose.
1. Importing dataset: As we have looked earlier, most of the datasets are stored in CSV format. So to import and to load this dataset, we can use the pandas library. It can be used to import the CSV file. For illustration, we will use one sample dataset containing zones and the online shopping trend. The following syntax can be used for importing the dataset.
dataset1 = pd.read_csv('SAMPLE.csv')
dataset1.head() # this command is used to show first 5 rows of the dataset

2. Handling of Missing Data: For building a machine learning model we need to take care of missing values. Generally, there are three methods to treat the missing values. First is to make a prediction of missing values by training a predictive values, anther is removing the rows containing the missing values. But this is not the right way to handle the smaller dataset. And the third and general method is to take the mean of the remaining values in the same column in the dataset. This is the common way to treat the missing values in the dataset. The following command lines can be used to treat the missing values in the third way. There is a tool called simpleimputer in sci-kit learn library for this purpose.
# replace mean of all the other values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:])
X[:, 1:] = imputer.transform(X[:, 1:])
 

You can now see the missing values in the dataset are replaced with some mean values. And now the dataset does not contain any null value.
3. Handling of Categorical Data: We cannot train the model with strings input. In most of the raw dataset, categorical values have an assigned a string variables. But for training the machine learning model, we need to convert them into some numerical values or arrays. There are two tools available in sci-kit - learn library for doing this. One is label encoder. This tool is mostly used to encode the target variable. Labelencoder assigns a number according to the category of the variable.
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
X[:,0]
 
array([1, 0, 2, 0, 2, 1, 0, 1, 2, 1], dtype=object)
 
You can now see some numbers 0,1 and 2 are assigned to the categorical values.  Another is a one-hot encoder. This tool creates a binary NumPy array specifying each category. Keras library also includes this tool as the last dense layer of neural network consists of these types of data. The following illustration can demonstrate the implementation of the one-hot encoder.
enc = OneHotEncoder()
n = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(n)
encoded_n = enc.transform(n).toarray()
encoded_n
array([[0., 1., 0., 1., 0., 1.],
       [1., 0., 1., 0., 1., 0.]])
 
You can now see the one-hot encoded data in the form of NumPy array
4. Train - test split of the dataset: While building the machine learning and deep learning model, two datasets are required one is a training dataset and another is testing dataset or validation dataset. We can train the model with a training dataset but it needs to verify the accuracy of the model. So for validation of the model, or to calculate the accuracy of the model, a validation dataset is necessary. We can use the train_test_split tool available in the scikit-learn module for this purpose. Generally, 80% - 20%, 90% - 10%, 70% - 30% train - test split ratios are used. Here we are using 70% - 30% ration for train - test split of the dataset. You can use the following commands for this purpose.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=33) # 30% split 
5. Feature Scaling: Here we are talking about Variable Transformation or Feature Scaling in step in data preprocessing. This is one of the important steps in data preprocessing. You can see the two columns in the dataset one is age and another is income. Age is in the range of 30 - 60 and income is in the range of 60000 to 100000. This is not on the same scale. This may lead to some issues in your trained machine learning model. Due to this model will try to predict the results according to specific columns. So we need to convert these columns on the same scale. This method is called as feature scaling. There is an inbuilt function called StandardScale in scikit - learn for doing this. We can transform the data on the same scale by using the following python command.
# feature scaling
sc_X = StandardScaler()
X_train[:,1:3] = sc_X.fit_transform(X_train[:,1:3])
X_test[:,1:3] = sc_X.transform(X_test[:,1:3])
 
array([[1, -0.1208766377101645, 0.4505577782776152],
       [0, -1.5853435945833165, -1.353180217034444],
       [0, 0.13947304351172932, -0.9734459022319053],
       [2, -1.146003507521371, -0.7835787448306358],
       [1, 0.9042502321010419, 0.9252256717807886],
       [1, 1.4900370148503028, 1.5897607226852315],
       [2, 0.3184634493517811, 0.14466069135334744]], dtype=object)

Now we can see that all the values are in the same scale.
We have implemented a lot more techniques and data preprocessing methods on the dataset. Now our dataset is perfect to fed in the machine learning model. We cannot use the raw dataset directly to train the machine learning model, some operations need to performed on the dataset. So we have seen that data preprocessing is an important step in building predictive models. So this is all about data preprocessing with python. You may take some help from the scikit - learn website for more operations and data preprocessing methods. You can download the full jupyter notebook and dataset from this link


14 comments:

View Blog