Around quintillions of bytes of data are generating
day by day so it has become a challenging task to manipulate this data. Data
scientist aims to manage and analyze that data to make some predictive models
by using the data. Machine learning algorithms can learn through the data and
generates some useful predictive models. Which can be helpful to predict some
future results. The main requirement of these machine learning algorithms is
“data”. The accuracy and efficiency of the model depend on the algorithm as
well as the quality of the dataset. So data preparation is necessary to develop
predictive models. Data can be in any form like image, text, audio, tables, etc.
Different types of data require different types of preprocessing techniques. In
this blog, we are describing data preprocessing techniques for image dataset
and numerical dataset. Python is one of the best tools in the data science
field. Python supports too many libraries for machine learning as well as data
preprocessing. There are some important steps in data preprocessing:
1. Importing
the Dataset
2. Checking
for Missing Data in Dataset
3. Encoding
categorical data
4. Feature
Scaling
5. Splitting
the Dataset into Training set and Test Set
In this blog, we are going to explain these data
preprocessing techniques. We’ll explain how python can be used for specific
purposes.
Pandas: Pandas is the
most powerful, fast and excellent open-source python library in data
preprocessing and data analysis. Most of the times data is stored in CSV form
or excel spreadsheet. Pandas library is an important tool to read and process
this data. Pandas supports too many inbuilt functions. So that accessing the
data becomes easy with pandas. For detecting and removing the outliers, pandas
becomes an excellent tool. All the data preprocessing steps can be done by
using this library. Still, we can't use pandas for data preprocessing because
it may take intense coding practices and consume a lot more time.
NumPy: NumPy is a
python package mostly used for scientific calculations or computations in data
science and other fields. It is basically used to create larger dimensional
arrays. With the help of NumPy library, matrix operations and some operations
on larger dimensional arrays is possible. While training deep learning models
there is a need to store images in larger arrays. So here NumPy tool plays an
important role. It can store a larger image dataset in a single array. So data
preprocessing on image data becomes an easy task due to this tool. So this
library is also important for preprocessing complex data such as audio signals
and images.
Scikit-learn: Scikit-learn is one of the simple, easy and efficient library for
building machine learning models in python. Also, scikit-learn can be used for
data preprocessing purposes. There are too many inbuilt functions and tools are
available in this library for data preprocessing. With the help of this tool,
the one-line command is sufficient for larger and complex transformation.
Kearas Utilities: Keras is the high-level API for building TensorFlow models. But it can
be also used for image data preprocessing. There is a tool called
ImageDataGenerator in Keras for processing images. Using this tool, brightness
range, zoom range, rotation range, shift range can be adjusted. Also image
resizing, standard normalization can be done by using these tools. Also train,
a test split can be done by using Keras utilities.
Now we will discuss some data preprocessing techniques
and python syntax which can be useful for this purpose.
1. Importing dataset: As we have looked earlier, most of the datasets are stored in CSV
format. So to import and to load this dataset, we can use the pandas library.
It can be used to import the CSV file. For illustration, we will use one sample
dataset containing zones and the online shopping trend. The following syntax
can be used for importing the dataset.
dataset1 = pd.read_csv('SAMPLE.csv')
dataset1.head() # this command is used to show first 5 rows of the
dataset
2. Handling of Missing Data: For building a
machine learning model we need to take care of missing values. Generally, there
are three methods to treat the missing values. First is to make a prediction of
missing values by training a predictive values, anther is removing the rows
containing the missing values. But this is not the right way to handle the
smaller dataset. And the third and general method is to take the mean of the
remaining values in the same column in the dataset. This is the common way to
treat the missing values in the dataset. The following command lines can be
used to treat the missing values in the third way. There is a tool called
simpleimputer in sci-kit learn library for this purpose.
# replace mean of all the other values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:])
X[:, 1:] = imputer.transform(X[:, 1:])
You can now see the missing values in the dataset are
replaced with some mean values. And now the dataset does not contain any null
value.
3. Handling of Categorical Data: We cannot train the model with strings input. In most
of the raw dataset, categorical values have an assigned a string variables. But
for training the machine learning model, we need to convert them into some
numerical values or arrays. There are two tools available in sci-kit - learn
library for doing this. One is label encoder. This tool is mostly used to
encode the target variable. Labelencoder assigns a number according to the
category of the variable.
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
X[:,0]
array([1, 0, 2, 0, 2, 1, 0, 1, 2, 1], dtype=object)
You can now see some numbers 0,1 and 2 are assigned to
the categorical values. Another is a
one-hot encoder. This tool creates a binary NumPy array specifying each
category. Keras library also includes this tool as the last dense layer of
neural network consists of these types of data. The following illustration can
demonstrate the implementation of the one-hot encoder.
enc = OneHotEncoder()
n = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(n)
encoded_n = enc.transform(n).toarray()
encoded_n
array([[0., 1., 0., 1., 0., 1.],
[1., 0., 1., 0., 1., 0.]])
You can now see the one-hot encoded data in the form
of NumPy array
4. Train - test split of the dataset: While building the machine learning and deep learning
model, two datasets are required one is a training dataset and another is
testing dataset or validation dataset. We can train the model with a training
dataset but it needs to verify the accuracy of the model. So for validation of
the model, or to calculate the accuracy of the model, a validation dataset is necessary.
We can use the train_test_split tool available in the scikit-learn module for
this purpose. Generally, 80% - 20%, 90% - 10%, 70% - 30% train - test split
ratios are used. Here we are using 70% - 30% ration for train - test split of
the dataset. You can use the following commands for this purpose.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=33) # 30% split
5. Feature Scaling: Here we are talking about Variable Transformation or Feature Scaling in
step in data preprocessing. This is one of the important steps in data
preprocessing. You can see the two columns in the dataset one is age and
another is income. Age is in the range of 30 - 60 and income is in the range of
60000 to 100000. This is not on the same scale. This may lead to some issues in
your trained machine learning model. Due to this model will try to predict the
results according to specific columns. So we need to convert these columns on
the same scale. This method is called as feature scaling. There is an inbuilt
function called StandardScale in scikit - learn for doing this. We can
transform the data on the same scale by using the following python command.
# feature scaling
sc_X = StandardScaler()
X_train[:,1:3] = sc_X.fit_transform(X_train[:,1:3])
X_test[:,1:3] = sc_X.transform(X_test[:,1:3])
array([[1, -0.1208766377101645, 0.4505577782776152],
[0, -1.5853435945833165, -1.353180217034444],
[0, 0.13947304351172932, -0.9734459022319053],
[2, -1.146003507521371, -0.7835787448306358],
[1, 0.9042502321010419, 0.9252256717807886],
[1, 1.4900370148503028, 1.5897607226852315],
[2, 0.3184634493517811, 0.14466069135334744]], dtype=object)
Now we can see that all the values are in the same
scale.
We have implemented a lot more techniques and data
preprocessing methods on the dataset. Now our dataset is perfect to fed in the
machine learning model. We cannot use the raw dataset directly to train the
machine learning model, some operations need to performed on the dataset. So we
have seen that data preprocessing is an important step in building predictive
models. So this is all about data preprocessing with python. You may take some
help from the scikit - learn website for more operations and data preprocessing
methods. You can download the full jupyter notebook and dataset from this link
https://drive.google.com/open?id=1kcnwDXb-o5aAxLJPkvEBQW-RJUQ3KWjc
You can directly view this Notebook from here:
https://datapreprocessingwithpython.blogspot.com/p/import-numpy-as-np-import-pandas-as-pd.html
https://datapreprocessingwithpython.blogspot.com/p/import-numpy-as-np-import-pandas-as-pd.html



Nice!
ReplyDeleteBest blog...good work guys!!
ReplyDeleteNicely explained 👍
ReplyDeleteGreat..
ReplyDeleteBest blog i've ever read
ReplyDeleteSo much inspirational work by you guys...too much informative and easy to understand...
ReplyDeleteVery useful and informative!!
ReplyDeleteUseful for ML learners as well as data science...informative well done guys
ReplyDeleteVery informative
ReplyDeleteGreat & nice..
ReplyDeleteVery informative 👍
ReplyDeleteVery helpful
ReplyDeleteNice work 👏👏
ReplyDeleteNice blog !
ReplyDelete