ExploreThe Fluent ways of handling Large DataSets for Machine Learning Using Python
'Large dataset'- the integral slice of machine learning and data science. But handling such a large quantity of data is not so easy because it crashes the RAM efficiency, leading to non-fitting the ML algorithm in the concerned system. If you got stuck with the same problem, then you have landed on the right page. This blog will help you out with the easy process of handling large datasets for machine learning with the maximum possible level of fluency.
What is a large dataset?
Suppose you have conducted a survey. This was needed for your market research studies. Now, obviously, the collection of responses from the survey will not be a tiny one (like 100 responses). Usually, business surveys include tons of individual responses (microdata) and many factors (dependent and independent) as raw data for data analysis and manipulations. Such an expanded set of data is called a large dataset.
How does a large dataset differ from big data?
As described in the above definition, the large dataset of data associated with large volume micro and raw data that a data analyst has to deal with.
On the other hand, big data is a technical buzzword that indicates the set of data owning monstrous volume. As the phrase used ‘monstrous, certainly capable of handling such data goes beyond the human hand. Rather it needs the interference of AI and Deep Learning.
Can python handle a large dataset?
The simplest answer is ‘yes.
With the collaboration of different python libraries like Pandas, Numpy, Matplotlib, Dask, etc., python can easily and credibly handle a large set of data.
However, depending on the data science project scenario and the volume of the data set, the entire data analysis process may either be completed within an indivisible computer processing unit or through dispersed reckoning arrangements (distributed computing system).
How does machine learning deal with a large dataset?
Although direct input off dataset and algorithm designing can crash your system, using python programming, you can carry out your data analysis and ML modelling very easily.
With the help of the tricky use of pandas, machine learning can cope with large dataset analysis.
Yes. You heard right. I said 'tricky' because even after holding the capability of forming a powerful DataFrame, diverse handling of multi-format data files such as JSON, doc, txt, CSV, etc., with the increasing volume of data set panda performance also starts flickering. Hence, to keep the data manipulation, filtering, and other data analysis, you need to follow tricky steps.
1. Targeting the right data types only.
If you use the default setting of panda's data import, it will choose the most memory friendly values. However, these chosen values don't need to prove actually creditable to your data analysis project goal. So what you need to do is to target the right data types manually.
For simple instance, your chosen dataset may comprise a column named ‘studentID’ consuming the value of 1 and 2 solely. But the default panda feature uses the data type of ‘int64’ that consumes more memory. You can simply switch to a boolean category if the same introduces the benefits of lower memory consumption.
In addition, you can alter the following measures to make your data set memory friendly.
● ‘dropoff_latitude’
● ‘Dropoff_longitude’
● ‘columns pickup_latitude pickup_longitude’
The above measures have to be switched from ‘float64’ to ‘ float32’.
If your data set contains a payment column. 'payment_type', it may be switched to the 'categorical' type.
Below is a programming example for a school bus service.
import pandas as pd from
sys import getsizeof data =
pd.read_csv("dataset/filename.csv") size
= getsizeof(data)/(1024*1024) print("Initial
Size: %.4f MB"%size) #
Altering studentID to boolean data.studentID =
data.studentID.apply(lambda
x: x==2) #
chaning pickup_latitude, pickup_longitude, dropoff_latitude,
dropoff_longitude to float32 location_columns
= ['pickup_latitude','pickup_longitude', 'dropoff_latitude','dropoff_longitude'] data[location_columns]
= data[location_columns].astype('float32') #
Altering payment_type to categorical data.payment_type
= data.payment_type.astype('category') size
= getsizeof(data)/(1024*1024) print("Size
after reduction: %.4f MB"%size) |
2. Divide your dataset into multiple hunks
While you are proceeding with your data analysis, certainly, you don't need the entire dataset all at once. Rather you work on different sub-parts of your dataset across time. So, while you are working on a specific part of the data set, why waste the memory by running the rest. Instead, make a practice of splicing (chunking) your data sets as per your requirement. Then, depending on the size of your system Random Access Memory (RAM), you can identify your best-fit chuck data size.
Here you need to use to commands as follows:
● ‘read_csv() ’ followed by
● ‘chunksize’.
Below is an example of optimised
programming associated with dataset chunking.
import
pandas as pd import
psutil #
Loading the training dataset by chunking dataframe memory_timestep_1
= psutil.virtual_memory() data_iterator
= pd.read_csv("dataset/filename.csv",
chunksize=100000) fare_amount_sum_chunk
= 0 #fare=
bus fare for
data_chunk in data_iterator: fare_amount_sum_chunk +=
data_chunk['fare_amount'].sum() memory_timestep_2
= psutil.virtual_memory() memory_used_pd
= (memory_timestep_2[3] - memory_timestep_1[3])/(1024*1024) print("Memory
acquired with chunking the dataframe: %.4f MB"%memory_used_pd) #
Training dataset loading with pandas memory_timestep_3
= psutil.virtual_memory() training_data_pd
= pd.read_csv("dataset/filename.csv") fare_amount_sum_pd
= training_data_pd['fare_amount'].sum() memory_timestep_4
= psutil.virtual_memory() #psutil.virtual_memory() is the command for
displaying the memory consumption. memory_used_pd
= (memory_timestep_4[3] - memory_timestep_3[3])/(1024*1024) print("Memory
acquired without chunking the dataframe: %.4f MB"%memory_used_pd) |
The output will show the acquired
memory for both DataFrame with and without chunking, respectively.
3. Column dropping
Columns consume a lot of memory. As
for the other types of data, in the case of tables also, we don’t need all of
the columns at the same time. In such cases, you can temporarily drop off all
those momentarily worthless columns and proceed with a lower memory consuming
data analysis task.
To apply column dropping, In the case of ‘read_csv()', you need
to use the parameter 'usecols'.
Below is an example of python code
with column dropping parameter in use.
import pandas as pd import psutil # Training dataset Loading by
chunking dataframe memory_timestep_1 =
psutil.virtual_memory() columns = ['fare_amount',
'trip_distance'] #trip distance is the distance
between student pickup and drop location data_1 = pd.read_csv("dataset/filename.csv",
usecols=columns) memory_timestep_2 =
psutil.virtual_memory() memory_used_pd =
(memory_timestep_2[3] - memory_timestep_1[3])/(1024*1024) print("Memory
acquired by sampling columns: %.4f MB"%memory_used_pd) # Training dataset loading with
the help of pandas memory_timestep_3 =
psutil.virtual_memory() data_2 = pd.read_csv("dataset/filename.csv") memory_timestep_4 =
psutil.virtual_memory() memory_used_pd =
(memory_timestep_4[3] - memory_timestep_3[3])/(1024*1024) print("Memory
acquired without sampling columns: %.4f MB"%memory_used_pd) |
The output will show the acquired
memory for both DataFrame with and without sampling columns, respectively.
Now, you may ask that while working
on a data science project, you should use manual programming manipulation. In
fact, the example I have given here is the simplest one. The actual machine
learning problem holds far more complex than this.
Hence, the question becomes, while
data science offers apparently smarter solutions to everything, then why not
these memory management issues associated with machine learning.
Yes, you are right. However, a
handy alternative solution is there. 'Dask' is such a python library that
becomes the magic remedy for obstacles associated with the large dataset for
machine learning. Let's have a quick look at how to use Dask for handling such
memory management issues.
How to use dask for handling large datasets for machine learning?
What is Dask?
Have you ever heard of 'parallel computing? If you are well aware of python programming, certainly you know this term.
Well. Parallel computation is a special kind of gauging architecture that consists of multiple smaller and simultaneous computation sub-process. Such sub-processes can be executed through different processors at the same time. Each of the smaller sub-process generates from the splitting of larger and complicated data processing problems.
Desk is an example of such a parallel computation library. The use of a disk offers the impressive benefits of instant calculation at the lowest possible memory consumption.
How is dask different from pandas?
Although the dask consumes less memory, in actuality, the operation of the dask is quite lazier than the pandas. Yes. It's a shocking truth.
Still, dask provides apparently quick output because dask only reads the required values and provides the respective task graph. And once the requisite values are used, the same gets deleted from the memory. In fact, this is the ultimate reason why dask consume lower memory than the pandas.
Use of dask in collaboration with pandas
The DataFrame that needs to be used is the 'single DaskDataFrame' consisting of multiple smaller Panda methods.
To check the efficacy of dask you can simile load the same dataset individually using dask and panda and compare the memory consumption.
Below is the python code for revealing the memory consumption of sample training data (associated with the file name, 'filename.csv') using pandas and dask, respectively.
import
pandas as pd import
dask.dataframe as ddf import
psutil #
Loading the sample training dataset using pandas memory_timestep_1
= psutil.virtual_memory() training_data_pd
= pd.read_csv("dataset/filename.csv") memory_timestep_2
= psutil.virtual_memory() memory_used_pd
= (memory_timestep_2[3] - memory_timestep_1[3])/(1024*1024) print("Memory
acquired using pandas: %.4f MB"%memory_used_pd) #Loading
the sample training dataset using dask memory_timestep_3
= psutil.virtual_memory() training_data_ddf
= ddf.read_csv("dataset/filename.csv") memory_timestep_4
= psutil.virtual_memory() memory_used_ddf
= (memory_timestep_4[3] - memory_timestep_3[3])/(1024*1024) print("Memory
acquired using dask: %.4f MB"%memory_used_ddf) |
Now, all we have discussed until now was about a normal data set containing numbers and characters. But if your dataset includes tons of large images. Well, images are the massive memory killer. So working on the image containing the data set becomes challenging in ML projects.
However, in such cases, Keras become the saviour.
Image data optimisation with Keras
Keras offers a feature called 'ImageDataGenerator'. Using this feature, you can import all of your images stored on the disk as usual but in several batches.
Suppose your dataset is associated with two directories:
● XYZ
● ABC
( ‘XYZ’ and ‘ABC’ are the categories of the dataset)
Each of the directories is further divided into other subdirectories as follows.
For illustration, we have two kinds of image datasets, one contains baby boys, and the other contains baby girls.
Once the images' loading gets completed, you need to load the data using the following command.
flow_from_directory() |
Below is a sample python programme for the same.
from
tensorflow.keras.preprocessing.image import
ImageDataGenerator import
matplotlib.pyplot as plt #
Object generation for ImageDataGenerator datagen
= ImageDataGenerator( rotation_range=20, # randomly
rotate images by 20 degrees horizontal_flip = True # randomly
flip images ) #
Generator development using flow_from_directory method data_generator
= datagen.flow_from_directory( directory = "/content/dataset/XYZ_set/XYZ_set", # specify
your dataset directory batch_size=16, # specify
the no. of images you want to load at a time ) #
load a batch using next images,
labels = next(data_generator) nrows
= 4 ncols
= 4 fig
= plt.figure(figsize=(10,10)) for
i in range(16): fig.add_subplot(nrows, ncols, i+1) plt.imshow(images[i].astype('uint8')) plt.axis(False) plt.show() |
The output will provide all the images that fall under the two defined classes (D and E).
However, in some cases, sometimes you need to filter more custom data. For example, suppose in the above dataset you need the custom data for baby girls images only.
The best solution in such cases will be the following method.
‘tf.keras.utils.Sequence’ class application.
Here you need to follow a simple 2-step method:
Step1:
__getitem__ |
Step 2:
__len__ |
These will lower the RAM consumption, and to fit your image data set within the available memory, you can use the following method.
model.fit() |
Below is a sample programing that will show optimised data only for the baby girls dataset.
import tensorflow as tf import cv2 import numpy import os import
matplotlib.pyplot as plt class CustomDataGenerator(tf.keras.utils.Sequence): def __init__(self,
batch_size, dataset_directory): self.batch_size = batch_size self.directory = dataset_directory self.list_IDs =
os.listdir(self.directory) # Returns the number of batches to generate def __len__(self): return
len(self.list_IDs) // self.batch_size # Return a batch of given index # Create your logic how you want to load your data def __getitem__(self,
index): batch_IDs =
self.list_IDs[index*self.batch_size : (index+1)*self.batch_size] images = [] for id
in batch_IDs: path = os.path.join(self.directory,
id) image = cv2.imread(path) image = cv2.resize(image, (100,100)) images.append(image) return
images dog_data_generator
= CustomDataGenerator( batch_size = 16, dataset_directory = "/content/dataset/XYZ_set/XYZ_set/babygirls" ) #
get a batch of images images
= next(iter(babygirl_data_generator)) nrows
= 4 ncols
= 4 fig
= plt.figure(figsize=(10,10)) for i in range(16): fig.add_subplot(nrows, ncols, i+1) plt.imshow(images[i].astype('uint8')) plt.axis(False) plt.show() |
The output will show only the baby girl images in optimised form.
So, this is a brief of optimising large data for ML in python. Now, if you want to learn more, you can join the data science and AI courses. Our courses are associated with market competent learning modules for both programming and statistics. We offer practical training on TensorFlow, Keras, Pandas, and all other most demanding python libraries. In addition, we offer live industrial project opportunities directly from domain-specific product based MNCs. The project experience certificate will help you secure a job with companies paying high data science salaries in India.
Although Learnbay is based in Bangalore, our courses are available across different cities like Patna, Kolkata, Mumbai, Delhi, Hyderabad, and Chennai.
To know more, get the latest update about our courses, blogs, and data science tricks and tips, follow us on: LinkedIn, Twitter, Facebook, Youtube, Instagram, Medium.
Comments
Post a Comment