ExploreThe Fluent ways of handling Large DataSets for Machine Learning Using Python

'Large dataset'- the integral slice of machine learning and data science. But handling such a large quantity of data is not so easy because it crashes the RAM efficiency, leading to non-fitting the ML algorithm in the concerned system. If you got stuck with the same problem, then you have landed on the right page. This blog will help you out with the easy process of handling large datasets for machine learning with the maximum possible level of fluency.

What is a large dataset?

Suppose you have conducted a survey. This was needed for your market research studies. Now, obviously, the collection of responses from the survey will not be a tiny one (like 100 responses). Usually, business surveys include tons of individual responses (microdata) and many factors (dependent and independent) as raw data for data analysis and manipulations. Such an expanded set of data is called a large dataset.

How does a large dataset differ from big data?

As described in the above definition, the large dataset of data associated with large volume micro and raw data that a data analyst has to deal with.

On the other hand, big data is a technical buzzword that indicates the set of data owning monstrous volume. As the phrase used ‘monstrous, certainly capable of handling such data goes beyond the human hand. Rather it needs the interference of AI and Deep Learning.

Can python handle a large dataset?

The simplest answer is ‘yes.

With the collaboration of different python libraries like Pandas, Numpy, Matplotlib, Dask, etc., python can easily and credibly handle a large set of data.

However, depending on the data science project scenario and the volume of the data set, the entire data analysis process may either be completed within an indivisible computer processing unit or through dispersed reckoning arrangements (distributed computing system).

How does machine learning deal with a large dataset?

Although direct input off dataset and algorithm designing can crash your system, using python programming, you can carry out your data analysis and ML modelling very easily.

With the help of the tricky use of pandas, machine learning can cope with large dataset analysis.

Yes. You heard right. I said 'tricky' because even after holding the capability of forming a powerful DataFrame, diverse handling of multi-format data files such as JSON, doc, txt, CSV, etc., with the increasing volume of data set panda performance also starts flickering. Hence, to keep the data manipulation, filtering, and other data analysis, you need to follow tricky steps.

1. Targeting the right data types only.

If you use the default setting of panda's data import, it will choose the most memory friendly values. However, these chosen values don't need to prove actually creditable to your data analysis project goal. So what you need to do is to target the right data types manually.

For simple instance, your chosen dataset may comprise a column named ‘studentID’ consuming the value of 1 and 2 solely. But the default panda feature uses the data type of ‘int64’ that consumes more memory. You can simply switch to a boolean category if the same introduces the benefits of lower memory consumption.

In addition, you can alter the following measures to make your data set memory friendly.

● ‘dropoff_latitude’

● ‘Dropoff_longitude’

● ‘columns pickup_latitude pickup_longitude’

The above measures have to be switched from ‘float64’ to ‘ float32’.

If your data set contains a payment column. 'payment_type', it may be switched to the 'categorical' type.

Below is a programming example for a school bus service.

import pandas as pd

from sys import getsizeof

data = pd.read_csv("dataset/filename.csv")

size = getsizeof(data)/(1024*1024)

print("Initial Size: %.4f MB"%size)

# Altering studentID to boolean

data.studentID = data.studentID.apply(lambda x: x==2)

# chaning pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude to float32

location_columns = ['pickup_latitude','pickup_longitude',

'dropoff_latitude','dropoff_longitude']

data[location_columns] = data[location_columns].astype('float32')

# Altering payment_type to categorical

data.payment_type = data.payment_type.astype('category')

size = getsizeof(data)/(1024*1024)

print("Size after reduction: %.4f MB"%size)

2. Divide your dataset into multiple hunks

While you are proceeding with your data analysis, certainly, you don't need the entire dataset all at once. Rather you work on different sub-parts of your dataset across time. So, while you are working on a specific part of the data set, why waste the memory by running the rest. Instead, make a practice of splicing (chunking) your data sets as per your requirement. Then, depending on the size of your system Random Access Memory (RAM), you can identify your best-fit chuck data size.

Here you need to use to commands as follows:

● ‘read_csv() ’ followed by

● ‘chunksize’.

Below is an example of optimised programming associated with dataset chunking.

import pandas as pd

import psutil

# Loading the training dataset by chunking dataframe

memory_timestep_1 = psutil.virtual_memory()

data_iterator = pd.read_csv("dataset/filename.csv", chunksize=100000)

fare_amount_sum_chunk = 0 #fare= bus fare

for data_chunk in data_iterator:

fare_amount_sum_chunk += data_chunk['fare_amount'].sum()

memory_timestep_2 = psutil.virtual_memory()

memory_used_pd = (memory_timestep_2[3] - memory_timestep_1[3])/(1024*1024)

print("Memory acquired with chunking the dataframe: %.4f MB"%memory_used_pd)

# Training dataset loading with pandas

memory_timestep_3 = psutil.virtual_memory()

training_data_pd = pd.read_csv("dataset/filename.csv")

fare_amount_sum_pd = training_data_pd['fare_amount'].sum()

memory_timestep_4 = psutil.virtual_memory() #psutil.virtual_memory() is the command for displaying the memory consumption.

memory_used_pd = (memory_timestep_4[3] - memory_timestep_3[3])/(1024*1024)

print("Memory acquired without chunking the dataframe: %.4f MB"%memory_used_pd)

The output will show the acquired memory for both DataFrame with and without chunking, respectively.

3. Column dropping

Columns consume a lot of memory. As for the other types of data, in the case of tables also, we don’t need all of the columns at the same time. In such cases, you can temporarily drop off all those momentarily worthless columns and proceed with a lower memory consuming data analysis task.

To apply column dropping, In the case of ‘read_csv()', you need to use the parameter 'usecols'.

Below is an example of python code with column dropping parameter in use.

import pandas as pd

import psutil

# Training dataset Loading by chunking dataframe

memory_timestep_1 = psutil.virtual_memory()

columns = ['fare_amount', 'trip_distance'] #trip distance is the distance between student pickup and drop location

data_1 = pd.read_csv("dataset/filename.csv", usecols=columns)

memory_timestep_2 = psutil.virtual_memory()

memory_used_pd = (memory_timestep_2[3] - memory_timestep_1[3])/(1024*1024)

print("Memory acquired by sampling columns: %.4f MB"%memory_used_pd)

# Training dataset loading with the help of pandas

memory_timestep_3 = psutil.virtual_memory()

data_2 = pd.read_csv("dataset/filename.csv")

memory_timestep_4 = psutil.virtual_memory()

memory_used_pd = (memory_timestep_4[3] - memory_timestep_3[3])/(1024*1024)

print("Memory acquired without sampling columns: %.4f MB"%memory_used_pd)

The output will show the acquired memory for both DataFrame with and without sampling columns, respectively.

Now, you may ask that while working on a data science project, you should use manual programming manipulation. In fact, the example I have given here is the simplest one. The actual machine learning problem holds far more complex than this.

Hence, the question becomes, while data science offers apparently smarter solutions to everything, then why not these memory management issues associated with machine learning.

Yes, you are right. However, a handy alternative solution is there. 'Dask' is such a python library that becomes the magic remedy for obstacles associated with the large dataset for machine learning. Let's have a quick look at how to use Dask for handling such memory management issues.

How to use dask for handling large datasets for machine learning?

What is Dask?

Have you ever heard of 'parallel computing? If you are well aware of python programming, certainly you know this term.

Well. Parallel computation is a special kind of gauging architecture that consists of multiple smaller and simultaneous computation sub-process. Such sub-processes can be executed through different processors at the same time. Each of the smaller sub-process generates from the splitting of larger and complicated data processing problems.

Desk is an example of such a parallel computation library. The use of a disk offers the impressive benefits of instant calculation at the lowest possible memory consumption.

How is dask different from pandas?

Although the dask consumes less memory, in actuality, the operation of the dask is quite lazier than the pandas. Yes. It's a shocking truth.

Still, dask provides apparently quick output because dask only reads the required values and provides the respective task graph. And once the requisite values are used, the same gets deleted from the memory. In fact, this is the ultimate reason why dask consume lower memory than the pandas.

Use of dask in collaboration with pandas

The DataFrame that needs to be used is the 'single DaskDataFrame' consisting of multiple smaller Panda methods.

To check the efficacy of dask you can simile load the same dataset individually using dask and panda and compare the memory consumption.

Below is the python code for revealing the memory consumption of sample training data (associated with the file name, 'filename.csv') using pandas and dask, respectively.

import pandas as pd

import dask.dataframe as ddf

import psutil

# Loading the sample training dataset using pandas

memory_timestep_1 = psutil.virtual_memory()

training_data_pd = pd.read_csv("dataset/filename.csv")

memory_timestep_2 = psutil.virtual_memory()

memory_used_pd = (memory_timestep_2[3] - memory_timestep_1[3])/(1024*1024)

print("Memory acquired using pandas: %.4f MB"%memory_used_pd)

#Loading the sample training dataset using dask

memory_timestep_3 = psutil.virtual_memory()

training_data_ddf = ddf.read_csv("dataset/filename.csv")

memory_timestep_4 = psutil.virtual_memory()

memory_used_ddf = (memory_timestep_4[3] - memory_timestep_3[3])/(1024*1024)

print("Memory acquired using dask: %.4f MB"%memory_used_ddf)

Now, all we have discussed until now was about a normal data set containing numbers and characters. But if your dataset includes tons of large images. Well, images are the massive memory killer. So working on the image containing the data set becomes challenging in ML projects.

However, in such cases, Keras become the saviour.

Image data optimisation with Keras

Keras offers a feature called 'ImageDataGenerator'. Using this feature, you can import all of your images stored on the disk as usual but in several batches.

Suppose your dataset is associated with two directories:

● XYZ

● ABC

( ‘XYZ’ and ‘ABC’ are the categories of the dataset)

Each of the directories is further divided into other subdirectories as follows.

[1]

For illustration, we have two kinds of image datasets, one contains baby boys, and the other contains baby girls.

Once the images' loading gets completed, you need to load the data using the following command.

flow_from_directory()

Below is a sample python programme for the same.

from tensorflow.keras.preprocessing.image import ImageDataGenerator

import matplotlib.pyplot as plt

# Object generation for ImageDataGenerator

datagen = ImageDataGenerator(

rotation_range=20, # randomly rotate images by 20 degrees

horizontal_flip = True # randomly flip images

)

# Generator development using flow_from_directory method

data_generator = datagen.flow_from_directory(

directory = "/content/dataset/XYZ_set/XYZ_set", # specify your dataset directory

batch_size=16, # specify the no. of images you want to load at a time

)

# load a batch using next

images, labels = next(data_generator)

nrows = 4

ncols = 4

fig = plt.figure(figsize=(10,10))

for i in range(16):

fig.add_subplot(nrows, ncols, i+1)

plt.imshow(images[i].astype('uint8'))

plt.axis(False)

plt.show()

The output will provide all the images that fall under the two defined classes (D and E).

However, in some cases, sometimes you need to filter more custom data. For example, suppose in the above dataset you need the custom data for baby girls images only.

The best solution in such cases will be the following method.

‘tf.keras.utils.Sequence’ class application.

Here you need to follow a simple 2-step method:

Step1:

__getitem__

Step 2:

__len__

These will lower the RAM consumption, and to fit your image data set within the available memory, you can use the following method.

model.fit()

Below is a sample programing that will show optimised data only for the baby girls dataset.

import tensorflow as tf

import cv2

import numpy

import os

import matplotlib.pyplot as plt

class CustomDataGenerator(tf.keras.utils.Sequence):

def __init__(self, batch_size, dataset_directory):

self.batch_size = batch_size

self.directory = dataset_directory

self.list_IDs = os.listdir(self.directory)

# Returns the number of batches to generate

def __len__(self):

return len(self.list_IDs) // self.batch_size

# Return a batch of given index

# Create your logic how you want to load your data

def __getitem__(self, index):

batch_IDs = self.list_IDs[index*self.batch_size : (index+1)*self.batch_size]

images = []

for id in batch_IDs:

path = os.path.join(self.directory, id)

image = cv2.imread(path)

image = cv2.resize(image, (100,100))

images.append(image)

return images

dog_data_generator = CustomDataGenerator(

batch_size = 16,

dataset_directory = "/content/dataset/XYZ_set/XYZ_set/babygirls"

)

# get a batch of images

images = next(iter(babygirl_data_generator))

nrows = 4

ncols = 4

fig = plt.figure(figsize=(10,10))

for i in range(16):

fig.add_subplot(nrows, ncols, i+1)

plt.imshow(images[i].astype('uint8'))

plt.axis(False)

plt.show()

The output will show only the baby girl images in optimised form.

So, this is a brief of optimising large data for ML in python. Now, if you want to learn more, you can join the data science and AI courses. Our courses are associated with market competent learning modules for both programming and statistics. We offer practical training on TensorFlow, Keras, Pandas, and all other most demanding python libraries. In addition, we offer live industrial project opportunities directly from domain-specific product based MNCs. The project experience certificate will help you secure a job with companies paying high data science salaries in India.

Although Learnbay is based in Bangalore, our courses are available across different cities like Patna, Kolkata, Mumbai, Delhi, Hyderabad, and Chennai.

To know more, get the latest update about our courses, blogs, and data science tricks and tips, follow us on: LinkedIn, Twitter, Facebook, Youtube, Instagram, Medium.

Please make a canvas chart of it

Search This Blog

Data Science