How can data scientists use parallelism?

Finally, my program is running! Should I go for a coffee?

Data scientists have laptops with quad-core or octa-core processors and turbo boost technology. We routinely use servers with more cores and computing power. But are we really using the raw power we have at hand?

Instead of leveraging resources, we often wait for time-consuming processes to complete. Sometimes we wait for hours, even when urgent deliverables are approaching the deadline. Can you do it better?

This post describes how to use multiprocessing and Joblib to parallelize your code. These libraries allow you to use multiple cores on your machine, speeding up code execution.

Parallel processing in data science

Parallel processing is a technique that divides a large process into multiple smaller parts, each of which is processed by a separate processor. Data scientists need to add this method to their toolkit to reduce the time it takes to run large processes and to deliver results to clients quickly.

Using multiprocessing with single-parameter functions

We start with the problem that the list of items is large and we apply the function to every element of the list.

Why do you want to do this? It may seem like a trivial matter, but this is what we do on a daily basis in data science. For example, you may have a model and want to perform multiple iterations of the model with different hyperparameters. Or you are creating a new feature in a big dataframe and you need to use the apply keyword to apply the function to the dataframe row by row. By the end of this post, most of the use cases you face in data science will be parallelized in this simple configuration.

So let’s go back to our hypothetical problem and apply the square function to every element in the list.

def square(num):
    return x**2

Of course, you can use simple Python to execute this function on every element of the list.

result = [f(x) for x in list(range(100000))]

But the code is running in sequence. This means that only one core of our machine is doing all the work. In theory, this load can be shared with all cores of the machine. To do this, you can use multiprocessing to apply this function to all elements of a particular list in parallel, using the eight cores of a powerful computer.

from multiprocessing import Pool
pool = Pool(8)
result = pool.map(f,list(range(100000)))
pool.close()

These rows can create a multiprocessing pool of eight workers and use this pool to map the required functions to this list.

Let’s see how this code works.

from multiprocessing import Pool
import time
import plotly.express as px
import plotly
import pandas as pd

def f(x):
    return x**2

def runner(list_length):
    print(f"Size of List:{list_length}")
    t0 = time.time()
    result1 = [f(x) for x in list(range(list_length))]
    t1 = time.time()
    print(f"Without multiprocessing we ran the function in {t1 - t0:0.4f} seconds")
    time_without_multiprocessing = t1-t0
    t0 = time.time()
    pool = Pool(8)
    result2 = pool.map(f,list(range(list_length)))
    pool.close()
    t1 = time.time()
    print(f"With multiprocessing we ran the function in {t1 - t0:0.4f} seconds")
    time_with_multiprocessing = t1-t0
    return time_without_multiprocessing, time_with_multiprocessing

if __name__ ==  '__main__':
    times_taken = []
    for i in range(1,9):
        list_length = 10**i
        time_without_multiprocessing, time_with_multiprocessing = runner(list_length)
        times_taken.append([list_length, 'No Mutiproc', time_without_multiprocessing])
        times_taken.append([list_length, 'Multiproc', time_with_multiprocessing])

    timedf = pd.DataFrame(times_taken,columns = ['list_length', 'type','time_taken'])
    fig =  px.line(timedf,x = 'list_length',y='time_taken',color="type",log_x=True)
    plotly.offline.plot(fig, filename="comparison_bw_multiproc.html")

As you can see, if the list is long, the multiprocessing runtime will be slightly longer, but if the list is long, it will not grow as fast as the runtime for non-multiprocessing functions. This shows that using multiprocessing for less time-consuming processes does not make much sense, as it incurs some computational overhead.

In practice, we don’t use multiprocessing for functions that finish in milliseconds, but we do use them for much larger calculations that can take seconds or even hours.

Now let’s try a more complex calculation, which usually takes more than 2 seconds. Here we are using time.sleep as a proxy for the calculation.

from multiprocessing import Pool
import time
import plotly.express as px
import plotly
import pandas as pd

def f(x):
    time.sleep(2)
    return x**2


def runner(list_length):
    print(f"Size of List:{list_length}")
    t0 = time.time()
    result1 = [f(x) for x in list(range(list_length))]
    t1 = time.time()
    print(f"Without multiprocessing we ran the function in {t1 - t0:0.4f} seconds")
    time_without_multiprocessing = t1-t0
    t0 = time.time()
    pool = Pool(8)
    result2 = pool.map(f,list(range(list_length)))
    pool.close()
    t1 = time.time()
    print(f"With multiprocessing we ran the function in {t1 - t0:0.4f} seconds")
    time_with_multiprocessing = t1-t0
    return time_without_multiprocessing, time_with_multiprocessing

if __name__ ==  '__main__':
    times_taken = []
    for i in range(1,10):
        list_length = i
        time_without_multiprocessing, time_with_multiprocessing = runner(list_length)
        times_taken.append([list_length, 'No Mutiproc', time_without_multiprocessing])
        times_taken.append([list_length, 'Multiproc', time_with_multiprocessing])

    timedf = pd.DataFrame(times_taken,columns = ['list_length', 'type','time_taken'])
    fig =  px.line(timedf,x = 'list_length',y='time_taken',color="type")
    plotly.offline.plot(fig, filename="comparison_bw_multiproc.html")

Parallel Processing Data Science-3Parallel processing data science

As you can see, the difference in this case is much more noticeable. This function takes much longer without multiprocessing than with it. Again, this makes perfect sense. This is because when you start multiprocessing, eight workers start working in parallel instead of running the tasks in sequence, and each task takes two seconds.

Other articles by Rahul Agarwal5 Python Dunder methods you should know

Multiprocessing with multiple parameter functions

The above code extension occurs when you need to execute a function that can receive multiple parameters. As a use case, you need to tune a particular model with multiple hyperparameters. You can do something like this:

import random
def model_runner(n_estimators, max_depth):
    # Some code that runs and fits our model here using the   
    # hyperparams in the argument.
    # Proxy for this code with sleep.
    time.sleep(random.choice([1,2,3])
    # Return some model evaluation score
    return random.choice([1,2,3])

How do you execute such a function? This can be done in two ways.

Using Pool.map and * magic

def multi_run_wrapper(args):
   return model_runner(*args)
pool = Pool(4)
hyperparams = [[100,4],[150,5],[200,6],[300,4]]
results = pool.map(multi_run_wrapper,hyperparams)
pool.close()

This code uses to provide arguments to model_runner

Use pool.starmap

Starting with Python 3.3, you can do this even more easily using the Starmap method.

pool = Pool(4)
hyperparams = [[100,4],[150,5],[200,6],[300,4]]
results = pool.starmap(model_runner,hyperparams)
pool.close()

Using Joblib with a single-parameter function

Joblib is another library that provides a simple helper class for creating parallel for loops using multiprocessing. I think it’s much easier to use than a multiprocessing module. Running a parallel process is as easy as writing a single line with parallel and deferred keywords.

from joblib import Parallel, delayed
import time
def f(x):
    time.sleep(2)
    return x**2
results = Parallel(n_jobs=8)(delayed(f)(i) for i in range(10))

Let’s compare Joblib in parallel with the multiprocessing module using the same functions we used earlier.

from multiprocessing import Pool
import time
import plotly.express as px
import plotly
import pandas as pd
from joblib import Parallel, delayed

def f(x):
    time.sleep(2)
    return x**2


def runner(list_length):
    print(f"Size of List:{list_length}")
    t0 = time.time()
    result1 = Parallel(n_jobs=8)(delayed(f)(i) for i in range(list_length))
    t1 = time.time()
    print(f"With joblib we ran the function in {t1 - t0:0.4f} seconds")
    time_without_multiprocessing = t1-t0
    t0 = time.time()
    pool = Pool(8)
    result2 = pool.map(f,list(range(list_length)))
    pool.close()
    t1 = time.time()
    print(f"With multiprocessing we ran the function in {t1 - t0:0.4f} seconds")
    time_with_multiprocessing = t1-t0
    return time_without_multiprocessing, time_with_multiprocessing

if __name__ ==  '__main__':
    times_taken = []
    for i in range(1,16):
        list_length = i
        time_without_multiprocessing, time_with_multiprocessing = runner(list_length)
        times_taken.append([list_length, 'No Mutiproc', time_without_multiprocessing])
        times_taken.append([list_length, 'Multiproc', time_with_multiprocessing])

    timedf = pd.DataFrame(times_taken,columns = ['list_length', 'type','time_taken'])
    fig =  px.line(timedf,x = 'list_length',y='time_taken',color="type")
    plotly.offline.plot(fig, filename="comparison_bw_multiproc.html")

Parallel Processing Data Science-5

You can see that the runtimes are about the same. Even better, the Joblib code looks much simpler than the multiprocessing approach.

Use Joblib with multiple parameter functions

Using multiple arguments for a function is as easy as passing the arguments using Joblib. This is a minimal example.

from joblib import Parallel, delayed
import time
def f(x,y):
    time.sleep(2)
    return x**2 + y**2
params = [[x,x] for x in range(10)]
results = Parallel(n_jobs=8)(delayed(f)(x,y) for x,y in params)

Save time with multiprocessing

Multiprocessing is a very good concept and should at least be known to all data scientists. It doesn’t solve all your problems, and you should still work on optimizing your functionality. Still, including it in the toolkit can save you a lot of time waiting for the code to finish, or just staring at the screen when you’ve already got the results and presented them to your business.

Also, if you want to know more about Python 3, I’d love to hear from you. This wonderful course From the University of Michigan. Please check it by all means.



[ad_2]

- Advertisement -