Dask¶

Objectives

Understand Dask’s role as a parallel/distributed computing framework for large datasets
Master chunking and partitioning concepts for memory-efficient data processing
Work with Dask Arrays, DataFrames, and Bags for different data structures
Implement lazy evaluation using .compute() to trigger actual computations
Use the delayed() function to build task graphs for custom workflows
Create Dask objects from existing data sources and read multiple files efficiently
Apply indexing and slicing operations on Dask collections
Leverage built-in methods (mean, unique, etc.) that mirror NumPy and Pandas APIs

Instructor note

Start with the chunking concept using small examples before scaling to larger arrays
Emphasize that Dask’s performance advantage appears with larger-than-memory datasets, not small ones
Show the difference between lazy (task graph created) vs eager (.compute() called) execution
Highlight how .delayed() enables building custom parallel workflows beyond standard operations
Demonstrate pattern matching with glob for reading multiple CSV files as a single dataset
Explain the importance of knowing partitions/divisions for efficient indexing operations
Have students observe overhead costs by timing operations with and without .compute()
Connect Dask concepts to real-world scenarios (ETL, big data analysis, distributed computing)

Dask is an open-source Python library that enables parallel and distributed computing, making it easier to scale data analysis from a single machine to large cloud-based clusters. It integrates seamlessly with the existing Python and PyData ecosystem, offering familiar APIs that mirror those of popular libraries such as Pandas, NumPy, and scikit-learn.

Designed for scalability, Dask can utilize all the resources of your local machine or expand to distributed systems in the cloud.

Dask primarily provides three core interfaces for working with data: Dask DataFrames, Dask Arrays, and Dask Bags.

Key Features

Scalability: Handles datasets larger than available memory dask.org
Performance: 50% faster than Apache Spark on standard benchmarks dask.org
Flexibility: Runs on laptops or scales to thousands of nodes

Basic Concepts of Dask¶

On a high-level, you can think of Dask as a wrapper that extends the capabilities of traditional tools like pandas, NumPy, and Spark to handle larger-than-memory datasets.

When faced with large objects like larger-than-memory arrays (vectors) or matrices (dataframes), Dask breaks them up into chunks, also called partitions.

For example, consider the array of 12 random numbers in both NumPy and Dask:

!pip install dask pandas pyarrow graphviz # fist intall dask and pandas using pip

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: dask in /home/gth/.local/lib/python3.11/site-packages (2025.4.1)
Requirement already satisfied: pandas in /home/gth/.local/lib/python3.11/site-packages (2.2.3)
Requirement already satisfied: pyarrow in /home/gth/.local/lib/python3.11/site-packages (20.0.0)
Collecting graphviz
  Downloading graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Requirement already satisfied: click>=8.1 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (8.1.7)
Requirement already satisfied: cloudpickle>=3.0.0 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (3.0.0)
Requirement already satisfied: fsspec>=2021.09.0 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (2023.10.0)
Requirement already satisfied: packaging>=20.0 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (23.2)
Requirement already satisfied: partd>=1.4.0 in /home/gth/.local/lib/python3.11/site-packages (from dask) (1.4.2)
Requirement already satisfied: pyyaml>=5.3.1 in /apps/all/PyYAML/6.0.1-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (6.0.1)
Requirement already satisfied: toolz>=0.10.0 in /home/gth/.local/lib/python3.11/site-packages (from dask) (1.0.0)
Requirement already satisfied: importlib_metadata>=4.13.0 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (6.8.0)
Requirement already satisfied: numpy>=1.23.2 in /home/gth/.local/lib/python3.11/site-packages (from pandas) (2.2.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.7 in /home/gth/.local/lib/python3.11/site-packages (from pandas) (2025.2)
Requirement already satisfied: zipp>=0.5 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from importlib_metadata>=4.13.0->dask) (3.17.0)
Requirement already satisfied: locket in /home/gth/.local/lib/python3.11/site-packages (from partd>=1.4.0->dask) (1.0.0)
Requirement already satisfied: six>=1.5 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Downloading graphviz-0.20.3-py3-none-any.whl (47 kB)
Installing collected packages: graphviz
Successfully installed graphviz-0.20.3

[notice] A new release of pip is available: 24.3.1 -> 25.1
[notice] To update, run: pip install --upgrade pip

import dask.array as da
import dask.bag as db
import dask.dataframe as dd
import numpy as np
import pandas as pd
import dask

import numpy as np
narr = np.random.rand(12)

narr

array([0.88535136, 0.2347785 , 0.10454781, 0.14247243, 0.1455551 ,
       0.70315651, 0.70865184, 0.62603594, 0.29620262, 0.26905289,
       0.71273231, 0.76479225])

The next code creates a Dask array darr from a NumPy array narr, splitting it into chunks of size 3. This allows efficient parallel computation and memory management.

darr = da.from_array(narr, chunks=3)
darr

	Array	Chunk
Bytes	96 B	24 B
Shape	(12,)	(3,)
Dask graph	4 chunks in 1 graph layer
Data type	float64 numpy.ndarray

The image above shows that the Dask array contains four chunks as we set chunks to 3. Under the hood, each chunk is a NumPy array in itself.

Now, let’s consider a much larger example. We will create two 10k by 100k arrays (1 billion elements) and perform element-wise multiplication in both libraries while measuring the performance:

# Create the NumPy arrays
arr1 = np.random.rand(10_000, 100_000)
arr2 = np.random.rand(10_000, 100_000)
# Create the Dask arrays
darr1 = da.from_array(arr1, chunks=(1_000, 10_000))
darr2 = da.from_array(arr2, chunks=(1_000, 10_000))

%%time
result_np = np.multiply(arr1, arr2)

CPU times: user 748 ms, sys: 649 ms, total: 1.4 s
Wall time: 1.4 s

%%time
result_dask = da.multiply(darr1, darr2)
result_dask = result_dask.compute()  # This triggers the actual computation

CPU times: user 3.93 s, sys: 2.53 s, total: 6.46 s
Wall time: 2.42 s

Dask shines for larger-than-memory arrays or truly parallel workloads, but for single-node, RAM-fit tasks, NumPy is often faster due to lower overhead.

Dask uses a similar approach of chunking and distributing these chunks across all available cores on your machine for other objects as well.

Integration Speedup Chart

This is a short overview of Dask

Dask DataFrames¶

Dask DataFrames are a powerful tool for processing large-scale tabular data by parallelizing pandas operations docs.dask.org. They allow you to work with datasets that are too large to fit into memory by breaking them down into smaller chunks and processing them in parallel.

import dask
from dask import delayed
from time import sleep

def inc(x):
    sleep(1)
    return x + 1

def add(x, y):
    sleep(1)
    return x + y

%%time
# Sequential code
a=[1,2,3]
results = []
for x in a:
    y = inc(x)
    results.append(y)
    
total = sum(results)

CPU times: user 636 µs, sys: 912 µs, total: 1.55 ms
Wall time: 3 s

This block uses Dask’s delayed to create lazy tasks.

The computation (inc(x)) is not executed immediately; instead, Dask constructs a task graph.

Total time: The sleep(1) calls still happen sequentially (since we’re not triggering parallelism here yet), so the time should still be around 3 seconds.

%%time
# Sequential code

results = []
for x in a:
    y = delayed(inc)(x)
    results.append(y)
    
total = sum(results)

CPU times: user 0 ns, sys: 767 µs, total: 767 µs
Wall time: 731 µs

the first block runs sequentially, calling the inc(x) function for each element in the list a, resulting in a total execution time of about 3 seconds. In the second block, Dask’s delayed is used to create lazy tasks for the same function calls.

%%time
# Sequential code

results = []
for x in a:
    y = delayed(inc)(x)
    results.append(y)
    
total = delayed(sum)(results)

CPU times: user 424 μs, sys: 142 μs, total: 566 μs
Wall time: 582 μs

here we are using Dask’s delayed function to create a task graph for a sequence of computations (incrementing each value in the list a and summing the results). The tasks are not executed immediately but are delayed until explicitly triggered. However, since no compute() function is called, the computation is still done sequentially, and the tasks are just set up. To actually perform the computations in parallel, you would need to call total.compute().

The delayed function in Dask is used to defer the execution of a function, creating a lazy task instead of running the function immediately.

import pandas as pd
df=pd.read_csv("dataset/dask_dataset.csv")
df

	Timestamp	Dask APIs	Interactive or Batch?	Local machine or Cluster?	How often do you use Dask?	What Dask resources have you used for support in the last six months?	Which would help you most right now?	Is Dask stable enough for you?	What common feature requests do you care about most? [Better Numpy/Pandas support]	What common feature requests do you care about most? [Better Scikit-Learn/ML support]	...	What common feature requests do you care about most? [Ease of deployment]	What common feature requests do you care about most? [Cloud integration]	What common feature requests do you care about most? [Managing many users]	What common feature requests do you care about most? [GPUs]	Python 2 or 3?	If you use a cluster, how do you launch Dask?	Preferred Cloud?	What are some other libraries that you often use with Dask?	How easy is it for you to upgrade to newer versions of Python libraries	Do you use Dask as part of a larger group?
0	2019-06-25 06:59:49	Array;DataFrame	Interactive: I use Dask with Jupyter or IPyth...	Personal laptop;Large workstation;Cluster with...	Occasionally	Documentation;Tutorial;Examples at examples.da...	Performance improvements	Yes	Somewhat useful	Somewhat useful	...	Somewhat useful	Not relevant for me	Somewhat useful	Somewhat useful	3.0	SSH;HPC resource manager (SLURM, PBS, SGE, LSF...	Amazon Web Services (AWS);Google Cloud Platfor...	NaN	4.0	My team or research group also use Dask
1	2019-06-25 07:01:12	DataFrame;ML	Interactive: I use Dask with Jupyter or IPyth...	Large workstation	Occasionally	Documentation;Stack Overflow #Dask tag	More examples in my field	Yes	Critical to me	Critical to me	...	Somewhat useful	Somewhat useful	Somewhat useful	Critical to me	3.0	NaN	Google Cloud Platform (GCP)	NaN	3.0	I use Dask mostly on my own
2	2019-06-25 07:02:25	DataFrame;Futures;Xarray	Interactive: I use Dask with Jupyter or IPyth...	Cluster of 2-10 machines	Occasionally	Documentation;Stack Overflow #Dask tag	Bug fixes	No	Critical to me	Somewhat useful	...	Critical to me	Critical to me	Critical to me	Somewhat useful	3.0	I don't know, someone else does this for me	Amazon Web Services (AWS)	Datashader	3.0	My team or research group also use Dask
3	2019-06-25 07:07:08	Bag;DataFrame;Delayed;Futures	Interactive: I use Dask with Jupyter or IPyth...	Personal laptop;Cluster of 2-10 machines	I use Dask all the time, even when I sleep	Documentation;Tutorial;Examples at examples.da...	Performance improvements	Yes	Critical to me	Critical to me	...	Somewhat useful	Somewhat useful	Somewhat useful	Somewhat useful	3.0	SSH	Amazon Web Services (AWS);Microsoft Azure	NaN	4.0	My team or research group also use Dask
4	2019-06-25 07:11:32	Array;DataFrame;Delayed	Interactive: I use Dask with Jupyter or IPyth...	Personal laptop;Large workstation	Just looking for now	Documentation;Stack Overflow #Dask tag;Tutoria...	More examples in my field	Yes	Somewhat useful	NaN	...	Somewhat useful	NaN	NaN	NaN	3.0	NaN	Microsoft Azure	OpenCV	4.0	I use Dask mostly on my own
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
254	2019-07-21 07:19:54	Delayed;Futures	Batch: I submit scripts that run in the future	Personal laptop	Occasionally	Documentation	More examples in my field	Yes	Somewhat useful	Not relevant for me	...	Somewhat useful	Somewhat useful	Not relevant for me	Not relevant for me	3.0	SSH	Google Cloud Platform (GCP)	Pandas	4.0	I use Dask mostly on my own
255	2019-07-22 08:23:11	DataFrame;Futures;Xarray	Interactive: I use Dask with Jupyter or IPyth...	Personal laptop;Large workstation	Just looking for now	Documentation;Examples at examples.dask.org	More examples in my field	Yes	Critical to me	Somewhat useful	...	Somewhat useful	Somewhat useful	Somewhat useful	Critical to me	3.0	NaN	Microsoft Azure	NaN	3.0	My team or research group also use Dask
256	2019-07-23 08:22:35	Futures	Batch: I submit scripts that run in the future	Cluster with 10-100 machines	Every day	Documentation;GitHub issue trackers (reading p...	New features	Yes	Somewhat useful	Somewhat useful	...	Critical to me	Not relevant for me	Not relevant for me	Somewhat useful	3.0	SSH	NaN	Pybullet, LMDB, pytorch.	4.0	My team or research group also use Dask
257	2019-07-23 09:08:42	Delayed;Futures	Batch: I submit scripts that run in the future	Cluster with 100+ machines	Every day	Documentation;GitHub issue trackers (reading p...	Bug fixes	No	Somewhat useful	Not relevant for me	...	Somewhat useful	Not relevant for me	Not relevant for me	Critical to me	3.0	SSH	We have our own cluster for the lab	torch, lmdb	3.0	My team or research group also use Dask
258	2019-07-23 14:18:22	Array;DataFrame;ML	Interactive: I use Dask with Jupyter or IPyth...	Personal laptop;Large workstation	Occasionally	Documentation;Stack Overflow #Dask tag	More examples in my field	Yes	Somewhat useful	Critical to me	...	Critical to me	Not relevant for me	Not relevant for me	Critical to me	3.0	NaN	NaN	NaN	3.0	My team or research group also use Dask

259 rows × 24 columns

ddf = dd.read_csv("dataset/dask_dataset.csv") 
ddf

Dask DataFrame Structure:

	Timestamp	Dask APIs	Interactive or Batch?	Local machine or Cluster?	How often do you use Dask?	What Dask resources have you used for support in the last six months?	Which would help you most right now?	Is Dask stable enough for you?	What common feature requests do you care about most? [Better Numpy/Pandas support]	What common feature requests do you care about most? [Better Scikit-Learn/ML support]	What common feature requests do you care about most? [Integrate with Deep Learning Frameworks]	What common feature requests do you care about most? [Support for new libraries in my field]	What common feature requests do you care about most? [Improve Scaling ]	What common feature requests do you care about most? [Dashboard / Diagnostics]	What common feature requests do you care about most? [Ease of deployment]	What common feature requests do you care about most? [Cloud integration]	What common feature requests do you care about most? [Managing many users]	What common feature requests do you care about most? [GPUs]	Python 2 or 3?	If you use a cluster, how do you launch Dask?	Preferred Cloud?	What are some other libraries that you often use with Dask?	How easy is it for you to upgrade to newer versions of Python libraries	Do you use Dask as part of a larger group?
npartitions=1
	string	string	string	string	string	string	string	string	string	string	string	string	string	string	string	string	string	string	float64	string	string	string	float64	string
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

Dask Name: to_string_dtype, 2 expressions

ddf.dtypes 

Timestamp                                                                                          string[pyarrow]
Dask APIs                                                                                          string[pyarrow]
Interactive or Batch?                                                                              string[pyarrow]
Local machine or Cluster?                                                                          string[pyarrow]
How often do you use Dask?                                                                         string[pyarrow]
What Dask resources have you used for support in the last six months?                              string[pyarrow]
Which would help you most right now?                                                               string[pyarrow]
Is Dask stable enough for you?                                                                     string[pyarrow]
What common feature requests do you care about most?  [Better Numpy/Pandas support]                string[pyarrow]
What common feature requests do you care about most?  [Better Scikit-Learn/ML support]             string[pyarrow]
What common feature requests do you care about most?  [Integrate with Deep Learning Frameworks]    string[pyarrow]
What common feature requests do you care about most?  [Support for new libraries in my field]      string[pyarrow]
What common feature requests do you care about most?  [Improve Scaling ]                           string[pyarrow]
What common feature requests do you care about most?  [Dashboard / Diagnostics]                    string[pyarrow]
What common feature requests do you care about most?  [Ease of deployment]                         string[pyarrow]
What common feature requests do you care about most?  [Cloud integration]                          string[pyarrow]
What common feature requests do you care about most?  [Managing many users]                        string[pyarrow]
What common feature requests do you care about most?  [GPUs]                                       string[pyarrow]
Python 2 or 3?                                                                                               int64
If you use a cluster, how do you launch Dask?                                                      string[pyarrow]
Preferred Cloud?                                                                                   string[pyarrow]
What are some other libraries that you often use with Dask?                                        string[pyarrow]
How easy is it for you to upgrade to newer versions of Python libraries                                    float64
Do you use Dask as part of a larger group?                                                         string[pyarrow]
dtype: object

we created Dask DataFrame using dask.dataframe.read_csv. Dask DataFrames are lazy, meaning they don’t load the data into memory immediately. Instead, they create a task graph for operations that will be executed when triggered. To view or manipulate the data, you would need to call methods like .head() or .compute() to perform the actual computation and load the data into memory or run operations.

ddf.head()

	Timestamp	Dask APIs	Interactive or Batch?	Local machine or Cluster?	How often do you use Dask?	What Dask resources have you used for support in the last six months?	Which would help you most right now?	Is Dask stable enough for you?	What common feature requests do you care about most? [Better Numpy/Pandas support]	What common feature requests do you care about most? [Better Scikit-Learn/ML support]	...	What common feature requests do you care about most? [Ease of deployment]	What common feature requests do you care about most? [Cloud integration]	What common feature requests do you care about most? [Managing many users]	What common feature requests do you care about most? [GPUs]	Python 2 or 3?	If you use a cluster, how do you launch Dask?	Preferred Cloud?	What are some other libraries that you often use with Dask?	How easy is it for you to upgrade to newer versions of Python libraries	Do you use Dask as part of a larger group?
0	2019-06-25 06:59:49	Array;DataFrame	Interactive: I use Dask with Jupyter or IPyth...	Personal laptop;Large workstation;Cluster with...	Occasionally	Documentation;Tutorial;Examples at examples.da...	Performance improvements	Yes	Somewhat useful	Somewhat useful	...	Somewhat useful	Not relevant for me	Somewhat useful	Somewhat useful	3.0	SSH;HPC resource manager (SLURM, PBS, SGE, LSF...	Amazon Web Services (AWS);Google Cloud Platfor...	<NA>	4.0	My team or research group also use Dask
1	2019-06-25 07:01:12	DataFrame;ML	Interactive: I use Dask with Jupyter or IPyth...	Large workstation	Occasionally	Documentation;Stack Overflow #Dask tag	More examples in my field	Yes	Critical to me	Critical to me	...	Somewhat useful	Somewhat useful	Somewhat useful	Critical to me	3.0	<NA>	Google Cloud Platform (GCP)	<NA>	3.0	I use Dask mostly on my own
2	2019-06-25 07:02:25	DataFrame;Futures;Xarray	Interactive: I use Dask with Jupyter or IPyth...	Cluster of 2-10 machines	Occasionally	Documentation;Stack Overflow #Dask tag	Bug fixes	No	Critical to me	Somewhat useful	...	Critical to me	Critical to me	Critical to me	Somewhat useful	3.0	I don't know, someone else does this for me	Amazon Web Services (AWS)	Datashader	3.0	My team or research group also use Dask
3	2019-06-25 07:07:08	Bag;DataFrame;Delayed;Futures	Interactive: I use Dask with Jupyter or IPyth...	Personal laptop;Cluster of 2-10 machines	I use Dask all the time, even when I sleep	Documentation;Tutorial;Examples at examples.da...	Performance improvements	Yes	Critical to me	Critical to me	...	Somewhat useful	Somewhat useful	Somewhat useful	Somewhat useful	3.0	SSH	Amazon Web Services (AWS);Microsoft Azure	<NA>	4.0	My team or research group also use Dask
4	2019-06-25 07:11:32	Array;DataFrame;Delayed	Interactive: I use Dask with Jupyter or IPyth...	Personal laptop;Large workstation	Just looking for now	Documentation;Stack Overflow #Dask tag;Tutoria...	More examples in my field	Yes	Somewhat useful	<NA>	...	Somewhat useful	<NA>	<NA>	<NA>	3.0	<NA>	Microsoft Azure	OpenCV	4.0	I use Dask mostly on my own

5 rows × 24 columns

/home/gth/.local/lib/python3.11/site-packages/dask/dataframe/io/csv.py:77: FutureWarning: Support for nested sequences for 'parse_dates' in pd.read_csv is deprecated. Combine the desired columns with pd.to_datetime after parsing instead.
  df = reader(bio, **kwargs)
/home/gth/.local/lib/python3.11/site-packages/dask/dataframe/io/csv.py:77: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  df = reader(bio, **kwargs)

Mulitple Files¶

Dask provides powerful pattern matching capabilities for reading multiple files simultaneously, allowing you to work with collections of files as if they were a single dataset.

from glob import glob
import os
filepath = glob("dataset/*.csv")
ddf = dd.read_csv(os.path.join('dataset','*.csv'),
                 parse_dates={'comp': [0, 1]}, # Here we parse the year, month and day into date
                 dtype={'Interactive or Batch?': str});

/home/gth/.local/lib/python3.11/site-packages/dask/dataframe/io/csv.py:594: FutureWarning: Support for nested sequences for 'parse_dates' in pd.read_csv is deprecated. Combine the desired columns with pd.to_datetime after parsing instead.
  head = reader(BytesIO(b_sample), nrows=sample_rows, **head_kwargs)
/home/gth/.local/lib/python3.11/site-packages/dask/dataframe/io/csv.py:594: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  head = reader(BytesIO(b_sample), nrows=sample_rows, **head_kwargs)

ddf

Dask DataFrame Structure:

	comp	Interactive or Batch?	Local machine or Cluster?	How often do you use Dask?	What Dask resources have you used for support in the last six months?	Which would help you most right now?	Is Dask stable enough for you?	What common feature requests do you care about most? [Better Numpy/Pandas support]	What common feature requests do you care about most? [Better Scikit-Learn/ML support]	What common feature requests do you care about most? [Integrate with Deep Learning Frameworks]	What common feature requests do you care about most? [Support for new libraries in my field]	What common feature requests do you care about most? [Improve Scaling ]	What common feature requests do you care about most? [Dashboard / Diagnostics]	What common feature requests do you care about most? [Ease of deployment]	What common feature requests do you care about most? [Cloud integration]	What common feature requests do you care about most? [Managing many users]	What common feature requests do you care about most? [GPUs]	Python 2 or 3?	If you use a cluster, how do you launch Dask?	Preferred Cloud?	What are some other libraries that you often use with Dask?	How easy is it for you to upgrade to newer versions of Python libraries	Do you use Dask as part of a larger group?
npartitions=2
	string	string	string	string	string	string	string	string	string	string	string	string	string	string	string	string	string	float64	string	string	string	float64	string
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

Dask Name: to_string_dtype, 2 expressions

Creating a Dask Object¶

You can create a Dask object from scratch by supplying existing data and optionally including information about how the chunks should be structured.

array¶

import numpy as np
import dask.array as da

data = np.arange(100_000).reshape(200, 500)
a = da.from_array(data, chunks=(100, 100))
a

	Array	Chunk
Bytes	781.25 kiB	78.12 kiB
Shape	(200, 500)	(100, 100)
Dask graph	10 chunks in 1 graph layer
Data type	int64 numpy.ndarray

The Dask array a is a lazy, chunked version of the NumPy array, meaning operations on it will be performed in smaller blocks and computed in parallel only when explicitly triggered (e.g., using .compute()). This is especially useful for large datasets that don’t fit into memory all at once.

Dataframe¶

This code creates a time-based Pandas DataFrame with 2,400 rows and converts it into a Dask DataFrame split into 10 partitions. This enables parallel and memory-efficient processing of the data.

index = pd.date_range("2021-09-01", periods=2400, freq="1h")
df = pd.DataFrame({"a": np.arange(2400), "b": list("abcaddbe" * 300)}, index=index)
ddf3 = dd.from_pandas(df, npartitions=10)
ddf3

Dask DataFrame Structure:

	a	b
npartitions=10
2021-09-01 00:00:00	int64	string
2021-09-11 00:00:00	...	...
...	...	...
2021-11-30 00:00:00	...	...
2021-12-09 23:00:00	...	...

Dask Name: frompandas, 1 expression

Now we have a Dask DataFrame with 2 columns and 2400 rows composed of 10 partitions where each partition has 240 rows. Each partition represents a piece of the data.

ddf3.divisions

(Timestamp('2021-09-01 00:00:00'),
 Timestamp('2021-09-11 00:00:00'),
 Timestamp('2021-09-21 00:00:00'),
 Timestamp('2021-10-01 00:00:00'),
 Timestamp('2021-10-11 00:00:00'),
 Timestamp('2021-10-21 00:00:00'),
 Timestamp('2021-10-31 00:00:00'),
 Timestamp('2021-11-10 00:00:00'),
 Timestamp('2021-11-20 00:00:00'),
 Timestamp('2021-11-30 00:00:00'),
 Timestamp('2021-12-09 23:00:00'))

access the parition number

ddf3.partitions[3].head()

	a	b
2021-10-01 00:00:00	720	a
2021-10-01 01:00:00	721	b
2021-10-01 02:00:00	722	c
2021-10-01 03:00:00	723	a
2021-10-01 04:00:00	724	d

Indexing¶

Dask provides several powerful methods for indexing large-scale data efficiently across distributed partitions. Understanding how to effectively use these indexing methods is crucial for working with big data in Dask.

Indexing Dask collections feels just like slicing NumPy arrays or pandas DataFrame.

array¶

import numpy as np
import dask.array as da

data = np.arange(100_000).reshape(200, 500)
a = da.from_array(data, chunks=(100, 100))
a[:50, 200]

	Array	Chunk
Bytes	400 B	400 B
Shape	(50,)	(50,)
Dask graph	1 chunks in 2 graph layers
Data type	int64 numpy.ndarray

Dataframe¶

Dask DataFrames support column-based indexing (e.g., ddf[‘col’]), but row-based indexing like .loc or .iloc requires a sorted, known index set using set_index().

ddf3.b

Dask Series Structure:
npartitions=10
2021-09-01 00:00:00    string
2021-09-11 00:00:00       ...
                        ...  
2021-11-30 00:00:00       ...
2021-12-09 23:00:00       ...
Dask Name: getitem, 2 expressions
Expr=df['b']

ddf3["2021-10-01": "2021-10-09 5:00"]

Dask DataFrame Structure:

	a	b
npartitions=1
2021-10-01 00:00:00.000000000	int64	string
2021-10-09 05:00:59.999999999	...	...

Dask Name: loc, 2 expressions

Computation¶

Dask is lazily evaluated. The result from a computation isn’t computed until you ask for it. Instead, a Dask task graph for the computation is produced.

Anytime you have a Dask object and you want to get the result, call compute:

array¶

a[:50, 200].compute()

array([  200,   700,  1200,  1700,  2200,  2700,  3200,  3700,  4200,
        4700,  5200,  5700,  6200,  6700,  7200,  7700,  8200,  8700,
        9200,  9700, 10200, 10700, 11200, 11700, 12200, 12700, 13200,
       13700, 14200, 14700, 15200, 15700, 16200, 16700, 17200, 17700,
       18200, 18700, 19200, 19700, 20200, 20700, 21200, 21700, 22200,
       22700, 23200, 23700, 24200, 24700])

Dataframe¶

ddf3["2021-10-01": "2021-10-09 5:00"].compute()

	a	b
2021-10-01 00:00:00	720	a
2021-10-01 01:00:00	721	b
2021-10-01 02:00:00	722	c
2021-10-01 03:00:00	723	a
2021-10-01 04:00:00	724	d
...	...	...
2021-10-09 01:00:00	913	b
2021-10-09 02:00:00	914	c
2021-10-09 03:00:00	915	a
2021-10-09 04:00:00	916	d
2021-10-09 05:00:00	917	d

198 rows × 2 columns

.compute() triggers the actual execution of a Dask computation and returns the final result as a regular in-memory object (like a Pandas DataFrame or Series).

Methods¶

Dask collections match existing numpy and pandas methods, so they should feel familiar. Call the method to set up the task graph, and then call compute to get the result.

array¶

a.mean()

a.mean().compute()

np.sin(a)

np.sin(a).compute()

a.T

a.T.compute()

array([[    0,   500,  1000, ..., 98500, 99000, 99500],
       [    1,   501,  1001, ..., 98501, 99001, 99501],
       [    2,   502,  1002, ..., 98502, 99002, 99502],
       ...,
       [  497,   997,  1497, ..., 98997, 99497, 99997],
       [  498,   998,  1498, ..., 98998, 99498, 99998],
       [  499,   999,  1499, ..., 98999, 99499, 99999]], shape=(500, 200))

Dataframe¶

ddf3.a.mean()

ddf3.a.mean().compute()

ddf3.b.unique()

ddf3.b.unique().compute()

  e
  c
  d
  a
  b
Name: b, dtype: string

exercice¶

For Dask arrays, calculate the mean along axis=1 of the sum of the x array and its transpose.

import dask.array as da
import numpy as np
import dask

xd = da.random.normal( 10 , 0.1 , size=( 30_000 , 30_000 ), chunks=( 300 , 300 ))

creates a Dask array of random values with a normal distribution, mean of 10, and standard deviation of 0.1, of shape (30,000, 30,000). The array is chunked into blocks of 300×300 elements for parallel processing.

Solution¶

x_sum = xd + xd.T
res = x_sum.mean(axis= 1 )
res.compute()

array([19.99894119, 19.99777179, 19.99950783, ..., 20.00016687,
       19.99939267, 19.99915841], shape=(30000,))

Conclusion¶

Keypoints

Dask breaks large datasets into chunks (partitions) for parallel processing on available cores
Lazy evaluation means operations are recorded in a task graph until .compute() is called
Dask APIs mirror Pandas, NumPy, and scikit-learn for familiar syntax and easier adoption
Multiple files can be loaded simultaneously using glob patterns and wildcard matching
.delayed() defers function execution to build complex computation graphs
Row-based indexing on DataFrames requires .set_index() for efficient operations
Column-based operations on DataFrames are more efficient than row-based indexing
Methods like .mean(), .unique() work on Dask objects and only execute when .compute() is called