Dask

Dask is an open-source Python library that enables parallel and distributed computing, making it easier to scale data analysis from a single machine to large cloud-based clusters. It integrates seamlessly with the existing Python and PyData ecosystem, offering familiar APIs that mirror those of popular libraries such as Pandas, NumPy, and scikit-learn.

Designed for scalability, Dask can utilize all the resources of your local machine or expand to distributed systems in the cloud.

Dask primarily provides three core interfaces for working with data: Dask DataFrames, Dask Arrays, and Dask Bags.

Key Features

  • Scalability: Handles datasets larger than available memory dask.org

  • Performance: 50% faster than Apache Spark on standard benchmarks dask.org

  • Flexibility: Runs on laptops or scales to thousands of nodes

Basic Concepts of Dask

On a high-level, you can think of Dask as a wrapper that extends the capabilities of traditional tools like pandas, NumPy, and Spark to handle larger-than-memory datasets.

When faced with large objects like larger-than-memory arrays (vectors) or matrices (dataframes), Dask breaks them up into chunks, also called partitions.

For example, consider the array of 12 random numbers in both NumPy and Dask:

!pip install dask pandas pyarrow graphviz # fist intall dask and pandas using pip
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: dask in /home/gth/.local/lib/python3.11/site-packages (2025.4.1)
Requirement already satisfied: pandas in /home/gth/.local/lib/python3.11/site-packages (2.2.3)
Requirement already satisfied: pyarrow in /home/gth/.local/lib/python3.11/site-packages (20.0.0)
Collecting graphviz
  Downloading graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Requirement already satisfied: click>=8.1 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (8.1.7)
Requirement already satisfied: cloudpickle>=3.0.0 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (3.0.0)
Requirement already satisfied: fsspec>=2021.09.0 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (2023.10.0)
Requirement already satisfied: packaging>=20.0 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (23.2)
Requirement already satisfied: partd>=1.4.0 in /home/gth/.local/lib/python3.11/site-packages (from dask) (1.4.2)
Requirement already satisfied: pyyaml>=5.3.1 in /apps/all/PyYAML/6.0.1-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (6.0.1)
Requirement already satisfied: toolz>=0.10.0 in /home/gth/.local/lib/python3.11/site-packages (from dask) (1.0.0)
Requirement already satisfied: importlib_metadata>=4.13.0 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (6.8.0)
Requirement already satisfied: numpy>=1.23.2 in /home/gth/.local/lib/python3.11/site-packages (from pandas) (2.2.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.7 in /home/gth/.local/lib/python3.11/site-packages (from pandas) (2025.2)
Requirement already satisfied: zipp>=0.5 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from importlib_metadata>=4.13.0->dask) (3.17.0)
Requirement already satisfied: locket in /home/gth/.local/lib/python3.11/site-packages (from partd>=1.4.0->dask) (1.0.0)
Requirement already satisfied: six>=1.5 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Downloading graphviz-0.20.3-py3-none-any.whl (47 kB)
Installing collected packages: graphviz
Successfully installed graphviz-0.20.3

[notice] A new release of pip is available: 24.3.1 -> 25.1
[notice] To update, run: pip install --upgrade pip
import dask.array as da
import dask.bag as db
import dask.dataframe as dd
import numpy as np
import pandas as pd
import dask
import numpy as np
narr = np.random.rand(12)

narr
array([0.88535136, 0.2347785 , 0.10454781, 0.14247243, 0.1455551 ,
       0.70315651, 0.70865184, 0.62603594, 0.29620262, 0.26905289,
       0.71273231, 0.76479225])

The next code creates a Dask array darr from a NumPy array narr, splitting it into chunks of size 3. This allows efficient parallel computation and memory management.

darr = da.from_array(narr, chunks=3)
darr
Array Chunk
Bytes 96 B 24 B
Shape (12,) (3,)
Dask graph 4 chunks in 1 graph layer
Data type float64 numpy.ndarray
12 1

The image above shows that the Dask array contains four chunks as we set chunks to 3. Under the hood, each chunk is a NumPy array in itself.

Now, let’s consider a much larger example. We will create two 10k by 100k arrays (1 billion elements) and perform element-wise multiplication in both libraries while measuring the performance:

# Create the NumPy arrays
arr1 = np.random.rand(10_000, 100_000)
arr2 = np.random.rand(10_000, 100_000)
# Create the Dask arrays
darr1 = da.from_array(arr1, chunks=(1_000, 10_000))
darr2 = da.from_array(arr2, chunks=(1_000, 10_000))
%%time
result_np = np.multiply(arr1, arr2)
CPU times: user 748 ms, sys: 649 ms, total: 1.4 s
Wall time: 1.4 s
%%time
result_dask = da.multiply(darr1, darr2)
result_dask = result_dask.compute()  # This triggers the actual computation
CPU times: user 3.93 s, sys: 2.53 s, total: 6.46 s
Wall time: 2.42 s

Dask shines for larger-than-memory arrays or truly parallel workloads, but for single-node, RAM-fit tasks, NumPy is often faster due to lower overhead.

Dask uses a similar approach of chunking and distributing these chunks across all available cores on your machine for other objects as well.

Integration Speedup Chart

This is a short overview of Dask

Dask DataFrames

Dask DataFrames are a powerful tool for processing large-scale tabular data by parallelizing pandas operations docs.dask.org. They allow you to work with datasets that are too large to fit into memory by breaking them down into smaller chunks and processing them in parallel.

import dask
from dask import delayed
from time import sleep

def inc(x):
    sleep(1)
    return x + 1

def add(x, y):
    sleep(1)
    return x + y
%%time
# Sequential code
a=[1,2,3]
results = []
for x in a:
    y = inc(x)
    results.append(y)
    
total = sum(results)
CPU times: user 636 µs, sys: 912 µs, total: 1.55 ms
Wall time: 3 s

This block uses Dask’s delayed to create lazy tasks.

The computation (inc(x)) is not executed immediately; instead, Dask constructs a task graph.

Total time: The sleep(1) calls still happen sequentially (since we’re not triggering parallelism here yet), so the time should still be around 3 seconds.

%%time
# Sequential code

results = []
for x in a:
    y = delayed(inc)(x)
    results.append(y)
    
total = sum(results)
CPU times: user 0 ns, sys: 767 µs, total: 767 µs
Wall time: 731 µs

the first block runs sequentially, calling the inc(x) function for each element in the list a, resulting in a total execution time of about 3 seconds. In the second block, Dask’s delayed is used to create lazy tasks for the same function calls.

%%time
# Sequential code

results = []
for x in a:
    y = delayed(inc)(x)
    results.append(y)
    
total = delayed(sum)(results)
CPU times: user 424 μs, sys: 142 μs, total: 566 μs
Wall time: 582 μs

here we are using Dask’s delayed function to create a task graph for a sequence of computations (incrementing each value in the list a and summing the results). The tasks are not executed immediately but are delayed until explicitly triggered. However, since no compute() function is called, the computation is still done sequentially, and the tasks are just set up. To actually perform the computations in parallel, you would need to call total.compute().

The delayed function in Dask is used to defer the execution of a function, creating a lazy task instead of running the function immediately.

import pandas as pd
df=pd.read_csv("dataset/dask_dataset.csv")
df
Timestamp Dask APIs Interactive or Batch? Local machine or Cluster? How often do you use Dask? What Dask resources have you used for support in the last six months? Which would help you most right now? Is Dask stable enough for you? What common feature requests do you care about most? [Better Numpy/Pandas support] What common feature requests do you care about most? [Better Scikit-Learn/ML support] ... What common feature requests do you care about most? [Ease of deployment] What common feature requests do you care about most? [Cloud integration] What common feature requests do you care about most? [Managing many users] What common feature requests do you care about most? [GPUs] Python 2 or 3? If you use a cluster, how do you launch Dask? Preferred Cloud? What are some other libraries that you often use with Dask? How easy is it for you to upgrade to newer versions of Python libraries Do you use Dask as part of a larger group?
0 2019-06-25 06:59:49 Array;DataFrame Interactive: I use Dask with Jupyter or IPyth... Personal laptop;Large workstation;Cluster with... Occasionally Documentation;Tutorial;Examples at examples.da... Performance improvements Yes Somewhat useful Somewhat useful ... Somewhat useful Not relevant for me Somewhat useful Somewhat useful 3.0 SSH;HPC resource manager (SLURM, PBS, SGE, LSF... Amazon Web Services (AWS);Google Cloud Platfor... NaN 4.0 My team or research group also use Dask
1 2019-06-25 07:01:12 DataFrame;ML Interactive: I use Dask with Jupyter or IPyth... Large workstation Occasionally Documentation;Stack Overflow #Dask tag More examples in my field Yes Critical to me Critical to me ... Somewhat useful Somewhat useful Somewhat useful Critical to me 3.0 NaN Google Cloud Platform (GCP) NaN 3.0 I use Dask mostly on my own
2 2019-06-25 07:02:25 DataFrame;Futures;Xarray Interactive: I use Dask with Jupyter or IPyth... Cluster of 2-10 machines Occasionally Documentation;Stack Overflow #Dask tag Bug fixes No Critical to me Somewhat useful ... Critical to me Critical to me Critical to me Somewhat useful 3.0 I don't know, someone else does this for me Amazon Web Services (AWS) Datashader 3.0 My team or research group also use Dask
3 2019-06-25 07:07:08 Bag;DataFrame;Delayed;Futures Interactive: I use Dask with Jupyter or IPyth... Personal laptop;Cluster of 2-10 machines I use Dask all the time, even when I sleep Documentation;Tutorial;Examples at examples.da... Performance improvements Yes Critical to me Critical to me ... Somewhat useful Somewhat useful Somewhat useful Somewhat useful 3.0 SSH Amazon Web Services (AWS);Microsoft Azure NaN 4.0 My team or research group also use Dask
4 2019-06-25 07:11:32 Array;DataFrame;Delayed Interactive: I use Dask with Jupyter or IPyth... Personal laptop;Large workstation Just looking for now Documentation;Stack Overflow #Dask tag;Tutoria... More examples in my field Yes Somewhat useful NaN ... Somewhat useful NaN NaN NaN 3.0 NaN Microsoft Azure OpenCV 4.0 I use Dask mostly on my own
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
254 2019-07-21 07:19:54 Delayed;Futures Batch: I submit scripts that run in the future Personal laptop Occasionally Documentation More examples in my field Yes Somewhat useful Not relevant for me ... Somewhat useful Somewhat useful Not relevant for me Not relevant for me 3.0 SSH Google Cloud Platform (GCP) Pandas 4.0 I use Dask mostly on my own
255 2019-07-22 08:23:11 DataFrame;Futures;Xarray Interactive: I use Dask with Jupyter or IPyth... Personal laptop;Large workstation Just looking for now Documentation;Examples at examples.dask.org More examples in my field Yes Critical to me Somewhat useful ... Somewhat useful Somewhat useful Somewhat useful Critical to me 3.0 NaN Microsoft Azure NaN 3.0 My team or research group also use Dask
256 2019-07-23 08:22:35 Futures Batch: I submit scripts that run in the future Cluster with 10-100 machines Every day Documentation;GitHub issue trackers (reading p... New features Yes Somewhat useful Somewhat useful ... Critical to me Not relevant for me Not relevant for me Somewhat useful 3.0 SSH NaN Pybullet, LMDB, pytorch. 4.0 My team or research group also use Dask
257 2019-07-23 09:08:42 Delayed;Futures Batch: I submit scripts that run in the future Cluster with 100+ machines Every day Documentation;GitHub issue trackers (reading p... Bug fixes No Somewhat useful Not relevant for me ... Somewhat useful Not relevant for me Not relevant for me Critical to me 3.0 SSH We have our own cluster for the lab torch, lmdb 3.0 My team or research group also use Dask
258 2019-07-23 14:18:22 Array;DataFrame;ML Interactive: I use Dask with Jupyter or IPyth... Personal laptop;Large workstation Occasionally Documentation;Stack Overflow #Dask tag More examples in my field Yes Somewhat useful Critical to me ... Critical to me Not relevant for me Not relevant for me Critical to me 3.0 NaN NaN NaN 3.0 My team or research group also use Dask

259 rows × 24 columns

ddf = dd.read_csv("dataset/dask_dataset.csv") 
ddf
Dask DataFrame Structure:
Timestamp Dask APIs Interactive or Batch? Local machine or Cluster? How often do you use Dask? What Dask resources have you used for support in the last six months? Which would help you most right now? Is Dask stable enough for you? What common feature requests do you care about most? [Better Numpy/Pandas support] What common feature requests do you care about most? [Better Scikit-Learn/ML support] What common feature requests do you care about most? [Integrate with Deep Learning Frameworks] What common feature requests do you care about most? [Support for new libraries in my field] What common feature requests do you care about most? [Improve Scaling ] What common feature requests do you care about most? [Dashboard / Diagnostics] What common feature requests do you care about most? [Ease of deployment] What common feature requests do you care about most? [Cloud integration] What common feature requests do you care about most? [Managing many users] What common feature requests do you care about most? [GPUs] Python 2 or 3? If you use a cluster, how do you launch Dask? Preferred Cloud? What are some other libraries that you often use with Dask? How easy is it for you to upgrade to newer versions of Python libraries Do you use Dask as part of a larger group?
npartitions=1
string string string string string string string string string string string string string string string string string string float64 string string string float64 string
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: to_string_dtype, 2 expressions
ddf.dtypes 
Timestamp                                                                                          string[pyarrow]
Dask APIs                                                                                          string[pyarrow]
Interactive or Batch?                                                                              string[pyarrow]
Local machine or Cluster?                                                                          string[pyarrow]
How often do you use Dask?                                                                         string[pyarrow]
What Dask resources have you used for support in the last six months?                              string[pyarrow]
Which would help you most right now?                                                               string[pyarrow]
Is Dask stable enough for you?                                                                     string[pyarrow]
What common feature requests do you care about most?  [Better Numpy/Pandas support]                string[pyarrow]
What common feature requests do you care about most?  [Better Scikit-Learn/ML support]             string[pyarrow]
What common feature requests do you care about most?  [Integrate with Deep Learning Frameworks]    string[pyarrow]
What common feature requests do you care about most?  [Support for new libraries in my field]      string[pyarrow]
What common feature requests do you care about most?  [Improve Scaling ]                           string[pyarrow]
What common feature requests do you care about most?  [Dashboard / Diagnostics]                    string[pyarrow]
What common feature requests do you care about most?  [Ease of deployment]                         string[pyarrow]
What common feature requests do you care about most?  [Cloud integration]                          string[pyarrow]
What common feature requests do you care about most?  [Managing many users]                        string[pyarrow]
What common feature requests do you care about most?  [GPUs]                                       string[pyarrow]
Python 2 or 3?                                                                                               int64
If you use a cluster, how do you launch Dask?                                                      string[pyarrow]
Preferred Cloud?                                                                                   string[pyarrow]
What are some other libraries that you often use with Dask?                                        string[pyarrow]
How easy is it for you to upgrade to newer versions of Python libraries                                    float64
Do you use Dask as part of a larger group?                                                         string[pyarrow]
dtype: object

we created Dask DataFrame using dask.dataframe.read_csv. Dask DataFrames are lazy, meaning they don’t load the data into memory immediately. Instead, they create a task graph for operations that will be executed when triggered. To view or manipulate the data, you would need to call methods like .head() or .compute() to perform the actual computation and load the data into memory or run operations.

ddf.head()
Timestamp Dask APIs Interactive or Batch? Local machine or Cluster? How often do you use Dask? What Dask resources have you used for support in the last six months? Which would help you most right now? Is Dask stable enough for you? What common feature requests do you care about most? [Better Numpy/Pandas support] What common feature requests do you care about most? [Better Scikit-Learn/ML support] ... What common feature requests do you care about most? [Ease of deployment] What common feature requests do you care about most? [Cloud integration] What common feature requests do you care about most? [Managing many users] What common feature requests do you care about most? [GPUs] Python 2 or 3? If you use a cluster, how do you launch Dask? Preferred Cloud? What are some other libraries that you often use with Dask? How easy is it for you to upgrade to newer versions of Python libraries Do you use Dask as part of a larger group?
0 2019-06-25 06:59:49 Array;DataFrame Interactive: I use Dask with Jupyter or IPyth... Personal laptop;Large workstation;Cluster with... Occasionally Documentation;Tutorial;Examples at examples.da... Performance improvements Yes Somewhat useful Somewhat useful ... Somewhat useful Not relevant for me Somewhat useful Somewhat useful 3.0 SSH;HPC resource manager (SLURM, PBS, SGE, LSF... Amazon Web Services (AWS);Google Cloud Platfor... <NA> 4.0 My team or research group also use Dask
1 2019-06-25 07:01:12 DataFrame;ML Interactive: I use Dask with Jupyter or IPyth... Large workstation Occasionally Documentation;Stack Overflow #Dask tag More examples in my field Yes Critical to me Critical to me ... Somewhat useful Somewhat useful Somewhat useful Critical to me 3.0 <NA> Google Cloud Platform (GCP) <NA> 3.0 I use Dask mostly on my own
2 2019-06-25 07:02:25 DataFrame;Futures;Xarray Interactive: I use Dask with Jupyter or IPyth... Cluster of 2-10 machines Occasionally Documentation;Stack Overflow #Dask tag Bug fixes No Critical to me Somewhat useful ... Critical to me Critical to me Critical to me Somewhat useful 3.0 I don't know, someone else does this for me Amazon Web Services (AWS) Datashader 3.0 My team or research group also use Dask
3 2019-06-25 07:07:08 Bag;DataFrame;Delayed;Futures Interactive: I use Dask with Jupyter or IPyth... Personal laptop;Cluster of 2-10 machines I use Dask all the time, even when I sleep Documentation;Tutorial;Examples at examples.da... Performance improvements Yes Critical to me Critical to me ... Somewhat useful Somewhat useful Somewhat useful Somewhat useful 3.0 SSH Amazon Web Services (AWS);Microsoft Azure <NA> 4.0 My team or research group also use Dask
4 2019-06-25 07:11:32 Array;DataFrame;Delayed Interactive: I use Dask with Jupyter or IPyth... Personal laptop;Large workstation Just looking for now Documentation;Stack Overflow #Dask tag;Tutoria... More examples in my field Yes Somewhat useful <NA> ... Somewhat useful <NA> <NA> <NA> 3.0 <NA> Microsoft Azure OpenCV 4.0 I use Dask mostly on my own

5 rows × 24 columns

/home/gth/.local/lib/python3.11/site-packages/dask/dataframe/io/csv.py:77: FutureWarning: Support for nested sequences for 'parse_dates' in pd.read_csv is deprecated. Combine the desired columns with pd.to_datetime after parsing instead.
  df = reader(bio, **kwargs)
/home/gth/.local/lib/python3.11/site-packages/dask/dataframe/io/csv.py:77: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  df = reader(bio, **kwargs)

Mulitple Files

Dask provides powerful pattern matching capabilities for reading multiple files simultaneously, allowing you to work with collections of files as if they were a single dataset.

from glob import glob
import os
filepath = glob("dataset/*.csv")
ddf = dd.read_csv(os.path.join('dataset','*.csv'),
                 parse_dates={'comp': [0, 1]}, # Here we parse the year, month and day into date
                 dtype={'Interactive or Batch?': str});
/home/gth/.local/lib/python3.11/site-packages/dask/dataframe/io/csv.py:594: FutureWarning: Support for nested sequences for 'parse_dates' in pd.read_csv is deprecated. Combine the desired columns with pd.to_datetime after parsing instead.
  head = reader(BytesIO(b_sample), nrows=sample_rows, **head_kwargs)
/home/gth/.local/lib/python3.11/site-packages/dask/dataframe/io/csv.py:594: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  head = reader(BytesIO(b_sample), nrows=sample_rows, **head_kwargs)
ddf
Dask DataFrame Structure:
comp Interactive or Batch? Local machine or Cluster? How often do you use Dask? What Dask resources have you used for support in the last six months? Which would help you most right now? Is Dask stable enough for you? What common feature requests do you care about most? [Better Numpy/Pandas support] What common feature requests do you care about most? [Better Scikit-Learn/ML support] What common feature requests do you care about most? [Integrate with Deep Learning Frameworks] What common feature requests do you care about most? [Support for new libraries in my field] What common feature requests do you care about most? [Improve Scaling ] What common feature requests do you care about most? [Dashboard / Diagnostics] What common feature requests do you care about most? [Ease of deployment] What common feature requests do you care about most? [Cloud integration] What common feature requests do you care about most? [Managing many users] What common feature requests do you care about most? [GPUs] Python 2 or 3? If you use a cluster, how do you launch Dask? Preferred Cloud? What are some other libraries that you often use with Dask? How easy is it for you to upgrade to newer versions of Python libraries Do you use Dask as part of a larger group?
npartitions=2
string string string string string string string string string string string string string string string string string float64 string string string float64 string
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: to_string_dtype, 2 expressions

Creating a Dask Object

You can create a Dask object from scratch by supplying existing data and optionally including information about how the chunks should be structured.

array

import numpy as np
import dask.array as da

data = np.arange(100_000).reshape(200, 500)
a = da.from_array(data, chunks=(100, 100))
a
Array Chunk
Bytes 781.25 kiB 78.12 kiB
Shape (200, 500) (100, 100)
Dask graph 10 chunks in 1 graph layer
Data type int64 numpy.ndarray
500 200

The Dask array a is a lazy, chunked version of the NumPy array, meaning operations on it will be performed in smaller blocks and computed in parallel only when explicitly triggered (e.g., using .compute()). This is especially useful for large datasets that don’t fit into memory all at once.

Dataframe

This code creates a time-based Pandas DataFrame with 2,400 rows and converts it into a Dask DataFrame split into 10 partitions. This enables parallel and memory-efficient processing of the data.

index = pd.date_range("2021-09-01", periods=2400, freq="1h")
df = pd.DataFrame({"a": np.arange(2400), "b": list("abcaddbe" * 300)}, index=index)
ddf3 = dd.from_pandas(df, npartitions=10)
ddf3
Dask DataFrame Structure:
a b
npartitions=10
2021-09-01 00:00:00 int64 string
2021-09-11 00:00:00 ... ...
... ... ...
2021-11-30 00:00:00 ... ...
2021-12-09 23:00:00 ... ...
Dask Name: frompandas, 1 expression

Now we have a Dask DataFrame with 2 columns and 2400 rows composed of 10 partitions where each partition has 240 rows. Each partition represents a piece of the data.

ddf3.divisions
(Timestamp('2021-09-01 00:00:00'),
 Timestamp('2021-09-11 00:00:00'),
 Timestamp('2021-09-21 00:00:00'),
 Timestamp('2021-10-01 00:00:00'),
 Timestamp('2021-10-11 00:00:00'),
 Timestamp('2021-10-21 00:00:00'),
 Timestamp('2021-10-31 00:00:00'),
 Timestamp('2021-11-10 00:00:00'),
 Timestamp('2021-11-20 00:00:00'),
 Timestamp('2021-11-30 00:00:00'),
 Timestamp('2021-12-09 23:00:00'))

access the parition number

ddf3.partitions[3].head()
a b
2021-10-01 00:00:00 720 a
2021-10-01 01:00:00 721 b
2021-10-01 02:00:00 722 c
2021-10-01 03:00:00 723 a
2021-10-01 04:00:00 724 d

Indexing

Dask provides several powerful methods for indexing large-scale data efficiently across distributed partitions. Understanding how to effectively use these indexing methods is crucial for working with big data in Dask.

Indexing Dask collections feels just like slicing NumPy arrays or pandas DataFrame.

array

import numpy as np
import dask.array as da

data = np.arange(100_000).reshape(200, 500)
a = da.from_array(data, chunks=(100, 100))
a[:50, 200]
Array Chunk
Bytes 400 B 400 B
Shape (50,) (50,)
Dask graph 1 chunks in 2 graph layers
Data type int64 numpy.ndarray
50 1

Dataframe

Dask DataFrames support column-based indexing (e.g., ddf[‘col’]), but row-based indexing like .loc or .iloc requires a sorted, known index set using set_index().

ddf3.b
Dask Series Structure:
npartitions=10
2021-09-01 00:00:00    string
2021-09-11 00:00:00       ...
                        ...  
2021-11-30 00:00:00       ...
2021-12-09 23:00:00       ...
Dask Name: getitem, 2 expressions
Expr=df['b']
ddf3["2021-10-01": "2021-10-09 5:00"]
Dask DataFrame Structure:
a b
npartitions=1
2021-10-01 00:00:00.000000000 int64 string
2021-10-09 05:00:59.999999999 ... ...
Dask Name: loc, 2 expressions

Computation

Dask is lazily evaluated. The result from a computation isn’t computed until you ask for it. Instead, a Dask task graph for the computation is produced.

Anytime you have a Dask object and you want to get the result, call compute:

array

a[:50, 200].compute()
array([  200,   700,  1200,  1700,  2200,  2700,  3200,  3700,  4200,
        4700,  5200,  5700,  6200,  6700,  7200,  7700,  8200,  8700,
        9200,  9700, 10200, 10700, 11200, 11700, 12200, 12700, 13200,
       13700, 14200, 14700, 15200, 15700, 16200, 16700, 17200, 17700,
       18200, 18700, 19200, 19700, 20200, 20700, 21200, 21700, 22200,
       22700, 23200, 23700, 24200, 24700])

Dataframe

ddf3["2021-10-01": "2021-10-09 5:00"].compute()
a b
2021-10-01 00:00:00 720 a
2021-10-01 01:00:00 721 b
2021-10-01 02:00:00 722 c
2021-10-01 03:00:00 723 a
2021-10-01 04:00:00 724 d
... ... ...
2021-10-09 01:00:00 913 b
2021-10-09 02:00:00 914 c
2021-10-09 03:00:00 915 a
2021-10-09 04:00:00 916 d
2021-10-09 05:00:00 917 d

198 rows × 2 columns

.compute() triggers the actual execution of a Dask computation and returns the final result as a regular in-memory object (like a Pandas DataFrame or Series).

Methods

Dask collections match existing numpy and pandas methods, so they should feel familiar. Call the method to set up the task graph, and then call compute to get the result.

array

a.mean()

a.mean().compute()

np.sin(a)

np.sin(a).compute()

a.T

a.T.compute()
array([[    0,   500,  1000, ..., 98500, 99000, 99500],
       [    1,   501,  1001, ..., 98501, 99001, 99501],
       [    2,   502,  1002, ..., 98502, 99002, 99502],
       ...,
       [  497,   997,  1497, ..., 98997, 99497, 99997],
       [  498,   998,  1498, ..., 98998, 99498, 99998],
       [  499,   999,  1499, ..., 98999, 99499, 99999]], shape=(500, 200))

Dataframe

ddf3.a.mean()

ddf3.a.mean().compute()

ddf3.b.unique()

ddf3.b.unique().compute()
0    e
0    c
1    d
0    a
0    b
Name: b, dtype: string

exercice

For Dask arrays, calculate the mean along axis=1 of the sum of the x array and its transpose.

import dask.array as da
import numpy as np
import dask
xd = da.random.normal( 10 , 0.1 , size=( 30_000 , 30_000 ), chunks=( 300 , 300 ))

creates a Dask array of random values with a normal distribution, mean of 10, and standard deviation of 0.1, of shape (30,000, 30,000). The array is chunked into blocks of 300×300 elements for parallel processing.

Solution

x_sum = xd + xd.T
res = x_sum.mean(axis= 1 )
res.compute()
array([19.99894119, 19.99777179, 19.99950783, ..., 20.00016687,
       19.99939267, 19.99915841], shape=(30000,))