Dask¶
Dask is an open-source Python library that enables parallel and distributed computing, making it easier to scale data analysis from a single machine to large cloud-based clusters. It integrates seamlessly with the existing Python and PyData ecosystem, offering familiar APIs that mirror those of popular libraries such as Pandas, NumPy, and scikit-learn.
Designed for scalability, Dask can utilize all the resources of your local machine or expand to distributed systems in the cloud.
Dask primarily provides three core interfaces for working with data: Dask DataFrames, Dask Arrays, and Dask Bags.
Key Features
Scalability: Handles datasets larger than available memory dask.org
Performance: 50% faster than Apache Spark on standard benchmarks dask.org
Flexibility: Runs on laptops or scales to thousands of nodes
Basic Concepts of Dask¶
On a high-level, you can think of Dask as a wrapper that extends the capabilities of traditional tools like pandas, NumPy, and Spark to handle larger-than-memory datasets.
When faced with large objects like larger-than-memory arrays (vectors) or matrices (dataframes), Dask breaks them up into chunks, also called partitions.
For example, consider the array of 12 random numbers in both NumPy and Dask:
!pip install dask pandas pyarrow graphviz # fist intall dask and pandas using pip
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: dask in /home/gth/.local/lib/python3.11/site-packages (2025.4.1)
Requirement already satisfied: pandas in /home/gth/.local/lib/python3.11/site-packages (2.2.3)
Requirement already satisfied: pyarrow in /home/gth/.local/lib/python3.11/site-packages (20.0.0)
Collecting graphviz
Downloading graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Requirement already satisfied: click>=8.1 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (8.1.7)
Requirement already satisfied: cloudpickle>=3.0.0 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (3.0.0)
Requirement already satisfied: fsspec>=2021.09.0 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (2023.10.0)
Requirement already satisfied: packaging>=20.0 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (23.2)
Requirement already satisfied: partd>=1.4.0 in /home/gth/.local/lib/python3.11/site-packages (from dask) (1.4.2)
Requirement already satisfied: pyyaml>=5.3.1 in /apps/all/PyYAML/6.0.1-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (6.0.1)
Requirement already satisfied: toolz>=0.10.0 in /home/gth/.local/lib/python3.11/site-packages (from dask) (1.0.0)
Requirement already satisfied: importlib_metadata>=4.13.0 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from dask) (6.8.0)
Requirement already satisfied: numpy>=1.23.2 in /home/gth/.local/lib/python3.11/site-packages (from pandas) (2.2.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.7 in /home/gth/.local/lib/python3.11/site-packages (from pandas) (2025.2)
Requirement already satisfied: zipp>=0.5 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from importlib_metadata>=4.13.0->dask) (3.17.0)
Requirement already satisfied: locket in /home/gth/.local/lib/python3.11/site-packages (from partd>=1.4.0->dask) (1.0.0)
Requirement already satisfied: six>=1.5 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Downloading graphviz-0.20.3-py3-none-any.whl (47 kB)
Installing collected packages: graphviz
Successfully installed graphviz-0.20.3
[notice] A new release of pip is available: 24.3.1 -> 25.1
[notice] To update, run: pip install --upgrade pip
import dask.array as da
import dask.bag as db
import dask.dataframe as dd
import numpy as np
import pandas as pd
import dask
import numpy as np
narr = np.random.rand(12)
narr
array([0.88535136, 0.2347785 , 0.10454781, 0.14247243, 0.1455551 ,
0.70315651, 0.70865184, 0.62603594, 0.29620262, 0.26905289,
0.71273231, 0.76479225])
The next code creates a Dask array darr from a NumPy array narr, splitting it into chunks of size 3. This allows efficient parallel computation and memory management.
darr = da.from_array(narr, chunks=3)
darr
|
|
|||||||||||||||
The image above shows that the Dask array contains four chunks as we set chunks to 3. Under the hood, each chunk is a NumPy array in itself.
Now, let’s consider a much larger example. We will create two 10k by 100k arrays (1 billion elements) and perform element-wise multiplication in both libraries while measuring the performance:
# Create the NumPy arrays
arr1 = np.random.rand(10_000, 100_000)
arr2 = np.random.rand(10_000, 100_000)
# Create the Dask arrays
darr1 = da.from_array(arr1, chunks=(1_000, 10_000))
darr2 = da.from_array(arr2, chunks=(1_000, 10_000))
%%time
result_np = np.multiply(arr1, arr2)
CPU times: user 748 ms, sys: 649 ms, total: 1.4 s
Wall time: 1.4 s
%%time
result_dask = da.multiply(darr1, darr2)
result_dask = result_dask.compute() # This triggers the actual computation
CPU times: user 3.93 s, sys: 2.53 s, total: 6.46 s
Wall time: 2.42 s
Dask shines for larger-than-memory arrays or truly parallel workloads, but for single-node, RAM-fit tasks, NumPy is often faster due to lower overhead.
Dask uses a similar approach of chunking and distributing these chunks across all available cores on your machine for other objects as well.

This is a short overview of Dask
Dask DataFrames¶
Dask DataFrames are a powerful tool for processing large-scale tabular data by parallelizing pandas operations docs.dask.org. They allow you to work with datasets that are too large to fit into memory by breaking them down into smaller chunks and processing them in parallel.
import dask
from dask import delayed
from time import sleep
def inc(x):
sleep(1)
return x + 1
def add(x, y):
sleep(1)
return x + y
%%time
# Sequential code
a=[1,2,3]
results = []
for x in a:
y = inc(x)
results.append(y)
total = sum(results)
CPU times: user 636 µs, sys: 912 µs, total: 1.55 ms
Wall time: 3 s
This block uses Dask’s delayed to create lazy tasks.
The computation (inc(x)) is not executed immediately; instead, Dask constructs a task graph.
Total time: The sleep(1) calls still happen sequentially (since we’re not triggering parallelism here yet), so the time should still be around 3 seconds.
%%time
# Sequential code
results = []
for x in a:
y = delayed(inc)(x)
results.append(y)
total = sum(results)
CPU times: user 0 ns, sys: 767 µs, total: 767 µs
Wall time: 731 µs
the first block runs sequentially, calling the inc(x) function for each element in the list a, resulting in a total execution time of about 3 seconds. In the second block, Dask’s delayed is used to create lazy tasks for the same function calls.
%%time
# Sequential code
results = []
for x in a:
y = delayed(inc)(x)
results.append(y)
total = delayed(sum)(results)
CPU times: user 424 μs, sys: 142 μs, total: 566 μs
Wall time: 582 μs
here we are using Dask’s delayed function to create a task graph for a sequence of computations (incrementing each value in the list a and summing the results). The tasks are not executed immediately but are delayed until explicitly triggered. However, since no compute() function is called, the computation is still done sequentially, and the tasks are just set up. To actually perform the computations in parallel, you would need to call total.compute().
The delayed function in Dask is used to defer the execution of a function, creating a lazy task instead of running the function immediately.
import pandas as pd
df=pd.read_csv("dataset/dask_dataset.csv")
df
| Timestamp | Dask APIs | Interactive or Batch? | Local machine or Cluster? | How often do you use Dask? | What Dask resources have you used for support in the last six months? | Which would help you most right now? | Is Dask stable enough for you? | What common feature requests do you care about most? [Better Numpy/Pandas support] | What common feature requests do you care about most? [Better Scikit-Learn/ML support] | ... | What common feature requests do you care about most? [Ease of deployment] | What common feature requests do you care about most? [Cloud integration] | What common feature requests do you care about most? [Managing many users] | What common feature requests do you care about most? [GPUs] | Python 2 or 3? | If you use a cluster, how do you launch Dask? | Preferred Cloud? | What are some other libraries that you often use with Dask? | How easy is it for you to upgrade to newer versions of Python libraries | Do you use Dask as part of a larger group? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-06-25 06:59:49 | Array;DataFrame | Interactive: I use Dask with Jupyter or IPyth... | Personal laptop;Large workstation;Cluster with... | Occasionally | Documentation;Tutorial;Examples at examples.da... | Performance improvements | Yes | Somewhat useful | Somewhat useful | ... | Somewhat useful | Not relevant for me | Somewhat useful | Somewhat useful | 3.0 | SSH;HPC resource manager (SLURM, PBS, SGE, LSF... | Amazon Web Services (AWS);Google Cloud Platfor... | NaN | 4.0 | My team or research group also use Dask |
| 1 | 2019-06-25 07:01:12 | DataFrame;ML | Interactive: I use Dask with Jupyter or IPyth... | Large workstation | Occasionally | Documentation;Stack Overflow #Dask tag | More examples in my field | Yes | Critical to me | Critical to me | ... | Somewhat useful | Somewhat useful | Somewhat useful | Critical to me | 3.0 | NaN | Google Cloud Platform (GCP) | NaN | 3.0 | I use Dask mostly on my own |
| 2 | 2019-06-25 07:02:25 | DataFrame;Futures;Xarray | Interactive: I use Dask with Jupyter or IPyth... | Cluster of 2-10 machines | Occasionally | Documentation;Stack Overflow #Dask tag | Bug fixes | No | Critical to me | Somewhat useful | ... | Critical to me | Critical to me | Critical to me | Somewhat useful | 3.0 | I don't know, someone else does this for me | Amazon Web Services (AWS) | Datashader | 3.0 | My team or research group also use Dask |
| 3 | 2019-06-25 07:07:08 | Bag;DataFrame;Delayed;Futures | Interactive: I use Dask with Jupyter or IPyth... | Personal laptop;Cluster of 2-10 machines | I use Dask all the time, even when I sleep | Documentation;Tutorial;Examples at examples.da... | Performance improvements | Yes | Critical to me | Critical to me | ... | Somewhat useful | Somewhat useful | Somewhat useful | Somewhat useful | 3.0 | SSH | Amazon Web Services (AWS);Microsoft Azure | NaN | 4.0 | My team or research group also use Dask |
| 4 | 2019-06-25 07:11:32 | Array;DataFrame;Delayed | Interactive: I use Dask with Jupyter or IPyth... | Personal laptop;Large workstation | Just looking for now | Documentation;Stack Overflow #Dask tag;Tutoria... | More examples in my field | Yes | Somewhat useful | NaN | ... | Somewhat useful | NaN | NaN | NaN | 3.0 | NaN | Microsoft Azure | OpenCV | 4.0 | I use Dask mostly on my own |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 254 | 2019-07-21 07:19:54 | Delayed;Futures | Batch: I submit scripts that run in the future | Personal laptop | Occasionally | Documentation | More examples in my field | Yes | Somewhat useful | Not relevant for me | ... | Somewhat useful | Somewhat useful | Not relevant for me | Not relevant for me | 3.0 | SSH | Google Cloud Platform (GCP) | Pandas | 4.0 | I use Dask mostly on my own |
| 255 | 2019-07-22 08:23:11 | DataFrame;Futures;Xarray | Interactive: I use Dask with Jupyter or IPyth... | Personal laptop;Large workstation | Just looking for now | Documentation;Examples at examples.dask.org | More examples in my field | Yes | Critical to me | Somewhat useful | ... | Somewhat useful | Somewhat useful | Somewhat useful | Critical to me | 3.0 | NaN | Microsoft Azure | NaN | 3.0 | My team or research group also use Dask |
| 256 | 2019-07-23 08:22:35 | Futures | Batch: I submit scripts that run in the future | Cluster with 10-100 machines | Every day | Documentation;GitHub issue trackers (reading p... | New features | Yes | Somewhat useful | Somewhat useful | ... | Critical to me | Not relevant for me | Not relevant for me | Somewhat useful | 3.0 | SSH | NaN | Pybullet, LMDB, pytorch. | 4.0 | My team or research group also use Dask |
| 257 | 2019-07-23 09:08:42 | Delayed;Futures | Batch: I submit scripts that run in the future | Cluster with 100+ machines | Every day | Documentation;GitHub issue trackers (reading p... | Bug fixes | No | Somewhat useful | Not relevant for me | ... | Somewhat useful | Not relevant for me | Not relevant for me | Critical to me | 3.0 | SSH | We have our own cluster for the lab | torch, lmdb | 3.0 | My team or research group also use Dask |
| 258 | 2019-07-23 14:18:22 | Array;DataFrame;ML | Interactive: I use Dask with Jupyter or IPyth... | Personal laptop;Large workstation | Occasionally | Documentation;Stack Overflow #Dask tag | More examples in my field | Yes | Somewhat useful | Critical to me | ... | Critical to me | Not relevant for me | Not relevant for me | Critical to me | 3.0 | NaN | NaN | NaN | 3.0 | My team or research group also use Dask |
259 rows × 24 columns
ddf = dd.read_csv("dataset/dask_dataset.csv")
ddf
| Timestamp | Dask APIs | Interactive or Batch? | Local machine or Cluster? | How often do you use Dask? | What Dask resources have you used for support in the last six months? | Which would help you most right now? | Is Dask stable enough for you? | What common feature requests do you care about most? [Better Numpy/Pandas support] | What common feature requests do you care about most? [Better Scikit-Learn/ML support] | What common feature requests do you care about most? [Integrate with Deep Learning Frameworks] | What common feature requests do you care about most? [Support for new libraries in my field] | What common feature requests do you care about most? [Improve Scaling ] | What common feature requests do you care about most? [Dashboard / Diagnostics] | What common feature requests do you care about most? [Ease of deployment] | What common feature requests do you care about most? [Cloud integration] | What common feature requests do you care about most? [Managing many users] | What common feature requests do you care about most? [GPUs] | Python 2 or 3? | If you use a cluster, how do you launch Dask? | Preferred Cloud? | What are some other libraries that you often use with Dask? | How easy is it for you to upgrade to newer versions of Python libraries | Do you use Dask as part of a larger group? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| npartitions=1 | ||||||||||||||||||||||||
| string | string | string | string | string | string | string | string | string | string | string | string | string | string | string | string | string | string | float64 | string | string | string | float64 | string | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ddf.dtypes
Timestamp string[pyarrow]
Dask APIs string[pyarrow]
Interactive or Batch? string[pyarrow]
Local machine or Cluster? string[pyarrow]
How often do you use Dask? string[pyarrow]
What Dask resources have you used for support in the last six months? string[pyarrow]
Which would help you most right now? string[pyarrow]
Is Dask stable enough for you? string[pyarrow]
What common feature requests do you care about most? [Better Numpy/Pandas support] string[pyarrow]
What common feature requests do you care about most? [Better Scikit-Learn/ML support] string[pyarrow]
What common feature requests do you care about most? [Integrate with Deep Learning Frameworks] string[pyarrow]
What common feature requests do you care about most? [Support for new libraries in my field] string[pyarrow]
What common feature requests do you care about most? [Improve Scaling ] string[pyarrow]
What common feature requests do you care about most? [Dashboard / Diagnostics] string[pyarrow]
What common feature requests do you care about most? [Ease of deployment] string[pyarrow]
What common feature requests do you care about most? [Cloud integration] string[pyarrow]
What common feature requests do you care about most? [Managing many users] string[pyarrow]
What common feature requests do you care about most? [GPUs] string[pyarrow]
Python 2 or 3? int64
If you use a cluster, how do you launch Dask? string[pyarrow]
Preferred Cloud? string[pyarrow]
What are some other libraries that you often use with Dask? string[pyarrow]
How easy is it for you to upgrade to newer versions of Python libraries float64
Do you use Dask as part of a larger group? string[pyarrow]
dtype: object
we created Dask DataFrame using dask.dataframe.read_csv. Dask DataFrames are lazy, meaning they don’t load the data into memory immediately. Instead, they create a task graph for operations that will be executed when triggered. To view or manipulate the data, you would need to call methods like .head() or .compute() to perform the actual computation and load the data into memory or run operations.
ddf.head()
| Timestamp | Dask APIs | Interactive or Batch? | Local machine or Cluster? | How often do you use Dask? | What Dask resources have you used for support in the last six months? | Which would help you most right now? | Is Dask stable enough for you? | What common feature requests do you care about most? [Better Numpy/Pandas support] | What common feature requests do you care about most? [Better Scikit-Learn/ML support] | ... | What common feature requests do you care about most? [Ease of deployment] | What common feature requests do you care about most? [Cloud integration] | What common feature requests do you care about most? [Managing many users] | What common feature requests do you care about most? [GPUs] | Python 2 or 3? | If you use a cluster, how do you launch Dask? | Preferred Cloud? | What are some other libraries that you often use with Dask? | How easy is it for you to upgrade to newer versions of Python libraries | Do you use Dask as part of a larger group? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-06-25 06:59:49 | Array;DataFrame | Interactive: I use Dask with Jupyter or IPyth... | Personal laptop;Large workstation;Cluster with... | Occasionally | Documentation;Tutorial;Examples at examples.da... | Performance improvements | Yes | Somewhat useful | Somewhat useful | ... | Somewhat useful | Not relevant for me | Somewhat useful | Somewhat useful | 3.0 | SSH;HPC resource manager (SLURM, PBS, SGE, LSF... | Amazon Web Services (AWS);Google Cloud Platfor... | <NA> | 4.0 | My team or research group also use Dask |
| 1 | 2019-06-25 07:01:12 | DataFrame;ML | Interactive: I use Dask with Jupyter or IPyth... | Large workstation | Occasionally | Documentation;Stack Overflow #Dask tag | More examples in my field | Yes | Critical to me | Critical to me | ... | Somewhat useful | Somewhat useful | Somewhat useful | Critical to me | 3.0 | <NA> | Google Cloud Platform (GCP) | <NA> | 3.0 | I use Dask mostly on my own |
| 2 | 2019-06-25 07:02:25 | DataFrame;Futures;Xarray | Interactive: I use Dask with Jupyter or IPyth... | Cluster of 2-10 machines | Occasionally | Documentation;Stack Overflow #Dask tag | Bug fixes | No | Critical to me | Somewhat useful | ... | Critical to me | Critical to me | Critical to me | Somewhat useful | 3.0 | I don't know, someone else does this for me | Amazon Web Services (AWS) | Datashader | 3.0 | My team or research group also use Dask |
| 3 | 2019-06-25 07:07:08 | Bag;DataFrame;Delayed;Futures | Interactive: I use Dask with Jupyter or IPyth... | Personal laptop;Cluster of 2-10 machines | I use Dask all the time, even when I sleep | Documentation;Tutorial;Examples at examples.da... | Performance improvements | Yes | Critical to me | Critical to me | ... | Somewhat useful | Somewhat useful | Somewhat useful | Somewhat useful | 3.0 | SSH | Amazon Web Services (AWS);Microsoft Azure | <NA> | 4.0 | My team or research group also use Dask |
| 4 | 2019-06-25 07:11:32 | Array;DataFrame;Delayed | Interactive: I use Dask with Jupyter or IPyth... | Personal laptop;Large workstation | Just looking for now | Documentation;Stack Overflow #Dask tag;Tutoria... | More examples in my field | Yes | Somewhat useful | <NA> | ... | Somewhat useful | <NA> | <NA> | <NA> | 3.0 | <NA> | Microsoft Azure | OpenCV | 4.0 | I use Dask mostly on my own |
5 rows × 24 columns
/home/gth/.local/lib/python3.11/site-packages/dask/dataframe/io/csv.py:77: FutureWarning: Support for nested sequences for 'parse_dates' in pd.read_csv is deprecated. Combine the desired columns with pd.to_datetime after parsing instead.
df = reader(bio, **kwargs)
/home/gth/.local/lib/python3.11/site-packages/dask/dataframe/io/csv.py:77: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
df = reader(bio, **kwargs)
Mulitple Files¶
Dask provides powerful pattern matching capabilities for reading multiple files simultaneously, allowing you to work with collections of files as if they were a single dataset.
from glob import glob
import os
filepath = glob("dataset/*.csv")
ddf = dd.read_csv(os.path.join('dataset','*.csv'),
parse_dates={'comp': [0, 1]}, # Here we parse the year, month and day into date
dtype={'Interactive or Batch?': str});
/home/gth/.local/lib/python3.11/site-packages/dask/dataframe/io/csv.py:594: FutureWarning: Support for nested sequences for 'parse_dates' in pd.read_csv is deprecated. Combine the desired columns with pd.to_datetime after parsing instead.
head = reader(BytesIO(b_sample), nrows=sample_rows, **head_kwargs)
/home/gth/.local/lib/python3.11/site-packages/dask/dataframe/io/csv.py:594: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
head = reader(BytesIO(b_sample), nrows=sample_rows, **head_kwargs)
ddf
| comp | Interactive or Batch? | Local machine or Cluster? | How often do you use Dask? | What Dask resources have you used for support in the last six months? | Which would help you most right now? | Is Dask stable enough for you? | What common feature requests do you care about most? [Better Numpy/Pandas support] | What common feature requests do you care about most? [Better Scikit-Learn/ML support] | What common feature requests do you care about most? [Integrate with Deep Learning Frameworks] | What common feature requests do you care about most? [Support for new libraries in my field] | What common feature requests do you care about most? [Improve Scaling ] | What common feature requests do you care about most? [Dashboard / Diagnostics] | What common feature requests do you care about most? [Ease of deployment] | What common feature requests do you care about most? [Cloud integration] | What common feature requests do you care about most? [Managing many users] | What common feature requests do you care about most? [GPUs] | Python 2 or 3? | If you use a cluster, how do you launch Dask? | Preferred Cloud? | What are some other libraries that you often use with Dask? | How easy is it for you to upgrade to newer versions of Python libraries | Do you use Dask as part of a larger group? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| npartitions=2 | |||||||||||||||||||||||
| string | string | string | string | string | string | string | string | string | string | string | string | string | string | string | string | string | float64 | string | string | string | float64 | string | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Creating a Dask Object¶
You can create a Dask object from scratch by supplying existing data and optionally including information about how the chunks should be structured.
array¶
import numpy as np
import dask.array as da
data = np.arange(100_000).reshape(200, 500)
a = da.from_array(data, chunks=(100, 100))
a
|
|
|||||||||||||||
The Dask array a is a lazy, chunked version of the NumPy array, meaning operations on it will be performed in smaller blocks and computed in parallel only when explicitly triggered (e.g., using .compute()). This is especially useful for large datasets that don’t fit into memory all at once.
Dataframe¶
This code creates a time-based Pandas DataFrame with 2,400 rows and converts it into a Dask DataFrame split into 10 partitions. This enables parallel and memory-efficient processing of the data.
index = pd.date_range("2021-09-01", periods=2400, freq="1h")
df = pd.DataFrame({"a": np.arange(2400), "b": list("abcaddbe" * 300)}, index=index)
ddf3 = dd.from_pandas(df, npartitions=10)
ddf3
| a | b | |
|---|---|---|
| npartitions=10 | ||
| 2021-09-01 00:00:00 | int64 | string |
| 2021-09-11 00:00:00 | ... | ... |
| ... | ... | ... |
| 2021-11-30 00:00:00 | ... | ... |
| 2021-12-09 23:00:00 | ... | ... |
Now we have a Dask DataFrame with 2 columns and 2400 rows composed of 10 partitions where each partition has 240 rows. Each partition represents a piece of the data.
ddf3.divisions
(Timestamp('2021-09-01 00:00:00'),
Timestamp('2021-09-11 00:00:00'),
Timestamp('2021-09-21 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-11 00:00:00'),
Timestamp('2021-10-21 00:00:00'),
Timestamp('2021-10-31 00:00:00'),
Timestamp('2021-11-10 00:00:00'),
Timestamp('2021-11-20 00:00:00'),
Timestamp('2021-11-30 00:00:00'),
Timestamp('2021-12-09 23:00:00'))
access the parition number
ddf3.partitions[3].head()
| a | b | |
|---|---|---|
| 2021-10-01 00:00:00 | 720 | a |
| 2021-10-01 01:00:00 | 721 | b |
| 2021-10-01 02:00:00 | 722 | c |
| 2021-10-01 03:00:00 | 723 | a |
| 2021-10-01 04:00:00 | 724 | d |
Indexing¶
Dask provides several powerful methods for indexing large-scale data efficiently across distributed partitions. Understanding how to effectively use these indexing methods is crucial for working with big data in Dask.
Indexing Dask collections feels just like slicing NumPy arrays or pandas DataFrame.
array¶
import numpy as np
import dask.array as da
data = np.arange(100_000).reshape(200, 500)
a = da.from_array(data, chunks=(100, 100))
a[:50, 200]
|
|
|||||||||||||||
Dataframe¶
Dask DataFrames support column-based indexing (e.g., ddf[‘col’]), but row-based indexing like .loc or .iloc requires a sorted, known index set using set_index().
ddf3.b
Dask Series Structure:
npartitions=10
2021-09-01 00:00:00 string
2021-09-11 00:00:00 ...
...
2021-11-30 00:00:00 ...
2021-12-09 23:00:00 ...
Dask Name: getitem, 2 expressions
Expr=df['b']
ddf3["2021-10-01": "2021-10-09 5:00"]
| a | b | |
|---|---|---|
| npartitions=1 | ||
| 2021-10-01 00:00:00.000000000 | int64 | string |
| 2021-10-09 05:00:59.999999999 | ... | ... |
Computation¶
Dask is lazily evaluated. The result from a computation isn’t computed until you ask for it. Instead, a Dask task graph for the computation is produced.
Anytime you have a Dask object and you want to get the result, call compute:
array¶
a[:50, 200].compute()
array([ 200, 700, 1200, 1700, 2200, 2700, 3200, 3700, 4200,
4700, 5200, 5700, 6200, 6700, 7200, 7700, 8200, 8700,
9200, 9700, 10200, 10700, 11200, 11700, 12200, 12700, 13200,
13700, 14200, 14700, 15200, 15700, 16200, 16700, 17200, 17700,
18200, 18700, 19200, 19700, 20200, 20700, 21200, 21700, 22200,
22700, 23200, 23700, 24200, 24700])
Dataframe¶
ddf3["2021-10-01": "2021-10-09 5:00"].compute()
| a | b | |
|---|---|---|
| 2021-10-01 00:00:00 | 720 | a |
| 2021-10-01 01:00:00 | 721 | b |
| 2021-10-01 02:00:00 | 722 | c |
| 2021-10-01 03:00:00 | 723 | a |
| 2021-10-01 04:00:00 | 724 | d |
| ... | ... | ... |
| 2021-10-09 01:00:00 | 913 | b |
| 2021-10-09 02:00:00 | 914 | c |
| 2021-10-09 03:00:00 | 915 | a |
| 2021-10-09 04:00:00 | 916 | d |
| 2021-10-09 05:00:00 | 917 | d |
198 rows × 2 columns
.compute() triggers the actual execution of a Dask computation and returns the final result as a regular in-memory object (like a Pandas DataFrame or Series).
Methods¶
Dask collections match existing numpy and pandas methods, so they should feel familiar. Call the method to set up the task graph, and then call compute to get the result.
array¶
a.mean()
a.mean().compute()
np.sin(a)
np.sin(a).compute()
a.T
a.T.compute()
array([[ 0, 500, 1000, ..., 98500, 99000, 99500],
[ 1, 501, 1001, ..., 98501, 99001, 99501],
[ 2, 502, 1002, ..., 98502, 99002, 99502],
...,
[ 497, 997, 1497, ..., 98997, 99497, 99997],
[ 498, 998, 1498, ..., 98998, 99498, 99998],
[ 499, 999, 1499, ..., 98999, 99499, 99999]], shape=(500, 200))
Dataframe¶
ddf3.a.mean()
ddf3.a.mean().compute()
ddf3.b.unique()
ddf3.b.unique().compute()
0 e
0 c
1 d
0 a
0 b
Name: b, dtype: string
exercice¶
For Dask arrays, calculate the mean along axis=1 of the sum of the x array and its transpose.
import dask.array as da
import numpy as np
import dask
xd = da.random.normal( 10 , 0.1 , size=( 30_000 , 30_000 ), chunks=( 300 , 300 ))
creates a Dask array of random values with a normal distribution, mean of 10, and standard deviation of 0.1, of shape (30,000, 30,000). The array is chunked into blocks of 300×300 elements for parallel processing.
Solution¶
x_sum = xd + xd.T
res = x_sum.mean(axis= 1 )
res.compute()
array([19.99894119, 19.99777179, 19.99950783, ..., 20.00016687,
19.99939267, 19.99915841], shape=(30000,))