Fast and Distributed
Python Query Engine
Daft is a framework for ETL, analytics and ML/AI at scale. Its familiar Python Dataframe API is built to outperform Spark in performance and ease of use.
CAPABILITIES
Blazing efficiency, designed for multimodal data.
Blazing efficiency, designed for multimodal data.
Integrate with ML/AI libraries.
Daft plugs directly into your ML/AI stack through efficient zero-copy integrations with essential Python libraries such as Pytorch and Ray. It also allows requesting GPUs as a resource for running models.
Go Distributed and Out-of-Core.
Daft runs locally with a lightweight multithreaded backend. When your local machine is no longer sufficient, it scales seamlessly to run out-of-core on a distributed cluster.
Execute Complex Operations.
Daft can handle User-Defined Functions (UDFs) on DataFrame columns, allowing you to apply complex expressions and operations on Python objects with full flexibility required for ML/AI.
Native Support for Cloud Storage.
Daft runs locally with a lightweight multithreaded backend. When your local machine is no longer sufficient, it scales seamlessly to run out-of-core on a distributed cluster.
Deliver Unmatched Speed.
Underneath its Python API, Daft is built in blazing fast Rust code. Rust powers Daft’s vectorized execution and async I/O, allowing Daft to outperform frameworks such as Spark.
USE CASES
Daft provides a familar and easy to use
Python dataframe API for:
Daft exposes a powerful type system that can represent complex datatypes such as JSON, URLs, Images and Tensors. Operations can be expressed using Daft's Expressions API, allowing for easy manipulation of these complex datatypes that is then lazily executed on Daft's blazing fast Rust core engine.
1 import daft
2
3 df = daft.from_pydict(
4 {
5 "image_urls": ["a", "b", "c"],
6 }
7 )
8
9 df = df.with_column(
10 "data",
11 df["image_urls"].url.download()
12 )
13 df = df.with_column(
14 "images",
15 df["data"].image.decode()
16 )
17 df.show()
Daft supports large-scale tabular batch data processing with its familiar DataFrame interface. Its Rust I/O engine is heavily tuned for cloud based workloads, and boasts record-setting efficiency when reading and writing data in formats such as Apache Parquet. Daft also applies powerful optimizations with its built in query optimizer, ensuring that your query is executed efficiently when run on data at terabyte-scales.
1 import daft
2
3 df = daft.read_parquet(
4 "s3://source-bucket/**/*.parquet"
5 )
6 df = df.sort("foo")
7
8 df.write_parquet("s3://destination-bucket/")
Daft provides tight integration with frameworks such as Pytorch and Ray to efficiently ingest data into your data-hungry ML model training workloads. Its blazing fast I/O and kernels allows for maximizing GPU utilization by pipelining your data through downloading, pre-processing and random per-epoch shuffling. Daft also leverages Apache Arrow memory formats, allowing for zero-copy data transfer between dataloading and model training.
1 import daft
2
3 df = daft.read_json("s3://my-json-files/**/*.json")
4 df = df.with_column(
5 "features",
6 run_featurization(df["data"])
7 )
8
9 pytorch_dataset = df.to_torch_iter_dataset()
Daft is built for usage from interactive environments such as Jupyter notebooks. This lets you perform interactive explorations of your data with just your local development environment. Daft's expressions API also lets you perform operations such as groupby, aggregations and joins.
1 import daft
2
3 df = daft.from_pydict(
4 {
5 "A": ["foo", "bar", "foo", "bar"],
6 "B": [i for i in range(4)],
7 }
8 )
9 grouped_df = df.groupby(df["A"])
10 aggregated_df = grouped_df.agg(
11 [
12 (grouped_df["B"].alias("B_sum"), "sum"),
13 ]
14 )
16 aggregated_df.collect()
Daft exposes a powerful type system that can represent complex datatypes such as JSON, URLs, Images and Tensors. Operations can be expressed using Daft's Expressions API, allowing for easy manipulation of these complex datatypes that is then lazily executed on Daft's blazing fast Rust core engine.
1 import daft
2
3 df = daft.from_pydict({
4 "image_urls": ["a", "b", "c"],
5 })
6
7 df = df.with_column("data", df["image_urls"].url.download())
8 df = df.with_column("images", df["data"].image.decode())
9
10 df.show()
Daft supports large-scale tabular batch data processing with its familiar DataFrame interface. Its Rust I/O engine is heavily tuned for cloud based workloads, and boasts record-setting efficiency when reading and writing data in formats such as Apache Parquet. Daft also applies powerful optimizations with its built in query optimizer, ensuring that your query is executed efficiently when run on data at terabyte-scales.
1 import daft
2
3 df = daft.read_parquet("s3://source-bucket/**/*.parquet")
4 df = df.sort("foo")
5
6 df.write_parquet("s3://destination-bucket/")
Daft provides tight integration with frameworks such as Pytorch and Ray to efficiently ingest data into your data-hungry ML model training workloads. Its blazing fast I/O and kernels allows for maximizing GPU utilization by pipelining your data through downloading, pre-processing and random per-epoch shuffling. Daft also leverages Apache Arrow memory formats, allowing for zero-copy data transfer between dataloading and model training.
1 import daft
2
3 df = daft.read_json("s3://my-json-files/**/*.json")
4 df = df.with_column(
5 "features",
6 run_featurization(df["data"])
7 )
8
9 pytorch_dataset = df.to_torch_iter_dataset()
Daft is built for usage from interactive environments such as Jupyter notebooks. This lets you perform interactive explorations of your data with just your local development environment. Daft's expressions API also lets you perform operations such as groupby, aggregations and joins.
1 import daft
2
3 df = daft.from_pydict(
4 {
5 "A": ["foo", "bar", "foo", "bar"],
6 "B": [i for i in range(4)],
7 }
8 )
9 grouped_df = df.groupby(df["A"])
10 aggregated_df = grouped_df.agg(
11 [
12 (grouped_df["B"].alias("B_sum"), "sum"),
13 ]
14 )
16 aggregated_df.collect()
COMMUNITY
Get updates,
contribute code, or say hi!
We hold contributor syncs on the last Thursday of every month to discuss new features and technical deep dives. Add it to your calendar here
Daft Engineering Blog
Join us as we explore innovative ways to handle vast datasets, optimize performance, and revolutionize your data workflows!
Take the next step with an easy tutorial.
MNIST Digit Classification
Use a simple deep learning model to run classification on the MNIST image dataset.
Running LLMs on the Red Pajamas Dataset
Perform similarity a search on Stack Exchange questions using language models and embeddings.
Querying Images with UDFs
Query the Open Images dataset to retrieve the top N “reddest” images using Numpy and Pillow inside Daft UDFs.
Image Generation on GPUs
Generate images from text prompts using a deep learning model (Mini DALL-E) and Daft UDFs