Fast and Distributed
Python Query Engine

Daft is a framework for ETL, analytics and ML/AI at scale. Its familiar Python Dataframe API is built to outperform Spark in performance and ease of use.

pip install getdaft see documentation

CAPABILITIES

Blazing efficiency, designed for multimodal data.

Integrate with ML/AI libraries.

Daft plugs directly into your ML/AI stack through efficient zero-copy integrations with essential Python libraries such as Pytorch and Ray. It also allows requesting GPUs as a resource for running models.

Go Distributed and Out-of-Core.

Daft runs locally with a lightweight multithreaded backend. When your local machine is no longer sufficient, it scales seamlessly to run out-of-core on a distributed cluster.

Execute Complex Operations.

Daft can handle User-Defined Functions (UDFs) on DataFrame columns, allowing you to apply complex expressions and operations on Python objects with full flexibility required for ML/AI.

Native Support for Cloud Storage.

Daft runs locally with a lightweight multithreaded backend. When your local machine is no longer sufficient, it scales seamlessly to run out-of-core on a distributed cluster.

Deliver Unmatched Speed.

Underneath its Python API, Daft is built in blazing fast Rust code. Rust powers Daft’s vectorized execution and async I/O, allowing Daft to outperform frameworks such as Spark.

USE CASES

Daft provides a familar and easy to use
Python dataframe API for:

Multimodal Data Processing

Daft exposes a powerful type system that can represent complex datatypes such as JSON, URLs, Images and Tensors. Operations can be expressed using Daft's Expressions API, allowing for easy manipulation of these complex datatypes that is then lazily executed on Daft's blazing fast Rust core engine.

                          
  1   import daft
  2
  3   df = daft.from_pydict(
  4       {
  5           "image_urls": ["a", "b", "c"],
  6       }
  7   )
  8
  9   df = df.with_column(
  10      "data", 
  11      df["image_urls"].url.download()
  12  )
  13  df = df.with_column(
  14      "images", 
  15      df["data"].image.decode()
  16  )
  17  df.show()

Batch Data Processing

Daft supports large-scale tabular batch data processing with its familiar DataFrame interface. Its Rust I/O engine is heavily tuned for cloud based workloads, and boasts record-setting efficiency when reading and writing data in formats such as Apache Parquet. Daft also applies powerful optimizations with its built in query optimizer, ensuring that your query is executed efficiently when run on data at terabyte-scales.

                          
  1   import daft
  2
  3   df = daft.read_parquet(
  4       "s3://source-bucket/**/*.parquet"
  5   )
  6   df = df.sort("foo")
  7
  8   df.write_parquet("s3://destination-bucket/")

Data Ingestion for ML Training

Daft provides tight integration with frameworks such as Pytorch and Ray to efficiently ingest data into your data-hungry ML model training workloads. Its blazing fast I/O and kernels allows for maximizing GPU utilization by pipelining your data through downloading, pre-processing and random per-epoch shuffling. Daft also leverages Apache Arrow memory formats, allowing for zero-copy data transfer between dataloading and model training.

                          
  1   import daft
  2
  3   df = daft.read_json("s3://my-json-files/**/*.json")
  4   df = df.with_column(
  5       "features", 
  6       run_featurization(df["data"])
  7   )
  8
  9   pytorch_dataset = df.to_torch_iter_dataset()

EDA (Data Science at Scale)

Daft is built for usage from interactive environments such as Jupyter notebooks. This lets you perform interactive explorations of your data with just your local development environment. Daft's expressions API also lets you perform operations such as groupby, aggregations and joins.

                          
  1   import daft
  2
  3   df = daft.from_pydict(
  4       {
  5           "A": ["foo", "bar", "foo", "bar"],
  6           "B": [i for i in range(4)],
  7       }
  8   )
  9   grouped_df = df.groupby(df["A"])
  10  aggregated_df = grouped_df.agg(
  11      [
  12          (grouped_df["B"].alias("B_sum"), "sum"),
  13      ]
  14  )
  16  aggregated_df.collect()

                      
  1   import daft
  2
  3   df = daft.from_pydict({
  4       "image_urls": ["a", "b", "c"],
  5   })
  6
  7   df = df.with_column("data", df["image_urls"].url.download())
  8   df = df.with_column("images", df["data"].image.decode())
  9
  10  df.show()

                      
  1   import daft
  2
  3   df = daft.read_parquet("s3://source-bucket/**/*.parquet")
  4   df = df.sort("foo")
  5
  6   df.write_parquet("s3://destination-bucket/")

                      
  1   import daft
  2
  3   df = daft.read_json("s3://my-json-files/**/*.json")
  4   df = df.with_column(
  5       "features", 
  6       run_featurization(df["data"])
  7   )
  8
  9   pytorch_dataset = df.to_torch_iter_dataset()

                      
  1   import daft
  2
  3   df = daft.from_pydict(
  4       {
  5           "A": ["foo", "bar", "foo", "bar"],
  6           "B": [i for i in range(4)],
  7       }
  8   )
  9   grouped_df = df.groupby(df["A"])
  10  aggregated_df = grouped_df.agg(
  11      [
  12          (grouped_df["B"].alias("B_sum"), "sum"),
  13      ]
  14  )
  16  aggregated_df.collect()

ECOSYSTEM