Software Engineering Team CU Dept. of Biomedical Informatics

Tip of the Week: Data Engineering with SQL, Arrow and DuckDB

Tip of the Week: Data Engineering with SQL, Arrow and DuckDB

Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to #software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com

Apache Arrow is a language-independent and high performance data format useful in many scenarios. DuckDB is an in-process SQL-based data management system which is Arrow-compatible. In addition to providing a SQLite-like database format, DuckDB also provides a standardized and high performance way to work with Arrow data where otherwise one may be forced to language-specific data structures or transforms.

TLDR (too long, didn’t read); DuckDB may be used to access and transform Arrow-based data from multiple data formats through SQL. Using Arrow and DuckDB provides a cross-language way to access and manage data. Data development with these tools may also enable improvements in performance, understandability, or long term maintainability of your code.

Reduce Wasted Conversion Effort with Arrow

flowchart TB
    Python:::outlined <--> Arrow
    R:::outlined <--> Arrow
    C++:::outlined <--> Arrow
    Java:::outlined <--> Arrow
    others...:::outlined <--> Arrow

    classDef outlined fill:#fff,stroke:#333

Arrow provides a multi-language data format which prevents you from needing to convert to other formats when dealing with multiple in-memory or serialized data formats. For example, this means that a Python and an R package may use the same in-memory or file-based data without conversion (where normally a Python Pandas dataframe and R data frame may require a conversion step in between).

flowchart TB
    subgraph Python
      Pandas:::outlined
      Polars:::outlined
      dict[Python dict]:::outlined
      list[Python list]:::outlined
    end

    Pandas <--> Arrow
    Polars <--> Arrow
    dict <--> Arrow
    list <--> Arrow

  classDef outlined fill:#fff,stroke:#333

The same stands for various libraries within one language - Arrow enables interchange between various language library formats (for example, a Python Pandas dataframe and Python dictionary are two distinct in-memory formats which may require conversions). Conversions to or from these formats can involve data type or other inferences which are costly to productivity. You can save time and effort by avoiding conversions using Arrow.

Using SQL to Join or Transform Arrow Data via DuckDB

flowchart LR
    subgraph duckdb["DuckDB Processing"]
        direction BT
        SQL[SQL] --> DuckDB[DuckDB Client]
    end
    parquet1[example.parquet] --> duckdb
    sqlite[example.sqlite] --> duckdb
    csv[example.csv] --> duckdb
    arrow["in-memory Arrow"] --> duckdb
    pandas["in-memory Pandas"] --> duckdb
    duckdb --> Arrow
    Arrow --> Other[Other work...]

DuckDB provides a management client and relational database format (similar to SQLite databases) which may be handled with Arrow. SQL may be used with the DuckDB client to filter, join, or change various data types. Due to Arrow’s cross-language properties, there is no additional cost to using SQL through DuckDB to return data for implementation within other purpose-built data formats. DuckDB provides client API’s in many languages (for example, Python, R, and C++), making it possible to write DuckDB client code with SQL to manage data without having to use manually written sub-procedures.

flowchart TB
  subgraph duckdb["DuckDB Processing"]
        direction BT
        SQL[SQL] --> DuckDB[DuckDB Client]
    end
    Python:::outlined <--> duckdb
    R:::outlined <--> duckdb
    C++:::outlined <--> duckdb
    Java:::outlined <--> duckdb
    others...:::outlined <--> duckdb
    duckdb <--> Arrow

    classDef outlined fill:#fff,stroke:#333

Using SQL to perform these operations with Arrow provides an opportunity for your data code to be used (or understood) within other languages without additional rewrites. SQL also provides you access to roughly 48 years worth of data management improvements without being constrained by imperative language data models or schema (reference: SQL Wikipedia: First appeared: 1974).

Example with SQL to Join Arrow Data with DuckDB in Python

Jupyter notebook example screenshot with DuckDB and Arrow data handling
Jupyter notebook example screenshot with DuckDB and Arrow data handling

The following example notebook shows how to use SQL to join data from multiple sources using the DuckDB client API within Python. The example includes DuckDB querying a remote CSV, local Parquet file, and Arrow in-memory tables.

Linked Example

Additional Resources

Please see the following the additional resources.

Previous post
Tip of the Week: Diagrams as Code
Next post
Tip of the Week: Remove Unused Code to Avoid Software Decay