Tip of the Week: Data Engineering with SQL, Arrow and DuckDB
Tip of the Week: Data Engineering with SQL, Arrow and DuckDB
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to #software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
Apache Arrow is a language-independent and high performance data format useful in many scenarios. DuckDB is an in-process SQL-based data management system which is Arrow-compatible. In addition to providing a SQLite-like database format, DuckDB also provides a standardized and high performance way to work with Arrow data where otherwise one may be forced to language-specific data structures or transforms.
TLDR (too long, didn’t read); DuckDB may be used to access and transform Arrow-based data from multiple data formats through SQL. Using Arrow and DuckDB provides a cross-language way to access and manage data. Data development with these tools may also enable improvements in performance, understandability, or long term maintainability of your code.
Reduce Wasted Conversion Effort with Arrow
flowchart TB Python:::outlined <--> Arrow R:::outlined <--> Arrow C++:::outlined <--> Arrow Java:::outlined <--> Arrow others...:::outlined <--> Arrow classDef outlined fill:#fff,stroke:#333
Arrow provides a multi-language data format which prevents you from needing to convert to other formats when dealing with multiple in-memory or serialized data formats. For example, this means that a Python and an R package may use the same in-memory or file-based data without conversion (where normally a Python Pandas dataframe and R data frame may require a conversion step in between).
flowchart TB subgraph Python Pandas:::outlined Polars:::outlined dict[Python dict]:::outlined list[Python list]:::outlined end Pandas <--> Arrow Polars <--> Arrow dict <--> Arrow list <--> Arrow classDef outlined fill:#fff,stroke:#333
The same stands for various libraries within one language - Arrow enables interchange between various language library formats (for example, a Python Pandas dataframe and Python dictionary are two distinct in-memory formats which may require conversions). Conversions to or from these formats can involve data type or other inferences which are costly to productivity. You can save time and effort by avoiding conversions using Arrow.
Using SQL to Join or Transform Arrow Data via DuckDB
flowchart LR subgraph duckdb["DuckDB Processing"] direction BT SQL[SQL] --> DuckDB[DuckDB Client] end parquet1[example.parquet] --> duckdb sqlite[example.sqlite] --> duckdb csv[example.csv] --> duckdb arrow["in-memory Arrow"] --> duckdb pandas["in-memory Pandas"] --> duckdb duckdb --> Arrow Arrow --> Other[Other work...]
DuckDB provides a management client and relational database format (similar to SQLite databases) which may be handled with Arrow. SQL may be used with the DuckDB client to filter, join, or change various data types. Due to Arrow’s cross-language properties, there is no additional cost to using SQL through DuckDB to return data for implementation within other purpose-built data formats. DuckDB provides client API’s in many languages (for example, Python, R, and C++), making it possible to write DuckDB client code with SQL to manage data without having to use manually written sub-procedures.
flowchart TB subgraph duckdb["DuckDB Processing"] direction BT SQL[SQL] --> DuckDB[DuckDB Client] end Python:::outlined <--> duckdb R:::outlined <--> duckdb C++:::outlined <--> duckdb Java:::outlined <--> duckdb others...:::outlined <--> duckdb duckdb <--> Arrow classDef outlined fill:#fff,stroke:#333
Using SQL to perform these operations with Arrow provides an opportunity for your data code to be used (or understood) within other languages without additional rewrites. SQL also provides you access to roughly 48 years worth of data management improvements without being constrained by imperative language data models or schema (reference: SQL Wikipedia: First appeared: 1974).
Example with SQL to Join Arrow Data with DuckDB in Python
The following example notebook shows how to use SQL to join data from multiple sources using the DuckDB client API within Python. The example includes DuckDB querying a remote CSV, local Parquet file, and Arrow in-memory tables.
Additional Resources
Please see the following the additional resources.