8 releases (4 breaking)
0.5.1 | Oct 14, 2024 |
---|---|
0.5.0 | Oct 11, 2024 |
0.4.0 | Oct 3, 2024 |
0.3.0 | Sep 27, 2024 |
0.1.0 | Jun 28, 2024 |
#740 in Database interfaces
1,028 downloads per month
Used in pyo3-geoarrow
215KB
5K
SLoC
pyo3-arrow
Lightweight Apache Arrow integration for pyo3. Designed to make it easier for Rust libraries to add interoperable, zero-copy Python bindings.
Specifically, pyo3-arrow implements zero-copy FFI conversions between Python objects and Rust representations using the arrow
crate. This relies heavily on the Arrow PyCapsule Interface for seamless interoperability across the Python Arrow ecosystem.
Usage
We can wrap a function to be used in Python with just a few lines of code.
When you use a struct defined in pyo3_arrow
as an argument to your function, it will automatically convert user input to a Rust arrow
object via zero-copy FFI. Then once you're done, call to_arro3
or to_pyarrow
to export the data back to Python.
use pyo3::prelude::*;
use pyo3_arrow::error::PyArrowResult;
use pyo3_arrow::PyArray;
/// Take elements by index from an Array, creating a new Array from those
/// indexes.
#[pyfunction]
pub fn take(py: Python, values: PyArray, indices: PyArray) -> PyArrowResult<PyObject> {
// We can call py.allow_threads to ensure the GIL is released during our
// operations
// This example just wraps `arrow_select::take::take`
let output_array =
py.allow_threads(|| arrow_select::take::take(values.as_ref(), indices.as_ref(), None))?;
// Construct a PyArray and export it to the arro3 Python Arrow
// implementation
Ok(PyArray::new(output_array, values.field().clone()).to_arro3(py)?)
}
Then on the Python side, we can call this function (exported via arro3.compute.take
):
import pyarrow as pa
from arro3.compute import take
arr = pa.array([2, 3, 0, 1])
output = take(arr, arr)
output
# <arro3.core._rust.Array at 0x10787b510>
pa.array(output)
# <pyarrow.lib.Int64Array object at 0x10aa11000>
# [
# 0,
# 1,
# 2,
# 3
# ]
In this example, we use pyarrow to create the original array and to view the result, but the use of pyarrow is not required. It does, at least, show how the Arrow PyCapsule Interface makes it seamless to share these Arrow objects between Python Arrow implementations.
Using Arrow data as input
Just include one of the pyo3-arrow structs in your function signature, and user input will be transparently converted
This uses the Arrow PyCapsule Interface. But note that that only defines three methods, and pyo3-arrow
contains more the three structs. Several structs are overloaded and use the same underlying transport mechanism.
For example, PySchema
and PyField
both use the __arrow_c_schema__
mechanism, but with different behavior. The former expects the transported field to be a struct type, and its children get unpacked to be the fields of the schema, while the latter has no constraint and passes a field through as-is. PySchema
will error if the passed field is not of struct type.
Struct name | Unpacks struct field |
---|---|
PySchema |
Yes |
PyField |
No |
PyArray
and PyRecordBatch
both use the __arrow_c_array__
mechanism:
Struct name | Unpacks StructArray to RecordBatch |
---|---|
PyRecordBatch |
Yes |
PyArray |
No |
PyTable
, PyChunkedArray
, and PyRecordBatchReader
all use the __arrow_c_stream__
mechanism:
Struct name | Unpacks StructArray to RecordBatch |
Materializes in memory |
---|---|---|
PyTable |
Yes | Yes |
PyRecordBatchReader |
Yes | No |
PyChunkedArray |
No | Yes |
PyArrayReader |
No | No |
Returning Arrow data back to Python
Using your own classes
If you're exporting your own Arrow-compatible classes to Python, you can implement the relevant Arrow PyCapsule Interface methods directly on your own classes.
You can use the helper functions to_array_pycapsules
, to_schema_pycapsule
, and to_stream_pycapsule
in the ffi
module to simplify exporting your data.
To export stream data, add a method to your class with the following signature:
use arrow_array::ArrayRef;
use arrow_schema::FieldRef;
use pyo3_arrow::ffi::{to_stream_pycapsule, ArrayIterator};
use pyo3::types::PyCapsule;
fn __arrow_c_stream__<'py>(
&'py self,
py: Python<'py>,
requested_schema: Option<Bound<'py, PyCapsule>>,
) -> PyResult<Bound<'py, PyCapsule>> {
let field: FieldRef = ...;
let arrays: Vec<ArrayRef> = ...;
let array_reader =
Box::new(ArrayIterator::new(arrays.into_iter().map(Ok), field));
to_stream_pycapsule(py, array_reader, requested_schema)
}
Exporting schema or array data is similar, just with the __arrow_c_schema__
and __arrow_c_array__
methods instead.
If you don't wish to export your own classes, refer to one of the solutions below.
Using arro3.core
arro3.core
is a very minimal Python Arrow implementation, designed to be lightweight (<1MB) and relatively stable. In comparison, pyarrow is on the order of ~100MB.
You must depend on the arro3-core
Python package; then you can use the to_arro3
method of each exported Arrow object to pass the data into an arro3.core
class.
Rust struct | arro3 class |
---|---|
PyField |
arro3.core.Field |
PySchema |
arro3.core.Schema |
PyArray |
arro3.core.Array |
PyArrayReader |
arro3.core.ArrayReader |
PyRecordBatch |
arro3.core.RecordBatch |
PyChunkedArray |
arro3.core.ChunkedArray |
PyTable |
arro3.core.Table |
PyRecordBatchReader |
arro3.core.RecordBatchReader |
Using pyarrow
pyarrow
, the canonical Python Arrow implementation, is a very large dependency. It's roughly 100MB in size on its own, plus 35MB more for its hard dependency on numpy. However, numpy
is very likely already in the user environment, and pyarrow
is quite common as well, so requiring a pyarrow
dependency may not be a problem.
In this case, you must depend on pyarrow
and you can use the to_pyarrow
method of Python structs to return data to Python. This requires pyarrow>=14
(pyarrow>=15
is required to return PyRecordBatchReader
).
Rust struct | pyarrow class |
---|---|
PyField |
pyarrow.Field |
PySchema |
pyarrow.Schema |
PyArray |
pyarrow.Array |
PyRecordBatch |
pyarrow.RecordBatch |
PyChunkedArray |
pyarrow.ChunkedArray |
PyTable |
pyarrow.Table |
PyRecordBatchReader |
pyarrow.RecordBatchReader |
pyarrow
does not have the equivalent of a PyArrayReader
, but if the materialized data fits in memory, you can convert a PyArrayReader
to a PyChunkedArray
and pass that to pyarrow
.
Using nanoarrow
nanoarrow
is an alternative Python library for working with Arrow data. It's similar in goals to arro3, but is written in C instead of Rust. Additionally, it has a smaller type system than pyarrow
or arro3
, with logical arrays and record batches both represented by the nanoarrow.Array
class.
In this case, you must depend on nanoarrow
and you can use the to_nanoarrow
method of Python structs to return data to Python.
Rust struct | nanoarrow class |
---|---|
PyField |
nanoarrow.Schema |
PySchema |
nanoarrow.Schema |
PyArray |
nanoarrow.Array |
PyRecordBatch |
nanoarrow.Array |
PyArrayReader |
nanoarrow.ArrayStream |
PyChunkedArray |
nanoarrow.ArrayStream |
PyTable |
nanoarrow.ArrayStream |
PyRecordBatchReader |
nanoarrow.ArrayStream |
Version compatibility
pyo3-arrow | pyo3 | arrow-rs |
---|---|---|
0.1.x | 0.21 | 52 |
0.2.x | 0.21 | 52 |
0.3.x | 0.21 | 53 |
0.4.x | 0.21 | 53 |
0.5.x | 0.22 | 53 |
Buffer Protocol
pyo3-arrow will automatically interpret Python objects that implement the Python Buffer Protocol. This is implemented as part of the FromPyObject
impl on PyArray
. So if your function accepts PyArray
, it will automatically accept buffer protocol input. This conversion is zero-copy.
Multi-dimensional buffer protocol objects are interpreted as nested fixed size lists.
Buffer protocol support is behind a buffer_protocol
feature flag (turned on by default), as it requires either the abi3-py311
pyo3 feature or building non-abi3 wheels.
Why not use arrow-rs's Python integration?
arrow-rs has some existing Python integration, but there are a few reasons why I created pyo3-arrow
:
- arrow-rs's Python FFI integration does not support Arrow extension types, because it omits field metadata when constructing an
Arc<dyn Array>
. pyo3-arrow gets around this by storing both anArrayRef
(Arc<dyn Array>
) and aFieldRef
(Arc<Field>
) in aPyArray
struct. - arrow-rs has no ability to work with an Arrow stream of bare arrays that are not record batches, and so arrow-rs has no way to interop with a
pyarrow.ChunkedArray
orpolars.Series
. - pyo3-arrow will automatically convert Python buffer protocol objects (such as numpy arrays,
memoryview
s,bytes
objects, and more) toPyArray
. This conversion is zero copy. - arrow-rs only supports returning data to pyarrow. Pyarrow is a very large dependency (its unpacked Linux wheels are 130MB, not including a required dependency on Numpy) and some projects may wish not to use it. Now that the Arrow PyCapsule interface exists, it's possible to have a modular approach, where a very small library contains core Arrow objects, and works seamlessly with other libraries.
- pyo3-arrow implements schema negotiation for the PyCapsule Interface on data export, allowing other Python libraries to request Arrow data types they know how to handle.
- In my opinion arrow-rs is too tightly connected to pyo3 and pyarrow. pyo3 releases don't line up with arrow-rs's release cadence, which means it could be a bit of a wait to use the latest pyo3 version with arrow-rs, especially with arrow-rs waiting longer to release breaking changes.
Dependencies
~17–25MB
~371K SLoC