How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python

By

Introduction

Fetching a million rows from SQL Server into a Polars DataFrame used to mean creating a million Python objects, triggering countless garbage-collector allocations, and then discarding it all to build the DataFrame. That overhead is now gone. The mssql-python driver, thanks to a community contribution by Felix Graßl (@ffelixg), can fetch SQL Server data directly as Apache Arrow structures. This guide walks you through using Arrow support in mssql-python to make your data pipelines faster and more memory-efficient, whether you work with Polars, Pandas, DuckDB, or any other Arrow-native library.

How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python
Source: devblogs.microsoft.com

What You Need

  • Python 3.8+ installed on your system.
  • mssql-python driver (version 1.2.0 or later) with Arrow support. Install via pip install mssql-python[arrow].
  • A target library that can consume Arrow data: Polars, Pandas (with ArrowDtype), DuckDB, or similar.
  • SQL Server database with appropriate credentials and network access.

Step-by-Step Guide

Step 1: Install Required Packages

First, ensure you have the latest mssql-python package with Arrow support. Open your terminal and run:

pip install mssql-python[arrow] polars

This installs both the driver and Polars as an example consumer. For Pandas or DuckDB, adjust accordingly.

Step 2: Import Libraries

Create a Python script and import the necessary modules:

import mssql
import polars as pl
from mssql import connect

The mssql module provides the driver; polars will be used to verify the zero-copy integration.

Step 3: Establish a Connection

Use your SQL Server credentials to create a connection. Replace placeholders with your actual server, database, username, and password:

connection = connect(
    server="your_server.database.windows.net",
    database="your_database",
    username="your_username",
    password="your_password"
)
cursor = connection.cursor()

Step 4: Execute a Query and Fetch as Arrow

Execute a SELECT query. The key new method is fetcharrow(), which returns an Arrow Table directly—no Python objects per row are created:

cursor.execute("SELECT TOP 1000000 * FROM large_table")
arrow_table = cursor.fetcharrow()

This call runs entirely in C++, writing values into contiguous, typed Arrow buffers. Nulls are tracked with a compact bitmap—no None objects per cell.

Step 5: Convert to Your DataFrame Library

Because Arrow uses a stable ABI (the Arrow C Data Interface), conversion is instantaneous—zero-copy:

How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python
Source: devblogs.microsoft.com
  • To Polars: df = pl.from_arrow(arrow_table)
  • To Pandas: df = arrow_table.to_pandas(types_mapper=pd.ArrowDtype)
  • To DuckDB: duckdb.sql("SELECT * FROM arrow_table")

No serialization, no copies, no re-parsing—just a pointer exchange.

Step 6: Perform Operations Without Python Overhead

Once data is in an Arrow-native DataFrame, operations like filters, joins, and aggregations also work in-place on the same shared memory buffers. For example, in Polars:

result = df.filter(pl.col("datetime_col") > "2023-01-01").group_by("category").agg(pl.col("value").sum())

No intermediate Python objects are materialized at any stage—this is the foundation for high-throughput pipelines.

Tips and Best Practices

  • Speed gains are most noticeable with temporal types (like DATETIME and DATETIMEOFFSET) because per-value Python-side conversions are eliminated entirely.
  • Memory usage drops dramatically: A column of one million integers becomes a single contiguous C array, not a million individual Python objects. Great for large datasets.
  • Seamless interoperability with other Arrow-native tools—Polars, Pandas (with ArrowDtype), DuckDB, and even tools like Hugging Face datasets—means you can mix and match libraries without conversion costs.
  • Ensure you're using a recent version of mssql-python (1.2.0+) to have the fetcharrow() method available.
  • For best results, avoid fetching rows individually—always use bulk fetch methods like fetcharrow() when working with Arrow.
  • Monitor your garbage collector; with Arrow, you'll see far less GC pressure, making your overall application more predictable.

Related Articles

Recommended

Discover More

8 Key Facts About the Philippines' Offshore Wind Revolution and Its 11 TWh PromiseWidening Math Gender Gap: Post-Pandemic Data Shows Girls Falling Behind Boys Globally8 Critical Steps to Operationalize Responsible AI at Enterprise ScaleBreaking: Microsoft Overhauls Windows Insider Program as Windows 11 25H2 ShipsHow to Respond to a Suspected Hantavirus Outbreak on a Cruise Ship: A Step-by-Step Guide