What is polars best used for?

polars is best used for high-performance, in-memory data manipulation and analysis of large datasets on a single machine. It excels in scenarios requiring fast data transformations, aggregations, and filtering using an expression-based querying system.

How does polars compare to pandas?

polars is generally designed for higher performance than pandas, especially with large datasets, due to its Rust backend, columnar memory model, and lazy evaluation. Pandas typically uses a row-oriented memory model and eager execution, which can be less efficient for certain analytical workloads. Both offer DataFrame APIs for data manipulation.

Does polars support lazy evaluation?

Yes, polars fully supports lazy evaluation. This means that operations are built into an optimized execution plan and only run when results are explicitly requested, allowing for performance optimizations like predicate pushdown.

What programming languages can I use with polars?

polars primarily offers official SDKs for Python and Rust. There are also community-driven bindings available for Node.js, extending its usability to JavaScript environments.

Is polars free to use?

Yes, polars is an open-source project released under the MIT License, making it free to use for both commercial and non-commercial applications without any licensing fees.

Can polars handle datasets larger than RAM?

While primarily an in-memory library, polars can work with datasets larger than available RAM by leveraging efficient I/O operations with formats like Apache Parquet. It can selectively read parts of large files, processing data in chunks.

What is the core difference between polars' eager and lazy APIs?

The eager API executes operations immediately, returning results directly. The lazy API constructs an optimized execution plan without immediate computation, only running the plan when a specific action (like `collect()`) is called, enabling more advanced optimizations.

polars — High-Performance DataFrame Library for Data Analysis

Overview

polars is an open-source DataFrame library designed for high-performance data processing and analysis. Developed in Rust, it provides language bindings for Python, Node.js, and other environments, enabling developers to perform complex data manipulations with efficiency. The library's architecture leverages a columnar memory model, which optimizes data access patterns and memory usage, particularly beneficial when working with large datasets. This design choice contributes to polars' performance characteristics, often exceeding those of row-oriented DataFrame implementations for specific operations.

A key aspect of polars' design is its emphasis on lazy evaluation. This means that operations are not executed immediately but rather built into an execution plan. The plan is then optimized and executed only when a result is explicitly requested, for example, when data needs to be collected or written. This approach allows polars to perform query optimizations, such as predicate pushdown and projection pushdown, reducing the amount of data processed and improving overall execution speed. For data professionals accustomed to tools like SQL or pandas, polars offers an intuitive, expression-based API that facilitates declarative data transformations.

polars is particularly well-suited for scenarios requiring fast, in-memory processing of substantial data volumes. Its capabilities extend to various data analysis tasks, including data cleaning, transformation, aggregation, and joining. The library supports a wide range of data types and provides robust functionalities for handling missing values, filtering, and sorting. While it excels in single-machine, in-memory operations, its performance characteristics make it a viable option for many tasks that might otherwise be considered for distributed computing frameworks. The project was founded in 2020 and has gained traction within the data science community for its speed and efficiency, offering a compelling alternative for data manipulation workflows.

Developers who need to process gigabytes of data on a single machine without resorting to distributed systems often find polars to be an effective solution. Its design prioritizes speed and memory efficiency, making it valuable for interactive data exploration as well as batch processing. The Python API, in particular, aims for a familiar feel for users transitioning from other DataFrame libraries, while introducing new paradigms like its expression system for more flexible and optimized query construction. The library's active development and growing ecosystem contribute to its utility for modern data engineering and analysis tasks.

Key features

Columnar Memory Model: Stores data in columns rather than rows, enhancing cache efficiency and performance for analytical queries by reducing memory access overhead.
Lazy Evaluation: Computations are deferred and optimized into an execution plan before being run, enabling performance enhancements like predicate pushdown and projection pushdown. This is similar to how query optimizers work in relational databases.
Expression-Based API: Provides a declarative way to define data transformations and aggregations, allowing for more concise and optimizable code compared to imperative approaches. The polars user guide on expressions details this approach.
High Performance (Rust-powered): Built on Rust, polars leverages its performance characteristics for speed and memory safety, making it efficient for large-scale data operations.
Multi-language Support: Offers official SDKs for Python and Rust, along with community-driven bindings for Node.js, expanding its accessibility across different development environments.
Parallel Processing: Utilizes multiple CPU cores automatically for many operations, speeding up computations on modern hardware.
Out-of-Core Capabilities: While primarily in-memory, polars can handle datasets larger than available RAM by integrating with file formats like Apache Parquet, which allows for efficient reading of subsets of data.
Rich Type System: Supports a wide array of data types, including numerical, string, boolean, date/time, and categorical types, with robust type inference and casting capabilities.
Missing Data Handling: Provides comprehensive tools for identifying, filtering, and imputing missing values within DataFrames.

Pricing

polars is an open-source project released under the MIT License. It is free to use for both commercial and non-commercial purposes, with no associated licensing fees or proprietary tiers.

Feature	Availability	Cost	Notes
Core Library	All users	Free	Includes all DataFrame functionalities, lazy and eager APIs.
Python Bindings	All users	Free	Available via PyPI.
Rust Crate	All users	Free	Available via crates.io for Rust projects.
Node.js Bindings	All users	Free	Community-driven, available via npm.
Community Support	All users	Free	Support through GitHub issues and community forums.

For more details on the open-source licensing, refer to the polars license documentation.

Common integrations

Apache Parquet: polars offers direct and highly optimized reading and writing of Parquet files, a columnar storage format, which is crucial for large-scale data workflows. The polars user guide on Parquet I/O provides usage examples.
CSV Files: Seamlessly reads and writes Comma Separated Values (CSV) files, supporting various delimiters, encodings, and parsing options. The polars CSV input/output documentation explains these features.
JSON Files: Provides functionality to read and write data in JSON format, often used for semi-structured data exchange.
Databases (via connectors): While not having native connectors for all databases, polars can easily integrate with database results through existing Python database connectors (e.g., psycopg2 for PostgreSQL, pyodbc for various ODBC-compliant databases) by converting query results into DataFrames.
NumPy: Can convert between polars DataFrames/Series and NumPy arrays, facilitating interoperability with the broader Python scientific computing ecosystem. The NumPy documentation provides an overview of the array library.
Pandas: Offers methods to convert polars DataFrames to pandas DataFrames and vice-versa, allowing users to leverage both libraries within the same workflow. This is particularly useful for migrating existing pandas code or using pandas-specific functionalities.
Arrow (Apache Arrow): polars is built on Apache Arrow, enabling efficient data exchange with other Arrow-compatible systems without serialization overhead.

Alternatives

pandas: A widely used Python library for data manipulation and analysis, offering a DataFrame object. While powerful, pandas typically uses a row-oriented memory model and eager execution, which can sometimes be less performant than polars for large datasets and specific operations.
Apache Spark: A unified analytics engine for large-scale data processing, often used for distributed computing across clusters. Spark's DataFrame API provides similar functionalities to polars but is designed for distributed environments, making it more complex for single-machine, in-memory tasks.
DuckDB: An in-process SQL OLAP database management system designed for analytical queries. DuckDB focuses on SQL as its primary interface and is highly optimized for analytical workloads, serving as an alternative for SQL-centric data analysis tasks.

Getting started

To begin using polars in Python, first install the library using pip:

pip install polars

Once installed, you can create a DataFrame and perform some basic operations. The following example demonstrates creating a DataFrame, adding a new column, filtering rows, and grouping data:

import polars as pl

# Create a sample DataFrame
data = {
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "age": [25, 30, 35, 28, 22],
    "city": ["New York", "London", "Paris", "New York", "London"],
    "score": [85, 92, 78, 95, 88]
}
df = pl.DataFrame(data)

print("Original DataFrame:")
print(df)

# Add a new column based on existing data (e.g., age in months)
df_with_months = df.with_columns(
    (pl.col("age") * 12).alias("age_months")
)
print("\nDataFrame with 'age_months' column:")
print(df_with_months)

# Filter rows where age is greater than 25
df_filtered = df.filter(pl.col("age") > 25)
print("\nFiltered DataFrame (age > 25):")
print(df_filtered)

# Group by 'city' and calculate the average score
df_grouped = df.group_by("city").agg(
    pl.col("score").mean().alias("average_score")
)
print("\nGrouped DataFrame (average score by city):")
print(df_grouped)

# Demonstrate lazy evaluation with a more complex chain
lazy_df = (
    pl.LazyFrame(data)
    .filter(pl.col("score") > 80)
    .with_columns(pl.col("age").rank(method="average").alias("age_rank"))
    .group_by("city")
    .agg(
        pl.col("age_rank").mean().alias("avg_age_rank_in_city"),
        pl.col("name").count().alias("city_count")
    )
    .sort("city")
)

print("\nLazy DataFrame plan (before execution):")
print(lazy_df)

print("\nCollecting results of lazy DataFrame:")
print(lazy_df.collect())

This example demonstrates the creation of a DataFrame, basic column manipulation using with_columns, row selection with filter, and aggregation with group_by. It also introduces the concept of lazy evaluation using pl.LazyFrame, where operations are chained and only executed when .collect() is called. This separation of plan definition and execution is a core feature for performance optimization in polars. For further learning, the polars user guide provides comprehensive documentation and examples.

polars

Overview

Key features

Pricing

Common integrations

Alternatives

Getting started

# frequently asked questions

## reviews

## comments

Overview

Key features

Pricing

Common integrations

Alternatives

Getting started

# frequently asked questions

# see also

## reviews

## comments