Overview
polars is an open-source DataFrame library designed for high-performance data processing and analysis. Developed in Rust, it provides language bindings for Python, Node.js, and other environments, enabling developers to perform complex data manipulations with efficiency. The library's architecture leverages a columnar memory model, which optimizes data access patterns and memory usage, particularly beneficial when working with large datasets. This design choice contributes to polars' performance characteristics, often exceeding those of row-oriented DataFrame implementations for specific operations.
A key aspect of polars' design is its emphasis on lazy evaluation. This means that operations are not executed immediately but rather built into an execution plan. The plan is then optimized and executed only when a result is explicitly requested, for example, when data needs to be collected or written. This approach allows polars to perform query optimizations, such as predicate pushdown and projection pushdown, reducing the amount of data processed and improving overall execution speed. For data professionals accustomed to tools like SQL or pandas, polars offers an intuitive, expression-based API that facilitates declarative data transformations.
polars is particularly well-suited for scenarios requiring fast, in-memory processing of substantial data volumes. Its capabilities extend to various data analysis tasks, including data cleaning, transformation, aggregation, and joining. The library supports a wide range of data types and provides robust functionalities for handling missing values, filtering, and sorting. While it excels in single-machine, in-memory operations, its performance characteristics make it a viable option for many tasks that might otherwise be considered for distributed computing frameworks. The project was founded in 2020 and has gained traction within the data science community for its speed and efficiency, offering a compelling alternative for data manipulation workflows.
Developers who need to process gigabytes of data on a single machine without resorting to distributed systems often find polars to be an effective solution. Its design prioritizes speed and memory efficiency, making it valuable for interactive data exploration as well as batch processing. The Python API, in particular, aims for a familiar feel for users transitioning from other DataFrame libraries, while introducing new paradigms like its expression system for more flexible and optimized query construction. The library's active development and growing ecosystem contribute to its utility for modern data engineering and analysis tasks.
Key features
- Columnar Memory Model: Stores data in columns rather than rows, enhancing cache efficiency and performance for analytical queries by reducing memory access overhead.
- Lazy Evaluation: Computations are deferred and optimized into an execution plan before being run, enabling performance enhancements like predicate pushdown and projection pushdown. This is similar to how query optimizers work in relational databases.
- Expression-Based API: Provides a declarative way to define data transformations and aggregations, allowing for more concise and optimizable code compared to imperative approaches. The polars user guide on expressions details this approach.
- High Performance (Rust-powered): Built on Rust, polars leverages its performance characteristics for speed and memory safety, making it efficient for large-scale data operations.
- Multi-language Support: Offers official SDKs for Python and Rust, along with community-driven bindings for Node.js, expanding its accessibility across different development environments.
- Parallel Processing: Utilizes multiple CPU cores automatically for many operations, speeding up computations on modern hardware.
- Out-of-Core Capabilities: While primarily in-memory, polars can handle datasets larger than available RAM by integrating with file formats like Apache Parquet, which allows for efficient reading of subsets of data.
- Rich Type System: Supports a wide array of data types, including numerical, string, boolean, date/time, and categorical types, with robust type inference and casting capabilities.
- Missing Data Handling: Provides comprehensive tools for identifying, filtering, and imputing missing values within DataFrames.
Pricing
polars is an open-source project released under the MIT License. It is free to use for both commercial and non-commercial purposes, with no associated licensing fees or proprietary tiers.
| Feature | Availability | Cost | Notes |
|---|---|---|---|
| Core Library | All users | Free | Includes all DataFrame functionalities, lazy and eager APIs. |
| Python Bindings | All users | Free | Available via PyPI. |
| Rust Crate | All users | Free | Available via crates.io for Rust projects. |
| Node.js Bindings | All users | Free | Community-driven, available via npm. |
| Community Support | All users | Free | Support through GitHub issues and community forums. |
For more details on the open-source licensing, refer to the polars license documentation.
Common integrations
- Apache Parquet: polars offers direct and highly optimized reading and writing of Parquet files, a columnar storage format, which is crucial for large-scale data workflows. The polars user guide on Parquet I/O provides usage examples.
- CSV Files: Seamlessly reads and writes Comma Separated Values (CSV) files, supporting various delimiters, encodings, and parsing options. The polars CSV input/output documentation explains these features.
- JSON Files: Provides functionality to read and write data in JSON format, often used for semi-structured data exchange.
- Databases (via connectors): While not having native connectors for all databases, polars can easily integrate with database results through existing Python database connectors (e.g.,
psycopg2for PostgreSQL,pyodbcfor various ODBC-compliant databases) by converting query results into DataFrames. - NumPy: Can convert between polars DataFrames/Series and NumPy arrays, facilitating interoperability with the broader Python scientific computing ecosystem. The NumPy documentation provides an overview of the array library.
- Pandas: Offers methods to convert polars DataFrames to pandas DataFrames and vice-versa, allowing users to leverage both libraries within the same workflow. This is particularly useful for migrating existing pandas code or using pandas-specific functionalities.
- Arrow (Apache Arrow): polars is built on Apache Arrow, enabling efficient data exchange with other Arrow-compatible systems without serialization overhead.
Alternatives
- pandas: A widely used Python library for data manipulation and analysis, offering a DataFrame object. While powerful, pandas typically uses a row-oriented memory model and eager execution, which can sometimes be less performant than polars for large datasets and specific operations.
- Apache Spark: A unified analytics engine for large-scale data processing, often used for distributed computing across clusters. Spark's DataFrame API provides similar functionalities to polars but is designed for distributed environments, making it more complex for single-machine, in-memory tasks.
- DuckDB: An in-process SQL OLAP database management system designed for analytical queries. DuckDB focuses on SQL as its primary interface and is highly optimized for analytical workloads, serving as an alternative for SQL-centric data analysis tasks.
Getting started
To begin using polars in Python, first install the library using pip:
pip install polars
Once installed, you can create a DataFrame and perform some basic operations. The following example demonstrates creating a DataFrame, adding a new column, filtering rows, and grouping data:
import polars as pl
# Create a sample DataFrame
data = {
"name": ["Alice", "Bob", "Charlie", "David", "Eve"],
"age": [25, 30, 35, 28, 22],
"city": ["New York", "London", "Paris", "New York", "London"],
"score": [85, 92, 78, 95, 88]
}
df = pl.DataFrame(data)
print("Original DataFrame:")
print(df)
# Add a new column based on existing data (e.g., age in months)
df_with_months = df.with_columns(
(pl.col("age") * 12).alias("age_months")
)
print("\nDataFrame with 'age_months' column:")
print(df_with_months)
# Filter rows where age is greater than 25
df_filtered = df.filter(pl.col("age") > 25)
print("\nFiltered DataFrame (age > 25):")
print(df_filtered)
# Group by 'city' and calculate the average score
df_grouped = df.group_by("city").agg(
pl.col("score").mean().alias("average_score")
)
print("\nGrouped DataFrame (average score by city):")
print(df_grouped)
# Demonstrate lazy evaluation with a more complex chain
lazy_df = (
pl.LazyFrame(data)
.filter(pl.col("score") > 80)
.with_columns(pl.col("age").rank(method="average").alias("age_rank"))
.group_by("city")
.agg(
pl.col("age_rank").mean().alias("avg_age_rank_in_city"),
pl.col("name").count().alias("city_count")
)
.sort("city")
)
print("\nLazy DataFrame plan (before execution):")
print(lazy_df)
print("\nCollecting results of lazy DataFrame:")
print(lazy_df.collect())
This example demonstrates the creation of a DataFrame, basic column manipulation using with_columns, row selection with filter, and aggregation with group_by. It also introduces the concept of lazy evaluation using pl.LazyFrame, where operations are chained and only executed when .collect() is called. This separation of plan definition and execution is a core feature for performance optimization in polars. For further learning, the polars user guide provides comprehensive documentation and examples.