Overview
Pandas is an open-source Python library designed for data manipulation and analysis. It provides data structures and functions needed to work with structured data, making it a foundational tool for data scientists and analysts using Python. The library's primary data structures, DataFrame and Series, are optimized for performance and ease of use, enabling efficient handling of large datasets. A DataFrame is a two-dimensional labeled data structure with columns of potentially different types, similar to a spreadsheet or SQL table. A Series is a one-dimensional labeled array capable of holding any data type.
Pandas excels in various data-related tasks, including data cleaning and preparation, where it offers robust tools for handling missing data, filtering, and transforming datasets. For exploratory data analysis, pandas facilitates quick data inspection, aggregation, and visualization setup, allowing users to gain insights rapidly. Its capabilities extend to time series analysis, providing specialized functions for date and time indexing, frequency conversion, and moving window calculations. Furthermore, pandas is frequently used to prepare data for statistical modeling and machine learning algorithms, serving as an essential preprocessing step. The library's design emphasizes intuitive syntax for common operations, supported by extensive documentation and a large, active community.
Developed by Wes McKinney in 2008, pandas filled a need for a high-performance, flexible tool for quantitative analysis in Python. It has since become a cornerstone of the Python data science ecosystem, often used in conjunction with other libraries like NumPy for numerical operations, Matplotlib and Seaborn for visualization, and scikit-learn for machine learning. Its interoperability with these tools enhances its utility across the entire data analysis pipeline. The library is distributed under a BSD 3-Clause license, reflecting its commitment to open-source principles and community contributions.
Key features
- DataFrame Object: A two-dimensional, size-mutable, tabular data structure with labeled axes (rows and columns). It can store data of different types in each column, similar to a spreadsheet or a database table.
- Series Object: A one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). It is the primary building block for DataFrames.
- Data Alignment: Automatically aligns data by labels, preventing common errors that arise from misaligned data when performing operations like merging or joining.
- Missing Data Handling: Comprehensive tools for identifying, removing, or imputing missing data (
NaNvalues), crucial for data cleaning and preparation. - Data Reshaping and Pivoting: Functions for reshaping data, such as
pivot,melt,stack, andunstack, allowing flexible transformation of data layouts for analysis. - Group By Functionality: Powerful
groupbyengine for splitting data into groups based on one or more keys, applying a function to each group independently, and then combining the results. This is fundamental for aggregate analysis. - Time Series Functionality: Robust tools for working with time-stamped data, including date range generation, frequency conversion, moving window statistics, and date/time indexing.
- Input/Output Tools: Facilitates reading and writing data in various formats, including CSV, Excel, SQL databases, HDF5, JSON, and Parquet, simplifying data ingestion and export tasks.
- High Performance: Implemented in Cython and C, pandas offers high-performance operations, especially for large datasets, by leveraging optimized underlying data structures from NumPy.
Pricing
| Product | Pricing Model | Details | As Of Date |
|---|---|---|---|
| pandas library | Free and Open-Source | Available under the BSD 3-Clause License. No licensing fees for commercial or non-commercial use. | 2026-06-19 |
Common integrations
- NumPy: Provides the fundamental array objects and numerical computing routines that pandas builds upon. Pandas objects like Series and DataFrames are built on top of NumPy arrays, ensuring efficient numerical operations. Learn about NumPy's array objects.
- Matplotlib: A comprehensive library for creating static, animated, and interactive visualizations in Python. Pandas has built-in plotting methods that wrap Matplotlib functions, allowing for quick data visualization directly from DataFrames and Series. Explore Matplotlib's quick start guide.
- SciPy: An ecosystem of open-source software for mathematics, science, and engineering. Pandas often integrates with SciPy for advanced scientific computing, statistical analysis, and optimization tasks. View SciPy's general tutorial.
- scikit-learn: A popular machine learning library in Python. Pandas DataFrames are commonly used as input for scikit-learn models, providing structured data for training and prediction.
- Jupyter Notebook/Lab: Interactive computing environments where pandas is extensively used for data exploration, analysis, and visualization in a cell-by-cell execution model.
- Dask: A flexible library for parallel computing in Python. Dask DataFrames provide a parallelized version of pandas DataFrames, enabling scaling of pandas workflows to larger-than-memory datasets or distributed computing environments. Get started with Dask.
Alternatives
- Polars: A DataFrame library written in Rust, offering high performance and memory efficiency, particularly for large datasets, and focuses on lazy evaluation.
- NumPy: The foundational package for numerical computing with Python, providing powerful N-dimensional array objects and functions for linear algebra, Fourier transforms, and random number capabilities.
- Dask: A flexible library for parallel computing with Python, offering Dask DataFrames that mimic pandas DataFrames but can operate on larger-than-memory datasets or across clusters.
- Apache Spark (PySpark): A unified analytics engine for large-scale data processing, offering a DataFrame API for Python that allows distributed data manipulation and analysis across clusters.
- R DataFrames: The native data structure for tabular data in the R programming language, providing similar functionality to pandas DataFrames for data manipulation and statistical analysis.
Getting started
To begin using pandas, you first need to install it. The most common method is via pip:
pip install pandas
Once installed, you can import the library and start working with DataFrames. The following example demonstrates how to create a DataFrame, perform basic operations like selecting columns, and filter rows.
import pandas as pd
# Create a dictionary of data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [24, 27, 22, 32, 29],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'],
'Occupation': ['Engineer', 'Artist', 'Student', 'Doctor', 'Designer']
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("\n")
# Select a single column
print("Selected 'Name' column:")
print(df['Name'])
print("\n")
# Select multiple columns
print("Selected 'Name' and 'Age' columns:")
print(df[['Name', 'Age']])
print("\n")
# Filter rows where Age is greater than 25
print("Rows where Age > 25:")
print(df[df['Age'] > 25])
print("\n")
# Add a new column
df['Salary'] = [70000, 60000, 30000, 95000, 65000]
print("DataFrame with 'Salary' column:")
print(df)
print("\n")
# Calculate the average age
print(f"Average Age: {df['Age'].mean()}")
This code snippet initializes a DataFrame, prints its contents, and then demonstrates how to select specific columns, filter data based on conditions, and add a new column. Finally, it calculates a basic statistic (mean age), illustrating the ease of performing analytical operations with pandas. For more detailed guides and advanced features, consult the pandas API reference documentation.