Keywords
Pandas ("Panel Data") for data analysis and data manipulation is one of the best known Python projects. It can import data from spreadsheets and a wide range of SQL databases and other data sources such as HDF5, and has strong support for working with JSON, XML, and HTML.
The primary data structures are the Series
(a one-dimensional labeled array holding data of any type) and the DataFrame
(a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns).
Pandas works together with NumPy, and can leverage constructs such as the Not-A-Number np.nan
. It is typically imported as:
import numpy as np
import pandas as pd
One of the main differences between Pandas and NumPy is the ability to work with labelled data and spreadsheet-like tabular data. Another is that Pandas has dedicated features for dealing with time series and very large data sets, as well as advanced data analytics and data cleansing tools.
The Pandas DataFrame
is more flexible than a NumPy ndarray
:
NumPy arrays have one dtype for the entire array while pandas DataFrames have one dtype per column. When you callDataFrame.to_numpy()
, pandas will find the NumPydtype
that can hold all of the dtypes in the DataFrame.
Pandas also offers plots using Matplotlib.
Pandas has support for multi-level hierarchical indexing with MultiIndex
.
To get a feel for the DataFrame
and access read 10 minutes to pandas. Note how it emphasises that selection using direct [start:stop:step]
slicing notation is supported, in production code one should use the data access methods at()
, iat()
, loc()
, and iloc()
.
Pandas provides some vectorised operations such as apply()
and map()
that operate on entire arrays or columns at once, and leverage C optimisations, so are much faster than Python loops.
Pandas does not directly support parallel processing, but additional
libraries such as pandarallel
, parallel-pandas
, and Modin
enable parallel processing with a Pandas-like API.
RAPIDS cuDF pandas offers GPU acceleration with zero code change.
Visit also about PySpark