Pandas Introduction: A Powerful Tool for Data Exploration
Pandas is a Python package that provides quick, versatile, and expressive data structures that make it simple and intuitive to work with structured (tabular, multidimensional, possibly heterogeneous) and time series data. It aspires to be the basic high-level building block for doing realistic, real-world data analysis in Python.
Pandas are well suited for:
- Tabular data has columns of varying types, such as in a SQL table or Excel spreadsheet.
- Time series data can be arranged or unordered.
- Arbitrary matrices (homogeneously or heterogeneously typed) with row and column labels
- Any other observational or statistical data sets. To be added into a pandas data structure, the data does not need to be tagged at all.
Pandas’ two core data structures, Series (1-dimensional) and DataFrame (2-dimensional), address the great majority of common use cases in finance, statistics, social science, and many fields of engineering.
DataFrame delivers everything that R’s data.frame does and much more. Pandas is built on NumPy and is meant to interface nicely with many other third-party libraries in a scientific computing environment.
Here is a simple example of using Pandas to load and manipulate a CSV file:
import pandas as pd
# Load the CSV file into a DataFrame
df = pd.read_csv('data.csv')
# Print the first 5 rows of the DataFrame
print(df.head())
# Select a subset of the columns
subset = df[['column_a', 'column_b']]
# Select a single column
column = df['column_a']
# Select a single row
row = df.iloc[0]
# Select a subset of the rows
subset = df[df['column_a'] > 0]
# Apply a function to each row
df['column_c'] = df.apply(lambda x: x['column_a'] + x['column_b'], axis=1)
# Save the DataFrame to a CSV file
df.to_csv('processed_data.csv', index=False)
Jupyter Notebooks provide an excellent environment for utilising pandas for data exploration and modelling, but pandas can also be utilised with text editors.
Jupyter Notebooks allow us to execute code in a specific cell rather than running the full file. When working with large datasets and sophisticated transformations, this saves a significant amount of time. Notebooks also make it simple to visualise pandas DataFrames and graphs. In reality, this post was written entirely in a Jupyter Notebook.