Introduction
Pandas
is a fast, powerful, flexible, and easy to use open source data analysis and manipulation tool, built on top of Python. Pandas is best known for for the extensive set of features that it provides:
- Selecting, and filtering rows and columns
- Proper handling of missing data
- Applying operations across rows and columns
- Merging data sets together
- Grouping and applying aggregation functions
Pandas
offers two classes Dataframe
and Series
which are 1D and 2D arrays, respectively, capable of holding data of any type (string, float, python objects, etc.). A pandas
Series
can simply be thought of a 1-dimensional data where each values is associated with an index (label). The DataFrame
can be thought of as the extension of Series
to 2D, where each values has a corresponding column name and index. Simply put, a DataFrame
is a table similar to an Excel
spreadsheet.
Series
import numpy as np
import pandas as pd
tempArr = np.arange(0, 6)
tempSeries = pd.Series(tempArr)
tempSeries
0 0 1 1 2 2 3 3 4 4 5 5 dtype: int64
The indices of a pandas
Series
can be specified when defining the Series
:
tempIndices = list('abcedf')
tempSeries = pd.Series(tempArr, index=tempIndices)
tempSeries.head(3)
a 0 b 1 c 2 dtype: int64
tempSeries.tail()
b 1 c 2 d 3 e 4 f 5 dtype: int64
Note: head(n)
shows the first n lines of a Series
or DataFrame
where the default value of n is 5. Similarly, tail()
shows the last 5 lines.
The indices can also be assigned by modifying the index attribute::
tempSeries.index = list('fedcba')
tempSeries
f 0 e 1 d 2 c 3 b 4 a 5 dtype: int64
DataFrame
tempArrEven = np.arange(0, 10, 2)
tempArrOdd = np.arange(1, 11, 2)
tempDataFrame = pd.DataFrame(np.array([tempArrEven, tempArrOdd]).T)
tempDataFrame
0 | 1 | |
---|---|---|
0 | 0 | 1 |
1 | 2 | 3 |
2 | 4 | 5 |
3 | 6 | 7 |
4 | 8 | 9 |
Note
DataFrame
object must have a second dimension ≥ 1. Similarly, Series
object is 1-dimensional. See the following example:
print(pd.DataFrame(tempArr).shape)
print(pd.Series(tempArr).shape)
(6, 1) (6,)
Column names and indices can be assigned to Dataframes
in the same exact way as Series
:
tempDataFrame = pd.DataFrame(data=np.array([tempArrEven, tempArrOdd]).T,
columns=['even_numbers', 'odd_numbers'],
index=list('abcde'))
tempDataFrame
0 | 1 | |
---|---|---|
a | 0 | 1 |
b | 2 | 3 |
c | 4 | 5 |
d | 6 | 7 |
e | 8 | 9 |