Pandas

Pandas
Load Data (csv)

`read_csv()`: read data from csv and tsv files

One of the most valuable tools provided by pandas is the ability to load data from a great variety of file formats and databases, like SQL (*.db), Excel (*.xls), and comma-separated values (*.csv) files.

General Usage

import pandas as pd
dataPath = 'https://raw.githubusercontent.com/alineu/pyDataScintist-Notebooks/main/data/'
df = pd.read_csv(dataPath+'BostonHousing_min.csv')
df.head()

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

sep: Delimiter to use - the default value is ','

# Not specifying the delimiter when reading a tsv file is problematic
pd.read_csv(dataPath+'BostonHousing_min.tsv').head()

	crim\t"zn"\t"indus"\t"chas"\t"nox"\t"rm"\t"age"\t"dis"\t"rad"\t"tax"\t"ptratio"\t"b"\t"lstat"\t"medv"
0	0.00632\t18\t2.31\t"0"\t0.538\t6.575\t65.2\t4....
1	0.02731\t0\t7.07\t"0"\t0.469\t6.421\t78.9\t4.9...
2	0.02729\t0\t7.07\t"0"\t0.469\t7.185\t61.1\t4.9...
3	0.03237\t0\t2.18\t"0"\t0.458\t6.998\t45.8\t6.0...
4	0.06905\t0\t2.18\t"0"\t0.458\t7.147\t54.2\t6.0...

pd.read_csv(dataPath+'BostonHousing_min.tsv', 
            sep='\t').head()

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

Note: sep accepts regular expressions as well e.g. sep='\s{2}\t'

header: specifies what row to start reading the csv - the deafult value is 0

# header=0 is the default behavior
pd.read_csv(dataPath+'BostonHousing_min.csv', 
            header=0).head()

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

# header=n where n is an integer reads the file starting at the nth line
pd.read_csv(dataPath+'BostonHousing_min.csv', 
            header=3).head()

	0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
0	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
1	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2
2	0.02985	0.0	2.18	0.458	6.430	58.7	6.0622	3	222	18.7	394.12	5.21	28.7
3	0.08829	12.5	7.87	0.524	6.012	66.6	5.5605	5	311	15.2	395.60	12.43	22.9
4	0.14455	12.5	7.87	0.524	6.172	96.1	5.9505	5	311	15.2	396.90	19.15	27.1

Note: header=None treats the first line of the file as data, and the column names will be assigned as a list of integers

pd.read_csv(dataPath+'BostonHousing_min.csv', 
            header=None).head()

	0	1	2	3	4	5	6	7	8	9	10	11	12	13
0	crim	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
1	0.00632	18	2.31	0	0.538	6.575	65.2	4.09	1	296	15.3	396.9	4.98	24
2	0.02731	0	7.07	0	0.469	6.421	78.9	4.9671	2	242	17.8	396.9	9.14	21.6
3	0.02729	0	7.07	0	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
4	0.03237	0	2.18	0	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4

# So header=None is convenient if the file does not have a header
pd.read_csv(dataPath+'BostonHousing_no-header_min.csv', 
            header=None).head()

	0	1	2	4	5	6	7	8	9	10	11	12	13
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

prefix: string to be prepended to the column names. The common use is when the file has no header so the specified prefix is prepended to the integers that have automatically been assigned as column names.

pd.read_csv(dataPath+'BostonHousing_no-header_min.csv', 
            prefix='col_').head()

	col_0	col_1	col_2	col_4	col_5	col_6	col_7	col_8	col_9	col_10	col_11	col_12	col_13
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

skiprows: can be used as

a list to specify the line numbers to be skipped when reading a file, e.g. skiprows=[1, 5, 8]
an int that specifies where to start reading the file, e.g. skiprows=5. skiprows=n and header=n produce the same effect.
a callable function which is evaluated against the row indices e.g. skiprows=lambda x: x%3==2

pd.read_csv(dataPath+'BostonHousing.csv', 
            skiprows=range(2,20)).head()

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.72580	0.0	8.14	0.538	5.727	69.5	3.7965	4	307	21.0	390.95	11.28	18.2
2	1.25179	0.0	8.14	0.538	5.570	98.1	3.7979	4	307	21.0	376.57	21.02	13.6
3	0.85204	0.0	8.14	0.538	5.965	89.2	4.0123	4	307	21.0	392.53	13.83	19.6
4	1.23247	0.0	8.14	0.538	6.142	91.7	3.9769	4	307	21.0	396.90	18.72	15.2

pd.read_csv(dataPath+'BostonHousing_min.csv', 
            skiprows=lambda x: x%3==2).head()

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
2	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
3	0.02985	0.0	2.18	0.458	6.430	58.7	6.0622	3	222	18.7	394.12	5.21	28.7
4	0.08829	12.5	7.87	0.524	6.012	66.6	5.5605	5	311	15.2	395.60	12.43	22.9

names: List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.

col_names = [f'col_{i}' for i in range(len(df.columns))]
col_names

['col_0',
 'col_1',
 'col_2',
 'col_3',
 'col_4',
 'col_5',
 'col_6',
 'col_7',
 'col_8',
 'col_9',
 'col_10',
 'col_11',
 'col_12',
 'col_13']

pd.read_csv(dataPath+'BostonHousing_min.csv', 
            names=col_names, 
            header=0).head()

	col_0	col_1	col_2	col_4	col_5	col_6	col_7	col_8	col_9	col_10	col_11	col_12	col_13
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

Note: names is handy if you want to override a file that has header while prefix is often used to modify the column names of a file with no header.

usecols: List of column names or column ids to be included in the DataFrame.

# Using column names
pd.read_csv(dataPath+'BostonHousing_min.csv', 
            usecols=['age', 'tax']).head()

	age	tax
0	65.2	296
1	78.9	242
2	61.1	242
3	45.8	222
4	54.2	222

# Using column ids
pd.read_csv(dataPath+'BostonHousing_min.csv', 
            usecols=[0, 5]).head()

	crim	rm
0	0.00632	6.575
1	0.02731	6.421
2	0.02729	7.185
3	0.03237	6.998
4	0.06905	7.147

nrows: Number of lines to read. It is useful when reading pieces of a large file.

pd.read_csv(dataPath+'BostonHousing_min.csv', 
            nrows=2, 
            usecols=[0, 5]).head()

	crim	rm
0	0.00632	6.575
1	0.02731	6.421

dtype: a dictionary that enables the user to specify the data type. Otherwise pandas will automatically infer the data.

df = pd.read_csv(dataPath+'BostonHousing_min.csv', 
                 usecols=['age', 'tax'])
df.dtypes

age    float64
tax      int64
dtype: object

df.head()

	age	tax
0	65.2	296
1	78.9	242
2	61.1	242
3	45.8	222
4	54.2	222

# Change the "tax" column data type from int to float
df = pd.read_csv(dataPath+'BostonHousing_min.csv',  
                 usecols=['age', 'tax'], 
                 dtype={'tax': float})
df.dtypes

age    float64
tax      float64
dtype: object

# Note the floating point in the "tax" column
df.head()

	age	tax
0	65.2	296.0
1	78.9	242.0
2	61.1	242.0
3	45.8	222.0
4	54.2	222.0

Link to the ipython notebook

Pandas

read_csv(): read data from csv and tsv files

General Usage

Useful Arguments

`read_csv()`: read data from csv and tsv files