EDAhelper.preprocess

preprocess can be used to read data in different formats such as txt, json, csv and return the data as pandas.DataFrame. To use preprocess in a project:

# import function
from EDAhelper.EDAhelper import preprocess
import pandas as pd

Read csv data from buffer

file_path = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv'
df = preprocess(file_path)
df.iloc[:5, :5]
PassengerId Survived Pclass Name Sex
0 1 0 3 Braund, Mr. Owen Harris male
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 3 1 3 Heikkinen, Miss. Laina female
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 5 0 3 Allen, Mr. William Henry male

Read local data

file_path = '../tests/data_preprocess.csv'
preprocess(file_path)
Unnamed: 0 col_1 col_2 col_3
0 0 NaN 1.0 a
1 1 1.0 2.0 b
2 2 1.0 NaN c
3 3 3.0 5.0 d
4 4 0.0 NaN NaN

Read data with different methods to dealing with missing values

preprocess(file_path, method='mean', index_col=0)
col_1 col_2 col_3
0 1.25 1.000000 a
1 1.00 2.000000 b
2 1.00 2.666667 c
3 3.00 5.000000 d
4 0.00 2.666667 NaN
preprocess(file_path, method='median', index_col=0)
col_1 col_2 col_3
0 1.0 1.0 a
1 1.0 2.0 b
2 1.0 2.0 c
3 3.0 5.0 d
4 0.0 2.0 NaN

Read data with extra pandas settings

preprocess(file_path, read_func=pd.read_csv, index_col=1)
Unnamed: 0 col_2 col_3
col_1
NaN 0 1.0 a
1.0 1 2.0 b
1.0 2 NaN c
3.0 3 5.0 d
0.0 4 NaN NaN