EDAhelper.preprocess
preprocess can be used to read data in different formats such as txt, json, csv and return the data as pandas.DataFrame. To use preprocess in a project:
# import function
from EDAhelper.EDAhelper import preprocess
import pandas as pd
Read csv data from buffer
file_path = 'https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv'
df = preprocess(file_path)
df.iloc[:5, :5]
| PassengerId | Survived | Pclass | Name | Sex | |
|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male |
Read local data
file_path = '../tests/data_preprocess.csv'
preprocess(file_path)
| Unnamed: 0 | col_1 | col_2 | col_3 | |
|---|---|---|---|---|
| 0 | 0 | NaN | 1.0 | a |
| 1 | 1 | 1.0 | 2.0 | b |
| 2 | 2 | 1.0 | NaN | c |
| 3 | 3 | 3.0 | 5.0 | d |
| 4 | 4 | 0.0 | NaN | NaN |
Read data with different methods to dealing with missing values
preprocess(file_path, method='mean', index_col=0)
| col_1 | col_2 | col_3 | |
|---|---|---|---|
| 0 | 1.25 | 1.000000 | a |
| 1 | 1.00 | 2.000000 | b |
| 2 | 1.00 | 2.666667 | c |
| 3 | 3.00 | 5.000000 | d |
| 4 | 0.00 | 2.666667 | NaN |
preprocess(file_path, method='median', index_col=0)
| col_1 | col_2 | col_3 | |
|---|---|---|---|
| 0 | 1.0 | 1.0 | a |
| 1 | 1.0 | 2.0 | b |
| 2 | 1.0 | 2.0 | c |
| 3 | 3.0 | 5.0 | d |
| 4 | 0.0 | 2.0 | NaN |
Read data with extra pandas settings
preprocess(file_path, read_func=pd.read_csv, index_col=1)
| Unnamed: 0 | col_2 | col_3 | |
|---|---|---|---|
| col_1 | |||
| NaN | 0 | 1.0 | a |
| 1.0 | 1 | 2.0 | b |
| 1.0 | 2 | NaN | c |
| 3.0 | 3 | 5.0 | d |
| 0.0 | 4 | NaN | NaN |