Python data cleaning cheat sheet

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

# 2. Import libraries and modules

importnumpy asnp

importpandas aspd

fromsklearn.model_selection importtrain_test_split

fromsklearn importpreprocessing

fromsklearn.ensemble importRandomForestRegressor

fromsklearn.pipeline importmake_pipeline

fromsklearn.model_selection import GridSearchCV

fromsklearn.metrics importmean_squared_error,r2_score

importjoblib

# 3. Load red wine data.

dataset_url= 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'

data=pd.read_csv(dataset_url,sep=';')

# 4. Split data into training and test sets

y= data.quality

X=data.drop('quality',axis=1)

X_train,X_test,y_train,y_test= train_test_split(X,y,

                                                    test_size=0.2,

                                                    random_state=123,

                                                    stratify=y)

# 5. Declare data preprocessing steps

pipeline=make_pipeline(preprocessing.StandardScaler(),

                         RandomForestRegressor(n_estimators=100,

                                               random_state=123))

# 6. Declare hyperparameters to tune

hyperparameters={'randomforestregressor__max_features':['auto','sqrt','log2'],

                  'randomforestregressor__max_depth': [None,5,3,1]}

# 7. Tune model using cross-validation pipeline

clf=GridSearchCV(pipeline,hyperparameters, cv=10)

clf.fit(X_train,y_train)

# 8. Refit on the entire training set

# No additional code needed if clf.refit == True (default is True)

# 9. Evaluate model pipeline on test data

pred= clf.predict(X_test)

print(r2_score(y_test,pred))

print(mean_squared_error(y_test,pred))

# 10. Save model for future use

joblib.dump(clf,'rf_regressor.pkl')

# To load: clf2 = joblib.load('rf_regressor.pkl')

How do you clean data in Python?

Pythonic Data Cleaning With Pandas and NumPy.
Dropping Columns in a DataFrame..
Changing the Index of a DataFrame..
Tidying up Fields in the Data..
Combining str Methods with NumPy to Clean Columns..
Cleaning the Entire Dataset Using the applymap Function..
Renaming Columns and Skipping Rows..

Is Pandas good for data cleaning?

Pandas offer a diverse range of built-in functions that can be used to clean and manipulate datasets prior to analysis. It can allow you to drop incomplete rows and columns, fill missing values and improve the readability of the dataset through category renaming.

What is Pandas cheat sheet?

The Pandas cheat sheet will guide you through the basics of the Pandas library, going from the data structures to I/O, selection, dropping indices or columns, sorting and ranking, retrieving basic information of the data structures you're working with to applying functions and data alignment.

How do you manipulate data in Python?

In Machine Learning, the model requires a dataset to operate, i.e. to train and test. But data doesn't come fully prepared and ready to use. There are discrepancies like “Nan”/ “Null” / “NA” values in many rows and columns.