数据清洗 (data cleaning) 是机器学习和深度学习进入算法步前的一项重要任务,我平时比较习惯使用的 7 个步骤,总结如下:
- Step1 : read csv
- Step2 : preview data
- Step3: check null value for every column
- Step4: complete null value
- Step5: feature engineering
- Step 5.1: delete some features
- Step 5.2: create new feature
- Step6: encode for categories columns
- Step 6.1: Sklearn LabelEncode
- Step 6.2: Pandas get_dummies
- Step 7: check for data cleaning
1 读入数据这不废话吗,第一步就是读入数据 。
data_raw = pd.read_csv('../input/titanicdataset-traincsv/train.csv')data_raw结果:【Python数据分析,清洗数据 7 步走】
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S... ... ... ... ... ... ... ... ... ... ... ... ...886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q891 rows × 12 columns2 数据预览data_raw.info()data_raw.describe(include='all')结果:<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 12 columns):PassengerId891 non-null int64Survived891 non-null int64Pclass891 non-null int64Name891 non-null objectSex891 non-null objectAge714 non-null float64SibSp891 non-null int64Parch891 non-null int64Ticket891 non-null objectFare891 non-null float64Cabin204 non-null objectEmbarked889 non-null objectdtypes: float64(2), int64(5), object(5)memory usage: 83.7+ KB PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarkedcount 891.000000 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889unique NaN NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3top NaN NaN NaN Hakkarainen, Mr. Pekka Pietari male NaN NaN NaN 1601 NaN G6 Sfreq NaN NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644mean 446.000000 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaNstd 257.353842 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaNmin 1.000000 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN25% 223.500000 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN50% 446.000000 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN75% 668.500000 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaNmax 891.000000 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN N3 检查null值data1 = data_raw.copy(deep=True)data1.isnull().sum()结果:PassengerId0Survived0Pclass0Name0Sex0Age177SibSp0Parch0Ticket0Fare0Cabin687Embarked2dtype: int64Age 列 177 个空值,Cabin 687 个空值,一共才 891 行,估计没啥价值了!Embarked 2 个 。4 补全空值
data1['Age'].fillna(data1['Age'].median(), inplace = True)data1['Embarked'].fillna(data1['Embarked'].mode()[0], inplace = True)data1.isnull().sum()补全操作check:PassengerId0Survived0Pclass0Name0Sex0Age0SibSp0Parch0Ticket0Fare0Cabin687Embarked0dtype: int645 特征工程5.1 干掉 3 列:drop_column = ['PassengerId','Cabin', 'Ticket']data1.drop(drop_column, axis=1, inplace = True)5.2 增加 3 列增加一列FamilySizedata1['FamilySize'] = data1 ['SibSp'] + data1['Parch'] + 1data1打印结果:Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 21 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 22 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 13 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 24 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1... ... ... ... ... ... ... ... ... ... ...886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1891 rows × 10 columns
推荐阅读
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 如何用Python构建机器学习推荐系统?网易云、爱奇艺也用这种方法
- Pony - 最智能的 Python ORM 框架
- 花钱学Python?不存在的!一份大纲两个网站外加搜索,足矣
- 一张图整理了 Python 所有内置异常
- Python爬虫+数据分析实战–爬取并分析中国天气网的温度信息
- Python小程序网络耗时监控
- 空调清洗了反而温度还要调的更低了?空调出现什么情况必须要清洗了?
- 你知道Python有内置数据库吗?Python内置库SQlite3使用指南
- Python基础算法之快速求解
- 数据科学家必须知道的前十大PYTHON库
