5.Pandas分组
温馨提示:这篇文章已超过388天没有更新,请注意相关的内容是否还可用!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
sns.set(style='whitegrid')
warnings.filterwarnings('ignore')
pd.show_versions()
INSTALLED VERSIONS ------------------ commit : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673 python : 3.11.5.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22621 machine : AMD64 processor : Intel64 Family 6 Model 154 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Chinese (Simplified)_China.936 pandas : 2.1.4 numpy : 1.24.3 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : 7.4.0 hypothesis : None sphinx : 5.0.2 blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.3 html5lib : 1.1 pymysql : 1.0.2 psycopg2 : None jinja2 : 3.1.2 IPython : 8.15.0 pandas_datareader : None bs4 : 4.12.2 bottleneck : 1.3.5 dataframe-api-compat: None fastparquet : None fsspec : 2023.10.0 gcsfs : None matplotlib : 3.8.0 numba : 0.58.1 numexpr : 2.8.7 odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : 2023.10.0 scipy : 1.11.4 sqlalchemy : 2.0.21 tables : 3.8.0 tabulate : 0.9.0 xarray : 2023.6.0 xlrd : 2.0.1 zstandard : 0.19.0 tzdata : 2023.3 qtpy : 2.4.1 pyqt5 : None
1. 数据分组
分组的核心操作分为三步:
(图片来源网络,侵删)
- 分组:选择分组的方式,创建GroupBy对象;
- 应用:使用聚合、转换、过滤等类型的应用函数;
- 合并:将各应用函数结果合并。
df_english=pd.DataFrame(
{
"student": ["John", "James", "Jennifer"],
"gender": ["male", "male", "female"],
"score": [20, 30, 30],
"subject": "english"
}
)
df_math=pd.DataFrame(
{
"student": ["John", "James", "Jennifer"],
"gender": ["male", "male", "female"],
"score": [90, 100, 95],
"subject": "math"
}
)
df = pd.concat([df_english, df_math],ignore_index=True)
df = df.sort_values(['student','subject']).reset_index(drop=True)
df
| student | gender | score | subject | |
|---|---|---|---|---|
| 0 | James | male | 30 | english |
| 1 | James | male | 100 | math |
| 2 | Jennifer | female | 30 | english |
| 3 | Jennifer | female | 95 | math |
| 4 | John | male | 20 | english |
| 5 | John | male | 90 | math |
1.1 列筛选
# 不指定列 df.groupby(by='student').count()
| gender | score | subject | |
|---|---|---|---|
| student | |||
| James | 2 | 2 | 2 |
| Jennifer | 2 | 2 | 2 |
| John | 2 | 2 | 2 |
# 指定列 df.groupby(by='student')['score'].count()
student James 2 Jennifer 2 John 2 Name: score, dtype: int64
1.2 排序
groupby()默认有排序,如果不想要排序可以设置sort=False关闭。
df.groupby(by=['student', 'score'])['subject'].agg(['count', 'size'])
| count | size | ||
|---|---|---|---|
| student | score | ||
| James | 30 | 1 | 1 |
| 100 | 1 | 1 | |
| Jennifer | 30 | 1 | 1 |
| 95 | 1 | 1 | |
| John | 20 | 1 | 1 |
| 90 | 1 | 1 |
df.groupby(by=['student', 'score'], sort=False)['subject'].agg(['count', 'size'])
| count | size | ||
|---|---|---|---|
| student | score | ||
| James | 30 | 1 | 1 |
| 100 | 1 | 1 | |
| Jennifer | 30 | 1 | 1 |
| 95 | 1 | 1 | |
| John | 20 | 1 | 1 |
| 90 | 1 | 1 |
1.3 索引重置
groupby()会把分组列作为繁华dataframe的索引,如果不想这样,可以设置as_index=False参数。
df.groupby(by='student')['score'].agg(['mean', 'max', 'min'])
| mean | max | min | |
|---|---|---|---|
| student | |||
| James | 65.0 | 100 | 30 |
| Jennifer | 62.5 | 95 | 30 |
| John | 55.0 | 90 | 20 |
df.groupby(by='student', as_index=False)['score'].agg(['mean', 'max', 'min'])
| student | mean | max | min | |
|---|---|---|---|---|
| 0 | James | 65.0 | 100 | 30 |
| 1 | Jennifer | 62.5 | 95 | 30 |
| 2 | John | 55.0 | 90 | 20 |
1.4 空值统计
groupby()分组时默认忽略空值,但实际中有时需要对变量空值进行统计,可设置参数dropna=False。
dd = pd.DataFrame({
'A':['a1','a1','a2','a3'],
'B':['b1',None,'b2','b3'],
'C':[1,2,3,4],
'D':[5,None,9,10]})
dd
| A | B | C | D | |
|---|---|---|---|---|
| 0 | a1 | b1 | 1 | 5.0 |
| 1 | a1 | None | 2 | NaN |
| 2 | a2 | b2 | 3 | 9.0 |
| 3 | a3 | b3 | 4 | 10.0 |
# 默认忽略空值
dd.groupby('B').sum()
| A | C | D | |
|---|---|---|---|
| B | |||
| b1 | a1 | 1 | 5.0 |
| b2 | a2 | 3 | 9.0 |
| b3 | a3 | 4 | 10.0 |
# 分组空值加入统计
dd.groupby('B', dropna=False).sum()
| A | C | D | |
|---|---|---|---|
| B | |||
| b1 | a1 | 1 | 5.0 |
| b2 | a2 | 3 | 9.0 |
| b3 | a3 | 4 | 10.0 |
| NaN | a1 | 2 | 0.0 |
1.5 多级索引
除数据列外,我们也可以通过索引层级来实现分组。
df = df.set_index(['student', 'subject']) df
| gender | score | ||
|---|---|---|---|
| student | subject | ||
| James | english | male | 30 |
| math | male | 100 | |
| Jennifer | english | female | 30 |
| math | female | 95 | |
| John | english | male | 20 |
| math | male | 90 |
# 指定level=0分组 df.groupby(level=0).count()
| gender | score | |
|---|---|---|
| student | ||
| James | 2 | 2 |
| Jennifer | 2 | 2 |
| John | 2 | 2 |
# 指定level=1分组 df.groupby(level=1).count()
| gender | score | |
|---|---|---|
| subject | ||
| english | 3 | 3 |
| math | 3 | 3 |
1.6 分组聚合
即在分组对象后应用聚合函数,有内置方法和自定义方法两种。
| 方法 | 概述 |
|---|---|
| any() | 计算组中的任何数是否为真 |
| all() | 计算组中的任何数是否全都为真 |
| count() | 计算组中非NA的数量 |
| cov()* | 计算组的协方差 |
| first() | 计算每组第一个出现的值 |
| idxmax()* | 最大值的索引 |
| idxmin()* | 最小值的索引 |
| last() | 每组最后一个出现的值 |
| max() | 每组的最大值 |
| min() | 每组的最小值 |
| mean() | 每组的平均数 |
| median() | 每组的中位数 |
| nunique() | 每组中唯一值的数量 |
| prod() | 每组的乘积 |
| quantile() | 每组中值的给定分位数 |
| sem() | 计算每组中值的平均值的标准差 |
| size() | 值的数量 |
| skew()* | 值的偏差 |
| std() | 值的标准偏差 |
| sum() | 值的总和 |
| var() | 值的方差 |
# 集中趋势 # 按学生分组计算平均分数 df.groupby(['student'])['score'].mean() # 按学生分组计算中位数分数 df.groupby(['student'])['score'].median() # 按学生分组计算指定分位数 df.groupby(['student'])['score'].quantile()
student James 65.0 Jennifer 62.5 John 55.0 Name: score, dtype: float64
# 离散趋势 # 按学生分组计算最大分数 df.groupby(['student'])['score'].max() # 按学生分组计算最小分数 df.groupby(['student'])['score'].min() # 按学生分组计算方差 df.groupby(['student'])['score'].var() # 按学生分组计算协方差 # df.groupby(['student'])['score'].cov() # 按学生分组计算偏差 df.groupby(['student'])['score'].skew() # 按学生分组计算标准差 df.groupby(['student'])['score'].std()
student James 49.497475 Jennifer 45.961941 John 49.497475 Name: score, dtype: float64
# 计数 # 按学生分组计算非空数量 df.groupby(['student'])['score'].count() # 按学生分组计算所有值数量 df.groupby(['student'])['score'].size() # 按学生分组计算唯一值数量 df.groupby(['student'])['score'].nunique() # 按学生分组计算求和 df.groupby(['student'])['score'].sum() # 按学生分组计算求乘积 df.groupby(['student'])['score'].prod()
student James 3000 Jennifer 2850 John 1800 Name: score, dtype: int64
# 排序 # 按学生分组返回第一个score df.groupby(['student'])['score'].first() # 按学生分组返回最后一个score df.groupby(['student'])['score'].last() # 按学生分组返回第n个score df.groupby(['student'])['score'].nth(1)
student subject James math 100 Jennifer math 95 John math 90 Name: score, dtype: int64
2. 分箱
分箱时将连续数值变量变为离散分类变量的一种方法,常用于特征工程,这对于组的划分有更高的要求,而不是简单的将相同值划分为同一组。
分箱的分为很多种,简单的有等频和等距,如果涉及到算法的,常见的有决策树、卡方、最优分箱等,这里只给出最简单的能由pandas实现的等频和等距两种方法。
df = pd.read_csv('data/titanic.csv')
df.sample(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 414 | 415 | 1 | 3 | Sundman, Mr. Johan Julian | male | 44.0 | 0 | 0 | STON/O 2. 3101269 | 7.925 | NaN | S |
| 728 | 729 | 0 | 2 | Bryhl, Mr. Kurt Arnold Gottfrid | male | 25.0 | 1 | 0 | 236853 | 26.000 | NaN | S |
| 186 | 187 | 1 | 3 | O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey) | female | NaN | 1 | 0 | 370365 | 15.500 | NaN | Q |
| 449 | 450 | 1 | 1 | Peuchen, Major. Arthur Godfrey | male | 52.0 | 0 | 0 | 113786 | 30.500 | C104 | S |
| 382 | 383 | 0 | 3 | Tikkanen, Mr. Juho | male | 32.0 | 0 | 0 | STON/O 2. 3101293 | 7.925 | NaN | S |
2.1 等距分箱
即按相同距离将连续的数值划分成不同的箱体,每个箱体的间距相等,但箱体内的数据并不一定相等,用pd.cut()实现。
pd.cut(df['Age'], bins=6)
0 (13.683, 26.947]
1 (26.947, 40.21]
2 (13.683, 26.947]
3 (26.947, 40.21]
4 (26.947, 40.21]
...
886 (26.947, 40.21]
887 (13.683, 26.947]
888 NaN
889 (13.683, 26.947]
890 (26.947, 40.21]
Name: Age, Length: 891, dtype: category
Categories (6, interval[float64, right]): [(0.34, 13.683]
pd.cut(df['Age'], bins=6).value_counts().sort_index()
Age
(0.34, 13.683] 71
(13.683, 26.947] 248
(26.947, 40.21] 245
(40.21, 53.473] 100
(53.473, 66.737] 43
(66.737, 80.0] 7
Name: count, dtype: int64
2.2 自定义分箱
当bins为标量序列时,cut会按照我们自定义的边界点进行分箱,不再是等距分箱。
pd.cut(df['Age'], bins=[0, 10, 30, 50, 100])
0 (10.0, 30.0]
1 (30.0, 50.0]
2 (10.0, 30.0]
3 (30.0, 50.0]
4 (30.0, 50.0]
...
886 (10.0, 30.0]
887 (10.0, 30.0]
888 NaN
889 (10.0, 30.0]
890 (30.0, 50.0]
Name: Age, Length: 891, dtype: category
Categories (4, interval[int64, right]): [(0, 10]
pd.cut(df['Age'], bins=[0, 10, 30, 50, 100]).value_counts().sort_index()
Age
(0, 10] 64
(10, 30] 345
(30, 50] 241
(50, 100] 64
Name: count, dtype: int64
2.3 等频分箱
等频要求分组内的数量相等,在pandas中用qcut实现。
pd.qcut(df['Age'], q=10).value_counts().sort_index()
Age
(0.419, 14.0] 77
(14.0, 19.0] 87
(19.0, 22.0] 67
(22.0, 25.0] 70
(25.0, 28.0] 61
(28.0, 31.8] 66
(31.8, 36.0] 91
(36.0, 41.0] 53
(41.0, 50.0] 78
(50.0, 80.0] 64
Name: count, dtype: int64
2.4 分箱合并
2.4.1 分类合并
当分类型变量有很多值但分布比较分散的时候,我们有时需要做合并的处理。
df['Embarked'].value_counts(dropna=False)
Embarked
S 644
C 168
Q 77
NaN 2
Name: count, dtype: int64
where条件筛选
where可接受的筛选条件必须是布尔型的,基本判断逻辑是:如果对象不满足匹配条件,就被赋值为默认空值或other参数指定的值,满足条件的保持原值不变。
cond = df['Embarked'].isin(['S','C'])
df['Embarked'].where(cond, other='other', inplace=True)
df['Embarked'].value_counts(dropna=False)
Embarked
S 644
C 168
other 79
Name: count, dtype: int64
loc筛选反
df.loc[~df['Embarked'].isin(['S','C']), 'Embarked'] = 'other'
df['Embarked'].value_counts(dropna=False)
Embarked
S 644
C 168
other 79
Name: count, dtype: int64
2.4.2 数值合并
有时候我们会对分箱结果不满意,例如其中连着的两个分箱内数量太少不满足统计最小要求,因此选择合并。
# 先分箱
a, b = pd.cut(df['Age'], bins=8, retbins=True)
print(b)
print('-'*8)
a.value_counts(dropna=False).sort_index()
[ 0.34042 10.3675 20.315 30.2625 40.21 50.1575 60.105 70.0525
80. ]
--------
Age
(0.34, 10.368] 64
(10.368, 20.315] 115
(20.315, 30.263] 230
(30.263, 40.21] 155
(40.21, 50.158] 86
(50.158, 60.105] 42
(60.105, 70.052] 17
(70.052, 80.0] 5
NaN 177
Name: count, dtype: int64
# (60.105, 70.052]和(60.105, 70.052]箱体数量太少,选择合并
bins = [0.34042,10.3675,20.315,30.2625,40.21,50.1575,60.105,80]
a, b = pd.cut(df['Age'], bins=bins, retbins=True)
print(b)
print('-'*8)
a.value_counts(dropna=False).sort_index()
[ 0.34042 10.3675 20.315 30.2625 40.21 50.1575 60.105 80. ]
--------
Age
(0.34, 10.368] 64
(10.368, 20.315] 115
(20.315, 30.262] 230
(30.262, 40.21] 155
(40.21, 50.158] 86
(50.158, 60.105] 42
(60.105, 80.0] 22
NaN 177
Name: count, dtype: int64
免责声明:我们致力于保护作者版权,注重分享,被刊用文章因无法核实真实出处,未能及时与作者取得联系,或有版权异议的,请联系管理员,我们会立即处理! 部分文章是来自自研大数据AI进行生成,内容摘自(百度百科,百度知道,头条百科,中国民法典,刑法,牛津词典,新华词典,汉语词典,国家院校,科普平台)等数据,内容仅供学习参考,不准确地方联系删除处理! 图片声明:本站部分配图来自人工智能系统AI生成,觅知网授权图片,PxHere摄影无版权图库和百度,360,搜狗等多加搜索引擎自动关键词搜索配图,如有侵权的图片,请第一时间联系我们,邮箱:ciyunidc@ciyunshuju.com。本站只作为美观性配图使用,无任何非法侵犯第三方意图,一切解释权归图片著作权方,本站不承担任何责任。如有恶意碰瓷者,必当奉陪到底严惩不贷!
