5.Pandas分组

2024-02-29 1597阅读

温馨提示：这篇文章已超过388天没有更新，请注意相关的内容是否还可用！

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
sns.set(style='whitegrid')
warnings.filterwarnings('ignore')

pd.show_versions()

INSTALLED VERSIONS
------------------
commit              : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673
python              : 3.11.5.final.0
python-bits         : 64
OS                  : Windows
OS-release          : 10
Version             : 10.0.22621
machine             : AMD64
processor           : Intel64 Family 6 Model 154 Stepping 3, GenuineIntel
byteorder           : little
LC_ALL              : None
LANG                : None
LOCALE              : Chinese (Simplified)_China.936
pandas              : 2.1.4
numpy               : 1.24.3
pytz                : 2023.3.post1
dateutil            : 2.8.2
setuptools          : 68.2.2
pip                 : 23.3.1
Cython              : None
pytest              : 7.4.0
hypothesis          : None
sphinx              : 5.0.2
blosc               : None
feather             : None
xlsxwriter          : None
lxml.etree          : 4.9.3
html5lib            : 1.1
pymysql             : 1.0.2
psycopg2            : None
jinja2              : 3.1.2
IPython             : 8.15.0
pandas_datareader   : None
bs4                 : 4.12.2
bottleneck          : 1.3.5
dataframe-api-compat: None
fastparquet         : None
fsspec              : 2023.10.0
gcsfs               : None
matplotlib          : 3.8.0
numba               : 0.58.1
numexpr             : 2.8.7
odfpy               : None
openpyxl            : 3.0.10
pandas_gbq          : None
pyarrow             : 11.0.0
pyreadstat          : None
pyxlsb              : None
s3fs                : 2023.10.0
scipy               : 1.11.4
sqlalchemy          : 2.0.21
tables              : 3.8.0
tabulate            : 0.9.0
xarray              : 2023.6.0
xlrd                : 2.0.1
zstandard           : 0.19.0
tzdata              : 2023.3
qtpy                : 2.4.1
pyqt5               : None

1. 数据分组

分组的核心操作分为三步：

（图片来源网络，侵删）

分组：选择分组的方式，创建GroupBy对象；
应用：使用聚合、转换、过滤等类型的应用函数；
合并：将各应用函数结果合并。

df_english=pd.DataFrame(
     {
         "student": ["John", "James", "Jennifer"],
         "gender": ["male", "male", "female"],
         "score": [20, 30, 30],
         "subject": "english"
     }
 )
 
df_math=pd.DataFrame(
     {
         "student": ["John", "James", "Jennifer"],
         "gender": ["male", "male", "female"],
         "score": [90, 100, 95],
         "subject": "math"
     }
 )
df = pd.concat([df_english, df_math],ignore_index=True)
df = df.sort_values(['student','subject']).reset_index(drop=True)
df

	student	gender	score	subject
0	James	male	30	english
1	James	male	100	math
2	Jennifer	female	30	english
3	Jennifer	female	95	math
4	John	male	20	english
5	John	male	90	math

1.1 列筛选

# 不指定列
df.groupby(by='student').count()

	gender	score	subject
student
James	2	2	2
Jennifer	2	2	2
John	2	2	2

# 指定列
df.groupby(by='student')['score'].count()

student
James       2
Jennifer    2
John        2
Name: score, dtype: int64

1.2 排序

groupby()默认有排序，如果不想要排序可以设置sort=False关闭。

df.groupby(by=['student', 'score'])['subject'].agg(['count', 'size'])

		count	size
student	score
James	30	1	1
100	1	1
Jennifer	30	1	1
95	1	1
John	20	1	1
90	1	1

df.groupby(by=['student', 'score'], sort=False)['subject'].agg(['count', 'size'])

		count	size
student	score
James	30	1	1
100	1	1
Jennifer	30	1	1
95	1	1
John	20	1	1
90	1	1

1.3 索引重置

groupby()会把分组列作为繁华dataframe的索引，如果不想这样，可以设置as_index=False参数。

df.groupby(by='student')['score'].agg(['mean', 'max', 'min'])

	mean	max	min
student
James	65.0	100	30
Jennifer	62.5	95	30
John	55.0	90	20

df.groupby(by='student', as_index=False)['score'].agg(['mean', 'max', 'min'])

	student	mean	max	min
0	James	65.0	100	30
1	Jennifer	62.5	95	30
2	John	55.0	90	20

1.4 空值统计

groupby()分组时默认忽略空值，但实际中有时需要对变量空值进行统计，可设置参数dropna=False。

dd = pd.DataFrame({
      'A':['a1','a1','a2','a3'],
      'B':['b1',None,'b2','b3'],
      'C':[1,2,3,4],
      'D':[5,None,9,10]})
dd

	A	B	C	D
0	a1	b1	1	5.0
1	a1	None	2	NaN
2	a2	b2	3	9.0
3	a3	b3	4	10.0

# 默认忽略空值
dd.groupby('B').sum()

	A	C	D
B
b1	a1	1	5.0
b2	a2	3	9.0
b3	a3	4	10.0

# 分组空值加入统计
dd.groupby('B', dropna=False).sum()

	A	C	D
B
b1	a1	1	5.0
b2	a2	3	9.0
b3	a3	4	10.0
NaN	a1	2	0.0

1.5 多级索引

除数据列外，我们也可以通过索引层级来实现分组。

df = df.set_index(['student', 'subject'])
df

		gender	score
student	subject
James	english	male	30
math	male	100
Jennifer	english	female	30
math	female	95
John	english	male	20
math	male	90

# 指定level=0分组
df.groupby(level=0).count()

	gender	score
student
James	2	2
Jennifer	2	2
John	2	2

# 指定level=1分组
df.groupby(level=1).count()

	gender	score
subject
english	3	3
math	3	3

1.6 分组聚合

即在分组对象后应用聚合函数，有内置方法和自定义方法两种。

方法	概述
any()	计算组中的任何数是否为真
all()	计算组中的任何数是否全都为真
count()	计算组中非NA的数量
cov()*	计算组的协方差
first()	计算每组第一个出现的值
idxmax()*	最大值的索引
idxmin()*	最小值的索引
last()	每组最后一个出现的值
max()	每组的最大值
min()	每组的最小值
mean()	每组的平均数
median()	每组的中位数
nunique()	每组中唯一值的数量
prod()	每组的乘积
quantile()	每组中值的给定分位数
sem()	计算每组中值的平均值的标准差
size()	值的数量
skew()*	值的偏差
std()	值的标准偏差
sum()	值的总和
var()	值的方差

# 集中趋势
# 按学生分组计算平均分数
df.groupby(['student'])['score'].mean()
# 按学生分组计算中位数分数
df.groupby(['student'])['score'].median()
# 按学生分组计算指定分位数
df.groupby(['student'])['score'].quantile()

student
James       65.0
Jennifer    62.5
John        55.0
Name: score, dtype: float64

# 离散趋势
# 按学生分组计算最大分数
df.groupby(['student'])['score'].max()
# 按学生分组计算最小分数
df.groupby(['student'])['score'].min()
# 按学生分组计算方差
df.groupby(['student'])['score'].var()
# 按学生分组计算协方差
# df.groupby(['student'])['score'].cov()
# 按学生分组计算偏差
df.groupby(['student'])['score'].skew()
# 按学生分组计算标准差
df.groupby(['student'])['score'].std()

student
James       49.497475
Jennifer    45.961941
John        49.497475
Name: score, dtype: float64

# 计数
# 按学生分组计算非空数量
df.groupby(['student'])['score'].count()
# 按学生分组计算所有值数量
df.groupby(['student'])['score'].size()
# 按学生分组计算唯一值数量
df.groupby(['student'])['score'].nunique()
# 按学生分组计算求和
df.groupby(['student'])['score'].sum()
# 按学生分组计算求乘积
df.groupby(['student'])['score'].prod()

student
James       3000
Jennifer    2850
John        1800
Name: score, dtype: int64

# 排序
# 按学生分组返回第一个score
df.groupby(['student'])['score'].first()
# 按学生分组返回最后一个score
df.groupby(['student'])['score'].last()
# 按学生分组返回第n个score
df.groupby(['student'])['score'].nth(1)

student   subject
James     math       100
Jennifer  math        95
John      math        90
Name: score, dtype: int64

2. 分箱

分箱时将连续数值变量变为离散分类变量的一种方法，常用于特征工程，这对于组的划分有更高的要求，而不是简单的将相同值划分为同一组。

分箱的分为很多种，简单的有等频和等距，如果涉及到算法的，常见的有决策树、卡方、最优分箱等，这里只给出最简单的能由pandas实现的等频和等距两种方法。

df = pd.read_csv('data/titanic.csv')
df.sample(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
414	415	1	3	Sundman, Mr. Johan Julian	male	44.0	0	STON/O 2. 3101269	7.925	NaN	S
728	729	0	2	Bryhl, Mr. Kurt Arnold Gottfrid	male	25.0	1	236853	26.000	NaN	S
186	187	1	3	O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey)	female	NaN	1	370365	15.500	NaN	Q
449	450	1	1	Peuchen, Major. Arthur Godfrey	male	52.0	0	113786	30.500	C104	S
382	383	0	3	Tikkanen, Mr. Juho	male	32.0	0	STON/O 2. 3101293	7.925	NaN	S

2.1 等距分箱

即按相同距离将连续的数值划分成不同的箱体，每个箱体的间距相等，但箱体内的数据并不一定相等，用pd.cut()实现。

pd.cut(df['Age'], bins=6)

0      (13.683, 26.947]
1       (26.947, 40.21]
2      (13.683, 26.947]
3       (26.947, 40.21]
4       (26.947, 40.21]
             ...       
886     (26.947, 40.21]
887    (13.683, 26.947]
888                 NaN
889    (13.683, 26.947]
890     (26.947, 40.21]
Name: Age, Length: 891, dtype: category
Categories (6, interval[float64, right]): [(0.34, 13.683]  
pd.cut(df['Age'], bins=6).value_counts().sort_index()
 
Age
(0.34, 13.683]       71
(13.683, 26.947]    248
(26.947, 40.21]     245
(40.21, 53.473]     100
(53.473, 66.737]     43
(66.737, 80.0]        7
Name: count, dtype: int64
 
2.2 自定义分箱 
当bins为标量序列时，cut会按照我们自定义的边界点进行分箱，不再是等距分箱。 
pd.cut(df['Age'], bins=[0, 10, 30, 50, 100])
 
0      (10.0, 30.0]
1      (30.0, 50.0]
2      (10.0, 30.0]
3      (30.0, 50.0]
4      (30.0, 50.0]
           ...     
886    (10.0, 30.0]
887    (10.0, 30.0]
888             NaN
889    (10.0, 30.0]
890    (30.0, 50.0]
Name: Age, Length: 891, dtype: category
Categories (4, interval[int64, right]): [(0, 10]  
pd.cut(df['Age'], bins=[0, 10, 30, 50, 100]).value_counts().sort_index()
 
Age
(0, 10]       64
(10, 30]     345
(30, 50]     241
(50, 100]     64
Name: count, dtype: int64
 
2.3 等频分箱 
等频要求分组内的数量相等，在pandas中用qcut实现。 
pd.qcut(df['Age'], q=10).value_counts().sort_index()
 
Age
(0.419, 14.0]    77
(14.0, 19.0]     87
(19.0, 22.0]     67
(22.0, 25.0]     70
(25.0, 28.0]     61
(28.0, 31.8]     66
(31.8, 36.0]     91
(36.0, 41.0]     53
(41.0, 50.0]     78
(50.0, 80.0]     64
Name: count, dtype: int64
 
2.4 分箱合并 
2.4.1 分类合并 
当分类型变量有很多值但分布比较分散的时候，我们有时需要做合并的处理。 
df['Embarked'].value_counts(dropna=False)
 
Embarked
S      644
C      168
Q       77
NaN      2
Name: count, dtype: int64
 
where条件筛选 
where可接受的筛选条件必须是布尔型的，基本判断逻辑是：如果对象不满足匹配条件，就被赋值为默认空值或other参数指定的值，满足条件的保持原值不变。 
cond = df['Embarked'].isin(['S','C'])
df['Embarked'].where(cond, other='other', inplace=True)
df['Embarked'].value_counts(dropna=False)
 
Embarked
S        644
C        168
other     79
Name: count, dtype: int64
 
loc筛选反 
df.loc[~df['Embarked'].isin(['S','C']), 'Embarked'] = 'other'
df['Embarked'].value_counts(dropna=False)
 
Embarked
S        644
C        168
other     79
Name: count, dtype: int64
 
2.4.2 数值合并 
有时候我们会对分箱结果不满意，例如其中连着的两个分箱内数量太少不满足统计最小要求，因此选择合并。 
# 先分箱
a, b = pd.cut(df['Age'], bins=8, retbins=True)
print(b)
print('-'*8)
a.value_counts(dropna=False).sort_index()
 
[ 0.34042 10.3675  20.315   30.2625  40.21    50.1575  60.105   70.0525
 80.     ]
--------

Age
(0.34, 10.368]       64
(10.368, 20.315]    115
(20.315, 30.263]    230
(30.263, 40.21]     155
(40.21, 50.158]      86
(50.158, 60.105]     42
(60.105, 70.052]     17
(70.052, 80.0]        5
NaN                 177
Name: count, dtype: int64
 
# (60.105, 70.052]和(60.105, 70.052]箱体数量太少，选择合并
bins = [0.34042,10.3675,20.315,30.2625,40.21,50.1575,60.105,80]
a, b = pd.cut(df['Age'], bins=bins, retbins=True)
print(b)
print('-'*8)
a.value_counts(dropna=False).sort_index()
 
[ 0.34042 10.3675  20.315   30.2625  40.21    50.1575  60.105   80.     ]
--------

Age
(0.34, 10.368]       64
(10.368, 20.315]    115
(20.315, 30.262]    230
(30.262, 40.21]     155
(40.21, 50.158]      86
(50.158, 60.105]     42
(60.105, 80.0]       22
NaN                 177
Name: count, dtype: int64

VPS购买请点击我

免责声明：我们致力于保护作者版权，注重分享，被刊用文章因无法核实真实出处，未能及时与作者取得联系，或有版权异议的，请联系管理员，我们会立即处理! 部分文章是来自自研大数据AI进行生成,内容摘自(百度百科,百度知道,头条百科,中国民法典,刑法,牛津词典,新华词典,汉语词典,国家院校,科普平台)等数据,内容仅供学习参考,不准确地方联系删除处理! 图片声明：本站部分配图来自人工智能系统AI生成,觅知网授权图片,PxHere摄影无版权图库和百度，360，搜狗等多加搜索引擎自动关键词搜索配图，如有侵权的图片，请第一时间联系我们，邮箱：ciyunidc@ciyunshuju.com。本站只作为美观性配图使用,无任何非法侵犯第三方意图,一切解释权归图片著作权方,本站不承担任何责任。如有恶意碰瓷者,必当奉陪到底严惩不贷!

5.Pandas分组

1. 数据分组

1.1 列筛选

1.2 排序

1.3 索引重置

1.4 空值统计

1.5 多级索引

1.6 分组聚合

2. 分箱

2.1 等距分箱

相关阅读

怎么把织梦的模板替换?

dedecms怎么调用特定的栏目文档?

怎么抓包一个网页?

wap怎么封装app?

目录[+]