当前位置:网站首页>选取某个分类占比最大的值,作为该分类的唯一值
选取某个分类占比最大的值,作为该分类的唯一值
2022-07-16 06:28:00 【bingbangx】
在日常的数据需求,我们的数据需求方——甲方巴巴可能会有这样的要求。从数据库中提取两部分数据进行对比。然而对比数据又不是唯一值,怎么确定唯一值是什么呢?就要求我们自己来计算。
比如,我最近接到的一个数据需求。要求将**店铺,7月份,每个sku编码,单价数量占比最高的单价,来作为该sku编码唯一值,用于比对时的衡量标准。
我想到了两个解决方案:
1、使用EXcel来处理
首先数据清洗,计算sku编码占比。只选择某个店铺某个月份的数据,并提取sku编码,单价,将所有数据去重,使用countifs()函数计数,再计算占比,字段呈现:sku编码,单价,个数,个数占比。
其次找出sku占比的最大值。提取sku编码去重,使用maxifs()函数,找出最大的占比值,字段呈现:sku编码,最大占比。
找出sku编码占比最大的单价。字段:使用vlookup()函数,合并sku编码&最大占比,查找单价,字段呈现:sku编码,单价。
2、Python来处理
data_7= data[(data['月份'] =='7月') & (data['店铺'] =='**订单')] #多个条件的时候使用&,并且条件之间使用括号括起来关联
data_71 = data_7.groupby(by = ['sku编码','单价'])['订单号'].count().reset_index(drop = False) #统计每个sku编码的数据量
data_71 = data_71.rename(columns ={'订单号':'数量'}) #重命名
data_71['占比'] = data_71['数量'] /data_71['数量'].sum() #计算每个sku编码,单价的数据量占比
#使用降序排列,并且相同数字的,按照1234…命名。这样可以避免同一排名的对应多个
data_71['ranks'] = data_71.groupby(['sku编码'])['占比'].rank(method ='first',ascending =False).astype(int)
# data_71.reset_index(drop =False) #重置索引
data_72 = data_71[data_71['ranks'] == 1] # 取排名第1的行,作为对比数据
data_72['占比'] = data_72['占比'].apply(lambda x: format(x,'.2%'))
print(data_72)
print(len(data_72))结果呈现:

至此,我们就完成了sku编码唯一值的提取。可以进行下面的操作了~~~~~
边栏推荐
- Antd a-upload limit the number of uploads when multiple is true
- CUDA and cudnn installation tutorial (super detailed) Uninstalling CUDA, installing nsight visual studio edition of CUDA fails, and there is no CUDA option in vs2019+cuda11.1 new project
- Wechat selection and voting of finished works of applet graduation project (6) opening defense ppt
- 为什么电商平台要重点关注非法经营罪,这与二清又有什么关联?
- Learning path PHP -- post can't get the requested data
- Change and invariance of hierarchical automated test model
- Writing a new app with swift5 requires some considerations
- How to write ADB SQL in dataworks
- STM32 application development practice tutorial: human machine interface application development with interactive function
- Stress testing tools (commonly used) and sendfile process
猜你喜欢

Operation of simulated examination platform for 2022 G2 boiler stoker examination

(CVPR-2022)用于改进步态识别的拉格朗日运动分析和视角嵌入

05 gulimall VirtualBox set fixed IP for virtual machine

Xsslabs customs clearance

Conditional ternary operator...

Change and invariance of hierarchical automated test model

QT learning diary 16 - qfile file reading and writing

STM32应用开发实践教程:具备交互功能的人机界面应用开发

00 后博士获聘南大特任副研究员,曾 4 岁上小学,14 岁考入南大!

The secret of black industry that the risk controller cannot know
随机推荐
保研机试备考十五:字符串处理 集大成者
Total sequencing problem
Eureka read-write lock fantasy, too top!
AcWing 3540. Binary search tree binary sort tree BST
Wechat selection and voting of finished works of applet graduation project (6) opening defense ppt
AcWing 3433. Eat candy recursively | find rules
Mitmproxy getting started installation
微机原理与技术接口 实验三 循环结构
HMM & MEMM & CRF
STM32 application development practice tutorial: multi computer communication application development based on CAN bus
(CVPR-2022)用于改进步态识别的拉格朗日运动分析和视角嵌入
Is there a trap for Huatai Securities to open an account? Is it safe to open an account online?
C library function - sscanf() usage
STM32应用开发实践教程:基于 CAN 总线的多机通信应用开发
How to choose databases and tables and newsql?
JVM memory model
Learning path PHP -- post can't get the requested data
The first one in the whole network: how to avoid app crash due to calling unimplemented class methods
Gee (6): set the number of decimal places reserved for the calculated value / image
Stonedb announces open source, why is the integrated real-time HTAP architecture the current best solution