add some pandas usage code

master
wanglei 2021-01-31 19:00:17 +08:00
parent 92ff42bd14
commit a3170fbd18
3 changed files with 322 additions and 0 deletions

View File

@ -0,0 +1,191 @@
## 1.前言
join操作是关系型数据库中最核心没有之一的操作实际中最容易出问题经常需要优化的点也是join操作。如果我们将dataframe类比为一张表自然也会涉及到join操作而且非常非常常见。下面我们就来仔细看看pandas中的join用法。
## 2.join方法原型
pandas源码中join方法的签名如下
```
def join(
self, other, on=None, how="left", lsuffix="", rsuffix="", sort=False
) -> "DataFrame":
"""
Join columns of another DataFrame.
Join columns with `other` DataFrame either on index or on a key
column. Efficiently join multiple DataFrame objects by index at once by
passing a list.
Parameters
----------
other : DataFrame, Series, or list of DataFrame
Index should be similar to one of the columns in this one. If a
Series is passed, its name attribute must be set, and that will be
used as the column name in the resulting joined DataFrame.
on : str, list of str, or array-like, optional
Column or index level name(s) in the caller to join on the index
in `other`, otherwise joins index-on-index. If multiple
values given, the `other` DataFrame must have a MultiIndex. Can
pass an array as the join key if it is not already contained in
the calling DataFrame. Like an Excel VLOOKUP operation.
how : {'left', 'right', 'outer', 'inner'}, default 'left'
How to handle the operation of the two objects.
* left: use calling frame's index (or column if on is specified)
* right: use `other`'s index.
* outer: form union of calling frame's index (or column if on is
specified) with `other`'s index, and sort it.
lexicographically.
* inner: form intersection of calling frame's index (or column if
on is specified) with `other`'s index, preserving the order
of the calling's one.
lsuffix : str, default ''
Suffix to use from left frame's overlapping columns.
rsuffix : str, default ''
Suffix to use from right frame's overlapping columns.
sort : bool, default False
Order result DataFrame lexicographically by the join key. If False,
the order of the join key depends on the join type (how keyword).
Returns
-------
DataFrame
A dataframe containing columns from both the caller and `other`.
```
def join(self, other, on=None, how="left", lsuffix="", rsuffix="", sort=False)
其中
otherDataFrame, Series, or list of DataFrame另外一个dataframe, series或者dataframe list。
on: 参与join的列与sql中的on参数类似。
how: {'left', 'right', 'outer', 'inner'}, default 'left' 与sql中的join方式类似。
lsuffix: 左DataFrame中重复列的后缀
rsuffix: 右DataFrame中重复列的后缀
sort: 按字典序对结果在连接键上排序
## 3.按指定列进行join
实际中最常见的join方式为按某个相同列进行join。我们先来尝试一个简单的join实例
```
import pandas as pd
def joindemo():
age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
'age': [18, 28, 24, 36]})
score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
'score': ['A', 'B', 'C', 'B']})
result = age_df.join(score_df, on='name')
print(result)
```
上面的代码会报如下错误:
```
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
```
原因在于join的时候会根据dataframe的索引进行。如果不理解下面看一段测试代码就明白
```
def joindemo2():
age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
'age': [18, 28, 24, 36]})
score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
'score': ['A', 'B', 'C', 'B']})
print(age_df)
age_df.set_index('name', inplace=True)
print(age_df)
```
上面这段代码运行的结果如下
```
name age
0 lili 18
1 lucy 28
2 tracy 24
3 mike 36
age
name
lili 18
lucy 28
tracy 24
mike 36
```
dataframe默认的index是从0开始递增的整数前面的数字0,1,2,3表示的就是index。如果我们指定index为name输出的dataframe结构就发生了改变前面递增的数字就没有了。
如果要实现最开始的join需求可以按如下代码
```
def joindemo():
age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
'age': [18, 28, 24, 36]})
score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
'score': ['A', 'B', 'C', 'B']})
age_df.set_index('name', inplace=True)
score_df.set_index('name', inplace=True)
result = age_df.join(score_df, on='name')
print(result)
```
代码的输出结果为
```
age score
name
lili 18 NaN
lucy 28 NaN
tracy 24 B
mike 36 B
```
默认的为left join这就实现了我们上面的需求。
## 4.按默认自增index进行join
如果想按默认的自增index进行join我们接下来进行尝试。
```
def joindemo():
age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
'age': [18, 28, 24, 36]})
score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
'score': ['A', 'B', 'C', 'B']})
result = age_df.join(score_df)
print(result)
```
上面的代码也会报错
```
ValueError: columns overlap but no suffix specified: Index(['name'], dtype='object')
```
这个时候就需要lsuffix,rsuffix参数了
```
def joindemo():
age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
'age': [18, 28, 24, 36]})
score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
'score': ['A', 'B', 'C', 'B']})
result = age_df.join(score_df, lsuffix='_left', rsuffix='_right')
print(result)
```
```
name_left age name_right score
0 lili 18 tony A
1 lucy 28 mike B
2 tracy 24 akuda C
3 mike 36 tracy B
```
## 5.与merge的区别
pandas中还有merge方法也能实现join的功能。他们的具体区别可以参考如下链接
https://stackoverflow.com/questions/22676081/what-is-the-difference-between-join-and-merge-in-pandas

View File

@ -0,0 +1,38 @@
pandas中一个Dataframe经常需要根据其中一列再新建一列比如一个常见的例子需要根据分数来确定等级范围下面我们就来看一下怎么实现。
```
def getlevel(score):
if score < 60:
return "bad"
elif score < 80:
return "mid"
else:
return "good"
def test():
data = {'name': ['lili', 'lucy', 'tracy', 'tony', 'mike'],
'score': [85, 61, 75, 49, 90]
}
df = pd.DataFrame(data=data)
# 两种方式都可以
# df['level'] = df.apply(lambda x: getlevel(x['score']), axis=1)
df['level'] = df.apply(lambda x: getlevel(x.score), axis=1)
print(df)
```
上面代码运行结果
```
name score level
0 lili 85 good
1 lucy 61 mid
2 tracy 75 mid
3 tony 49 bad
4 mike 90 good
```
要实现上面的功能主要是使用到dataframe中的apply方法。
上面的代码对dataframe新增加一列名为levellevel由分数一列而来如果小于60分为bad60-80之间为mid80以上为good。
其中axis=1表示原有dataframe的行不变列的维数发生改变。

View File

@ -0,0 +1,93 @@
## 1.python中的md5
将一个字符串md5匿名化是数据处理中的一种常见手段python3.X中也内置实现了md5算法下面我们看下md5的用法。
```
import hashlib
def test():
s = "123"
m = hashlib.md5()
for i in range(5):
m.update(s.encode("utf8"))
result = m.hexdigest()
print(result)
```
上面代码的输出为:
```
202cb962ac59075b964b07152d234b70
4297f44b13955235245b2497399d7a93
f5bb0c8de146c67b44babbf4e6584cc0
101193d7181cc88340ae5b2b17bba8a1
e277dd1e05688a22e377e25a3dae5de1
```
我们想对123这个字符串做5次md5理论上5次md5的结果应该一样但是最后输出的结果却不相同。
原因就在于update方法
当同一个hashlib对象调用update方法时假设第一次输入字符串a,第二次输入字符串b那么第二次md5的结果其实是a+b的md5结果。
看个简单的例子来证实一下我们的结论:
```
def test():
s = "123123"
m = hashlib.md5()
m.update(s.encode("utf8"))
result = m.hexdigest()
print(result)
```
上面的输出结果为
```
4297f44b13955235245b2497399d7a93
```
与之前for循环遍历的第二次输出结果一直即"123123"进行md5以后得到的结果。
## 2.hashlib模块
hashlib模块中包含常用的hash算法源码中列出来的有如下
```
# This tuple and __get_builtin_constructor() must be modified if a new
# always available algorithm is added.
__always_supported = ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512',
'blake2b', 'blake2s',
'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512',
'shake_128', 'shake_256')
```
下面我们选择几个常见的做一下测试
```
from hashlib import md5
from hashlib import sha256
from hashlib import sha512
hash_functions = [md5, sha256, sha512]
def get_hash_code(s):
result = []
for function in hash_functions:
hash_obj = function(s)
hash_hex = hash_obj.hexdigest()
result.append((hash_obj.name, hash_hex, len(hash_hex)))
return result
if __name__ == '__main__':
s = "123"
result = get_hash_code(s.encode("utf-8"))
for each in result:
print(each)
```
最终输出结果为:
```
('md5', '202cb962ac59075b964b07152d234b70', 32)
('sha256', 'a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3', 64)
('sha512', '3c9909afec25354d551dae21590bb26e38d53f2173b8d3dc3eee4c047e7ab1c1eb8b85103e3be7ba613b31bb5c9c36214dc9f14a42fd7a2fdb84856bca5c44c2', 128)
```