add some pandas usage code
parent
92ff42bd14
commit
a3170fbd18
|
@ -0,0 +1,191 @@
|
|||
## 1.前言
|
||||
join操作是关系型数据库中最核心没有之一的操作,实际中最容易出问题,经常需要优化的点也是join操作。如果我们将dataframe类比为一张表,自然也会涉及到join操作,而且非常非常常见。下面我们就来仔细看看pandas中的join用法。
|
||||
|
||||
## 2.join方法原型
|
||||
pandas源码中join方法的签名如下
|
||||
|
||||
```
|
||||
def join(
|
||||
self, other, on=None, how="left", lsuffix="", rsuffix="", sort=False
|
||||
) -> "DataFrame":
|
||||
"""
|
||||
Join columns of another DataFrame.
|
||||
|
||||
Join columns with `other` DataFrame either on index or on a key
|
||||
column. Efficiently join multiple DataFrame objects by index at once by
|
||||
passing a list.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
other : DataFrame, Series, or list of DataFrame
|
||||
Index should be similar to one of the columns in this one. If a
|
||||
Series is passed, its name attribute must be set, and that will be
|
||||
used as the column name in the resulting joined DataFrame.
|
||||
on : str, list of str, or array-like, optional
|
||||
Column or index level name(s) in the caller to join on the index
|
||||
in `other`, otherwise joins index-on-index. If multiple
|
||||
values given, the `other` DataFrame must have a MultiIndex. Can
|
||||
pass an array as the join key if it is not already contained in
|
||||
the calling DataFrame. Like an Excel VLOOKUP operation.
|
||||
how : {'left', 'right', 'outer', 'inner'}, default 'left'
|
||||
How to handle the operation of the two objects.
|
||||
|
||||
* left: use calling frame's index (or column if on is specified)
|
||||
* right: use `other`'s index.
|
||||
* outer: form union of calling frame's index (or column if on is
|
||||
specified) with `other`'s index, and sort it.
|
||||
lexicographically.
|
||||
* inner: form intersection of calling frame's index (or column if
|
||||
on is specified) with `other`'s index, preserving the order
|
||||
of the calling's one.
|
||||
lsuffix : str, default ''
|
||||
Suffix to use from left frame's overlapping columns.
|
||||
rsuffix : str, default ''
|
||||
Suffix to use from right frame's overlapping columns.
|
||||
sort : bool, default False
|
||||
Order result DataFrame lexicographically by the join key. If False,
|
||||
the order of the join key depends on the join type (how keyword).
|
||||
|
||||
Returns
|
||||
-------
|
||||
DataFrame
|
||||
A dataframe containing columns from both the caller and `other`.
|
||||
|
||||
```
|
||||
|
||||
def join(self, other, on=None, how="left", lsuffix="", rsuffix="", sort=False)
|
||||
其中
|
||||
other:DataFrame, Series, or list of DataFrame,另外一个dataframe, series,或者dataframe list。
|
||||
on: 参与join的列,与sql中的on参数类似。
|
||||
how: {'left', 'right', 'outer', 'inner'}, default 'left', 与sql中的join方式类似。
|
||||
lsuffix: 左DataFrame中重复列的后缀
|
||||
rsuffix: 右DataFrame中重复列的后缀
|
||||
sort: 按字典序对结果在连接键上排序
|
||||
|
||||
## 3.按指定列进行join
|
||||
实际中最常见的join方式为按某个相同列进行join。我们先来尝试一个简单的join实例
|
||||
|
||||
```
|
||||
import pandas as pd
|
||||
|
||||
def joindemo():
|
||||
age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
|
||||
'age': [18, 28, 24, 36]})
|
||||
score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
|
||||
'score': ['A', 'B', 'C', 'B']})
|
||||
|
||||
result = age_df.join(score_df, on='name')
|
||||
print(result)
|
||||
```
|
||||
|
||||
上面的代码会报如下错误:
|
||||
|
||||
```
|
||||
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
|
||||
```
|
||||
|
||||
原因在于,join的时候会根据dataframe的索引进行。如果不理解,下面看一段测试代码就明白
|
||||
|
||||
```
|
||||
def joindemo2():
|
||||
age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
|
||||
'age': [18, 28, 24, 36]})
|
||||
score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
|
||||
'score': ['A', 'B', 'C', 'B']})
|
||||
print(age_df)
|
||||
age_df.set_index('name', inplace=True)
|
||||
print(age_df)
|
||||
```
|
||||
|
||||
上面这段代码运行的结果如下
|
||||
|
||||
```
|
||||
name age
|
||||
0 lili 18
|
||||
1 lucy 28
|
||||
2 tracy 24
|
||||
3 mike 36
|
||||
age
|
||||
name
|
||||
lili 18
|
||||
lucy 28
|
||||
tracy 24
|
||||
mike 36
|
||||
```
|
||||
|
||||
dataframe默认的index是从0开始递增的整数,前面的数字0,1,2,3表示的就是index。如果我们指定index为name,输出的dataframe结构就发生了改变,前面递增的数字就没有了。
|
||||
|
||||
|
||||
如果要实现最开始的join需求,可以按如下代码
|
||||
|
||||
```
|
||||
def joindemo():
|
||||
age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
|
||||
'age': [18, 28, 24, 36]})
|
||||
score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
|
||||
'score': ['A', 'B', 'C', 'B']})
|
||||
|
||||
age_df.set_index('name', inplace=True)
|
||||
score_df.set_index('name', inplace=True)
|
||||
result = age_df.join(score_df, on='name')
|
||||
print(result)
|
||||
```
|
||||
|
||||
代码的输出结果为
|
||||
|
||||
```
|
||||
age score
|
||||
name
|
||||
lili 18 NaN
|
||||
lucy 28 NaN
|
||||
tracy 24 B
|
||||
mike 36 B
|
||||
```
|
||||
|
||||
默认的为left join,这就实现了我们上面的需求。
|
||||
|
||||
## 4.按默认自增index进行join
|
||||
|
||||
如果想按默认的自增index进行join,我们接下来进行尝试。
|
||||
|
||||
```
|
||||
def joindemo():
|
||||
age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
|
||||
'age': [18, 28, 24, 36]})
|
||||
score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
|
||||
'score': ['A', 'B', 'C', 'B']})
|
||||
|
||||
result = age_df.join(score_df)
|
||||
print(result)
|
||||
```
|
||||
|
||||
上面的代码也会报错
|
||||
|
||||
```
|
||||
ValueError: columns overlap but no suffix specified: Index(['name'], dtype='object')
|
||||
```
|
||||
|
||||
这个时候,就需要lsuffix,rsuffix参数了
|
||||
|
||||
```
|
||||
def joindemo():
|
||||
age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
|
||||
'age': [18, 28, 24, 36]})
|
||||
score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
|
||||
'score': ['A', 'B', 'C', 'B']})
|
||||
|
||||
result = age_df.join(score_df, lsuffix='_left', rsuffix='_right')
|
||||
print(result)
|
||||
```
|
||||
|
||||
```
|
||||
name_left age name_right score
|
||||
0 lili 18 tony A
|
||||
1 lucy 28 mike B
|
||||
2 tracy 24 akuda C
|
||||
3 mike 36 tracy B
|
||||
```
|
||||
|
||||
## 5.与merge的区别
|
||||
pandas中还有merge方法,也能实现join的功能。他们的具体区别,可以参考如下链接:
|
||||
https://stackoverflow.com/questions/22676081/what-is-the-difference-between-join-and-merge-in-pandas
|
|
@ -0,0 +1,38 @@
|
|||
pandas中一个Dataframe,经常需要根据其中一列再新建一列,比如一个常见的例子:需要根据分数来确定等级范围,下面我们就来看一下怎么实现。
|
||||
|
||||
```
|
||||
def getlevel(score):
|
||||
if score < 60:
|
||||
return "bad"
|
||||
elif score < 80:
|
||||
return "mid"
|
||||
else:
|
||||
return "good"
|
||||
|
||||
|
||||
def test():
|
||||
data = {'name': ['lili', 'lucy', 'tracy', 'tony', 'mike'],
|
||||
'score': [85, 61, 75, 49, 90]
|
||||
}
|
||||
df = pd.DataFrame(data=data)
|
||||
# 两种方式都可以
|
||||
# df['level'] = df.apply(lambda x: getlevel(x['score']), axis=1)
|
||||
df['level'] = df.apply(lambda x: getlevel(x.score), axis=1)
|
||||
|
||||
print(df)
|
||||
```
|
||||
|
||||
上面代码运行结果
|
||||
|
||||
```
|
||||
name score level
|
||||
0 lili 85 good
|
||||
1 lucy 61 mid
|
||||
2 tracy 75 mid
|
||||
3 tony 49 bad
|
||||
4 mike 90 good
|
||||
```
|
||||
|
||||
要实现上面的功能,主要是使用到dataframe中的apply方法。
|
||||
上面的代码,对dataframe新增加一列名为level,level由分数一列而来,如果小于60分为bad,60-80之间为mid,80以上为good。
|
||||
其中axis=1表示原有dataframe的行不变,列的维数发生改变。
|
|
@ -0,0 +1,93 @@
|
|||
## 1.python中的md5
|
||||
将一个字符串md5匿名化是数据处理中的一种常见手段,python3.X中也内置实现了md5算法,下面我们看下md5的用法。
|
||||
|
||||
```
|
||||
import hashlib
|
||||
|
||||
def test():
|
||||
s = "123"
|
||||
m = hashlib.md5()
|
||||
for i in range(5):
|
||||
m.update(s.encode("utf8"))
|
||||
result = m.hexdigest()
|
||||
print(result)
|
||||
```
|
||||
|
||||
上面代码的输出为:
|
||||
|
||||
```
|
||||
202cb962ac59075b964b07152d234b70
|
||||
4297f44b13955235245b2497399d7a93
|
||||
f5bb0c8de146c67b44babbf4e6584cc0
|
||||
101193d7181cc88340ae5b2b17bba8a1
|
||||
e277dd1e05688a22e377e25a3dae5de1
|
||||
```
|
||||
|
||||
我们想对123这个字符串做5次md5,理论上5次md5的结果应该一样,但是最后输出的结果却不相同。
|
||||
|
||||
原因就在于update方法:
|
||||
当同一个hashlib对象调用update方法时,假设第一次输入字符串a,第二次输入字符串b,那么第二次md5的结果其实是a+b的md5结果。
|
||||
|
||||
看个简单的例子来证实一下我们的结论:
|
||||
|
||||
```
|
||||
def test():
|
||||
s = "123123"
|
||||
m = hashlib.md5()
|
||||
m.update(s.encode("utf8"))
|
||||
result = m.hexdigest()
|
||||
print(result)
|
||||
```
|
||||
|
||||
上面的输出结果为
|
||||
|
||||
```
|
||||
4297f44b13955235245b2497399d7a93
|
||||
```
|
||||
|
||||
与之前for循环遍历的第二次输出结果一直,即"123123"进行md5以后得到的结果。
|
||||
|
||||
## 2.hashlib模块
|
||||
hashlib模块中包含常用的hash算法,源码中列出来的有如下:
|
||||
|
||||
```
|
||||
# This tuple and __get_builtin_constructor() must be modified if a new
|
||||
# always available algorithm is added.
|
||||
__always_supported = ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512',
|
||||
'blake2b', 'blake2s',
|
||||
'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512',
|
||||
'shake_128', 'shake_256')
|
||||
```
|
||||
|
||||
下面我们选择几个常见的做一下测试
|
||||
|
||||
```
|
||||
from hashlib import md5
|
||||
from hashlib import sha256
|
||||
from hashlib import sha512
|
||||
|
||||
hash_functions = [md5, sha256, sha512]
|
||||
|
||||
def get_hash_code(s):
|
||||
result = []
|
||||
for function in hash_functions:
|
||||
hash_obj = function(s)
|
||||
hash_hex = hash_obj.hexdigest()
|
||||
result.append((hash_obj.name, hash_hex, len(hash_hex)))
|
||||
return result
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
s = "123"
|
||||
result = get_hash_code(s.encode("utf-8"))
|
||||
for each in result:
|
||||
print(each)
|
||||
```
|
||||
|
||||
最终输出结果为:
|
||||
|
||||
```
|
||||
('md5', '202cb962ac59075b964b07152d234b70', 32)
|
||||
('sha256', 'a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3', 64)
|
||||
('sha512', '3c9909afec25354d551dae21590bb26e38d53f2173b8d3dc3eee4c047e7ab1c1eb8b85103e3be7ba613b31bb5c9c36214dc9f14a42fd7a2fdb84856bca5c44c2', 128)
|
||||
```
|
Loading…
Reference in New Issue