diff --git a/code-languages/python/pandas join操作详解.md b/code-languages/python/pandas join操作详解.md new file mode 100644 index 0000000..08ed0e9 --- /dev/null +++ b/code-languages/python/pandas join操作详解.md @@ -0,0 +1,191 @@ +## 1.前言 +join操作是关系型数据库中最核心没有之一的操作,实际中最容易出问题,经常需要优化的点也是join操作。如果我们将dataframe类比为一张表,自然也会涉及到join操作,而且非常非常常见。下面我们就来仔细看看pandas中的join用法。 + +## 2.join方法原型 +pandas源码中join方法的签名如下 + +``` + def join( + self, other, on=None, how="left", lsuffix="", rsuffix="", sort=False + ) -> "DataFrame": + """ + Join columns of another DataFrame. + + Join columns with `other` DataFrame either on index or on a key + column. Efficiently join multiple DataFrame objects by index at once by + passing a list. + + Parameters + ---------- + other : DataFrame, Series, or list of DataFrame + Index should be similar to one of the columns in this one. If a + Series is passed, its name attribute must be set, and that will be + used as the column name in the resulting joined DataFrame. + on : str, list of str, or array-like, optional + Column or index level name(s) in the caller to join on the index + in `other`, otherwise joins index-on-index. If multiple + values given, the `other` DataFrame must have a MultiIndex. Can + pass an array as the join key if it is not already contained in + the calling DataFrame. Like an Excel VLOOKUP operation. + how : {'left', 'right', 'outer', 'inner'}, default 'left' + How to handle the operation of the two objects. + + * left: use calling frame's index (or column if on is specified) + * right: use `other`'s index. + * outer: form union of calling frame's index (or column if on is + specified) with `other`'s index, and sort it. + lexicographically. + * inner: form intersection of calling frame's index (or column if + on is specified) with `other`'s index, preserving the order + of the calling's one. + lsuffix : str, default '' + Suffix to use from left frame's overlapping columns. + rsuffix : str, default '' + Suffix to use from right frame's overlapping columns. + sort : bool, default False + Order result DataFrame lexicographically by the join key. If False, + the order of the join key depends on the join type (how keyword). + + Returns + ------- + DataFrame + A dataframe containing columns from both the caller and `other`. + +``` + + def join(self, other, on=None, how="left", lsuffix="", rsuffix="", sort=False) + 其中 + other:DataFrame, Series, or list of DataFrame,另外一个dataframe, series,或者dataframe list。 + on: 参与join的列,与sql中的on参数类似。 + how: {'left', 'right', 'outer', 'inner'}, default 'left', 与sql中的join方式类似。 + lsuffix: 左DataFrame中重复列的后缀 + rsuffix: 右DataFrame中重复列的后缀 + sort: 按字典序对结果在连接键上排序 + +## 3.按指定列进行join +实际中最常见的join方式为按某个相同列进行join。我们先来尝试一个简单的join实例 + +``` +import pandas as pd + +def joindemo(): + age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'], + 'age': [18, 28, 24, 36]}) + score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'], + 'score': ['A', 'B', 'C', 'B']}) + + result = age_df.join(score_df, on='name') + print(result) +``` + +上面的代码会报如下错误: + +``` +ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat +``` + +原因在于,join的时候会根据dataframe的索引进行。如果不理解,下面看一段测试代码就明白 + +``` +def joindemo2(): + age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'], + 'age': [18, 28, 24, 36]}) + score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'], + 'score': ['A', 'B', 'C', 'B']}) + print(age_df) + age_df.set_index('name', inplace=True) + print(age_df) +``` + +上面这段代码运行的结果如下 + +``` + name age +0 lili 18 +1 lucy 28 +2 tracy 24 +3 mike 36 + age +name +lili 18 +lucy 28 +tracy 24 +mike 36 +``` + +dataframe默认的index是从0开始递增的整数,前面的数字0,1,2,3表示的就是index。如果我们指定index为name,输出的dataframe结构就发生了改变,前面递增的数字就没有了。 + + +如果要实现最开始的join需求,可以按如下代码 + +``` +def joindemo(): + age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'], + 'age': [18, 28, 24, 36]}) + score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'], + 'score': ['A', 'B', 'C', 'B']}) + + age_df.set_index('name', inplace=True) + score_df.set_index('name', inplace=True) + result = age_df.join(score_df, on='name') + print(result) +``` + +代码的输出结果为 + +``` + age score +name +lili 18 NaN +lucy 28 NaN +tracy 24 B +mike 36 B +``` + +默认的为left join,这就实现了我们上面的需求。 + +## 4.按默认自增index进行join + +如果想按默认的自增index进行join,我们接下来进行尝试。 + +``` +def joindemo(): + age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'], + 'age': [18, 28, 24, 36]}) + score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'], + 'score': ['A', 'B', 'C', 'B']}) + + result = age_df.join(score_df) + print(result) +``` + +上面的代码也会报错 + +``` +ValueError: columns overlap but no suffix specified: Index(['name'], dtype='object') +``` + +这个时候,就需要lsuffix,rsuffix参数了 + +``` +def joindemo(): + age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'], + 'age': [18, 28, 24, 36]}) + score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'], + 'score': ['A', 'B', 'C', 'B']}) + + result = age_df.join(score_df, lsuffix='_left', rsuffix='_right') + print(result) +``` + +``` + name_left age name_right score +0 lili 18 tony A +1 lucy 28 mike B +2 tracy 24 akuda C +3 mike 36 tracy B +``` + +## 5.与merge的区别 +pandas中还有merge方法,也能实现join的功能。他们的具体区别,可以参考如下链接: +https://stackoverflow.com/questions/22676081/what-is-the-difference-between-join-and-merge-in-pandas diff --git a/code-languages/python/pandas根据现有列新添加一列.md b/code-languages/python/pandas根据现有列新添加一列.md new file mode 100644 index 0000000..8a3bc93 --- /dev/null +++ b/code-languages/python/pandas根据现有列新添加一列.md @@ -0,0 +1,38 @@ +pandas中一个Dataframe,经常需要根据其中一列再新建一列,比如一个常见的例子:需要根据分数来确定等级范围,下面我们就来看一下怎么实现。 + +``` +def getlevel(score): + if score < 60: + return "bad" + elif score < 80: + return "mid" + else: + return "good" + + +def test(): + data = {'name': ['lili', 'lucy', 'tracy', 'tony', 'mike'], + 'score': [85, 61, 75, 49, 90] + } + df = pd.DataFrame(data=data) + # 两种方式都可以 + # df['level'] = df.apply(lambda x: getlevel(x['score']), axis=1) + df['level'] = df.apply(lambda x: getlevel(x.score), axis=1) + + print(df) +``` + +上面代码运行结果 + +``` + name score level +0 lili 85 good +1 lucy 61 mid +2 tracy 75 mid +3 tony 49 bad +4 mike 90 good +``` + +要实现上面的功能,主要是使用到dataframe中的apply方法。 +上面的代码,对dataframe新增加一列名为level,level由分数一列而来,如果小于60分为bad,60-80之间为mid,80以上为good。 +其中axis=1表示原有dataframe的行不变,列的维数发生改变。 \ No newline at end of file diff --git a/code-languages/python/python md5算法调用与hashlib模块.md b/code-languages/python/python md5算法调用与hashlib模块.md new file mode 100644 index 0000000..2b7a98b --- /dev/null +++ b/code-languages/python/python md5算法调用与hashlib模块.md @@ -0,0 +1,93 @@ +## 1.python中的md5 +将一个字符串md5匿名化是数据处理中的一种常见手段,python3.X中也内置实现了md5算法,下面我们看下md5的用法。 + +``` +import hashlib + +def test(): + s = "123" + m = hashlib.md5() + for i in range(5): + m.update(s.encode("utf8")) + result = m.hexdigest() + print(result) +``` + +上面代码的输出为: + +``` +202cb962ac59075b964b07152d234b70 +4297f44b13955235245b2497399d7a93 +f5bb0c8de146c67b44babbf4e6584cc0 +101193d7181cc88340ae5b2b17bba8a1 +e277dd1e05688a22e377e25a3dae5de1 +``` + +我们想对123这个字符串做5次md5,理论上5次md5的结果应该一样,但是最后输出的结果却不相同。 + +原因就在于update方法: +当同一个hashlib对象调用update方法时,假设第一次输入字符串a,第二次输入字符串b,那么第二次md5的结果其实是a+b的md5结果。 + +看个简单的例子来证实一下我们的结论: + +``` +def test(): + s = "123123" + m = hashlib.md5() + m.update(s.encode("utf8")) + result = m.hexdigest() + print(result) +``` + +上面的输出结果为 + +``` +4297f44b13955235245b2497399d7a93 +``` + +与之前for循环遍历的第二次输出结果一直,即"123123"进行md5以后得到的结果。 + +## 2.hashlib模块 +hashlib模块中包含常用的hash算法,源码中列出来的有如下: + +``` +# This tuple and __get_builtin_constructor() must be modified if a new +# always available algorithm is added. +__always_supported = ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512', + 'blake2b', 'blake2s', + 'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512', + 'shake_128', 'shake_256') +``` + +下面我们选择几个常见的做一下测试 + +``` +from hashlib import md5 +from hashlib import sha256 +from hashlib import sha512 + +hash_functions = [md5, sha256, sha512] + +def get_hash_code(s): + result = [] + for function in hash_functions: + hash_obj = function(s) + hash_hex = hash_obj.hexdigest() + result.append((hash_obj.name, hash_hex, len(hash_hex))) + return result + + +if __name__ == '__main__': + s = "123" + result = get_hash_code(s.encode("utf-8")) + for each in result: + print(each) +``` + +最终输出结果为: + +``` +('md5', '202cb962ac59075b964b07152d234b70', 32) +('sha256', 'a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3', 64) +('sha512', '3c9909afec25354d551dae21590bb26e38d53f2173b8d3dc3eee4c047e7ab1c1eb8b85103e3be7ba613b31bb5c9c36214dc9f14a42fd7a2fdb84856bca5c44c2', 128) +```