add some pandas usage code

2021-01-31 19:00:17 +08:00 · 2021-01-31 19:00:17 +08:00 · a3170fbd18
parent 92ff42bd14
commit a3170fbd18
3 changed files with 322 additions and 0 deletions
--- a/code-languages/python/pandas
+++ b/code-languages/python/pandas
@ -0,0 +1,191 @@
+## 1.前言
+join操作是关系型数据库中最核心没有之一的操作，实际中最容易出问题，经常需要优化的点也是join操作。如果我们将dataframe类比为一张表，自然也会涉及到join操作，而且非常非常常见。下面我们就来仔细看看pandas中的join用法。  
+
+## 2.join方法原型
+pandas源码中join方法的签名如下  
+
+```
+    def join(
+        self, other, on=None, how="left", lsuffix="", rsuffix="", sort=False
+    ) -> "DataFrame":
+        """
+        Join columns of another DataFrame.
+
+        Join columns with `other` DataFrame either on index or on a key
+        column. Efficiently join multiple DataFrame objects by index at once by
+        passing a list.
+
+        Parameters
+        ----------
+        other : DataFrame, Series, or list of DataFrame
+            Index should be similar to one of the columns in this one. If a
+            Series is passed, its name attribute must be set, and that will be
+            used as the column name in the resulting joined DataFrame.
+        on : str, list of str, or array-like, optional
+            Column or index level name(s) in the caller to join on the index
+            in `other`, otherwise joins index-on-index. If multiple
+            values given, the `other` DataFrame must have a MultiIndex. Can
+            pass an array as the join key if it is not already contained in
+            the calling DataFrame. Like an Excel VLOOKUP operation.
+        how : {'left', 'right', 'outer', 'inner'}, default 'left'
+            How to handle the operation of the two objects.
+
+            * left: use calling frame's index (or column if on is specified)
+            * right: use `other`'s index.
+            * outer: form union of calling frame's index (or column if on is
+              specified) with `other`'s index, and sort it.
+              lexicographically.
+            * inner: form intersection of calling frame's index (or column if
+              on is specified) with `other`'s index, preserving the order
+              of the calling's one.
+        lsuffix : str, default ''
+            Suffix to use from left frame's overlapping columns.
+        rsuffix : str, default ''
+            Suffix to use from right frame's overlapping columns.
+        sort : bool, default False
+            Order result DataFrame lexicographically by the join key. If False,
+            the order of the join key depends on the join type (how keyword).
+
+        Returns
+        -------
+        DataFrame
+            A dataframe containing columns from both the caller and `other`.
+
+```  
+
+ def join(self, other, on=None, how="left", lsuffix="", rsuffix="", sort=False)   
+ 其中  
+ other：DataFrame, Series, or list of DataFrame，另外一个dataframe, series，或者dataframe list。  
+ on: 参与join的列，与sql中的on参数类似。  
+ how:  {'left', 'right', 'outer', 'inner'}, default 'left'， 与sql中的join方式类似。  
+ lsuffix: 左DataFrame中重复列的后缀  
+ rsuffix: 右DataFrame中重复列的后缀  
+ sort: 按字典序对结果在连接键上排序  
+
+## 3.按指定列进行join
+实际中最常见的join方式为按某个相同列进行join。我们先来尝试一个简单的join实例  
+
+```
+import pandas as pd
+
+def joindemo():
+    age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
+                           'age': [18, 28, 24, 36]})
+    score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
+                             'score': ['A', 'B', 'C', 'B']})
+
+    result = age_df.join(score_df, on='name')
+    print(result)
+```  
+
+上面的代码会报如下错误：  
+
+```
+ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
+```  
+
+原因在于，join的时候会根据dataframe的索引进行。如果不理解，下面看一段测试代码就明白  
+
+```
+def joindemo2():
+    age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
+                           'age': [18, 28, 24, 36]})
+    score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
+                             'score': ['A', 'B', 'C', 'B']})
+    print(age_df)
+    age_df.set_index('name', inplace=True)
+    print(age_df)
+```  
+
+上面这段代码运行的结果如下  
+
+```
+    name  age
+0   lili   18
+1   lucy   28
+2  tracy   24
+3   mike   36
+       age
+name      
+lili    18
+lucy    28
+tracy   24
+mike    36
+```  
+
+dataframe默认的index是从0开始递增的整数，前面的数字0,1,2,3表示的就是index。如果我们指定index为name，输出的dataframe结构就发生了改变，前面递增的数字就没有了。  
+
+
+如果要实现最开始的join需求，可以按如下代码  
+
+```
+def joindemo():
+    age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
+                           'age': [18, 28, 24, 36]})
+    score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
+                             'score': ['A', 'B', 'C', 'B']})
+
+    age_df.set_index('name', inplace=True)
+    score_df.set_index('name', inplace=True)
+    result = age_df.join(score_df, on='name')
+    print(result)
+```  
+
+代码的输出结果为  
+
+```
+       age score
+name            
+lili    18   NaN
+lucy    28   NaN
+tracy   24     B
+mike    36     B
+```  
+
+默认的为left join，这就实现了我们上面的需求。  
+
+## 4.按默认自增index进行join
+
+如果想按默认的自增index进行join，我们接下来进行尝试。  
+
+```
+def joindemo():
+    age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
+                           'age': [18, 28, 24, 36]})
+    score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
+                             'score': ['A', 'B', 'C', 'B']})
+
+    result = age_df.join(score_df)
+    print(result)
+```  
+
+上面的代码也会报错  
+
+```
+ValueError: columns overlap but no suffix specified: Index(['name'], dtype='object')
+```  
+
+这个时候，就需要lsuffix,rsuffix参数了  
+
+```
+def joindemo():
+    age_df = pd.DataFrame({'name': ['lili', 'lucy', 'tracy', 'mike'],
+                           'age': [18, 28, 24, 36]})
+    score_df = pd.DataFrame({'name': ['tony', 'mike', 'akuda', 'tracy'],
+                             'score': ['A', 'B', 'C', 'B']})
+
+    result = age_df.join(score_df, lsuffix='_left', rsuffix='_right')
+    print(result)
+```
+
+```
+  name_left  age name_right score
+0      lili   18       tony     A
+1      lucy   28       mike     B
+2     tracy   24      akuda     C
+3      mike   36      tracy     B
+```  
+
+## 5.与merge的区别
+pandas中还有merge方法，也能实现join的功能。他们的具体区别，可以参考如下链接：  
+https://stackoverflow.com/questions/22676081/what-is-the-difference-between-join-and-merge-in-pandas
--- a/code-languages/python/pandas根据现有列新添加一列.md
+++ b/code-languages/python/pandas根据现有列新添加一列.md
@ -0,0 +1,38 @@
+pandas中一个Dataframe，经常需要根据其中一列再新建一列，比如一个常见的例子：需要根据分数来确定等级范围，下面我们就来看一下怎么实现。  
+
+```
+def getlevel(score):
+    if score < 60:
+        return "bad"
+    elif score < 80:
+        return "mid"
+    else:
+        return "good"
+
+
+def test():
+    data = {'name': ['lili', 'lucy', 'tracy', 'tony', 'mike'],
+            'score': [85, 61, 75, 49, 90]
+            }
+    df = pd.DataFrame(data=data)
+    # 两种方式都可以
+    # df['level'] = df.apply(lambda x: getlevel(x['score']), axis=1)
+    df['level'] = df.apply(lambda x: getlevel(x.score), axis=1)
+
+    print(df)
+```  
+
+上面代码运行结果  
+
+```
+    name  score level
+0   lili     85  good
+1   lucy     61   mid
+2  tracy     75   mid
+3   tony     49   bad
+4   mike     90  good
+```  
+
+要实现上面的功能，主要是使用到dataframe中的apply方法。  
+上面的代码，对dataframe新增加一列名为level，level由分数一列而来，如果小于60分为bad，60-80之间为mid，80以上为good。  
+其中axis=1表示原有dataframe的行不变，列的维数发生改变。  
--- a/md5算法调用与hashlib模块.md
+++ b/md5算法调用与hashlib模块.md
@ -0,0 +1,93 @@
+## 1.python中的md5
+将一个字符串md5匿名化是数据处理中的一种常见手段，python3.X中也内置实现了md5算法，下面我们看下md5的用法。  
+
+```
+import hashlib
+
+def test():
+    s = "123"
+    m = hashlib.md5()
+    for i in range(5):
+        m.update(s.encode("utf8"))
+        result = m.hexdigest()
+        print(result)
+```  
+
+上面代码的输出为：  
+
+```
+202cb962ac59075b964b07152d234b70
+4297f44b13955235245b2497399d7a93
+f5bb0c8de146c67b44babbf4e6584cc0
+101193d7181cc88340ae5b2b17bba8a1
+e277dd1e05688a22e377e25a3dae5de1
+```  
+
+我们想对123这个字符串做5次md5，理论上5次md5的结果应该一样，但是最后输出的结果却不相同。  
+
+原因就在于update方法：  
+当同一个hashlib对象调用update方法时，假设第一次输入字符串a,第二次输入字符串b，那么第二次md5的结果其实是a+b的md5结果。  
+
+看个简单的例子来证实一下我们的结论：  
+
+```
+def test():
+    s = "123123"
+    m = hashlib.md5()
+    m.update(s.encode("utf8"))
+    result = m.hexdigest()
+    print(result)
+```  
+
+上面的输出结果为  
+
+```
+4297f44b13955235245b2497399d7a93
+```  
+
+与之前for循环遍历的第二次输出结果一直，即"123123"进行md5以后得到的结果。  
+
+## 2.hashlib模块
+hashlib模块中包含常用的hash算法，源码中列出来的有如下：  
+
+```
+# This tuple and __get_builtin_constructor() must be modified if a new
+# always available algorithm is added.
+__always_supported = ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512',
+                      'blake2b', 'blake2s',
+                      'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512',
+                      'shake_128', 'shake_256')
+```  
+
+下面我们选择几个常见的做一下测试  
+
+```
+from hashlib import md5
+from hashlib import sha256
+from hashlib import sha512
+
+hash_functions = [md5, sha256, sha512]
+
+def get_hash_code(s):
+    result = []
+    for function in hash_functions:
+        hash_obj = function(s)
+        hash_hex = hash_obj.hexdigest()
+        result.append((hash_obj.name, hash_hex, len(hash_hex)))
+    return result
+
+
+if __name__ == '__main__':
+    s = "123"
+    result = get_hash_code(s.encode("utf-8"))
+    for each in result:
+        print(each)
+```  
+
+最终输出结果为：  
+
+```
+('md5', '202cb962ac59075b964b07152d234b70', 32)
+('sha256', 'a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3', 64)
+('sha512', '3c9909afec25354d551dae21590bb26e38d53f2173b8d3dc3eee4c047e7ab1c1eb8b85103e3be7ba613b31bb5c9c36214dc9f14a42fd7a2fdb84856bca5c44c2', 128)
+```