add some pandas code

2021-01-30 20:34:22 +08:00 · 2021-01-30 20:34:22 +08:00 · 36769abda5
parent c1c769f630
commit 36769abda5
6 changed files with 969 additions and 0 deletions
--- a/code-languages/python/pandas
+++ b/code-languages/python/pandas
@ -0,0 +1,108 @@
+## 1.分箱
+数据分箱的需求在实际中非常常见。对于一组连续的值，会对其切分成若干段，每一段我们将其看做一个类别，这个过程就叫做分箱。分箱操作本质上就是将连续值离散化的一个过程。  
+
+举个常见的例子：  
+最常见的就是对年龄进行分箱操作。假设人的年龄从0-120岁不等，我们将0-5认为是婴幼儿，6-15岁认为是少年，16-30岁认为是青年，31-50认为是中年，50-60认为是中老年，60岁以上认为是老年。在这个过程中，就将连续的年龄分为了婴幼儿、少年、青年、中年、中老年、老年这六个类别，或者说分成了六个“箱子”，每个"箱子"代表的就是一个类别。  
+
+## 2.cut方法
+pandas里面有cut方法与qcut方法都可以实现分箱的需求，下面我们先来看看cut方法。  
+
+```
+def t1():
+    scores = [80, 55, 78, 99, 60, 35, 82, 57]
+    cut = pd.cut(scores, 3)
+    print(cut)
+```  
+
+上面的方法，将scores分成三个区间，最后的结果为  
+
+```
+[(77.667, 99.0], (34.936, 56.333], (77.667, 99.0], (77.667, 99.0], (56.333, 77.667], (34.936, 56.333], (77.667, 99.0], (56.333, 77.667]]
+Categories (3, interval[float64]): [(34.936, 56.333] < (56.333, 77.667] < (77.667, 99.0]]
+```  
+
+输出的第一行表示原来的数据位于哪个箱子，第二行表示三个箱子的相关信息。  
+
+```
+def t2():
+    scores = [80, 55, 78, 99, 60, 35, 82, 57]
+    bins = [0, 60, 80, 100]
+    cut = pd.cut(scores, bins)
+    print(cut)
+
+    print(cut.codes)
+    print(cut.categories)
+    print(pd.value_counts(cut))
+```  
+
+输出结果为  
+
+```
+[(60, 80], (0, 60], (60, 80], (80, 100], (0, 60], (0, 60], (80, 100], (0, 60]]
+Categories (3, interval[int64]): [(0, 60] < (60, 80] < (80, 100]]
+[1 0 1 2 0 0 2 0]
+IntervalIndex([(0, 60], (60, 80], (80, 100]],
+              closed='right',
+              dtype='interval[int64]')
+(0, 60]      4
+(80, 100]    2
+(60, 80]     2
+dtype: int64
+```  
+
+上面的方法，指定了划分的bins，所以分箱的时候区间为(0, 60), (60, 80), (80, 100)。  
+value_counts方法，可以统计各区间的数量。  
+
+```
+def t3():
+    scores = [80, 55, 78, 99, 60, 35, 82, 57]
+    bins = [0, 60, 80, 100]
+    cut = pd.cut(scores, bins, labels=["low", "mid", "high"])
+    print(pd.value_counts(cut))
+    print()
+
+    cut2 = pd.cut(scores, bins, labels=["low", "mid", "high"], right=False)
+    print(pd.value_counts(cut2))
+```  
+
+```
+low     4
+high    2
+mid     2
+dtype: int64
+
+high    3
+low     3
+mid     2
+dtype: int64
+```  
+
+上面的方法，指定了labels参数，这样每个分箱区间相当于有了标签名称。  
+如果指定right=False，则右区间由默认的闭区间变成开区间。  
+
+## 3.qcut方法
+
+```
+def t4():
+    scores = [x**2 for x in range(11)]
+    cut = pd.qcut(scores, 5)
+    print(cut)
+    print()
+    print(pd.value_counts(cut))
+```  
+
+```
+[(-0.001, 4.0], (-0.001, 4.0], (-0.001, 4.0], (4.0, 16.0], (4.0, 16.0], ..., (16.0, 36.0], (36.0, 64.0], (36.0, 64.0], (64.0, 100.0], (64.0, 100.0]]
+Length: 11
+Categories (5, interval[float64]): [(-0.001, 4.0] < (4.0, 16.0] < (16.0, 36.0] < (36.0, 64.0] <
+                                    (64.0, 100.0]]
+
+(-0.001, 4.0]    3
+(64.0, 100.0]    2
+(36.0, 64.0]     2
+(16.0, 36.0]     2
+(4.0, 16.0]      2
+dtype: int64
+```  
+
+与cut方法不同的是，cut是按变量的值进行划分， qcut是按照变量的个数进行划分。上面方法的意思是，将输入分为数量相等的五个分箱区间。  
--- a/code-languages/python/pandas
+++ b/code-languages/python/pandas
@ -0,0 +1,105 @@
+## 1.分组groupby
+在日常数据分析过程中，经常有分组的需求。具体来说，就是根据一个或者多个字段，将数据划分为不同的组，然后进行进一步分析，比如求分组的数量，分组内的最大值最小值平均值等。在sql中，就是大名鼎鼎的groupby操作。  
+pandas中，也有对应的groupby操作，下面我们就来看看pandas中的groupby怎么使用。  
+
+## 2.groupby的数据结构
+首先我们看如下代码  
+
+```
+def ddd():
+    levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
+    nums = [10, 20, 30, 20, 15, 10, 12]
+    df = pd.DataFrame({"level": levels, "num": nums})
+    g = df.groupby('level')
+    print(g)
+    print()
+    print(list(g))
+```  
+
+输出结果如下：  
+
+```
+<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10f6f96d0>
+
+[('L1',   level  num
+0    L1   10
+1    L1   20
+2    L1   30), ('L2',   level  num
+3    L2   20
+4    L2   15), ('L3',   level  num
+5    L3   10
+6    L3   12)]
+```  
+
+做groupby操作以后，得到的是一个DataFrameGroupBy对象，直接打印该对象的话，显示的是其内存地址。  
+为了方便地观察数据，我们使用list方法转换一下，发现其是一个元组，元组中的第一个元素，是level的值。元祖中的第二个元素，则是其组别下的整个dataframe。  
+
+## 3.groupby的基本用法
+```
+def group1():
+    levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
+    nums = [10, 20, 30, 20, 15, 10, 12]
+    scores = [100, 200, 300, 200, 150, 100, 120]
+    df = pd.DataFrame({"level": levels, "num": nums, "score": scores})
+    result = df.groupby('level').agg({'num': 'sum', 'score': 'mean'})
+    allnum = result['num'].sum()
+    result['rate'] = result['num'].map(lambda x: x / allnum)
+    print(result)
+```  
+
+最后输出：  
+
+```
+       num  score      rate
+level                      
+L1      60    200  0.512821
+L2      35    175  0.299145
+L3      22    110  0.188034
+```  
+
+上面的例子展示了groupby的基本用法。  
+对dataframe按照level分组，然后对num列求和，对score列求平均值，可以得到result。  
+同时，我们还希望得到每个分组中，num的和在所有num和中的占比。于是我们先求num的综合，然后在用map方法，给result添加一列，求得其占比！  
+
+## 4.transform的用法
+
+下面我们看一个更复杂的例子。  
+
+```
+def t10():
+    levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
+    nums = [10, 20, 30, 20, 15, 10, 12]
+    df = pd.DataFrame({"level": levels, "num": nums})
+    ret = df.groupby('level')['num'].mean().to_dict()
+    df['avg_num'] = df['level'].map(ret)
+    print(ret)
+    print(df)
+```  
+
+```
+{'L1': 20.0, 'L2': 17.5, 'L3': 11.0}
+  level  num  avg_num
+0    L1   10     20.0
+1    L1   20     20.0
+2    L1   30     20.0
+3    L2   20     17.5
+4    L2   15     17.5
+5    L3   10     11.0
+6    L3   12     11.0
+```  
+
+上面的方法，我们对level分组以后，我们想给数据集添加一列，想给每行数据添加每个level对应的平均值。  
+上面的解法是先求得每个分组的平均值，转成一个dict，然后再使用map方法将每组的平均值添加上去。  
+
+```
+def trans():
+    levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
+    nums = [10, 20, 30, 20, 15, 10, 12]
+    df = pd.DataFrame({"level": levels, "num": nums})
+    df['avg_num'] = df.groupby('level')['num'].transform('mean')
+    print(df)
+```  
+如果使用transform方法，代码可以更简单更直观，如上所示。  
+
+transform方法的作用：调用函数在每个分组上产生一个与原df相同索引的dataFrame，整体返回与原来对象拥有相同索引且已填充了转换后的值的dataFrame，相当于就是给原来的dataframe添加了一列。  
+
--- a/code-languages/python/pandas
+++ b/code-languages/python/pandas
@ -0,0 +1,328 @@
+## 0 前言
+
+pandas的基本数据结构是Series与DataFrame。在数据处理过程中，对每个元素，或者每行/每列进行操作是尝尽的需求。而在pandas中，就内置了map,applymap,apply方法，可以满足上面的需求。接下来结合实际的例子，看看一些基本/常规/高大上的操作。  
+
+## 1.map方法
+map方法在数据处理中属于基本操作，重要性无须多言。map方法一般是对元素进行逐个操作，下面来看看几个例子。  
+
+首先明确一点：map方法只能作用再Series上，不能作用在DataFrame上。换句话说，DataFrame没有map方法。  
+
+Series中map方法的部分源码如下  
+```
+    def map(self, arg, na_action=None):
+        """
+        Map values of Series according to input correspondence.
+
+        Used for substituting each value in a Series with another value,
+        that may be derived from a function, a ``dict`` or
+        a :class:`Series`.
+
+        Parameters
+        ----------
+        arg : function, collections.abc.Mapping subclass or Series
+            Mapping correspondence.
+        na_action : {None, 'ignore'}, default None
+            If 'ignore', propagate NaN values, without passing them to the
+            mapping correspondence.
+
+        Returns
+        -------
+        Series
+            Same index as caller.
+
+        See Also
+        --------
+        Series.apply : For applying more complex functions on a Series.
+        DataFrame.apply : Apply a function row-/column-wise.
+        DataFrame.applymap : Apply a function elementwise on a whole DataFrame.
+
+        Notes
+        -----
+        When ``arg`` is a dictionary, values in Series that are not in the
+        dictionary (as keys) are converted to ``NaN``. However, if the
+        dictionary is a ``dict`` subclass that defines ``__missing__`` (i.e.
+        provides a method for default values), then this default is used
+        rather than ``NaN``.
+```  
+
+map方法的主要参数是arg，arg是一个方法或者字典，作用在每个元素上。  
+
+看个例子：  
+
+```
+import numpy as np
+import pandas as pd
+
+def test():
+    genders = ["male", "male", "female", "unknown", "female"]
+    levels = ["L1", "L2", "L1", "L1", "L2"]
+    df = pd.DataFrame({"gender": genders, "level": levels})
+
+    gender_dic = {"male": "男", "female": "女", "unknown": "未知"}
+    print(df)
+    print("\n\n")
+    df["gender"] = df["gender"].map(gender_dic)
+    print(df)
+```  
+
+输出如下：  
+
+```
+    gender level
+0     male    L1
+1     male    L2
+2   female    L1
+3  unknown    L1
+4   female    L2
+
+
+
+  gender level
+0      男    L1
+1      男    L2
+2      女    L1
+3     未知    L1
+4      女    L2
+```  
+
+上面的代码，是将gender这一列里的male映射成男，female映射成女，unknown映射成未知。  
+
+```
+def test():
+    x = [i for i in range(1, 11)]
+    y = [2*i + 0.5 for i in x]
+    df = pd.DataFrame({'x': x, 'y': y})
+    x2 = df['x']
+    print(x2.map(lambda i: "%.2f" % i))
+    print(x2.map(lambda i: "{:.2f}".format(i)))
+```  
+
+```
+0     1.00
+1     2.00
+2     3.00
+3     4.00
+4     5.00
+5     6.00
+6     7.00
+7     8.00
+8     9.00
+9    10.00
+Name: x, dtype: object
+0     1.00
+1     2.00
+2     3.00
+3     4.00
+4     5.00
+5     6.00
+6     7.00
+7     8.00
+8     9.00
+9    10.00
+Name: x, dtype: object
+```  
+
+上面的方法，则是将x变成带两位小数的浮点数。  
+
+不论是利用字典还是函数进行映射，map方法都是把对应的数据逐个当作参数传入到字典或函数中，得到映射后的值。  
+
+## 2.applymap方法
+上面提到，dataframe没有map方法。要对dataframe中的元素实现类似map的功能，可以使用applymap方法。    
+
+```
+def t8():
+    x = [i for i in range(1, 11)]
+    y = [2*i + 0.5 for i in x]
+    df = pd.DataFrame({'x': x, 'y': y})
+    print(df)
+    print()
+    print(df.applymap(lambda i: "%.2f" % i))
+```  
+
+```
+    x     y
+0   1   2.5
+1   2   4.5
+2   3   6.5
+3   4   8.5
+4   5  10.5
+5   6  12.5
+6   7  14.5
+7   8  16.5
+8   9  18.5
+9  10  20.5
+
+       x      y
+0   1.00   2.50
+1   2.00   4.50
+2   3.00   6.50
+3   4.00   8.50
+4   5.00  10.50
+5   6.00  12.50
+6   7.00  14.50
+7   8.00  16.50
+8   9.00  18.50
+9  10.00  20.50
+
+```  
+
+前面的例子，是对x这一列做map操作，将x中的数值变成带两位小数的浮点数。如果我们想将dataframe中的x,y同时变成带两位小数的浮点数，可以使用applymap方法。  
+
+
+## 3.apply方法
+apply方法与map的功能类似，主要区别在于apply能传入功能更为复杂的函数。  
+
+```
+    def apply(self, func, convert_dtype=True, args=(), **kwds):
+        """
+        Invoke function on values of Series.
+
+        Can be ufunc (a NumPy function that applies to the entire Series)
+        or a Python function that only works on single values.
+
+        Parameters
+        ----------
+        func : function
+            Python function or NumPy ufunc to apply.
+        convert_dtype : bool, default True
+            Try to find better dtype for elementwise function results. If
+            False, leave as dtype=object.
+        args : tuple
+            Positional arguments passed to func after the series value.
+        **kwds
+            Additional keyword arguments passed to func.
+
+        Returns
+        -------
+        Series or DataFrame
+            If func returns a Series object the result will be a DataFrame.
+
+        See Also
+        --------
+        Series.map: For element-wise operations.
+        Series.agg: Only perform aggregating type operations.
+        Series.transform: Only perform transforming type operations.
+
+```  
+
+我们看一下apply方法的源码，首先方法签名为  
+
+```
+    def apply(self, func, convert_dtype=True, args=(), **kwds):
+```  
+
+与map的源码相比，apply除了可以输入func，还可以以元组的方式输入参数，这样能够输入功能更加复杂的函数。  
+
+下面来看几个例子  
+
+```
+def square(x):
+    return x**2
+
+def test():
+    s = pd.Series([20, 21, 12], index = ['London', 'New York', 'Helsinki'])
+    s1 = s.apply(lambda x: x**2)
+    s2 = s.apply(square)
+    s3 = s.apply(np.log)
+
+    print(s1)
+    print()
+    print(s2)
+    print()
+    print(s3)
+```  
+
+输出为  
+
+```
+London      400
+New York    441
+Helsinki    144
+dtype: int64
+
+London      400
+New York    441
+Helsinki    144
+dtype: int64
+
+London      2.995732
+New York    3.044522
+Helsinki    2.484907
+dtype: float64
+```  
+
+上面的用法比较简单，跟map方法是一样的。  
+
+再看一个复杂一些的例子
+
+```
+def BMI(series):
+    weight = series['weight']
+    height = series['height'] / 100
+    BMI_Rate = weight / height**2
+    return BMI_Rate
+
+def test():
+    heights = [180, 175, 169, 158, 185]
+    weights = [75, 72, 68, 60, 76]
+    age = [30, 18, 26, 42, 34]
+    df = pd.DataFrame({"height": heights, "weight": weights, "age": age})
+    print(df)
+    print()
+    df['BMI'] = df.apply(BMI, axis=1)
+    print(df)
+```  
+
+输出结果为
+
+```
+   height  weight  age
+0     180      75   30
+1     175      72   18
+2     169      68   26
+3     158      60   42
+4     185      76   34
+
+   height  weight  age        BMI
+0     180      75   30  23.148148
+1     175      72   18  23.510204
+2     169      68   26  23.808690
+3     158      60   42  24.034610
+4     185      76   34  22.205990
+```  
+
+数据中包括身高体重，然后计算BMI指数=体重/身高的平方。  
+上面的apply方法在调用的时候，指定了axis=1，就是对每行进行操作。如果不容易的理解的同学可以这么想:axis=1要消除的是列的维度，保留行的维度，所以是对每行的数据进行操作。apply方法在运行时，实际上就是调用BMI方法对每行数据进行操作。  
+
+```
+def subtract_custom_value(x, custom_value):
+    return x - custom_value
+
+def test():
+    s = pd.Series([20, 21, 12], index = ['London', 'New York', 'Helsinki'])
+    print(s)
+    print()
+    s1 = s.apply(subtract_custom_value, args=(5,))
+    print(s1)
+```  
+
+输出结果为  
+
+```
+London      20
+New York    21
+Helsinki    12
+dtype: int64
+
+London      15
+New York    16
+Helsinki     7
+dtype: int64
+```  
+
+上面代码运行的时候，就是将每个值减去5，因为要传入参数5，所以map方法此时就无能为力。  
+
+## 4.总结
+1.map方法是针对Series的基本操作，dataframe无map方法。  
+2.dataframe如果要针对每个元素做map操作，可以使用applymap。  
+3.apply方法更为灵活，可以同时作用于series与dataframe。同时可以以元组的形式传入参数。  
--- a/code-languages/scala/scala正则表达式
+++ b/code-languages/scala/scala正则表达式
@ -0,0 +1,308 @@
+## 0.引子  
+节前最后一个工作日，在编写一个简单的正则表达式的时候，卡了比较长的时间。后来总结发现，还是对正则表达式的理解不是很深刻，于是利用假期的时间，特意比较详细地看了一下正则表达式相关内容并加以记录。  
+
+## 1.findFirstIn findFirstMatchIn
+正则表达式中常用的方法包括findFirstIn，findFirstMatchIn等类似的方法。先来看个例子，通过例子我们来看两者区别。  
+
+```
+  @Test
+  def test() = {
+    val s = "你好，今天是2021年1月2日18点30分"
+    val pattern = """今天是\d+年\d+月\d+日""".r
+    val result1 = pattern.findFirstIn(s)
+    println(result1)
+    val result2 = pattern.findFirstMatchIn(s) match {
+      case Some(data) => {
+        println("data type is: " + data.getClass.getSimpleName)
+        data group 0
+      }
+      case _ => "empty"
+    }
+    println(result2)
+  }
+```  
+
+输出结果：  
+
+```
+Some(今天是2021年1月2日)
+data type is: Match
+今天是2021年1月2日
+```  
+
+
+简单看下源码  
+
+```
+  /** Return an optional first matching string of this `Regex` in the given character sequence,
+   *  or None if there is no match.
+   *
+   *  @param source The text to match against.
+   *  @return       An [[scala.Option]] of the first matching string in the text.
+   *  @example      {{{"""\w+""".r findFirstIn "A simple example." foreach println // prints "A"}}}
+   */
+  def findFirstIn(source: CharSequence): Option[String] = {
+    val m = pattern.matcher(source)
+    if (m.find) Some(m.group) else None
+  }
+```  
+
+firdFirstIn是scala.util.matching.Regex的方法。该方法的输入是一个source，source类型为CharSequence接口，最常见的实现类为字符串。  
+返回值为Option[String]。在我们的例子中，因为匹配上了，所以返回的值为Some[String]。  
+
+```
+  /** Return an optional first match of this `Regex` in the given character sequence,
+   *  or None if it does not exist.
+   *
+   *  If the match is successful, the [[scala.util.matching.Regex.Match]] can be queried for
+   *  more data.
+   *
+   *  @param source The text to match against.
+   *  @return       A [[scala.Option]] of [[scala.util.matching.Regex.Match]] of the first matching string in the text.
+   *  @example      {{{("""[a-z]""".r findFirstMatchIn "A simple example.") map (_.start) // returns Some(2), the index of the first match in the text}}}
+   */
+  def findFirstMatchIn(source: CharSequence): Option[Match] = {
+    val m = pattern.matcher(source)
+    if (m.find) Some(new Match(source, m, groupNames)) else None
+  }
+```  
+
+findFirstMatchIn看源码与firdFirstIn差别不大，最大的不同在于返回的类型为Option[Match]。  
+
+## 2.Match MatchData
+看下Match的源码  
+
+```
+  /** Provides information about a successful match. */
+  class Match(val source: CharSequence,
+              private[matching] val matcher: Matcher,
+              val groupNames: Seq[String]) extends MatchData {
+
+    /** The index of the first matched character. */
+    val start = matcher.start
+
+    /** The index following the last matched character. */
+    val end = matcher.end
+
+    /** The number of subgroups. */
+    def groupCount = matcher.groupCount
+
+    private lazy val starts: Array[Int] =
+      ((0 to groupCount) map matcher.start).toArray
+    private lazy val ends: Array[Int] =
+      ((0 to groupCount) map matcher.end).toArray
+
+    /** The index of the first matched character in group `i`. */
+    def start(i: Int) = starts(i)
+
+    /** The index following the last matched character in group `i`. */
+    def end(i: Int) = ends(i)
+
+    /** The match itself with matcher-dependent lazy vals forced,
+     *  so that match is valid even once matcher is advanced.
+     */
+    def force: this.type = { starts; ends; this }
+  }
+```  
+
+第一行注释非常关键，告诉了我们Match类最重要的作用：Provides information about a successful match。如果匹配成功，这个类会给我们提供一些匹配成功的信息，包括匹配成功的起始位置等。  
+Match类继承了MatchData，我们再看看MatchData的源码  
+
+```
+ trait MatchData {
+
+    /** The source from which the match originated */
+    val source: CharSequence
+
+    /** The names of the groups, or an empty sequence if none defined */
+    val groupNames: Seq[String]
+
+    /** The number of capturing groups in the pattern.
+     *  (For a given successful match, some of those groups may not have matched any input.)
+     */
+    def groupCount: Int
+
+    /** The index of the first matched character, or -1 if nothing was matched */
+    def start: Int
+
+    /** The index of the first matched character in group `i`,
+     *  or -1 if nothing was matched for that group.
+     */
+    def start(i: Int): Int
+	...
+
+    /** The matched string in group `i`,
+     *  or `null` if nothing was matched.
+     */
+    def group(i: Int): String =
+      if (start(i) >= 0) source.subSequence(start(i), end(i)).toString
+      else null
+	...
+
+    /** Returns the group with given name.
+     *
+     *  @param id The group name
+     *  @return   The requested group
+     *  @throws   NoSuchElementException if the requested group name is not defined
+     */
+    def group(id: String): String = nameToIndex.get(id) match {
+      case None => throw new NoSuchElementException("group name "+id+" not defined")
+      case Some(index) => group(index)
+    }
+```  
+MatchData里面用得最多，最重要的方法应该就是group了，group最大的作用，就是用来提起分组。  
+
+## 3.提取分组
+
+```
+  @Test
+  def test() = {
+    val s = "你好，今天是2021年1月2日18点30分"
+    val pattern = """今天是(\d+)年(\d+)月(\d+)日""".r
+    val result = pattern.findFirstMatchIn(s)
+    val year = result match {
+      case Some(data) => data group 1
+      case _ => "-1"
+    }
+    println(year)  // 结果为 2021
+  }
+```  
+
+上面的例子就是提取分组的一个典型例子，就是利用findFirstMatchIn的group方法，提取匹配结果的第一个分组，就得到了年份数据。  
+
+## 4.提取分组的另外一种方式
+实际中提取分组还有另外一种常用方式。  
+
+```
+  @Test
+  def test() = {
+    val s = "你好，今天是2021年1月2日18点30分"
+    val pattern = """今天是(\d+)年(\d+)月(\d+)日""".r
+    val pattern(year, month, day) = s
+    println(s"year is $year.\n" +
+      f"month is $month.\n" + raw"day is $day")
+  }
+```  
+
+上面的代码看起来很正常，完全没毛病，但实际上却会报错有问题，本人就是在这里被卡了很长时间。  
+
+```
+scala.MatchError: 你好，今天是2021年1月2日18点30分 (of class java.lang.String)
+
+	at com.xiaomi.mifi.pdata.common.T4.t8(T4.scala:114)
+	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
+	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
+	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
+	at java.lang.reflect.Method.invoke(Method.java:498)
+	...
+```  
+
+当时百思不得其解，不知道问题出在哪里。仔细看了源码以后，才明白什么情况。如果我们在IDE中点击  
+```val pattern(year, month, day) = s```  
+这一行查看源码，会发现调用的其实是unapplySeq方法。  
+
+```
+  def unapplySeq(s: CharSequence): Option[List[String]] = s match {
+    case null => None
+    case _    =>
+      val m = pattern matcher s
+      if (runMatcher(m)) Some((1 to m.groupCount).toList map m.group)
+      else None
+  }
+```  
+
+这个方法上面有一段关键的注释  
+
+```
+  /** Tries to match a [[java.lang.CharSequence]].
+   *
+   *  If the match succeeds, the result is a list of the matching
+   *  groups (or a `null` element if a group did not match any input).
+   *  If the pattern specifies no groups, then the result will be an empty list
+   *  on a successful match.
+   *
+   *  This method attempts to match the entire input by default; to find the next
+   *  matching subsequence, use an unanchored `Regex`.
+```  
+
+这个方法默认是匹配整个输出，如果是要匹配子串，需要用unanchored这种方式。  
+
+将上面的代码稍作改动  
+```
+  @Test
+  def test() = {
+    val s = "你好，今天是2021年1月2日18点30分"
+    val pattern = """今天是(\d+)年(\d+)月(\d+)日""".r.unanchored
+    val pattern(year, month, day) = s
+    println(s"year is $year.\n" +
+      f"month is $month.\n" + raw"day is $day")
+  }
+```  
+
+可以得到我们预期的结果  
+
+```
+year is 2021.
+month is 1.
+day is 2
+```  
+
+## 5.findAllIn findAllMatchIn  
+findAllIn与firdFirstIn对应，findAllMatchIn与findFirstMatchIn对应，表示所有匹配结果。  
+先来看一个例子  
+
+```
+  @Test
+  def t9() = {
+    val dateRegex =  """(\d{4})-(\d{2})-(\d{2})""".r
+    val dates = "dates in history: 2004-01-20, 2005-02-28, 1998-01-15, 2009-10-25"
+    val result =  dateRegex.findAllIn(dates)
+    val array =  for (each <- result) yield each
+    println(array)
+    println(array.mkString("\t"))
+  }
+```  
+
+```
+non-empty iterator
+2004-01-20	2005-02-28	1998-01-15	2009-10-25
+```  
+
+findAllIn的方法签名如下：  
+```
+  /** Return all non-overlapping matches of this `Regex` in the given character 
+   *  sequence as a [[scala.util.matching.Regex.MatchIterator]],
+   *  which is a special [[scala.collection.Iterator]] that returns the
+   *  matched strings but can also be queried for more data about the last match,
+   *  such as capturing groups and start position.
+   ....
+
+  def findAllIn(source: CharSequence) = new Regex.MatchIterator(source, this, groupNames)
+```  
+
+返回的是一个MatchIterator，根据注释信息可以看出来MatchIterator是scala.collection.Iterator的一个特例，所以直接println(array)得到的信息是一个non-empty iterator。  
+
+如果我们想得到所有能匹配上的年份，则可以使用findAllMatchIn方法。该方法可以得到先得到所有的Match对象，然后再分组提取出年份即可。  
+
+```
+  @Test
+  def t10() = {
+    val dateRegex =  """(\d{4})-(\d{2})-(\d{2})""".r
+    val dates = "dates in history: 2004-01-20, 2005-02-28, 1998-01-15, 2009-10-25"
+    val result = dateRegex.findAllMatchIn(dates)
+    val array = for(each <- result) yield each.group(1)
+    println(array)
+    println(array.mkString("\t"))
+  }
+```  
+
+最后的输出结果为  
+
+```
+non-empty iterator
+2004	2005	1998	2009
+```
+
+
+
+
--- a/tools/linux-shell/统计文本去重行数.md
+++ b/tools/linux-shell/统计文本去重行数.md
@ -0,0 +1,27 @@
+常见的一个需求为：统计某个文本去重以后的行数  
+
+可以使用如下命令：  
+
+```
+sort xxxfile | uniq | wc -l
+```  
+
+也可以使用如下命令  
+
+```
+sort -u xxxfile | wc -l
+```  
+
+简单解释一下  
+
+其中sort -u的选项，解释如下  
+
+```
+     -u, --unique
+             Unique keys.  Suppress all lines that have a key that is equal to an already processed one.  This option, similarly to -s, implies a stable sort.  If used with -c or -C,
+             sort also checks that there are no lines with duplicate keys.
+```  
+
+可见sort的-u选项，就是自带去重功能。  
+
+而uniq 不会检查重复的行，除非它们是相邻的行，所以统计去重行数的时候，得先用sort排序，排序完了再用uniq去重，最后达到去重的目的。  
--- a/traditional-algorithm/tree/史上最全macos安装xgboost教程.md
+++ b/traditional-algorithm/tree/史上最全macos安装xgboost教程.md
@ -0,0 +1,93 @@
+## 0.前言
+自己的macos上一直没有安装上xgboost，最近因为工作需要，想在macos上安装一下xgboost。  
+本来以为是个很简单的事情，没想到还是费了一些波折，特意记录一下  
+
+## 1. 直接安装失败
+最开始直接使用  
+
+```
+pin install xgboost
+```  
+安装，安装过程没啥问题。但是安装完毕，使用过程中出现了问题。  
+
+```
+import xgboost as xgb
+```  
+导入xgboost以后，直接报错  
+
+```
+xgboost.core.XGBoostError: XGBoost Library (libxgboost.dylib) could not be loaded.
+Likely causes:
+  * OpenMP runtime is not installed (vcomp140.dll or libgomp-1.dll for Windows, libgomp.so for UNIX-like OSes)
+  * You are running 32-bit Python on a 64-bit OS
+....
+```  
+
+查了一下原因，大致原因如下：  
+Xgboost模型本身支持多线程运行，即用多个cpu线程进行训练；  
+但是，默认的apple clang编译器不支持openmp，因此使用默认编译器将禁用多线程。  
+
+## 2.解决方式1
+又搜了下网上的解决方式，大部分的套路都是这样：  
+先升级homebrew，然后通过homebrew安装更高版本的gcc，再去gitclone xgboost源码，build源码，再安装。  
+
+结果发现不管是升级homebrew，还是安装gcc，gitclone源码，每一步都难如登天，老铁们懂的.  
+
+所以这是种可行的方式，但是堪称地狱难度，直接放弃了。  
+
+## 3.解决方式2
+搜索的过程中发现有个老哥直接给了一行代码就可以解决问题  
+
+```
+conda install py-xgboost
+```  
+
+有几个帖子反映该方法简单粗暴好使，于是抱着试一试的想法试了下。  
+结果conda掉链子了。  
+
+```
+Solving environment: failed with initial frozen solve. Retrying with flexible solve.
+......
+```  
+
+
+## 4.接上conda的链子
+conda的问题，比较明显是source的问题。不禁又是一声叹息...  
+找了半天，试了N多源，发现都不奏效。  
+最后认真看了下清华开源镜像站的anaconda页面，抱着试试看的心态，把官网上的配置粘到本地的.condarc文件  
+
+```
+channels:
+  - defaults
+show_channel_urls: true
+channel_alias: https://mirrors.tuna.tsinghua.edu.cn/anaconda
+default_channels:
+  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
+  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
+  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
+  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/pro
+  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
+custom_channels:
+  conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
+  msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
+  bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
+  menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
+  pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
+  simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
+```  
+
+
+清华开源镜像站的anaconda链接：  
+[清华anaconda镜像](https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/)  
+
+看到这里其实有点小小的感慨，国内IT产业如火如荼，但是这种重要而且基本的东西，居然是一个学校的学生凭自己的兴趣爱好在自发维护.....  
+
+## 5.大功告成
+将conda的配置修改完毕，再执行安装命令  
+
+```
+conda install py-xgboost
+```  
+
+发现大功告成，可以在本地正常运行xgb相关的代码。  
+后面有时间再稍微查查这个py-xgboost有啥特别的地方。