From 36769abda5424cce56ce767e489fbfebd5da3a12 Mon Sep 17 00:00:00 2001 From: wanglei Date: Sat, 30 Jan 2021 20:34:22 +0800 Subject: [PATCH] add some pandas code --- .../pandas cut qcut 分箱算法详解.md | 108 ++++++ .../python/pandas groupby 用法详解.md | 105 ++++++ .../pandas map applymap apply方法详解.md | 328 ++++++++++++++++++ ...llMatchIn Match MatchData 提取分组.md | 308 ++++++++++++++++ tools/linux-shell/统计文本去重行数.md | 27 ++ .../史上最全macos安装xgboost教程.md | 93 +++++ 6 files changed, 969 insertions(+) create mode 100644 code-languages/python/pandas cut qcut 分箱算法详解.md create mode 100644 code-languages/python/pandas groupby 用法详解.md create mode 100644 code-languages/python/pandas map applymap apply方法详解.md create mode 100644 code-languages/scala/scala正则表达式 findFirstIn findAllIn findFirstMatchIn findAllMatchIn Match MatchData 提取分组.md create mode 100644 tools/linux-shell/统计文本去重行数.md create mode 100644 traditional-algorithm/tree/史上最全macos安装xgboost教程.md diff --git a/code-languages/python/pandas cut qcut 分箱算法详解.md b/code-languages/python/pandas cut qcut 分箱算法详解.md new file mode 100644 index 0000000..2bcf132 --- /dev/null +++ b/code-languages/python/pandas cut qcut 分箱算法详解.md @@ -0,0 +1,108 @@ +## 1.分箱 +数据分箱的需求在实际中非常常见。对于一组连续的值,会对其切分成若干段,每一段我们将其看做一个类别,这个过程就叫做分箱。分箱操作本质上就是将连续值离散化的一个过程。 + +举个常见的例子: +最常见的就是对年龄进行分箱操作。假设人的年龄从0-120岁不等,我们将0-5认为是婴幼儿,6-15岁认为是少年,16-30岁认为是青年,31-50认为是中年,50-60认为是中老年,60岁以上认为是老年。在这个过程中,就将连续的年龄分为了婴幼儿、少年、青年、中年、中老年、老年这六个类别,或者说分成了六个“箱子”,每个"箱子"代表的就是一个类别。 + +## 2.cut方法 +pandas里面有cut方法与qcut方法都可以实现分箱的需求,下面我们先来看看cut方法。 + +``` +def t1(): + scores = [80, 55, 78, 99, 60, 35, 82, 57] + cut = pd.cut(scores, 3) + print(cut) +``` + +上面的方法,将scores分成三个区间,最后的结果为 + +``` +[(77.667, 99.0], (34.936, 56.333], (77.667, 99.0], (77.667, 99.0], (56.333, 77.667], (34.936, 56.333], (77.667, 99.0], (56.333, 77.667]] +Categories (3, interval[float64]): [(34.936, 56.333] < (56.333, 77.667] < (77.667, 99.0]] +``` + +输出的第一行表示原来的数据位于哪个箱子,第二行表示三个箱子的相关信息。 + +``` +def t2(): + scores = [80, 55, 78, 99, 60, 35, 82, 57] + bins = [0, 60, 80, 100] + cut = pd.cut(scores, bins) + print(cut) + + print(cut.codes) + print(cut.categories) + print(pd.value_counts(cut)) +``` + +输出结果为 + +``` +[(60, 80], (0, 60], (60, 80], (80, 100], (0, 60], (0, 60], (80, 100], (0, 60]] +Categories (3, interval[int64]): [(0, 60] < (60, 80] < (80, 100]] +[1 0 1 2 0 0 2 0] +IntervalIndex([(0, 60], (60, 80], (80, 100]], + closed='right', + dtype='interval[int64]') +(0, 60] 4 +(80, 100] 2 +(60, 80] 2 +dtype: int64 +``` + +上面的方法,指定了划分的bins,所以分箱的时候区间为(0, 60), (60, 80), (80, 100)。 +value_counts方法,可以统计各区间的数量。 + +``` +def t3(): + scores = [80, 55, 78, 99, 60, 35, 82, 57] + bins = [0, 60, 80, 100] + cut = pd.cut(scores, bins, labels=["low", "mid", "high"]) + print(pd.value_counts(cut)) + print() + + cut2 = pd.cut(scores, bins, labels=["low", "mid", "high"], right=False) + print(pd.value_counts(cut2)) +``` + +``` +low 4 +high 2 +mid 2 +dtype: int64 + +high 3 +low 3 +mid 2 +dtype: int64 +``` + +上面的方法,指定了labels参数,这样每个分箱区间相当于有了标签名称。 +如果指定right=False,则右区间由默认的闭区间变成开区间。 + +## 3.qcut方法 + +``` +def t4(): + scores = [x**2 for x in range(11)] + cut = pd.qcut(scores, 5) + print(cut) + print() + print(pd.value_counts(cut)) +``` + +``` +[(-0.001, 4.0], (-0.001, 4.0], (-0.001, 4.0], (4.0, 16.0], (4.0, 16.0], ..., (16.0, 36.0], (36.0, 64.0], (36.0, 64.0], (64.0, 100.0], (64.0, 100.0]] +Length: 11 +Categories (5, interval[float64]): [(-0.001, 4.0] < (4.0, 16.0] < (16.0, 36.0] < (36.0, 64.0] < + (64.0, 100.0]] + +(-0.001, 4.0] 3 +(64.0, 100.0] 2 +(36.0, 64.0] 2 +(16.0, 36.0] 2 +(4.0, 16.0] 2 +dtype: int64 +``` + +与cut方法不同的是,cut是按变量的值进行划分, qcut是按照变量的个数进行划分。上面方法的意思是,将输入分为数量相等的五个分箱区间。 diff --git a/code-languages/python/pandas groupby 用法详解.md b/code-languages/python/pandas groupby 用法详解.md new file mode 100644 index 0000000..034d5e7 --- /dev/null +++ b/code-languages/python/pandas groupby 用法详解.md @@ -0,0 +1,105 @@ +## 1.分组groupby +在日常数据分析过程中,经常有分组的需求。具体来说,就是根据一个或者多个字段,将数据划分为不同的组,然后进行进一步分析,比如求分组的数量,分组内的最大值最小值平均值等。在sql中,就是大名鼎鼎的groupby操作。 +pandas中,也有对应的groupby操作,下面我们就来看看pandas中的groupby怎么使用。 + +## 2.groupby的数据结构 +首先我们看如下代码 + +``` +def ddd(): + levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"] + nums = [10, 20, 30, 20, 15, 10, 12] + df = pd.DataFrame({"level": levels, "num": nums}) + g = df.groupby('level') + print(g) + print() + print(list(g)) +``` + +输出结果如下: + +``` + + +[('L1', level num +0 L1 10 +1 L1 20 +2 L1 30), ('L2', level num +3 L2 20 +4 L2 15), ('L3', level num +5 L3 10 +6 L3 12)] +``` + +做groupby操作以后,得到的是一个DataFrameGroupBy对象,直接打印该对象的话,显示的是其内存地址。 +为了方便地观察数据,我们使用list方法转换一下,发现其是一个元组,元组中的第一个元素,是level的值。元祖中的第二个元素,则是其组别下的整个dataframe。 + +## 3.groupby的基本用法 +``` +def group1(): + levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"] + nums = [10, 20, 30, 20, 15, 10, 12] + scores = [100, 200, 300, 200, 150, 100, 120] + df = pd.DataFrame({"level": levels, "num": nums, "score": scores}) + result = df.groupby('level').agg({'num': 'sum', 'score': 'mean'}) + allnum = result['num'].sum() + result['rate'] = result['num'].map(lambda x: x / allnum) + print(result) +``` + +最后输出: + +``` + num score rate +level +L1 60 200 0.512821 +L2 35 175 0.299145 +L3 22 110 0.188034 +``` + +上面的例子展示了groupby的基本用法。 +对dataframe按照level分组,然后对num列求和,对score列求平均值,可以得到result。 +同时,我们还希望得到每个分组中,num的和在所有num和中的占比。于是我们先求num的综合,然后在用map方法,给result添加一列,求得其占比! + +## 4.transform的用法 + +下面我们看一个更复杂的例子。 + +``` +def t10(): + levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"] + nums = [10, 20, 30, 20, 15, 10, 12] + df = pd.DataFrame({"level": levels, "num": nums}) + ret = df.groupby('level')['num'].mean().to_dict() + df['avg_num'] = df['level'].map(ret) + print(ret) + print(df) +``` + +``` +{'L1': 20.0, 'L2': 17.5, 'L3': 11.0} + level num avg_num +0 L1 10 20.0 +1 L1 20 20.0 +2 L1 30 20.0 +3 L2 20 17.5 +4 L2 15 17.5 +5 L3 10 11.0 +6 L3 12 11.0 +``` + +上面的方法,我们对level分组以后,我们想给数据集添加一列,想给每行数据添加每个level对应的平均值。 +上面的解法是先求得每个分组的平均值,转成一个dict,然后再使用map方法将每组的平均值添加上去。 + +``` +def trans(): + levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"] + nums = [10, 20, 30, 20, 15, 10, 12] + df = pd.DataFrame({"level": levels, "num": nums}) + df['avg_num'] = df.groupby('level')['num'].transform('mean') + print(df) +``` +如果使用transform方法,代码可以更简单更直观,如上所示。 + +transform方法的作用:调用函数在每个分组上产生一个与原df相同索引的dataFrame,整体返回与原来对象拥有相同索引且已填充了转换后的值的dataFrame,相当于就是给原来的dataframe添加了一列。 + diff --git a/code-languages/python/pandas map applymap apply方法详解.md b/code-languages/python/pandas map applymap apply方法详解.md new file mode 100644 index 0000000..05d52f9 --- /dev/null +++ b/code-languages/python/pandas map applymap apply方法详解.md @@ -0,0 +1,328 @@ +## 0 前言 + +pandas的基本数据结构是Series与DataFrame。在数据处理过程中,对每个元素,或者每行/每列进行操作是尝尽的需求。而在pandas中,就内置了map,applymap,apply方法,可以满足上面的需求。接下来结合实际的例子,看看一些基本/常规/高大上的操作。 + +## 1.map方法 +map方法在数据处理中属于基本操作,重要性无须多言。map方法一般是对元素进行逐个操作,下面来看看几个例子。 + +首先明确一点:map方法只能作用再Series上,不能作用在DataFrame上。换句话说,DataFrame没有map方法。 + +Series中map方法的部分源码如下 +``` + def map(self, arg, na_action=None): + """ + Map values of Series according to input correspondence. + + Used for substituting each value in a Series with another value, + that may be derived from a function, a ``dict`` or + a :class:`Series`. + + Parameters + ---------- + arg : function, collections.abc.Mapping subclass or Series + Mapping correspondence. + na_action : {None, 'ignore'}, default None + If 'ignore', propagate NaN values, without passing them to the + mapping correspondence. + + Returns + ------- + Series + Same index as caller. + + See Also + -------- + Series.apply : For applying more complex functions on a Series. + DataFrame.apply : Apply a function row-/column-wise. + DataFrame.applymap : Apply a function elementwise on a whole DataFrame. + + Notes + ----- + When ``arg`` is a dictionary, values in Series that are not in the + dictionary (as keys) are converted to ``NaN``. However, if the + dictionary is a ``dict`` subclass that defines ``__missing__`` (i.e. + provides a method for default values), then this default is used + rather than ``NaN``. +``` + +map方法的主要参数是arg,arg是一个方法或者字典,作用在每个元素上。 + +看个例子: + +``` +import numpy as np +import pandas as pd + +def test(): + genders = ["male", "male", "female", "unknown", "female"] + levels = ["L1", "L2", "L1", "L1", "L2"] + df = pd.DataFrame({"gender": genders, "level": levels}) + + gender_dic = {"male": "男", "female": "女", "unknown": "未知"} + print(df) + print("\n\n") + df["gender"] = df["gender"].map(gender_dic) + print(df) +``` + +输出如下: + +``` + gender level +0 male L1 +1 male L2 +2 female L1 +3 unknown L1 +4 female L2 + + + + gender level +0 男 L1 +1 男 L2 +2 女 L1 +3 未知 L1 +4 女 L2 +``` + +上面的代码,是将gender这一列里的male映射成男,female映射成女,unknown映射成未知。 + +``` +def test(): + x = [i for i in range(1, 11)] + y = [2*i + 0.5 for i in x] + df = pd.DataFrame({'x': x, 'y': y}) + x2 = df['x'] + print(x2.map(lambda i: "%.2f" % i)) + print(x2.map(lambda i: "{:.2f}".format(i))) +``` + +``` +0 1.00 +1 2.00 +2 3.00 +3 4.00 +4 5.00 +5 6.00 +6 7.00 +7 8.00 +8 9.00 +9 10.00 +Name: x, dtype: object +0 1.00 +1 2.00 +2 3.00 +3 4.00 +4 5.00 +5 6.00 +6 7.00 +7 8.00 +8 9.00 +9 10.00 +Name: x, dtype: object +``` + +上面的方法,则是将x变成带两位小数的浮点数。 + +不论是利用字典还是函数进行映射,map方法都是把对应的数据逐个当作参数传入到字典或函数中,得到映射后的值。 + +## 2.applymap方法 +上面提到,dataframe没有map方法。要对dataframe中的元素实现类似map的功能,可以使用applymap方法。 + +``` +def t8(): + x = [i for i in range(1, 11)] + y = [2*i + 0.5 for i in x] + df = pd.DataFrame({'x': x, 'y': y}) + print(df) + print() + print(df.applymap(lambda i: "%.2f" % i)) +``` + +``` + x y +0 1 2.5 +1 2 4.5 +2 3 6.5 +3 4 8.5 +4 5 10.5 +5 6 12.5 +6 7 14.5 +7 8 16.5 +8 9 18.5 +9 10 20.5 + + x y +0 1.00 2.50 +1 2.00 4.50 +2 3.00 6.50 +3 4.00 8.50 +4 5.00 10.50 +5 6.00 12.50 +6 7.00 14.50 +7 8.00 16.50 +8 9.00 18.50 +9 10.00 20.50 + +``` + +前面的例子,是对x这一列做map操作,将x中的数值变成带两位小数的浮点数。如果我们想将dataframe中的x,y同时变成带两位小数的浮点数,可以使用applymap方法。 + + +## 3.apply方法 +apply方法与map的功能类似,主要区别在于apply能传入功能更为复杂的函数。 + +``` + def apply(self, func, convert_dtype=True, args=(), **kwds): + """ + Invoke function on values of Series. + + Can be ufunc (a NumPy function that applies to the entire Series) + or a Python function that only works on single values. + + Parameters + ---------- + func : function + Python function or NumPy ufunc to apply. + convert_dtype : bool, default True + Try to find better dtype for elementwise function results. If + False, leave as dtype=object. + args : tuple + Positional arguments passed to func after the series value. + **kwds + Additional keyword arguments passed to func. + + Returns + ------- + Series or DataFrame + If func returns a Series object the result will be a DataFrame. + + See Also + -------- + Series.map: For element-wise operations. + Series.agg: Only perform aggregating type operations. + Series.transform: Only perform transforming type operations. + +``` + +我们看一下apply方法的源码,首先方法签名为 + +``` + def apply(self, func, convert_dtype=True, args=(), **kwds): +``` + +与map的源码相比,apply除了可以输入func,还可以以元组的方式输入参数,这样能够输入功能更加复杂的函数。 + +下面来看几个例子 + +``` +def square(x): + return x**2 + +def test(): + s = pd.Series([20, 21, 12], index = ['London', 'New York', 'Helsinki']) + s1 = s.apply(lambda x: x**2) + s2 = s.apply(square) + s3 = s.apply(np.log) + + print(s1) + print() + print(s2) + print() + print(s3) +``` + +输出为 + +``` +London 400 +New York 441 +Helsinki 144 +dtype: int64 + +London 400 +New York 441 +Helsinki 144 +dtype: int64 + +London 2.995732 +New York 3.044522 +Helsinki 2.484907 +dtype: float64 +``` + +上面的用法比较简单,跟map方法是一样的。 + +再看一个复杂一些的例子 + +``` +def BMI(series): + weight = series['weight'] + height = series['height'] / 100 + BMI_Rate = weight / height**2 + return BMI_Rate + +def test(): + heights = [180, 175, 169, 158, 185] + weights = [75, 72, 68, 60, 76] + age = [30, 18, 26, 42, 34] + df = pd.DataFrame({"height": heights, "weight": weights, "age": age}) + print(df) + print() + df['BMI'] = df.apply(BMI, axis=1) + print(df) +``` + +输出结果为 + +``` + height weight age +0 180 75 30 +1 175 72 18 +2 169 68 26 +3 158 60 42 +4 185 76 34 + + height weight age BMI +0 180 75 30 23.148148 +1 175 72 18 23.510204 +2 169 68 26 23.808690 +3 158 60 42 24.034610 +4 185 76 34 22.205990 +``` + +数据中包括身高体重,然后计算BMI指数=体重/身高的平方。 +上面的apply方法在调用的时候,指定了axis=1,就是对每行进行操作。如果不容易的理解的同学可以这么想:axis=1要消除的是列的维度,保留行的维度,所以是对每行的数据进行操作。apply方法在运行时,实际上就是调用BMI方法对每行数据进行操作。 + +``` +def subtract_custom_value(x, custom_value): + return x - custom_value + +def test(): + s = pd.Series([20, 21, 12], index = ['London', 'New York', 'Helsinki']) + print(s) + print() + s1 = s.apply(subtract_custom_value, args=(5,)) + print(s1) +``` + +输出结果为 + +``` +London 20 +New York 21 +Helsinki 12 +dtype: int64 + +London 15 +New York 16 +Helsinki 7 +dtype: int64 +``` + +上面代码运行的时候,就是将每个值减去5,因为要传入参数5,所以map方法此时就无能为力。 + +## 4.总结 +1.map方法是针对Series的基本操作,dataframe无map方法。 +2.dataframe如果要针对每个元素做map操作,可以使用applymap。 +3.apply方法更为灵活,可以同时作用于series与dataframe。同时可以以元组的形式传入参数。 \ No newline at end of file diff --git a/code-languages/scala/scala正则表达式 findFirstIn findAllIn findFirstMatchIn findAllMatchIn Match MatchData 提取分组.md b/code-languages/scala/scala正则表达式 findFirstIn findAllIn findFirstMatchIn findAllMatchIn Match MatchData 提取分组.md new file mode 100644 index 0000000..247b5a2 --- /dev/null +++ b/code-languages/scala/scala正则表达式 findFirstIn findAllIn findFirstMatchIn findAllMatchIn Match MatchData 提取分组.md @@ -0,0 +1,308 @@ +## 0.引子 +节前最后一个工作日,在编写一个简单的正则表达式的时候,卡了比较长的时间。后来总结发现,还是对正则表达式的理解不是很深刻,于是利用假期的时间,特意比较详细地看了一下正则表达式相关内容并加以记录。 + +## 1.findFirstIn findFirstMatchIn +正则表达式中常用的方法包括findFirstIn,findFirstMatchIn等类似的方法。先来看个例子,通过例子我们来看两者区别。 + +``` + @Test + def test() = { + val s = "你好,今天是2021年1月2日18点30分" + val pattern = """今天是\d+年\d+月\d+日""".r + val result1 = pattern.findFirstIn(s) + println(result1) + val result2 = pattern.findFirstMatchIn(s) match { + case Some(data) => { + println("data type is: " + data.getClass.getSimpleName) + data group 0 + } + case _ => "empty" + } + println(result2) + } +``` + +输出结果: + +``` +Some(今天是2021年1月2日) +data type is: Match +今天是2021年1月2日 +``` + + +简单看下源码 + +``` + /** Return an optional first matching string of this `Regex` in the given character sequence, + * or None if there is no match. + * + * @param source The text to match against. + * @return An [[scala.Option]] of the first matching string in the text. + * @example {{{"""\w+""".r findFirstIn "A simple example." foreach println // prints "A"}}} + */ + def findFirstIn(source: CharSequence): Option[String] = { + val m = pattern.matcher(source) + if (m.find) Some(m.group) else None + } +``` + +firdFirstIn是scala.util.matching.Regex的方法。该方法的输入是一个source,source类型为CharSequence接口,最常见的实现类为字符串。 +返回值为Option[String]。在我们的例子中,因为匹配上了,所以返回的值为Some[String]。 + +``` + /** Return an optional first match of this `Regex` in the given character sequence, + * or None if it does not exist. + * + * If the match is successful, the [[scala.util.matching.Regex.Match]] can be queried for + * more data. + * + * @param source The text to match against. + * @return A [[scala.Option]] of [[scala.util.matching.Regex.Match]] of the first matching string in the text. + * @example {{{("""[a-z]""".r findFirstMatchIn "A simple example.") map (_.start) // returns Some(2), the index of the first match in the text}}} + */ + def findFirstMatchIn(source: CharSequence): Option[Match] = { + val m = pattern.matcher(source) + if (m.find) Some(new Match(source, m, groupNames)) else None + } +``` + +findFirstMatchIn看源码与firdFirstIn差别不大,最大的不同在于返回的类型为Option[Match]。 + +## 2.Match MatchData +看下Match的源码 + +``` + /** Provides information about a successful match. */ + class Match(val source: CharSequence, + private[matching] val matcher: Matcher, + val groupNames: Seq[String]) extends MatchData { + + /** The index of the first matched character. */ + val start = matcher.start + + /** The index following the last matched character. */ + val end = matcher.end + + /** The number of subgroups. */ + def groupCount = matcher.groupCount + + private lazy val starts: Array[Int] = + ((0 to groupCount) map matcher.start).toArray + private lazy val ends: Array[Int] = + ((0 to groupCount) map matcher.end).toArray + + /** The index of the first matched character in group `i`. */ + def start(i: Int) = starts(i) + + /** The index following the last matched character in group `i`. */ + def end(i: Int) = ends(i) + + /** The match itself with matcher-dependent lazy vals forced, + * so that match is valid even once matcher is advanced. + */ + def force: this.type = { starts; ends; this } + } +``` + +第一行注释非常关键,告诉了我们Match类最重要的作用:Provides information about a successful match。如果匹配成功,这个类会给我们提供一些匹配成功的信息,包括匹配成功的起始位置等。 +Match类继承了MatchData,我们再看看MatchData的源码 + +``` + trait MatchData { + + /** The source from which the match originated */ + val source: CharSequence + + /** The names of the groups, or an empty sequence if none defined */ + val groupNames: Seq[String] + + /** The number of capturing groups in the pattern. + * (For a given successful match, some of those groups may not have matched any input.) + */ + def groupCount: Int + + /** The index of the first matched character, or -1 if nothing was matched */ + def start: Int + + /** The index of the first matched character in group `i`, + * or -1 if nothing was matched for that group. + */ + def start(i: Int): Int + ... + + /** The matched string in group `i`, + * or `null` if nothing was matched. + */ + def group(i: Int): String = + if (start(i) >= 0) source.subSequence(start(i), end(i)).toString + else null + ... + + /** Returns the group with given name. + * + * @param id The group name + * @return The requested group + * @throws NoSuchElementException if the requested group name is not defined + */ + def group(id: String): String = nameToIndex.get(id) match { + case None => throw new NoSuchElementException("group name "+id+" not defined") + case Some(index) => group(index) + } +``` +MatchData里面用得最多,最重要的方法应该就是group了,group最大的作用,就是用来提起分组。 + +## 3.提取分组 + +``` + @Test + def test() = { + val s = "你好,今天是2021年1月2日18点30分" + val pattern = """今天是(\d+)年(\d+)月(\d+)日""".r + val result = pattern.findFirstMatchIn(s) + val year = result match { + case Some(data) => data group 1 + case _ => "-1" + } + println(year) // 结果为 2021 + } +``` + +上面的例子就是提取分组的一个典型例子,就是利用findFirstMatchIn的group方法,提取匹配结果的第一个分组,就得到了年份数据。 + +## 4.提取分组的另外一种方式 +实际中提取分组还有另外一种常用方式。 + +``` + @Test + def test() = { + val s = "你好,今天是2021年1月2日18点30分" + val pattern = """今天是(\d+)年(\d+)月(\d+)日""".r + val pattern(year, month, day) = s + println(s"year is $year.\n" + + f"month is $month.\n" + raw"day is $day") + } +``` + +上面的代码看起来很正常,完全没毛病,但实际上却会报错有问题,本人就是在这里被卡了很长时间。 + +``` +scala.MatchError: 你好,今天是2021年1月2日18点30分 (of class java.lang.String) + + at com.xiaomi.mifi.pdata.common.T4.t8(T4.scala:114) + at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) + at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:498) + ... +``` + +当时百思不得其解,不知道问题出在哪里。仔细看了源码以后,才明白什么情况。如果我们在IDE中点击 +```val pattern(year, month, day) = s``` +这一行查看源码,会发现调用的其实是unapplySeq方法。 + +``` + def unapplySeq(s: CharSequence): Option[List[String]] = s match { + case null => None + case _ => + val m = pattern matcher s + if (runMatcher(m)) Some((1 to m.groupCount).toList map m.group) + else None + } +``` + +这个方法上面有一段关键的注释 + +``` + /** Tries to match a [[java.lang.CharSequence]]. + * + * If the match succeeds, the result is a list of the matching + * groups (or a `null` element if a group did not match any input). + * If the pattern specifies no groups, then the result will be an empty list + * on a successful match. + * + * This method attempts to match the entire input by default; to find the next + * matching subsequence, use an unanchored `Regex`. +``` + +这个方法默认是匹配整个输出,如果是要匹配子串,需要用unanchored这种方式。 + +将上面的代码稍作改动 +``` + @Test + def test() = { + val s = "你好,今天是2021年1月2日18点30分" + val pattern = """今天是(\d+)年(\d+)月(\d+)日""".r.unanchored + val pattern(year, month, day) = s + println(s"year is $year.\n" + + f"month is $month.\n" + raw"day is $day") + } +``` + +可以得到我们预期的结果 + +``` +year is 2021. +month is 1. +day is 2 +``` + +## 5.findAllIn findAllMatchIn +findAllIn与firdFirstIn对应,findAllMatchIn与findFirstMatchIn对应,表示所有匹配结果。 +先来看一个例子 + +``` + @Test + def t9() = { + val dateRegex = """(\d{4})-(\d{2})-(\d{2})""".r + val dates = "dates in history: 2004-01-20, 2005-02-28, 1998-01-15, 2009-10-25" + val result = dateRegex.findAllIn(dates) + val array = for (each <- result) yield each + println(array) + println(array.mkString("\t")) + } +``` + +``` +non-empty iterator +2004-01-20 2005-02-28 1998-01-15 2009-10-25 +``` + +findAllIn的方法签名如下: +``` + /** Return all non-overlapping matches of this `Regex` in the given character + * sequence as a [[scala.util.matching.Regex.MatchIterator]], + * which is a special [[scala.collection.Iterator]] that returns the + * matched strings but can also be queried for more data about the last match, + * such as capturing groups and start position. + .... + + def findAllIn(source: CharSequence) = new Regex.MatchIterator(source, this, groupNames) +``` + +返回的是一个MatchIterator,根据注释信息可以看出来MatchIterator是scala.collection.Iterator的一个特例,所以直接println(array)得到的信息是一个non-empty iterator。 + +如果我们想得到所有能匹配上的年份,则可以使用findAllMatchIn方法。该方法可以得到先得到所有的Match对象,然后再分组提取出年份即可。 + +``` + @Test + def t10() = { + val dateRegex = """(\d{4})-(\d{2})-(\d{2})""".r + val dates = "dates in history: 2004-01-20, 2005-02-28, 1998-01-15, 2009-10-25" + val result = dateRegex.findAllMatchIn(dates) + val array = for(each <- result) yield each.group(1) + println(array) + println(array.mkString("\t")) + } +``` + +最后的输出结果为 + +``` +non-empty iterator +2004 2005 1998 2009 +``` + + + + diff --git a/tools/linux-shell/统计文本去重行数.md b/tools/linux-shell/统计文本去重行数.md new file mode 100644 index 0000000..58695f1 --- /dev/null +++ b/tools/linux-shell/统计文本去重行数.md @@ -0,0 +1,27 @@ +常见的一个需求为:统计某个文本去重以后的行数 + +可以使用如下命令: + +``` +sort xxxfile | uniq | wc -l +``` + +也可以使用如下命令 + +``` +sort -u xxxfile | wc -l +``` + +简单解释一下 + +其中sort -u的选项,解释如下 + +``` + -u, --unique + Unique keys. Suppress all lines that have a key that is equal to an already processed one. This option, similarly to -s, implies a stable sort. If used with -c or -C, + sort also checks that there are no lines with duplicate keys. +``` + +可见sort的-u选项,就是自带去重功能。 + +而uniq 不会检查重复的行,除非它们是相邻的行,所以统计去重行数的时候,得先用sort排序,排序完了再用uniq去重,最后达到去重的目的。 \ No newline at end of file diff --git a/traditional-algorithm/tree/史上最全macos安装xgboost教程.md b/traditional-algorithm/tree/史上最全macos安装xgboost教程.md new file mode 100644 index 0000000..c3ee5fe --- /dev/null +++ b/traditional-algorithm/tree/史上最全macos安装xgboost教程.md @@ -0,0 +1,93 @@ +## 0.前言 +自己的macos上一直没有安装上xgboost,最近因为工作需要,想在macos上安装一下xgboost。 +本来以为是个很简单的事情,没想到还是费了一些波折,特意记录一下 + +## 1. 直接安装失败 +最开始直接使用 + +``` +pin install xgboost +``` +安装,安装过程没啥问题。但是安装完毕,使用过程中出现了问题。 + +``` +import xgboost as xgb +``` +导入xgboost以后,直接报错 + +``` +xgboost.core.XGBoostError: XGBoost Library (libxgboost.dylib) could not be loaded. +Likely causes: + * OpenMP runtime is not installed (vcomp140.dll or libgomp-1.dll for Windows, libgomp.so for UNIX-like OSes) + * You are running 32-bit Python on a 64-bit OS +.... +``` + +查了一下原因,大致原因如下: +Xgboost模型本身支持多线程运行,即用多个cpu线程进行训练; +但是,默认的apple clang编译器不支持openmp,因此使用默认编译器将禁用多线程。 + +## 2.解决方式1 +又搜了下网上的解决方式,大部分的套路都是这样: +先升级homebrew,然后通过homebrew安装更高版本的gcc,再去gitclone xgboost源码,build源码,再安装。 + +结果发现不管是升级homebrew,还是安装gcc,gitclone源码,每一步都难如登天,老铁们懂的. + +所以这是种可行的方式,但是堪称地狱难度,直接放弃了。 + +## 3.解决方式2 +搜索的过程中发现有个老哥直接给了一行代码就可以解决问题 + +``` +conda install py-xgboost +``` + +有几个帖子反映该方法简单粗暴好使,于是抱着试一试的想法试了下。 +结果conda掉链子了。 + +``` +Solving environment: failed with initial frozen solve. Retrying with flexible solve. +...... +``` + + +## 4.接上conda的链子 +conda的问题,比较明显是source的问题。不禁又是一声叹息... +找了半天,试了N多源,发现都不奏效。 +最后认真看了下清华开源镜像站的anaconda页面,抱着试试看的心态,把官网上的配置粘到本地的.condarc文件 + +``` +channels: + - defaults +show_channel_urls: true +channel_alias: https://mirrors.tuna.tsinghua.edu.cn/anaconda +default_channels: + - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main + - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free + - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r + - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/pro + - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2 +custom_channels: + conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud + msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud + bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud + menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud + pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud + simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud +``` + + +清华开源镜像站的anaconda链接: +[清华anaconda镜像](https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/) + +看到这里其实有点小小的感慨,国内IT产业如火如荼,但是这种重要而且基本的东西,居然是一个学校的学生凭自己的兴趣爱好在自发维护..... + +## 5.大功告成 +将conda的配置修改完毕,再执行安装命令 + +``` +conda install py-xgboost +``` + +发现大功告成,可以在本地正常运行xgb相关的代码。 +后面有时间再稍微查查这个py-xgboost有啥特别的地方。 \ No newline at end of file