add some pandas code
parent
c1c769f630
commit
36769abda5
|
@ -0,0 +1,108 @@
|
|||
## 1.分箱
|
||||
数据分箱的需求在实际中非常常见。对于一组连续的值,会对其切分成若干段,每一段我们将其看做一个类别,这个过程就叫做分箱。分箱操作本质上就是将连续值离散化的一个过程。
|
||||
|
||||
举个常见的例子:
|
||||
最常见的就是对年龄进行分箱操作。假设人的年龄从0-120岁不等,我们将0-5认为是婴幼儿,6-15岁认为是少年,16-30岁认为是青年,31-50认为是中年,50-60认为是中老年,60岁以上认为是老年。在这个过程中,就将连续的年龄分为了婴幼儿、少年、青年、中年、中老年、老年这六个类别,或者说分成了六个“箱子”,每个"箱子"代表的就是一个类别。
|
||||
|
||||
## 2.cut方法
|
||||
pandas里面有cut方法与qcut方法都可以实现分箱的需求,下面我们先来看看cut方法。
|
||||
|
||||
```
|
||||
def t1():
|
||||
scores = [80, 55, 78, 99, 60, 35, 82, 57]
|
||||
cut = pd.cut(scores, 3)
|
||||
print(cut)
|
||||
```
|
||||
|
||||
上面的方法,将scores分成三个区间,最后的结果为
|
||||
|
||||
```
|
||||
[(77.667, 99.0], (34.936, 56.333], (77.667, 99.0], (77.667, 99.0], (56.333, 77.667], (34.936, 56.333], (77.667, 99.0], (56.333, 77.667]]
|
||||
Categories (3, interval[float64]): [(34.936, 56.333] < (56.333, 77.667] < (77.667, 99.0]]
|
||||
```
|
||||
|
||||
输出的第一行表示原来的数据位于哪个箱子,第二行表示三个箱子的相关信息。
|
||||
|
||||
```
|
||||
def t2():
|
||||
scores = [80, 55, 78, 99, 60, 35, 82, 57]
|
||||
bins = [0, 60, 80, 100]
|
||||
cut = pd.cut(scores, bins)
|
||||
print(cut)
|
||||
|
||||
print(cut.codes)
|
||||
print(cut.categories)
|
||||
print(pd.value_counts(cut))
|
||||
```
|
||||
|
||||
输出结果为
|
||||
|
||||
```
|
||||
[(60, 80], (0, 60], (60, 80], (80, 100], (0, 60], (0, 60], (80, 100], (0, 60]]
|
||||
Categories (3, interval[int64]): [(0, 60] < (60, 80] < (80, 100]]
|
||||
[1 0 1 2 0 0 2 0]
|
||||
IntervalIndex([(0, 60], (60, 80], (80, 100]],
|
||||
closed='right',
|
||||
dtype='interval[int64]')
|
||||
(0, 60] 4
|
||||
(80, 100] 2
|
||||
(60, 80] 2
|
||||
dtype: int64
|
||||
```
|
||||
|
||||
上面的方法,指定了划分的bins,所以分箱的时候区间为(0, 60), (60, 80), (80, 100)。
|
||||
value_counts方法,可以统计各区间的数量。
|
||||
|
||||
```
|
||||
def t3():
|
||||
scores = [80, 55, 78, 99, 60, 35, 82, 57]
|
||||
bins = [0, 60, 80, 100]
|
||||
cut = pd.cut(scores, bins, labels=["low", "mid", "high"])
|
||||
print(pd.value_counts(cut))
|
||||
print()
|
||||
|
||||
cut2 = pd.cut(scores, bins, labels=["low", "mid", "high"], right=False)
|
||||
print(pd.value_counts(cut2))
|
||||
```
|
||||
|
||||
```
|
||||
low 4
|
||||
high 2
|
||||
mid 2
|
||||
dtype: int64
|
||||
|
||||
high 3
|
||||
low 3
|
||||
mid 2
|
||||
dtype: int64
|
||||
```
|
||||
|
||||
上面的方法,指定了labels参数,这样每个分箱区间相当于有了标签名称。
|
||||
如果指定right=False,则右区间由默认的闭区间变成开区间。
|
||||
|
||||
## 3.qcut方法
|
||||
|
||||
```
|
||||
def t4():
|
||||
scores = [x**2 for x in range(11)]
|
||||
cut = pd.qcut(scores, 5)
|
||||
print(cut)
|
||||
print()
|
||||
print(pd.value_counts(cut))
|
||||
```
|
||||
|
||||
```
|
||||
[(-0.001, 4.0], (-0.001, 4.0], (-0.001, 4.0], (4.0, 16.0], (4.0, 16.0], ..., (16.0, 36.0], (36.0, 64.0], (36.0, 64.0], (64.0, 100.0], (64.0, 100.0]]
|
||||
Length: 11
|
||||
Categories (5, interval[float64]): [(-0.001, 4.0] < (4.0, 16.0] < (16.0, 36.0] < (36.0, 64.0] <
|
||||
(64.0, 100.0]]
|
||||
|
||||
(-0.001, 4.0] 3
|
||||
(64.0, 100.0] 2
|
||||
(36.0, 64.0] 2
|
||||
(16.0, 36.0] 2
|
||||
(4.0, 16.0] 2
|
||||
dtype: int64
|
||||
```
|
||||
|
||||
与cut方法不同的是,cut是按变量的值进行划分, qcut是按照变量的个数进行划分。上面方法的意思是,将输入分为数量相等的五个分箱区间。
|
|
@ -0,0 +1,105 @@
|
|||
## 1.分组groupby
|
||||
在日常数据分析过程中,经常有分组的需求。具体来说,就是根据一个或者多个字段,将数据划分为不同的组,然后进行进一步分析,比如求分组的数量,分组内的最大值最小值平均值等。在sql中,就是大名鼎鼎的groupby操作。
|
||||
pandas中,也有对应的groupby操作,下面我们就来看看pandas中的groupby怎么使用。
|
||||
|
||||
## 2.groupby的数据结构
|
||||
首先我们看如下代码
|
||||
|
||||
```
|
||||
def ddd():
|
||||
levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
|
||||
nums = [10, 20, 30, 20, 15, 10, 12]
|
||||
df = pd.DataFrame({"level": levels, "num": nums})
|
||||
g = df.groupby('level')
|
||||
print(g)
|
||||
print()
|
||||
print(list(g))
|
||||
```
|
||||
|
||||
输出结果如下:
|
||||
|
||||
```
|
||||
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10f6f96d0>
|
||||
|
||||
[('L1', level num
|
||||
0 L1 10
|
||||
1 L1 20
|
||||
2 L1 30), ('L2', level num
|
||||
3 L2 20
|
||||
4 L2 15), ('L3', level num
|
||||
5 L3 10
|
||||
6 L3 12)]
|
||||
```
|
||||
|
||||
做groupby操作以后,得到的是一个DataFrameGroupBy对象,直接打印该对象的话,显示的是其内存地址。
|
||||
为了方便地观察数据,我们使用list方法转换一下,发现其是一个元组,元组中的第一个元素,是level的值。元祖中的第二个元素,则是其组别下的整个dataframe。
|
||||
|
||||
## 3.groupby的基本用法
|
||||
```
|
||||
def group1():
|
||||
levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
|
||||
nums = [10, 20, 30, 20, 15, 10, 12]
|
||||
scores = [100, 200, 300, 200, 150, 100, 120]
|
||||
df = pd.DataFrame({"level": levels, "num": nums, "score": scores})
|
||||
result = df.groupby('level').agg({'num': 'sum', 'score': 'mean'})
|
||||
allnum = result['num'].sum()
|
||||
result['rate'] = result['num'].map(lambda x: x / allnum)
|
||||
print(result)
|
||||
```
|
||||
|
||||
最后输出:
|
||||
|
||||
```
|
||||
num score rate
|
||||
level
|
||||
L1 60 200 0.512821
|
||||
L2 35 175 0.299145
|
||||
L3 22 110 0.188034
|
||||
```
|
||||
|
||||
上面的例子展示了groupby的基本用法。
|
||||
对dataframe按照level分组,然后对num列求和,对score列求平均值,可以得到result。
|
||||
同时,我们还希望得到每个分组中,num的和在所有num和中的占比。于是我们先求num的综合,然后在用map方法,给result添加一列,求得其占比!
|
||||
|
||||
## 4.transform的用法
|
||||
|
||||
下面我们看一个更复杂的例子。
|
||||
|
||||
```
|
||||
def t10():
|
||||
levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
|
||||
nums = [10, 20, 30, 20, 15, 10, 12]
|
||||
df = pd.DataFrame({"level": levels, "num": nums})
|
||||
ret = df.groupby('level')['num'].mean().to_dict()
|
||||
df['avg_num'] = df['level'].map(ret)
|
||||
print(ret)
|
||||
print(df)
|
||||
```
|
||||
|
||||
```
|
||||
{'L1': 20.0, 'L2': 17.5, 'L3': 11.0}
|
||||
level num avg_num
|
||||
0 L1 10 20.0
|
||||
1 L1 20 20.0
|
||||
2 L1 30 20.0
|
||||
3 L2 20 17.5
|
||||
4 L2 15 17.5
|
||||
5 L3 10 11.0
|
||||
6 L3 12 11.0
|
||||
```
|
||||
|
||||
上面的方法,我们对level分组以后,我们想给数据集添加一列,想给每行数据添加每个level对应的平均值。
|
||||
上面的解法是先求得每个分组的平均值,转成一个dict,然后再使用map方法将每组的平均值添加上去。
|
||||
|
||||
```
|
||||
def trans():
|
||||
levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
|
||||
nums = [10, 20, 30, 20, 15, 10, 12]
|
||||
df = pd.DataFrame({"level": levels, "num": nums})
|
||||
df['avg_num'] = df.groupby('level')['num'].transform('mean')
|
||||
print(df)
|
||||
```
|
||||
如果使用transform方法,代码可以更简单更直观,如上所示。
|
||||
|
||||
transform方法的作用:调用函数在每个分组上产生一个与原df相同索引的dataFrame,整体返回与原来对象拥有相同索引且已填充了转换后的值的dataFrame,相当于就是给原来的dataframe添加了一列。
|
||||
|
|
@ -0,0 +1,328 @@
|
|||
## 0 前言
|
||||
|
||||
pandas的基本数据结构是Series与DataFrame。在数据处理过程中,对每个元素,或者每行/每列进行操作是尝尽的需求。而在pandas中,就内置了map,applymap,apply方法,可以满足上面的需求。接下来结合实际的例子,看看一些基本/常规/高大上的操作。
|
||||
|
||||
## 1.map方法
|
||||
map方法在数据处理中属于基本操作,重要性无须多言。map方法一般是对元素进行逐个操作,下面来看看几个例子。
|
||||
|
||||
首先明确一点:map方法只能作用再Series上,不能作用在DataFrame上。换句话说,DataFrame没有map方法。
|
||||
|
||||
Series中map方法的部分源码如下
|
||||
```
|
||||
def map(self, arg, na_action=None):
|
||||
"""
|
||||
Map values of Series according to input correspondence.
|
||||
|
||||
Used for substituting each value in a Series with another value,
|
||||
that may be derived from a function, a ``dict`` or
|
||||
a :class:`Series`.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
arg : function, collections.abc.Mapping subclass or Series
|
||||
Mapping correspondence.
|
||||
na_action : {None, 'ignore'}, default None
|
||||
If 'ignore', propagate NaN values, without passing them to the
|
||||
mapping correspondence.
|
||||
|
||||
Returns
|
||||
-------
|
||||
Series
|
||||
Same index as caller.
|
||||
|
||||
See Also
|
||||
--------
|
||||
Series.apply : For applying more complex functions on a Series.
|
||||
DataFrame.apply : Apply a function row-/column-wise.
|
||||
DataFrame.applymap : Apply a function elementwise on a whole DataFrame.
|
||||
|
||||
Notes
|
||||
-----
|
||||
When ``arg`` is a dictionary, values in Series that are not in the
|
||||
dictionary (as keys) are converted to ``NaN``. However, if the
|
||||
dictionary is a ``dict`` subclass that defines ``__missing__`` (i.e.
|
||||
provides a method for default values), then this default is used
|
||||
rather than ``NaN``.
|
||||
```
|
||||
|
||||
map方法的主要参数是arg,arg是一个方法或者字典,作用在每个元素上。
|
||||
|
||||
看个例子:
|
||||
|
||||
```
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
def test():
|
||||
genders = ["male", "male", "female", "unknown", "female"]
|
||||
levels = ["L1", "L2", "L1", "L1", "L2"]
|
||||
df = pd.DataFrame({"gender": genders, "level": levels})
|
||||
|
||||
gender_dic = {"male": "男", "female": "女", "unknown": "未知"}
|
||||
print(df)
|
||||
print("\n\n")
|
||||
df["gender"] = df["gender"].map(gender_dic)
|
||||
print(df)
|
||||
```
|
||||
|
||||
输出如下:
|
||||
|
||||
```
|
||||
gender level
|
||||
0 male L1
|
||||
1 male L2
|
||||
2 female L1
|
||||
3 unknown L1
|
||||
4 female L2
|
||||
|
||||
|
||||
|
||||
gender level
|
||||
0 男 L1
|
||||
1 男 L2
|
||||
2 女 L1
|
||||
3 未知 L1
|
||||
4 女 L2
|
||||
```
|
||||
|
||||
上面的代码,是将gender这一列里的male映射成男,female映射成女,unknown映射成未知。
|
||||
|
||||
```
|
||||
def test():
|
||||
x = [i for i in range(1, 11)]
|
||||
y = [2*i + 0.5 for i in x]
|
||||
df = pd.DataFrame({'x': x, 'y': y})
|
||||
x2 = df['x']
|
||||
print(x2.map(lambda i: "%.2f" % i))
|
||||
print(x2.map(lambda i: "{:.2f}".format(i)))
|
||||
```
|
||||
|
||||
```
|
||||
0 1.00
|
||||
1 2.00
|
||||
2 3.00
|
||||
3 4.00
|
||||
4 5.00
|
||||
5 6.00
|
||||
6 7.00
|
||||
7 8.00
|
||||
8 9.00
|
||||
9 10.00
|
||||
Name: x, dtype: object
|
||||
0 1.00
|
||||
1 2.00
|
||||
2 3.00
|
||||
3 4.00
|
||||
4 5.00
|
||||
5 6.00
|
||||
6 7.00
|
||||
7 8.00
|
||||
8 9.00
|
||||
9 10.00
|
||||
Name: x, dtype: object
|
||||
```
|
||||
|
||||
上面的方法,则是将x变成带两位小数的浮点数。
|
||||
|
||||
不论是利用字典还是函数进行映射,map方法都是把对应的数据逐个当作参数传入到字典或函数中,得到映射后的值。
|
||||
|
||||
## 2.applymap方法
|
||||
上面提到,dataframe没有map方法。要对dataframe中的元素实现类似map的功能,可以使用applymap方法。
|
||||
|
||||
```
|
||||
def t8():
|
||||
x = [i for i in range(1, 11)]
|
||||
y = [2*i + 0.5 for i in x]
|
||||
df = pd.DataFrame({'x': x, 'y': y})
|
||||
print(df)
|
||||
print()
|
||||
print(df.applymap(lambda i: "%.2f" % i))
|
||||
```
|
||||
|
||||
```
|
||||
x y
|
||||
0 1 2.5
|
||||
1 2 4.5
|
||||
2 3 6.5
|
||||
3 4 8.5
|
||||
4 5 10.5
|
||||
5 6 12.5
|
||||
6 7 14.5
|
||||
7 8 16.5
|
||||
8 9 18.5
|
||||
9 10 20.5
|
||||
|
||||
x y
|
||||
0 1.00 2.50
|
||||
1 2.00 4.50
|
||||
2 3.00 6.50
|
||||
3 4.00 8.50
|
||||
4 5.00 10.50
|
||||
5 6.00 12.50
|
||||
6 7.00 14.50
|
||||
7 8.00 16.50
|
||||
8 9.00 18.50
|
||||
9 10.00 20.50
|
||||
|
||||
```
|
||||
|
||||
前面的例子,是对x这一列做map操作,将x中的数值变成带两位小数的浮点数。如果我们想将dataframe中的x,y同时变成带两位小数的浮点数,可以使用applymap方法。
|
||||
|
||||
|
||||
## 3.apply方法
|
||||
apply方法与map的功能类似,主要区别在于apply能传入功能更为复杂的函数。
|
||||
|
||||
```
|
||||
def apply(self, func, convert_dtype=True, args=(), **kwds):
|
||||
"""
|
||||
Invoke function on values of Series.
|
||||
|
||||
Can be ufunc (a NumPy function that applies to the entire Series)
|
||||
or a Python function that only works on single values.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
func : function
|
||||
Python function or NumPy ufunc to apply.
|
||||
convert_dtype : bool, default True
|
||||
Try to find better dtype for elementwise function results. If
|
||||
False, leave as dtype=object.
|
||||
args : tuple
|
||||
Positional arguments passed to func after the series value.
|
||||
**kwds
|
||||
Additional keyword arguments passed to func.
|
||||
|
||||
Returns
|
||||
-------
|
||||
Series or DataFrame
|
||||
If func returns a Series object the result will be a DataFrame.
|
||||
|
||||
See Also
|
||||
--------
|
||||
Series.map: For element-wise operations.
|
||||
Series.agg: Only perform aggregating type operations.
|
||||
Series.transform: Only perform transforming type operations.
|
||||
|
||||
```
|
||||
|
||||
我们看一下apply方法的源码,首先方法签名为
|
||||
|
||||
```
|
||||
def apply(self, func, convert_dtype=True, args=(), **kwds):
|
||||
```
|
||||
|
||||
与map的源码相比,apply除了可以输入func,还可以以元组的方式输入参数,这样能够输入功能更加复杂的函数。
|
||||
|
||||
下面来看几个例子
|
||||
|
||||
```
|
||||
def square(x):
|
||||
return x**2
|
||||
|
||||
def test():
|
||||
s = pd.Series([20, 21, 12], index = ['London', 'New York', 'Helsinki'])
|
||||
s1 = s.apply(lambda x: x**2)
|
||||
s2 = s.apply(square)
|
||||
s3 = s.apply(np.log)
|
||||
|
||||
print(s1)
|
||||
print()
|
||||
print(s2)
|
||||
print()
|
||||
print(s3)
|
||||
```
|
||||
|
||||
输出为
|
||||
|
||||
```
|
||||
London 400
|
||||
New York 441
|
||||
Helsinki 144
|
||||
dtype: int64
|
||||
|
||||
London 400
|
||||
New York 441
|
||||
Helsinki 144
|
||||
dtype: int64
|
||||
|
||||
London 2.995732
|
||||
New York 3.044522
|
||||
Helsinki 2.484907
|
||||
dtype: float64
|
||||
```
|
||||
|
||||
上面的用法比较简单,跟map方法是一样的。
|
||||
|
||||
再看一个复杂一些的例子
|
||||
|
||||
```
|
||||
def BMI(series):
|
||||
weight = series['weight']
|
||||
height = series['height'] / 100
|
||||
BMI_Rate = weight / height**2
|
||||
return BMI_Rate
|
||||
|
||||
def test():
|
||||
heights = [180, 175, 169, 158, 185]
|
||||
weights = [75, 72, 68, 60, 76]
|
||||
age = [30, 18, 26, 42, 34]
|
||||
df = pd.DataFrame({"height": heights, "weight": weights, "age": age})
|
||||
print(df)
|
||||
print()
|
||||
df['BMI'] = df.apply(BMI, axis=1)
|
||||
print(df)
|
||||
```
|
||||
|
||||
输出结果为
|
||||
|
||||
```
|
||||
height weight age
|
||||
0 180 75 30
|
||||
1 175 72 18
|
||||
2 169 68 26
|
||||
3 158 60 42
|
||||
4 185 76 34
|
||||
|
||||
height weight age BMI
|
||||
0 180 75 30 23.148148
|
||||
1 175 72 18 23.510204
|
||||
2 169 68 26 23.808690
|
||||
3 158 60 42 24.034610
|
||||
4 185 76 34 22.205990
|
||||
```
|
||||
|
||||
数据中包括身高体重,然后计算BMI指数=体重/身高的平方。
|
||||
上面的apply方法在调用的时候,指定了axis=1,就是对每行进行操作。如果不容易的理解的同学可以这么想:axis=1要消除的是列的维度,保留行的维度,所以是对每行的数据进行操作。apply方法在运行时,实际上就是调用BMI方法对每行数据进行操作。
|
||||
|
||||
```
|
||||
def subtract_custom_value(x, custom_value):
|
||||
return x - custom_value
|
||||
|
||||
def test():
|
||||
s = pd.Series([20, 21, 12], index = ['London', 'New York', 'Helsinki'])
|
||||
print(s)
|
||||
print()
|
||||
s1 = s.apply(subtract_custom_value, args=(5,))
|
||||
print(s1)
|
||||
```
|
||||
|
||||
输出结果为
|
||||
|
||||
```
|
||||
London 20
|
||||
New York 21
|
||||
Helsinki 12
|
||||
dtype: int64
|
||||
|
||||
London 15
|
||||
New York 16
|
||||
Helsinki 7
|
||||
dtype: int64
|
||||
```
|
||||
|
||||
上面代码运行的时候,就是将每个值减去5,因为要传入参数5,所以map方法此时就无能为力。
|
||||
|
||||
## 4.总结
|
||||
1.map方法是针对Series的基本操作,dataframe无map方法。
|
||||
2.dataframe如果要针对每个元素做map操作,可以使用applymap。
|
||||
3.apply方法更为灵活,可以同时作用于series与dataframe。同时可以以元组的形式传入参数。
|
|
@ -0,0 +1,308 @@
|
|||
## 0.引子
|
||||
节前最后一个工作日,在编写一个简单的正则表达式的时候,卡了比较长的时间。后来总结发现,还是对正则表达式的理解不是很深刻,于是利用假期的时间,特意比较详细地看了一下正则表达式相关内容并加以记录。
|
||||
|
||||
## 1.findFirstIn findFirstMatchIn
|
||||
正则表达式中常用的方法包括findFirstIn,findFirstMatchIn等类似的方法。先来看个例子,通过例子我们来看两者区别。
|
||||
|
||||
```
|
||||
@Test
|
||||
def test() = {
|
||||
val s = "你好,今天是2021年1月2日18点30分"
|
||||
val pattern = """今天是\d+年\d+月\d+日""".r
|
||||
val result1 = pattern.findFirstIn(s)
|
||||
println(result1)
|
||||
val result2 = pattern.findFirstMatchIn(s) match {
|
||||
case Some(data) => {
|
||||
println("data type is: " + data.getClass.getSimpleName)
|
||||
data group 0
|
||||
}
|
||||
case _ => "empty"
|
||||
}
|
||||
println(result2)
|
||||
}
|
||||
```
|
||||
|
||||
输出结果:
|
||||
|
||||
```
|
||||
Some(今天是2021年1月2日)
|
||||
data type is: Match
|
||||
今天是2021年1月2日
|
||||
```
|
||||
|
||||
|
||||
简单看下源码
|
||||
|
||||
```
|
||||
/** Return an optional first matching string of this `Regex` in the given character sequence,
|
||||
* or None if there is no match.
|
||||
*
|
||||
* @param source The text to match against.
|
||||
* @return An [[scala.Option]] of the first matching string in the text.
|
||||
* @example {{{"""\w+""".r findFirstIn "A simple example." foreach println // prints "A"}}}
|
||||
*/
|
||||
def findFirstIn(source: CharSequence): Option[String] = {
|
||||
val m = pattern.matcher(source)
|
||||
if (m.find) Some(m.group) else None
|
||||
}
|
||||
```
|
||||
|
||||
firdFirstIn是scala.util.matching.Regex的方法。该方法的输入是一个source,source类型为CharSequence接口,最常见的实现类为字符串。
|
||||
返回值为Option[String]。在我们的例子中,因为匹配上了,所以返回的值为Some[String]。
|
||||
|
||||
```
|
||||
/** Return an optional first match of this `Regex` in the given character sequence,
|
||||
* or None if it does not exist.
|
||||
*
|
||||
* If the match is successful, the [[scala.util.matching.Regex.Match]] can be queried for
|
||||
* more data.
|
||||
*
|
||||
* @param source The text to match against.
|
||||
* @return A [[scala.Option]] of [[scala.util.matching.Regex.Match]] of the first matching string in the text.
|
||||
* @example {{{("""[a-z]""".r findFirstMatchIn "A simple example.") map (_.start) // returns Some(2), the index of the first match in the text}}}
|
||||
*/
|
||||
def findFirstMatchIn(source: CharSequence): Option[Match] = {
|
||||
val m = pattern.matcher(source)
|
||||
if (m.find) Some(new Match(source, m, groupNames)) else None
|
||||
}
|
||||
```
|
||||
|
||||
findFirstMatchIn看源码与firdFirstIn差别不大,最大的不同在于返回的类型为Option[Match]。
|
||||
|
||||
## 2.Match MatchData
|
||||
看下Match的源码
|
||||
|
||||
```
|
||||
/** Provides information about a successful match. */
|
||||
class Match(val source: CharSequence,
|
||||
private[matching] val matcher: Matcher,
|
||||
val groupNames: Seq[String]) extends MatchData {
|
||||
|
||||
/** The index of the first matched character. */
|
||||
val start = matcher.start
|
||||
|
||||
/** The index following the last matched character. */
|
||||
val end = matcher.end
|
||||
|
||||
/** The number of subgroups. */
|
||||
def groupCount = matcher.groupCount
|
||||
|
||||
private lazy val starts: Array[Int] =
|
||||
((0 to groupCount) map matcher.start).toArray
|
||||
private lazy val ends: Array[Int] =
|
||||
((0 to groupCount) map matcher.end).toArray
|
||||
|
||||
/** The index of the first matched character in group `i`. */
|
||||
def start(i: Int) = starts(i)
|
||||
|
||||
/** The index following the last matched character in group `i`. */
|
||||
def end(i: Int) = ends(i)
|
||||
|
||||
/** The match itself with matcher-dependent lazy vals forced,
|
||||
* so that match is valid even once matcher is advanced.
|
||||
*/
|
||||
def force: this.type = { starts; ends; this }
|
||||
}
|
||||
```
|
||||
|
||||
第一行注释非常关键,告诉了我们Match类最重要的作用:Provides information about a successful match。如果匹配成功,这个类会给我们提供一些匹配成功的信息,包括匹配成功的起始位置等。
|
||||
Match类继承了MatchData,我们再看看MatchData的源码
|
||||
|
||||
```
|
||||
trait MatchData {
|
||||
|
||||
/** The source from which the match originated */
|
||||
val source: CharSequence
|
||||
|
||||
/** The names of the groups, or an empty sequence if none defined */
|
||||
val groupNames: Seq[String]
|
||||
|
||||
/** The number of capturing groups in the pattern.
|
||||
* (For a given successful match, some of those groups may not have matched any input.)
|
||||
*/
|
||||
def groupCount: Int
|
||||
|
||||
/** The index of the first matched character, or -1 if nothing was matched */
|
||||
def start: Int
|
||||
|
||||
/** The index of the first matched character in group `i`,
|
||||
* or -1 if nothing was matched for that group.
|
||||
*/
|
||||
def start(i: Int): Int
|
||||
...
|
||||
|
||||
/** The matched string in group `i`,
|
||||
* or `null` if nothing was matched.
|
||||
*/
|
||||
def group(i: Int): String =
|
||||
if (start(i) >= 0) source.subSequence(start(i), end(i)).toString
|
||||
else null
|
||||
...
|
||||
|
||||
/** Returns the group with given name.
|
||||
*
|
||||
* @param id The group name
|
||||
* @return The requested group
|
||||
* @throws NoSuchElementException if the requested group name is not defined
|
||||
*/
|
||||
def group(id: String): String = nameToIndex.get(id) match {
|
||||
case None => throw new NoSuchElementException("group name "+id+" not defined")
|
||||
case Some(index) => group(index)
|
||||
}
|
||||
```
|
||||
MatchData里面用得最多,最重要的方法应该就是group了,group最大的作用,就是用来提起分组。
|
||||
|
||||
## 3.提取分组
|
||||
|
||||
```
|
||||
@Test
|
||||
def test() = {
|
||||
val s = "你好,今天是2021年1月2日18点30分"
|
||||
val pattern = """今天是(\d+)年(\d+)月(\d+)日""".r
|
||||
val result = pattern.findFirstMatchIn(s)
|
||||
val year = result match {
|
||||
case Some(data) => data group 1
|
||||
case _ => "-1"
|
||||
}
|
||||
println(year) // 结果为 2021
|
||||
}
|
||||
```
|
||||
|
||||
上面的例子就是提取分组的一个典型例子,就是利用findFirstMatchIn的group方法,提取匹配结果的第一个分组,就得到了年份数据。
|
||||
|
||||
## 4.提取分组的另外一种方式
|
||||
实际中提取分组还有另外一种常用方式。
|
||||
|
||||
```
|
||||
@Test
|
||||
def test() = {
|
||||
val s = "你好,今天是2021年1月2日18点30分"
|
||||
val pattern = """今天是(\d+)年(\d+)月(\d+)日""".r
|
||||
val pattern(year, month, day) = s
|
||||
println(s"year is $year.\n" +
|
||||
f"month is $month.\n" + raw"day is $day")
|
||||
}
|
||||
```
|
||||
|
||||
上面的代码看起来很正常,完全没毛病,但实际上却会报错有问题,本人就是在这里被卡了很长时间。
|
||||
|
||||
```
|
||||
scala.MatchError: 你好,今天是2021年1月2日18点30分 (of class java.lang.String)
|
||||
|
||||
at com.xiaomi.mifi.pdata.common.T4.t8(T4.scala:114)
|
||||
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
||||
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
|
||||
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
||||
at java.lang.reflect.Method.invoke(Method.java:498)
|
||||
...
|
||||
```
|
||||
|
||||
当时百思不得其解,不知道问题出在哪里。仔细看了源码以后,才明白什么情况。如果我们在IDE中点击
|
||||
```val pattern(year, month, day) = s```
|
||||
这一行查看源码,会发现调用的其实是unapplySeq方法。
|
||||
|
||||
```
|
||||
def unapplySeq(s: CharSequence): Option[List[String]] = s match {
|
||||
case null => None
|
||||
case _ =>
|
||||
val m = pattern matcher s
|
||||
if (runMatcher(m)) Some((1 to m.groupCount).toList map m.group)
|
||||
else None
|
||||
}
|
||||
```
|
||||
|
||||
这个方法上面有一段关键的注释
|
||||
|
||||
```
|
||||
/** Tries to match a [[java.lang.CharSequence]].
|
||||
*
|
||||
* If the match succeeds, the result is a list of the matching
|
||||
* groups (or a `null` element if a group did not match any input).
|
||||
* If the pattern specifies no groups, then the result will be an empty list
|
||||
* on a successful match.
|
||||
*
|
||||
* This method attempts to match the entire input by default; to find the next
|
||||
* matching subsequence, use an unanchored `Regex`.
|
||||
```
|
||||
|
||||
这个方法默认是匹配整个输出,如果是要匹配子串,需要用unanchored这种方式。
|
||||
|
||||
将上面的代码稍作改动
|
||||
```
|
||||
@Test
|
||||
def test() = {
|
||||
val s = "你好,今天是2021年1月2日18点30分"
|
||||
val pattern = """今天是(\d+)年(\d+)月(\d+)日""".r.unanchored
|
||||
val pattern(year, month, day) = s
|
||||
println(s"year is $year.\n" +
|
||||
f"month is $month.\n" + raw"day is $day")
|
||||
}
|
||||
```
|
||||
|
||||
可以得到我们预期的结果
|
||||
|
||||
```
|
||||
year is 2021.
|
||||
month is 1.
|
||||
day is 2
|
||||
```
|
||||
|
||||
## 5.findAllIn findAllMatchIn
|
||||
findAllIn与firdFirstIn对应,findAllMatchIn与findFirstMatchIn对应,表示所有匹配结果。
|
||||
先来看一个例子
|
||||
|
||||
```
|
||||
@Test
|
||||
def t9() = {
|
||||
val dateRegex = """(\d{4})-(\d{2})-(\d{2})""".r
|
||||
val dates = "dates in history: 2004-01-20, 2005-02-28, 1998-01-15, 2009-10-25"
|
||||
val result = dateRegex.findAllIn(dates)
|
||||
val array = for (each <- result) yield each
|
||||
println(array)
|
||||
println(array.mkString("\t"))
|
||||
}
|
||||
```
|
||||
|
||||
```
|
||||
non-empty iterator
|
||||
2004-01-20 2005-02-28 1998-01-15 2009-10-25
|
||||
```
|
||||
|
||||
findAllIn的方法签名如下:
|
||||
```
|
||||
/** Return all non-overlapping matches of this `Regex` in the given character
|
||||
* sequence as a [[scala.util.matching.Regex.MatchIterator]],
|
||||
* which is a special [[scala.collection.Iterator]] that returns the
|
||||
* matched strings but can also be queried for more data about the last match,
|
||||
* such as capturing groups and start position.
|
||||
....
|
||||
|
||||
def findAllIn(source: CharSequence) = new Regex.MatchIterator(source, this, groupNames)
|
||||
```
|
||||
|
||||
返回的是一个MatchIterator,根据注释信息可以看出来MatchIterator是scala.collection.Iterator的一个特例,所以直接println(array)得到的信息是一个non-empty iterator。
|
||||
|
||||
如果我们想得到所有能匹配上的年份,则可以使用findAllMatchIn方法。该方法可以得到先得到所有的Match对象,然后再分组提取出年份即可。
|
||||
|
||||
```
|
||||
@Test
|
||||
def t10() = {
|
||||
val dateRegex = """(\d{4})-(\d{2})-(\d{2})""".r
|
||||
val dates = "dates in history: 2004-01-20, 2005-02-28, 1998-01-15, 2009-10-25"
|
||||
val result = dateRegex.findAllMatchIn(dates)
|
||||
val array = for(each <- result) yield each.group(1)
|
||||
println(array)
|
||||
println(array.mkString("\t"))
|
||||
}
|
||||
```
|
||||
|
||||
最后的输出结果为
|
||||
|
||||
```
|
||||
non-empty iterator
|
||||
2004 2005 1998 2009
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,27 @@
|
|||
常见的一个需求为:统计某个文本去重以后的行数
|
||||
|
||||
可以使用如下命令:
|
||||
|
||||
```
|
||||
sort xxxfile | uniq | wc -l
|
||||
```
|
||||
|
||||
也可以使用如下命令
|
||||
|
||||
```
|
||||
sort -u xxxfile | wc -l
|
||||
```
|
||||
|
||||
简单解释一下
|
||||
|
||||
其中sort -u的选项,解释如下
|
||||
|
||||
```
|
||||
-u, --unique
|
||||
Unique keys. Suppress all lines that have a key that is equal to an already processed one. This option, similarly to -s, implies a stable sort. If used with -c or -C,
|
||||
sort also checks that there are no lines with duplicate keys.
|
||||
```
|
||||
|
||||
可见sort的-u选项,就是自带去重功能。
|
||||
|
||||
而uniq 不会检查重复的行,除非它们是相邻的行,所以统计去重行数的时候,得先用sort排序,排序完了再用uniq去重,最后达到去重的目的。
|
|
@ -0,0 +1,93 @@
|
|||
## 0.前言
|
||||
自己的macos上一直没有安装上xgboost,最近因为工作需要,想在macos上安装一下xgboost。
|
||||
本来以为是个很简单的事情,没想到还是费了一些波折,特意记录一下
|
||||
|
||||
## 1. 直接安装失败
|
||||
最开始直接使用
|
||||
|
||||
```
|
||||
pin install xgboost
|
||||
```
|
||||
安装,安装过程没啥问题。但是安装完毕,使用过程中出现了问题。
|
||||
|
||||
```
|
||||
import xgboost as xgb
|
||||
```
|
||||
导入xgboost以后,直接报错
|
||||
|
||||
```
|
||||
xgboost.core.XGBoostError: XGBoost Library (libxgboost.dylib) could not be loaded.
|
||||
Likely causes:
|
||||
* OpenMP runtime is not installed (vcomp140.dll or libgomp-1.dll for Windows, libgomp.so for UNIX-like OSes)
|
||||
* You are running 32-bit Python on a 64-bit OS
|
||||
....
|
||||
```
|
||||
|
||||
查了一下原因,大致原因如下:
|
||||
Xgboost模型本身支持多线程运行,即用多个cpu线程进行训练;
|
||||
但是,默认的apple clang编译器不支持openmp,因此使用默认编译器将禁用多线程。
|
||||
|
||||
## 2.解决方式1
|
||||
又搜了下网上的解决方式,大部分的套路都是这样:
|
||||
先升级homebrew,然后通过homebrew安装更高版本的gcc,再去gitclone xgboost源码,build源码,再安装。
|
||||
|
||||
结果发现不管是升级homebrew,还是安装gcc,gitclone源码,每一步都难如登天,老铁们懂的.
|
||||
|
||||
所以这是种可行的方式,但是堪称地狱难度,直接放弃了。
|
||||
|
||||
## 3.解决方式2
|
||||
搜索的过程中发现有个老哥直接给了一行代码就可以解决问题
|
||||
|
||||
```
|
||||
conda install py-xgboost
|
||||
```
|
||||
|
||||
有几个帖子反映该方法简单粗暴好使,于是抱着试一试的想法试了下。
|
||||
结果conda掉链子了。
|
||||
|
||||
```
|
||||
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
|
||||
......
|
||||
```
|
||||
|
||||
|
||||
## 4.接上conda的链子
|
||||
conda的问题,比较明显是source的问题。不禁又是一声叹息...
|
||||
找了半天,试了N多源,发现都不奏效。
|
||||
最后认真看了下清华开源镜像站的anaconda页面,抱着试试看的心态,把官网上的配置粘到本地的.condarc文件
|
||||
|
||||
```
|
||||
channels:
|
||||
- defaults
|
||||
show_channel_urls: true
|
||||
channel_alias: https://mirrors.tuna.tsinghua.edu.cn/anaconda
|
||||
default_channels:
|
||||
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
|
||||
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
|
||||
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
|
||||
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/pro
|
||||
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
|
||||
custom_channels:
|
||||
conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
|
||||
msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
|
||||
bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
|
||||
menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
|
||||
pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
|
||||
simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
|
||||
```
|
||||
|
||||
|
||||
清华开源镜像站的anaconda链接:
|
||||
[清华anaconda镜像](https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/)
|
||||
|
||||
看到这里其实有点小小的感慨,国内IT产业如火如荼,但是这种重要而且基本的东西,居然是一个学校的学生凭自己的兴趣爱好在自发维护.....
|
||||
|
||||
## 5.大功告成
|
||||
将conda的配置修改完毕,再执行安装命令
|
||||
|
||||
```
|
||||
conda install py-xgboost
|
||||
```
|
||||
|
||||
发现大功告成,可以在本地正常运行xgb相关的代码。
|
||||
后面有时间再稍微查查这个py-xgboost有啥特别的地方。
|
Loading…
Reference in New Issue