SMOTE__簡單原理圖示 演算法實現及R和Python調包簡單實現
一、SMOTE原理
SMOTE的全稱是Synthetic Minority Over-Sampling Technique 即「人工少數類過採樣法」,非直接對少數類進行重採樣,而是設計演算法來人工合成一些新的少數樣本。
SMOTE步驟__1.選一個正樣本
紅色圈覆蓋
打開今日頭條,查看更多圖片
SMOTE步驟__2.找到該正樣本的K個近鄰(假設K = 3)
可以是正樣本也可以是負樣本
綠色圈覆蓋
SMOTE步驟__3.隨機從K個近鄰中選出一個樣本
可以是正樣本也可以是負樣本
SMOTE步驟__4.在正樣本和隨機選出的這個近鄰之間的連線上,隨機找一點。這個點就是人工合成的新正樣本了
二、調包實現
2.1 R調包實現_SMOTE
主要參數解釋:
perc.over = a 需要生成的正樣本:最後正樣本數( 1 + a /100) * N : N 為目前有的正樣本數量
perc.under = a 需要從負樣本抽樣的個數:最後負樣本數 (a / 100 * b / 10) * N
K = x 用相近的x 個樣本(中的一個)生成正樣本
library(DMwR)
# pos = (1 + perc.over/100) * N (N original pos sample)
# neg = (perc.over/100 * perc.under/100) * N
# SMOT oversample
newdata <- SMOTE(tp~., data_in
, perc.over = 300, k = 5, perc.under = 200
)
1
2
3
4
5
6
7
2.2 Python 調包實現_SMOTE
imblearn.over_sampling.SMOTE(
sampling_strategy = 『auto』,
random_state = None, ## 隨機器設定
k_neighbors = 5, ## 用相近的 5 個樣本(中的一個)生成正樣本
m_neighbors = 10, ## 當使用 kind={"borderline1", "borderline2", "svm"}
out_step = 『0.5』, ## 當使用kind = "svm"
kind = "regular", ## 隨機選取少數類的樣本
– borderline1: 最近鄰中的隨機樣本b與該少數類樣本a來自於不同的類
– borderline2: 隨機樣本b可以是屬於任何一個類的樣本;
– svm:使用支持向量機分類器產生支持向量然後再生成新的少數類樣本
svm_estimator = SVC(), ## svm 分類器的選取
n_jobs = 1, ## 使用的常式數,為-1時使用全部CPU
ratio=None
)
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 42, n_jobs = -1)
x, y = sm.fit_sample(x_val, y_val)
1
2
3
三、演算法實現
#! /user/bin/python 3
# -*- coding: utf-8 -*-
# author: Scc_hy
# 2018-11-17
# SMOTE
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
import copy
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
class TWO_SMOTE():
"""
不平二分類人工插值法採樣
"""
def __init__(self,
K_neighbors = 5,
N_need = 200,
random_state = 42):
self.K_neighbors = K_neighbors
self.N_need = N_need
self.random_state = 42
def get_param_describe(self):
print(
"演算法參數:
"+
"K_neighbors: 和正樣本相近的隨機樣本數" + "
" +
"N_need: 需要增加的正樣本數 (N_need // 100 * a)" + "
" +
"random_state: 隨機器設定" + "
"
"
over_sample 參數:
" +
"x_data: 需要進行過採樣的全部數據集(非文本DataFrame)" + "
" +
"y_label: 類別標籤(非文本DataFrame.Series)"+ "
"
)
def div_data(self, x_data, y_label):
"""
將數據依據類分開
"""
tp = set(y_label)
tp_less = [a for a in tp if sum(y_label == a) < sum(y_label != a)][0]
data_less = x_data.iloc[y_label == tp_less, :]
data_more = x_data.iloc[y_label != tp_less, :]
tp.remove(tp_less)
return data_less, data_more, tp_less, list(tp)[0]
def get_SMOTE_sample(self, x_data, y_label):
"""
獲取需要抽樣的正樣本
"""
sample = []
data_less, data_more, tp_less, tp_more = self.div_data(x_data, y_label)
n_integ = self.N_need // 100
data_add = copy.deepcopy(data_less)
if n_integ == 0 :
print("WARNING: PLEASE RE-ENTER N_need")
else:
for i in range(n_integ-1):
data_out = data_less.append(data_add)
data_out.reset_index(inplace = True, drop = True)
return data_out, tp_less
def over_sample(self, x_data, y_label):
"""
SMOTE演算法簡單實現
"""
sample, tp_less = self.get_SMOTE_sample(x_data, y_label)
knn = NearestNeighbors(n_neighbors = self.K_neighbors ,n_jobs = -1).fit(sample)
n_atters = x_data.shape[1]
label_out = copy.deepcopy(y_label)
new = pd.DataFrame(columns = x_data.columns)
for i in range(len(sample)): # 1. 選擇一個正樣本
# 2.選擇少數類中最近的K個樣本
k_sample_index = knn.kneighbors(np.array(sample.iloc[i, :]).reshape(1, -1),
n_neighbors = self.K_neighbors + 1,
return_distance = False)
# 計算插值樣本
# 3.隨機選取K中的一個樣本
np.random.seed(self.random_state)
choice_all = k_sample_index.flatten()
choosed = np.random.choice(choice_all[choice_all != 0])
# 4. 在正樣本和隨機樣本之間選出一個點
diff = sample.iloc[choosed,] - sample.iloc[i,]
gap = np.random.rand(1, n_atters)
new.loc[i] = [x for x in sample.iloc[i,] + gap.flatten() * diff]
label_out = np.r_[label_out, tp_less]
new_sample = pd.concat([x_data, new])
new_sample.reset_index(inplace = True, drop = True)
return new_sample, label_out
if __name__ == "__main__":
iris = load_iris()
irisdf = pd.DataFrame(data = iris.data, columns = iris.feature_names)
y_label = iris.target
# 生成不平二分類數據
iris_1 = irisdf.iloc[y_label == 1,]
iris_2 = irisdf.iloc[y_label == 2,]
iris_2imb = pd.concat([iris_1, iris_2.iloc[:10, :]])
label_2imb =np.r_[y_label[y_label == 1], y_label[y_label == 2][:10]]
iris_2imb.reset_index(inplace = True, drop = True)
smt = TWO_SMOTE()
x_new, y_new = smt.over_sample(iris_2imb, label_2imb)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
以上就是SMOTE的簡單實現,尚未有考慮到僅有 0 1變數,後期會更新
![](https://pic.pimg.tw/zzuyanan/1488615166-1259157397.png)
![](https://pic.pimg.tw/zzuyanan/1482887990-2595557020.jpg)
※Web應用界面許可權控制要點總結
※Spring data MongoDB 之 MongoRepository
TAG:程序員小新人學習 |