SMOTE__簡單原理圖示演算法實現及R和Python調包簡單實現

知識 11-30

一、SMOTE原理

SMOTE的全稱是Synthetic Minority Over-Sampling Technique 即「人工少數類過採樣法」，非直接對少數類進行重採樣，而是設計演算法來人工合成一些新的少數樣本。

SMOTE步驟__1.選一個正樣本

紅色圈覆蓋

SMOTE__簡單原理圖示演算法實現及R和Python調包簡單實現

打開今日頭條，查看更多圖片

SMOTE步驟__2.找到該正樣本的K個近鄰（假設K = 3）

SMOTE__簡單原理圖示演算法實現及R和Python調包簡單實現

可以是正樣本也可以是負樣本

綠色圈覆蓋

SMOTE步驟__3.隨機從K個近鄰中選出一個樣本

SMOTE__簡單原理圖示演算法實現及R和Python調包簡單實現

可以是正樣本也可以是負樣本

SMOTE步驟__4.在正樣本和隨機選出的這個近鄰之間的連線上，隨機找一點。這個點就是人工合成的新正樣本了

SMOTE__簡單原理圖示演算法實現及R和Python調包簡單實現

二、調包實現

2.1 R調包實現_SMOTE

主要參數解釋：

perc.over = a 需要生成的正樣本：最後正樣本數( 1 + a /100) * N : N 為目前有的正樣本數量

perc.under = a 需要從負樣本抽樣的個數：最後負樣本數 (a / 100 * b / 10) * N

K = x 用相近的x 個樣本（中的一個）生成正樣本

library(DMwR)

# pos = (1 + perc.over/100) * N (N original pos sample)

# neg = (perc.over/100 * perc.under/100) * N

# SMOT oversample

newdata <- SMOTE(tp~., data_in

, perc.over = 300, k = 5, perc.under = 200

)

2.2 Python 調包實現_SMOTE

imblearn.over_sampling.SMOTE(

sampling_strategy = 『auto』,

random_state = None, ## 隨機器設定

k_neighbors = 5, ## 用相近的 5 個樣本（中的一個）生成正樣本

m_neighbors = 10, ## 當使用 kind={"borderline1", "borderline2", "svm"}

out_step = 『0.5』, ## 當使用kind = "svm"

kind = "regular", ## 隨機選取少數類的樣本

– borderline1：最近鄰中的隨機樣本b與該少數類樣本a來自於不同的類

– borderline2：隨機樣本b可以是屬於任何一個類的樣本;

– svm：使用支持向量機分類器產生支持向量然後再生成新的少數類樣本

svm_estimator = SVC(), ## svm 分類器的選取

n_jobs = 1, ## 使用的常式數，為-1時使用全部CPU

ratio=None

)

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state = 42, n_jobs = -1)

x, y = sm.fit_sample(x_val, y_val)

三、演算法實現

#! /user/bin/python 3

# -*- coding: utf-8 -*-

# author: Scc_hy

# 2018-11-17

# SMOTE

from sklearn.neighbors import NearestNeighbors

import numpy as np

import pandas as pd

import copy

from sklearn.datasets import load_iris

from sklearn.ensemble import RandomForestClassifier

class TWO_SMOTE():

"""

不平二分類人工插值法採樣

"""

def __init__(self,

K_neighbors = 5,

N_need = 200,

random_state = 42):

self.K_neighbors = K_neighbors

self.N_need = N_need

self.random_state = 42

def get_param_describe(self):

print(

"演算法參數:
"+

"K_neighbors: 和正樣本相近的隨機樣本數" + "
" +

"N_need: 需要增加的正樣本數 (N_need // 100 * a)" + "
" +

"random_state: 隨機器設定" + "
"

"
over_sample 參數：
" +

"x_data: 需要進行過採樣的全部數據集(非文本DataFrame)" + "
" +

"y_label: 類別標籤(非文本DataFrame.Series)"+ "
"

)

def div_data(self, x_data, y_label):

"""

將數據依據類分開

"""

tp = set(y_label)

tp_less = [a for a in tp if sum(y_label == a) < sum(y_label != a)][0]

data_less = x_data.iloc[y_label == tp_less, :]

data_more = x_data.iloc[y_label != tp_less, :]

tp.remove(tp_less)

return data_less, data_more, tp_less, list(tp)[0]

def get_SMOTE_sample(self, x_data, y_label):

"""

獲取需要抽樣的正樣本

"""

sample = []

data_less, data_more, tp_less, tp_more = self.div_data(x_data, y_label)

n_integ = self.N_need // 100

data_add = copy.deepcopy(data_less)

if n_integ == 0 :

print("WARNING: PLEASE RE-ENTER N_need")

else:

for i in range(n_integ-1):

data_out = data_less.append(data_add)

data_out.reset_index(inplace = True, drop = True)

return data_out, tp_less

def over_sample(self, x_data, y_label):

"""

SMOTE演算法簡單實現

"""

sample, tp_less = self.get_SMOTE_sample(x_data, y_label)

knn = NearestNeighbors(n_neighbors = self.K_neighbors ,n_jobs = -1).fit(sample)

n_atters = x_data.shape[1]

label_out = copy.deepcopy(y_label)

new = pd.DataFrame(columns = x_data.columns)

for i in range(len(sample)): # 1. 選擇一個正樣本

# 2.選擇少數類中最近的K個樣本

k_sample_index = knn.kneighbors(np.array(sample.iloc[i, :]).reshape(1, -1),

n_neighbors = self.K_neighbors + 1,

return_distance = False)

# 計算插值樣本

# 3.隨機選取K中的一個樣本

np.random.seed(self.random_state)

choice_all = k_sample_index.flatten()

choosed = np.random.choice(choice_all[choice_all != 0])

# 4. 在正樣本和隨機樣本之間選出一個點

diff = sample.iloc[choosed,] - sample.iloc[i,]

gap = np.random.rand(1, n_atters)

new.loc[i] = [x for x in sample.iloc[i,] + gap.flatten() * diff]

label_out = np.r_[label_out, tp_less]

new_sample = pd.concat([x_data, new])

new_sample.reset_index(inplace = True, drop = True)

return new_sample, label_out

if __name__ == "__main__":

iris = load_iris()

irisdf = pd.DataFrame(data = iris.data, columns = iris.feature_names)

y_label = iris.target

# 生成不平二分類數據

iris_1 = irisdf.iloc[y_label == 1,]

iris_2 = irisdf.iloc[y_label == 2,]

iris_2imb = pd.concat([iris_1, iris_2.iloc[:10, :]])

label_2imb =np.r_[y_label[y_label == 1], y_label[y_label == 2][:10]]

iris_2imb.reset_index(inplace = True, drop = True)

smt = TWO_SMOTE()

x_new, y_new = smt.over_sample(iris_2imb, label_2imb)

100

101

102

103

104

105

106

107

108

以上就是SMOTE的簡單實現，尚未有考慮到僅有 0 1變數，後期會更新

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 程序員小新人學習 的精彩文章:

※Web應用界面許可權控制要點總結
※Spring data MongoDB 之 MongoRepository

TAG:程序員小新人學習 |

SMOTE__簡單原理圖示 演算法實現及R和Python調包簡單實現

SMOTE__簡單原理圖示演算法實現及R和Python調包簡單實現