[ML] 6. 나이브 베이즈

데이터 분석/머신러닝

[ML] 6. 나이브 베이즈

eunnys 2023. 11. 22. 19:44

▶ 붓꽃 분류
- 텍스트 데이터처럼 희소한 고차원인 경우 높은 정확도와 속도를 제공
- 적용 분야 : 스펨 메일 분류, 문서 주제 분류, 컴퓨터 네트워크 침입자 분류

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# 데이터 로드 및 분할
x, y = load_iris(return_X_y=True)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10, stratify=y)

# 모델 생성
model = GaussianNB() # 설명 변수가 연속형인 경우 사용
model.fit(x_train, y_train)

# 모델 평가
y_hat = model.predict(x_test)
con_mat = confusion_matrix(y_test, y_hat)
print(con_mat)

[[15  0  0]
 [ 0 15  0]
 [ 0  0 15]]
 
 # 오분류 없음

▶ 독버섯 분류
- 설명변수가 범주형 자료

from sklearn.naive_bayes import MultinomialNB 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, roc_curve, roc_auc_score
import numpy as np
import pandas as pd

# 데이터 로드 및 확인
df = pd.read_csv('./Dataset/mushrooms.csv')
print(df.shape) # (8124, 23)
display(df.head())

df.info() # 모든 컬럼은 범주형 데이터

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   type                      8124 non-null   object
 1   cap_shape                 8124 non-null   object
 2   cap_surface               8124 non-null   object
 3   cap_color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill_attachment           8124 non-null   object
 7   gill_spacing              8124 non-null   object
 8   gill_size                 8124 non-null   object
 9   gill_color                8124 non-null   object
 10  stalk_shape               8124 non-null   object
 11  stalk_root                8124 non-null   object
 12  stalk_surface_above_ring  8124 non-null   object
 13  stalk_surface_below_ring  8124 non-null   object
 14  stalk_color_above_ring    8124 non-null   object
 15  stalk_color_below_ring    8124 non-null   object
 16  veil_type                 8124 non-null   object
 17  veil_color                8124 non-null   object
 18  ring_number               8124 non-null   object
 19  ring_type                 8124 non-null   object
 20  spore_print_color         8124 non-null   object
 21  population                8124 non-null   object
 22  habitat                   8124 non-null   object
dtypes: object(23)
memory usage: 1.4+ MB

## 데이터 인고팅 및 분리 ##

x = df.drop('type', axis=1)
y = df['type']

## 라벨 인코딩 ##

# (1) sklearn LabelEncoder를 이용한 라벨 인코딩
from sklearn.preprocessing import LabelEncoder

x = x.apply(lambda col:LabelEncoder().fit_transform(col))

# (2) pandas map을 이용한 라벨 인코딩 
def labeling(col):
    map_data = {v:i  for i, v in enumerate(np.sort(col.unique()))} 
    # 컬럼의 고유값을 알파벳 순서로 정렬 후 인덱스와 함께 값을 가져옴
    return map_data

x = x.apply(lambda col:col.map(labeling(col))) # 각 컬럼에 사용자정의함수 적용

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=10, stratify=y)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
# (6499, 22) (1625, 22) (6499,) (1625,)

## 원-핫 인코딩 ##

print(x.shape, y.shape) # 원본 데이터의 모양
# (8124, 22) (8124,)

x = pd.get_dummies(x) # 모든 컬럼에 대한 원-핫 인코딩 수행
print(x.shape) # 각 컬럼이 갖는 유니크값별로 컬럼이 새로 생성된 것을 확인
# (8124, 117)

y = y.map({'edible':0, 'poisonous':1})

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=10, stratify=y)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
# (6499, 117) (1625, 117) (6499,) (1625,)

# 모델 생성
model = MultinomialNB() # 설명 변수가 명목형인 경우 사용
model.fit(x_train, y_train)

# 모델 평가

model.classes_
# array([0, 1], dtype=int64)

y_hat = model.predict(x_test)
print('실제값:', np.array(y_test[:5]))
print('예측값:', y_hat[:5])

실제값: [0 0 1 0 0]
예측값: [0 0 1 0 0]

cf_mat = confusion_matrix(y_test, y_hat)#, labels=['poisonous', 'edible'])
# labels : 첫 번째 negative, 두 번째 positive에 사용할 분류값 지정
print(cf_mat)

print(f'정확도: {accuracy_score(y_test, y_hat):.3f}')
# print(f'정밀도: {precision_score(y_test, y_hat, pos_label="edible"):.3f}')
print(f'정밀도: {precision_score(y_test, y_hat):.3f}')
# pos_label : positive로 사용할 레이블 지정 (지정 안하면 에러 발생)
print(f'AUC: {roc_auc_score(y_test, model.predict_proba(x_test)[:,1]):.3f}')

[[841   1]
 [ 74 709]]
정확도: 0.954
정밀도: 0.999
AUC: 0.997

'데이터 분석 > 머신러닝' 카테고리의 다른 글

[ML] 8. 앙상블 - Bagging (0)	2023.11.22
[ML] 7. KNN (0)	2023.11.22
[ML] 지도학습 알고리즘 - 앙상블 러닝 (1)	2023.11.22
[ML] 지도학습 알고리즘 - KNN (0)	2023.11.22
[ML] 지도학습 알고리즘 - 확률, 의사결정 트리 (0)	2023.11.22

현재글[ML] 6. 나이브 베이즈

비전공자와 함께 데이터분석 뿌시기!

추천알고리즘, 파이썬, 컴파일옵션, 평가지표, 교차검증, 최근접이웃협업필터링, 군집분석, 원핫인코딩, 혼동행렬, 유방암판별예측, 나이브베이즈, map, 최적화함수, text_mining, 문서_군집화, 비용함수, 붓꽃품종예측, 전방_후방_탐색, 라벨인코딩, 보스톤집값예측,

Today :
Yesterday :

같이 데이터분석 공부할 사람 ༼ つ ◕_◕ ༽つ

[ML] 6. 나이브 베이즈

'데이터 분석 > 머신러닝' 카테고리의 다른 글

'데이터 분석/머신러닝'의 다른글

티스토리툴바

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

[ML] 6. 나이브 베이즈

'데이터 분석 > 머신러닝' 카테고리의 다른 글

'데이터 분석/머신러닝'의 다른글

관련글

티스토리툴바