[ML] 7. KNN

데이터 분석/머신러닝

[ML] 7. KNN

eunnys 2023. 11. 22. 19:53

▶ 붓꽃 품종 분류

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score

x, y = load_iris(return_X_y=True)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10, stratify=y)

# 모델 생성 및 학습
model = KNeighborsClassifier(n_neighbors=3)
model.fit(x_train, y_train)

# 예측 및 평가
y_hat = model.predict(x_test)
con_mat = confusion_matrix(y_test, y_hat)
print(con_mat)

[[15  0  0]
 [ 0 15  0]
 [ 0  0 15]]

# 데이터 분포 시각화
import matplotlib.pyplot as plt

plt.scatter(x[:,0], x[:,1], c=y, s=150) 
# x는 모든 행의 0번째 컬럼, y는 모든 행의 1번째 컬럼 (꽃잎의 길이와 넓이)
plt.show()

# 최적의 K 찾기
import numpy as np
k = 10
acc_score = np.zeros(k) # 0이 10개로 채워진 벡터
for k in range(1, k+1):
    model = KNeighborsClassifier(n_neighbors=k).fit(x_train, y_train)
    y_hat = model.predict(x_test)
    acc = accuracy_score(y_test, y_hat)
    acc_score[k-1] = acc

max_index = np.argmax(acc_score)
print(acc_score)
print(f'최적의 k는 {max_index+1}이며 정확도는 {acc_score[max_index]:.3f}이다.')

[1.         1.         1.         0.97777778 1.         1.
 1.         1.         1.         0.97777778]
최적의 k는 1이며 정확도는 1.000이다.

▶ MNIST 손글씨 분류
- MNIST (Modified National Institute of Standard and Technology) : 손으로 쓴 숫자들로 이루어진 데이터베이스

# 데이터 로딩
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1, parser='pandas')
print(type(mnist))
# <class 'sklearn.utils._bunch.Bunch'>

# 데이터 모양 확인
# 전체 7만개의 이미지와 784개의 특성값(28x28 픽셀 이미지)을 가지고 있다.

x = mnist.data
y = mnist.target
print(x.shape, y.shape) # (70000, 784) (70000,)
print(type(x)) # <class 'pandas.core.frame.DataFrame'>
display(x.head()) # 그림이 그려진 부분만 1, 나머지는 0

# 숫자 데이터 이미지화
import matplotlib.pyplot as plt
# 첫 번째 데이터 이미지화
img_data = x.iloc[0].values.reshape(28,28) # 2차원으로 변환
plt.imshow(img_data, cmap='binary')
plt.axis('off')
plt.show()

# 학습, 평가 데이터 분리
# MNIST 데이터셋은 이미 학습 대이터(앞쪽 6만개)와 평가 데이터(뒤쪽 1만개)로 나뉘어 있다.

x_train, x_test, y_train, y_test = x[:60000], x[60000:], y[:60000], y[60000:]

# 모델 생성 및 학습
knn = KNeighborsClassifier(n_neighbors=5) # n_neighbors 임의로 지정
knn.fit(x_train, y_train)

# 예측 및 평가
y_hat = knn.predict(x_test)
print(f'정확도: {accuracy_score(y_test, y_hat):.3f}')

정확도: 0.969

pred_proba = knn.predict_proba(x_test)
print(f'AUC: {roc_auc_score(y_test, pred_proba, multi_class="ovr"):.3f}')

AUC: 0.995

knn.predict(img_data.reshape(1,-1))

array(['5'], dtype=object)

'데이터 분석 > 머신러닝' 카테고리의 다른 글

[ML] 9. 앙상블 - Voting (0)	2023.11.23
[ML] 8. 앙상블 - Bagging (0)	2023.11.22
[ML] 6. 나이브 베이즈 (0)	2023.11.22
[ML] 지도학습 알고리즘 - 앙상블 러닝 (1)	2023.11.22
[ML] 지도학습 알고리즘 - KNN (0)	2023.11.22

현재글[ML] 7. KNN

비전공자와 함께 데이터분석 뿌시기!

평가지표, 컴파일옵션, 전방_후방_탐색, 문서_군집화, 파이썬, 추천알고리즘, text_mining, 최적화함수, 라벨인코딩, 혼동행렬, 원핫인코딩, 군집분석, 보스톤집값예측, 비용함수, 교차검증, 나이브베이즈, 붓꽃품종예측, map, 최근접이웃협업필터링, 유방암판별예측,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

같이 데이터분석 공부할 사람 ༼ つ ◕_◕ ༽つ

[ML] 7. KNN

'데이터 분석 > 머신러닝' 카테고리의 다른 글

'데이터 분석/머신러닝'의 다른글

티스토리툴바

[ML] 7. KNN

'데이터 분석 > 머신러닝' 카테고리의 다른 글

'데이터 분석/머신러닝'의 다른글

관련글

티스토리툴바