Data Analysis

[머신러닝] 데이터 분석 함수 및 내용 정리 (Kaggle 준비하기)

사족보행 개발자 2025. 4. 19. 16:45

728x90

📘 데이터 분석 내용 정리

1. 데이터 불러오기

① 데이터 로딩

import pandas as pd

# CSV 파일 불러오기
df = pd.read_csv('file.csv')

# Excel 파일 불러오기
df = pd.read_excel('file.xlsx', sheet_name='Sheet1')

# JSON 파일 불러오기
df = pd.read_json('file.json')

# 구분자 지정 CSV (탭, 세미콜론 등)
df = pd.read_csv('file.csv', sep='\t')  # 탭으로 구분된 경우

② 데이터 기본 정보 확인

df.head()      # 처음 5개 데이터
df.tail()      # 마지막 5개 데이터
df.sample(10)  # 랜덤 10개 데이터 확인

df.shape       # 데이터의 행, 열 개수 확인
df.columns     # 컬럼명 목록
df.dtypes      # 컬럼 데이터 타입 확인

df.info()      # 데이터 타입, 결측치 존재 여부, 메모리 사용량 확인
df.describe(include='all')  # 모든 변수 요약 (수치+범주형 모두)

2. EDA (Exploratory Data Analysis)

① 수치형 데이터 탐색

import matplotlib.pyplot as plt
import seaborn as sns

# 수치형 컬럼만 추출
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns

# 히스토그램으로 분포 확인
for col in numeric_cols:
    sns.histplot(df[col].dropna(), kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

# 상관관계 분석
plt.figure(figsize=(10,8))
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Boxplot (이상치 탐색)
for col in numeric_cols:
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
    plt.show()

해석 방법

히스토그램: 데이터가 정규분포인지, 편향(skewness)이 있는지 확인
- 왼쪽 편향(꼬리가 왼쪽), 오른쪽 편향(꼬리가 오른쪽)
상관관계: 변수 간 강한 상관관계(절댓값 0.7 이상)가 있으면 다중공선성 문제
Boxplot: 박스 바깥의 점들은 이상치로, 이후 전처리 대상

② 범주형 데이터 탐색

# 범주형 컬럼만 추출
cat_cols = df.select_dtypes(include=['object', 'category']).columns

# 빈도 분석
for col in cat_cols:
    sns.countplot(y=col, data=df, order=df[col].value_counts().index)
    plt.title(f'Frequency of {col}')
    plt.show()

# 교차 분석 (두 범주형 변수 간)
sns.countplot(x='범주형컬럼1', hue='범주형컬럼2', data=df)
plt.title('Cross Analysis')
plt.show()

해석 방법

각 범주형 데이터의 불균형 여부 파악 (편향된 클래스 문제 확인)
교차분석을 통해 특정 조건에서의 분포 차이 확인 (변수 간 의존성 탐색)

③ 시계열 데이터 탐색

df['date'] = pd.to_datetime(df['date'])  # 날짜 변환 후 index 지정
df.set_index('date', inplace=True)

# 전체 시계열 흐름 파악
df['수치형컬럼'].plot(figsize=(12,6))
plt.title('Time Series Trend')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

# 월별 평균, 일별 평균 등 그룹화
df.resample('M')['수치형컬럼'].mean().plot(kind='bar')
plt.title('Monthly Mean Values')
plt.show()

해석 방법

장기적 추세(trend), 반복적 패턴(seasonality) 파악
불규칙한 패턴(outlier, anomaly) 발견

3. EDA 해석을 기반으로 한 전처리 선정과 방법

🔸 결측치 처리

# 결측치 수량 확인
df.isnull().sum()

# 수치형 결측치 처리 (중앙값 또는 평균값으로 처리)
for col in numeric_cols:
    median = df[col].median()
    df[col].fillna(median, inplace=True)

# 범주형 결측치 처리 (최빈값 mode로 처리)
for col in cat_cols:
    mode = df[col].mode()[0]
    df[col].fillna(mode, inplace=True)

🔸 이상치 처리 (IQR 기법)

for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5*IQR
    upper_bound = Q3 + 1.5*IQR
    df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

🔸 변수 변환 (로그 변환 등)

import numpy as np
for col in numeric_cols:
    if df[col].skew() > 1:  # 편향이 심할 때
        df[col+'_log'] = np.log1p(df[col])

🔸 인코딩 방법

# 원핫 인코딩
df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

# 레이블 인코딩 (순서가 의미 있을 경우)
from sklearn.preprocessing import LabelEncoder
for col in cat_cols:
    df[col] = LabelEncoder().fit_transform(df[col])

해석 방법

원핫 인코딩과 레이블 인코딩의 차이점은 '순서가 중요한가?'에 있음
- 머신러닝 및 딥러닝 모델은 데이터를 숫자 그 자체로 인식함.
- 따라서 1과 2는 그냥 우리가 보기엔 숫자이지만, 모델의 관점에서는 1등과 2등 혹은 1점과 2점으로 비춰질 수 있다.
- 그렇기 때문에, 순서가 있는 경우에는 레이블 인코딩을 해도 되지만, '월, 화, 수, 목, 금'의 요일과 같이, 순서 보다는 그 자체 의미가 중요한 경우에는 원핫 인코딩을 해준다.
원핫 인코딩 사용시 대부분의 값이 0으로 채워지는 희소행렬 문제가 발생할 수 있음 (아래와 같이 해결)

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
encoded_data = ohe.fit_transform(data[features]) # CSR 행렬을 return함

위 데이터를 다시 데이터프레임으로 변환하면, 시간과 메모리가 낭비가 심해지기에, 병합을 scipy의 hstack으로 해줌
그 이후에는 평소 DF를 사용하는 것과 비슷하게 써주면 된다 !

from scipy import sparse

total_data = sparse.hstack([sparse.csr_matrix(data), encoded_data])

4. 모델링 방법 (분류/회귀/시계열)

🔹 분류 (Classification)

알고리즘: Logistic Regression, Random Forest, XGBoost, LightGBM
평가 지표: Accuracy, Precision, Recall, F1-score, ROC-AUC

🔹 회귀 (Regression)

알고리즘: Linear Regression, Random Forest Regressor, XGBoost, LightGBM
평가 지표: RMSE, MAE, R²

🔹 시계열 (Time Series)

알고리즘: Prophet, ARIMA, LSTM
평가 지표: RMSE, MAPE, MAE

# 예시 (랜덤포레스트)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

5. 앙상블 모델링 방법

앙상블 방식은 크게 3가지로 분류된다.

배깅(Bagging)
부스팅(Boosting)
스태킹(Stacking)

📌 1. 배깅(Bagging, Bootstrap Aggregating)

정의
- 데이터를 중복 허용 무작위로 샘플링하여 여러 모델을 학습한 후, 그 예측 결과를 투표(분류)나 평균(회귀)을 통해 결합한다.
장점
- 과적합(overfitting)을 방지
- 안정적인 결과 제공

✅ 주요 알고리즘

Random Forest
Extra Trees

📍 코드 예시 (Random Forest)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 모델 학습
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# 예측 및 평가
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

📌 2. 부스팅(Boosting)

정의
- 여러 개의 약한 학습기를 순차적으로 학습시키며, 이전 모델의 오류를 다음 모델이 보완하는 방식
장점
- 높은 정확성 달성 가능
- 대부분의 실제 환경에서 뛰어난 성능

✅ 주요 알고리즘

AdaBoost
Gradient Boosting Machine (GBM)
XGBoost
LightGBM
CatBoost

📍 코드 예시 (XGBoost)

import xgboost as xgb
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42)
xgb_model.fit(X_train, y_train)

y_pred = xgb_model.predict(X_test)
print(classification_report(y_test, y_pred))

📍 코드 예시 (LightGBM)

import lightgbm as lgb
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lgb_model = lgb.LGBMClassifier(n_estimators=100, random_state=42)
lgb_model.fit(X_train, y_train)

y_pred = lgb_model.predict(X_test)
print(classification_report(y_test, y_pred))

📌 3. 스태킹(Stacking)

정의
- 여러 가지 서로 다른 모델들을 동시에 학습시키고, 이들의 예측값을 최종적으로 하나의 메타 모델이 결합하여 결과를 내는 방식
장점
- 서로 다른 알고리즘의 장점을 결합하여 더욱 높은 성능 도출 가능

✅ 주요 구성 방법

Base Model: RandomForest, XGBoost, LightGBM 등
Meta Model: Logistic Regression, XGBoost, LightGBM 등

📍 코드 예시 (스태킹 간단 구현)

from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('xgb', XGBClassifier(n_estimators=100, random_state=42))
]

# 메타 모델: Logistic Regression
stack_model = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(),
    cv=5
)

stack_model.fit(X_train, y_train)

y_pred = stack_model.predict(X_test)
print(classification_report(y_test, y_pred))

🗂 앙상블 알고리즘별 특징 요약

알고리즘 방법 과적합 방지 특징

알고리즘	방식	과적합 방지	특징
Random Forest	배깅	높음	안정적, 간단, 범용성
AdaBoost	부스팅	중간	오류에 가중치 부여
Gradient Boost	부스팅	중간	오류 최소화 방향으로 순차적 학습
XGBoost	부스팅	중~높음	빠르고 정확한 성능
LightGBM	부스팅	중~높음	빠른 학습 속도, 대규모 데이터 적합
CatBoost	부스팅	높음	범주형 변수 자동 처리, 견고한 성능
Stacking	스태킹	중~높음	다양한 모델 장점 통합

🚩 실전 시험에서 사용할 때 주의사항

배깅: 데이터가 노이즈가 많고 불안정할 때 추천
부스팅: 데이터가 비교적 잘 정리된 경우 강력한 성능, 하지만 과적합에 유의
스태킹: 여러 다른 알고리즘을 결합하여 최대 성능을 끌어낼 때 유용, 복잡성 주의

📌 앙상블 알고리즘 평가 지표

분류: Accuracy, Precision, Recall, F1-Score, ROC-AUC
회귀: RMSE, MAE, MAPE, R²

728x90