TIL - 06.10

TIL 2024. 6. 10. 21:21

많은걸 하려 하지 말라

할수 있는 것을 하자

EDA, Outlier, Missing Value, Encoding, Scaling

기존과의 차이

countplot : 범주형 자료의 빈도 수 시각화

x : 범주형 자료 // y : 자료의 빈도 수

=> 요일별 출현 횟수

ESD(Extreme Studentized Deviation)

데이터가 정규분포를 따른다는 가정 하, 평균에서 표준편차의 3배 이상 떨어진 값(0.15%*2)

(정규분포 : Log 로도 비대칭이 잡히지 않을 때, 샘플 크기가 작은 경우 불가)

IQR(Inter Quantile Range)

위와 동일조건 사용 가능

Boxplot : 데이터의 사분위 수를 포함하여 분포를 보여주는 시각화 그래프, 상자-수염 그림

- 4분위 수 : 데이터를 순서에 따라 사등분 한 것 (Q1{25%},2{50%},3{75%})

''이상치는 주관적인 값

=> 도메인과 비즈니스 맥락에 따라 기준이 달라짐 // 이상탐지(Anomally Detection) : 패턴이 다르게 보이는 경우 찾음

범주형 데이터

최빈값

사용함수 : df.dropna(axis = 0, 1), df.fillna(value)

알고리즘 : sklearn.impute.SimpleImputer (평균,중앙,최빈) [statistics : 대치값 확인 가] / IterativeImputer (다변량-회귀) / KNNImputer (KNN 알고리즘)

K - Nearest Neighbors(k 최근접 이웃)

from sklearn.impute import SimpleImputer
si = SimpleImputer()
si.fit(titanic_df[['Age']])

si.statisics_

titanic_df['Age_si_mean'] = si.transform(tatanic_df[['Age']])

LabelEncoding

sklearn.preprocessing.LabelEncoder

# fit / transform / fit_transform / inverse_transform // 
# classes_ : 인코더가 학습한 클래스(범주)

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
oe = OneHotEncoder()

le.fit(titanic_df[['sex']])

le.classes_
#array(['female', 'male']), dtype =object

titanic_df['sex_le'] = le.transform(titanic_df[['sex']])

OneHotEncoding

pd.get_dumies

sklearn.preprocessing.OneHotEncoder
# LE와 동일한 method
# categories_ : 인코더가 학습한 클래스(범주)

oe.fit(titanic_df[['Embarked']])

embarked_csr = oe.transform(titanic_df[['Embarked']])
# 891*4 sparse matrix of type '<class 'numpy.float64'>' with 891 stored elements in Compressed Sparse Row format>

embarked_csr_df = pd.DataFrame(embarked_csr.toarray(), column = oe.get_feature_names_out())
#columns =Embarked_C ,_Q, _S ..

Standardization

sklearn.preprocessing.StandardScaler
# fit/transform
# mean_ / scale_,var_ : 데이터의 표준 편차, 분산값
# n_feature_in_ : fit 할때 들어간 변수 개수
# feature_names_in : fit 할때 들어간 변수 이름
# n_sample_seen : fit 할때 드렁간 데이터 개수

Normalization

sklearn.preprocessing.MinMaxScaler
# data_min_/max_/range_ : 원 데이터 최소값, 최대값, 최대-최소범위
#

Robust scaling

sklearn.preprocessing.RobustScaler
#center_ : 훈련 데이터의 중앙값

'TIL' 카테고리의 다른 글

TIL - 06.12 (0)	2024.06.12
TIL - 06.11 (0)	2024.06.11
TIL - 05.28 (0)	2024.05.28
TIL - 05.27 (0)	2024.05.27
TIL - 05.24 (0)	2024.05.27

ABOUT ME

자율탐구 자율탐구

'TIL' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'TIL' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바