Data Preprocessing

DB나 파일에서 원하는 데이터만 선택하거나, 데이터를 합친다
issues
- schema integration: same attribute with different names
- entity identification: same entity but with different values
- detecting and removing redundant data
handle
- check redundancy (ex. correlation analysis)

“dirty” datas

incomplete(missing): lacking attribute
- → ignore the tuples: 통째로 없애기; 데이터 양이 줄어든다
- → fill missing values manually: 확실하지만 오래걸린다
- → fill automatically
  - global constant (ex. “unknown”)
  - mean
  - estimated by ML
- → use methods that can handle missing values ex. decision trees
noisy: noise, error, outliers
- → smoothing: averaging, binning, clustering으로 “부드럽게” 만든다
- → remove automatically: outlier detection
- → do nothing: 노이즈도 데이터다!
inconsistent: incorrect data type (ex. age=”15”: string이다)
intentional: disguised missing data : 임의로 넣어둔 값이 있을 수 있다(ex. 이름 안쓴 사람은 전부 홍길동)

Random sampling
sampling without replacement(비복원추출)
sampling with replacement(복원추출)
stratified sampling : partition the data set, draw samples from each partition 몇가지 구역으로 나누고 거기서 샘플링을 한다.