- Reasons for data preprocessing
- Accuracy
- Completeness
- Consistency
- Schema: 데이터의 attribute와 type을 담고 있는 metadata
- Timeliness
- Trustness
- Interpretability
Data Preprocessing
1. Data Selection and Integration
- DB나 파일에서 원하는 데이터만 선택하거나, 데이터를 합친다
- issues
- schema integration: same attribute with different names
- entity identification: same entity but with different values
- detecting and removing redundant data
- handle
- check redundancy (ex. correlation analysis)
2. Data Cleaning
“dirty” datas
- incomplete(missing): lacking attribute
- → ignore the tuples: 통째로 없애기; 데이터 양이 줄어든다
- → fill missing values manually: 확실하지만 오래걸린다
- → fill automatically
- global constant (ex. “unknown”)
- mean
- estimated by ML
- → use methods that can handle missing values
ex. decision trees
- noisy: noise, error, outliers
- → smoothing: averaging, binning, clustering으로 “부드럽게” 만든다
- → remove automatically: outlier detection
- → do nothing: 노이즈도 데이터다!
- inconsistent: incorrect data type (ex. age=”15”: string이다)
- intentional: disguised missing data
: 임의로 넣어둔 값이 있을 수 있다(ex. 이름 안쓴 사람은 전부 홍길동)
3. Data Reduction
Data Reduction: Sampling
- Random sampling
- sampling without replacement(비복원추출)
- sampling with replacement(복원추출)
- stratified sampling
: partition the data set, draw samples from each partition
몇가지 구역으로 나누고 거기서 샘플링을 한다.