Skip to article frontmatterSkip to article content

What is Data Transformation

Common data transformation techniques

Min-Max Scaling

Definition and purpose: Min-max scaling is a popular data normalization technique that transforms features to fall within a specific range, typically between 0 and 1.

Equation:

X=Xaba×(ba)+aX'=\frac{X-a}{b-a} \times (b' - a') + a'

Where:

When to use: Normalizing data to a specific range and preserving original shape.

Min-Max Scaling

The histogram (top) and PDF (bottom) of the average subzone resale price before (left) and after (right) min-max scaling.

The histogram (top) and PDF (bottom) of the average subzone resale price before (left) and after (right) min-max scaling.

Standardization

Definition and purpose: Standardization (aka Z-score normalization) is a data transformation technique that scales features to have zero mean and unit variance. Its purpose is to make different features comparable by transforming them to a common scale.

Equation:

X=XμσX' = \frac{X - \mu}{\sigma}

Where:

.When to use:

The histogram (top) and PDF (bottom) of the average subzone resale price before (left) and after (right) standardization.

The histogram (top) and PDF (bottom) of the average subzone resale price before (left) and after (right) standardization.

The PDF (top) of the average subzone resale price of 3 rooms and 5 rooms HDB flat before (left) and after (right) standardization. The two scatter plots at the bottom show the relationship between 3 and 5 room HDB flat.

The PDF (top) of the average subzone resale price of 3 rooms and 5 rooms HDB flat before (left) and after (right) standardization. The two scatter plots at the bottom show the relationship between 3 and 5 room HDB flat.

Log Transformation

Definition and Purpose: Log transformation is a data transformation technique that applies the natural logarithm to the original data. It helps to reduce skewness in the data and stabilize variance for statistical analysis or modeling.

Equation:

Xlog=log(X)X_{\text{log}} = \log(X)

Where:

When to Use:

The before (left) and after (right) log-transformed of resale prices.

The before (left) and after (right) log-transformed of resale prices.

The before (top) and after (bottom) log-transformed of ridership (arrival passsenger).

The before (top) and after (bottom) log-transformed of ridership (arrival passsenger).

Box-Cox

Definition and Purpose: Box-Cox transformation is a power transformation technique that can stabilize variance and make data more normally distributed. It is a family of transformations that includes the logarithmic (λ=0\lambda=0) and square root (λ=0.5\lambda=0.5) transformations as special cases.

Equation:

Xbox-cox(λ)=Xλ1λifλ0X_{\text{box-cox}}(\lambda) = \frac{X^\lambda - 1}{\lambda} \quad \text{if} \quad \lambda \neq 0
Xbox-cox(λ)=log(X)ifλ=0X_{\text{box-cox}}(\lambda) = \log(X) \quad \text{if} \quad \lambda = 0

When to Use:

Histogram of Box-Cox transformed values, with \lambda set to: 0, 0.25, 0.5, 1, 2, 3.

Histogram of Box-Cox transformed values, with λ\lambda set to: 0, 0.25, 0.5, 1, 2, 3.

How to determine which λ\lambda is the best?

Ranking Data Transformation

Definition and Purpose: Ranking data transformation replaces original values with their corresponding ranks, preserving the order or hierarchy of the data. It can help in data visualization, feature engineering, or non-parametric statistical tests.

How it works: There is no explicit equation for ranking data transformation. The process involves sorting the values and assigning ranks based on their positions in the sorted list. In case of ties, you can use methods like average, min, or max ranking.

When to Use: When working with ordinal data or categorical data that can be ordered. When performing non-parametric statistical tests that require ranked data.

The relationship between the ranks and the original data, and the frequency distribution of the ranks.

The relationship between the ranks and the original data, and the frequency distribution of the ranks.

What is the expected frequency distribution? Why is it not exactly the same as the expected distribution?

Other Techniques

Choosing the Right Technique

A simple guide

  1. Assess data quality and characteristics: Examine your dataset for issues like missing values, outliers, non-linearity, and non-normal distributions. Identifying these challenges will inform the choice of transformation technique.

  2. Understand variable types: Determine whether your variables are categorical or numerical. This will help you choose appropriate transformation methods, as some techniques are designed specifically for certain variable types.

  3. Evaluate the goals of your analysis: Consider the objectives of your data analysis and the specific requirements of your chosen modeling technique. Different transformations can address particular challenges or better align with specific analysis goals.

  4. Investigate common transformation techniques: Explore various methods like logarithmic, square root, Box-Cox, or Z-score (standardization) transformations for numerical data, and one-hot encoding or ordinal encoding for categorical data.

  5. Compare the pros and cons: Weight the advantages and disadvantages of different techniques, considering factors like interpretability, ease of implementation, and potential impact on model performance.

  6. Iterate and validate: Test multiple transformation techniques and evaluate their effects on your data and model performance. This will help you identify the most effective approach for your specific situation.

  7. Consult resources and domain experts (expert in GIS): Leverage the knowledge of experienced professionals, peer-reviewed literature, and other credible sources to inform your decision-making process.