Skip to article frontmatterSkip to article content

What is the first step of getting your hands dirty?

Table 1:Iris dataset.

IDVarietySepal LengthSepal WidthPetal LengthPetal Width
0Setosa5.13.51.40.2
1Setosa4.931.40.2
2Setosa4.73.21.30.2
3Setosa4.63.11.50.2
4Setosa53.61.40.2
.................

First Step: Draw the frequency distribution.

Below is an example of Iris’ petal length from week 1, when we talked about the analysis of differences.

The frequency distributions were drawn separately for different species of Iris, which provided a visual comparison between the three species.

Are them different?

The differences in petal length between the three species.

The differences in petal length between the three species.

Define Differences

In statistics, “difference” typically refers to the extent to which two or more groups, samples, or observations differ from one another. This can be quantified and assessed through various statistical measures and analyses, depending on the nature of the data and research questions.

How to determine if they are different or not so different?

How to determine if they are different or not so different?

Spotting The Differences

The differences in the 4 variables between species.

The differences in the 4 variables between species.

Hypothesis Testing

Hypothesis testing is a statistical method used to evaluate claims or assertions about a population parameter by analyzing sample data.

The goal is to determine if the observed differences between groups or samples are statistically significant or merely due to chance.

What does it means by ‘due to chance’?

In another chapter, we talked about Normal Distribution, and why ‘Normal’ is ‘Normal’.

When we get a set of samples, the samples’ distribution tend to form a bell-shape, i.e., high at the middle (the central tendency), and dropped from middle to both sides (the spread). But why?

The Normal Distribution.

The Normal Distribution.

The variance, or spread, of a dataset can result from various factors, including random processes, natural variation, or systematic differences. Despite this variability, data points often cluster around a central value due to underlying patterns or tendencies within the population.

In other words, values that fall at positive or negative distances from the mean may be influenced by random factors or chance. This also applies when we draw a value and obtain an extreme value far from the center---such values can occur due to randomness or chance.

Everything can happen by chance---so when we see one value that is away from the sample mean, it does not promise to be ‘different’ from the distribution.

A data point compared to the distribution.

Figure 5:A data point compared to the distribution.

Let’s say we have a class of students, and their height values are measured and which shows a normal distribution (Figure 5), i.e., mean=170, std=10.

At this point, you can view these students as a set of sample, sampling from the population (the university). The university’s students height should (or can be assumed to) form a normal distribution.

Now, you get a student from the next door, and his height is 135. Deos it means this person is different from the currect class?

Strategy for measuring difference of an individual data point from a distribution

Comparison of two distributions

The distributions of two groups of students.

The distributions of two groups of students.

Next, get a series of students’ height from next door, perhaps the same number of students as the first class and which generate a distribution (another normal distribution).

Then, how to compare them?

Central values and spread Central values can be represented by measures of central tendency, such as the mean, median, or mode. In many situations, particularly when data follow a normal distribution, the combination of central tendency and spread can provide a helpful summary of the overall distribution and its characteristics.

Shape and other measurements Assume to be normal distribution

Strategy for comparison

Key Concepts

Hypothesis testing is a crucial tool in statistical inference and helps researchers make data-driven decisions about the relationships or differences between variables.

Null Hypothesis (H0\text{H}_0):

Alternative Hypothesis (H1\text{H}_1 or Ha\text{H}_a):

Test Statistic:

Significance Level (α\alpha):

p-value:

Statistical Significance:

About Type 1 and Type 2 errors

Type 1 and type 2 errors. Table from Wikipedia

Type 1 and type 2 errors. Table from Wikipedia

Hypothesis for Testing the Differences

The null hypothesis reflects the idea of “no effect” or “no difference.”

Hypothesis for testing differences between two means:

H0:μ0=μ1H1:μ0μ1\begin{align} \text{H}_0 &: \mu_0 = \mu_1 \\\\ \text{H}_1 &: \mu_0 \neq \mu_1 \end{align}

Hypothesis for testing differences between more than two means:

H0:μ0=μ1=μ2=...=μiH1:at least 2 of the μi are different\begin{align} \text{H}_0 &: \mu_0 = \mu_1 = \mu_2 = ... = \mu_i \\\\ \text{H}_1 &: \text{at least 2 of the }\mu_i\text{ are different} \end{align}

Null hypothesis is something that we DO NOT NEED to prove, because it cannot be proven anyway.

For Spatial Patterns Detection