 ### ID3

ID3

ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively (repeatedly) dichotomizes (divides) features into two or more groups at each step.

Most generally ID3 is only used for classification problems with nominal features only.

Metrics in ID3

ID3 algorithm selects the best feature at each step while building a Decision tree.

Information Gain calculates the reduction in the entropy and measures how well a given feature separates or classifies the target classes. The feature with the highest Information Gain is selected as the best one.

Entropy is the measure of disorder and the Entropy of a dataset is the measure of disorder in the target feature of the dataset.

In the case of binary classification (where the target column has only two types of classes) entropy is 0 if all values in the target column are homogenous(similar) and will be 1 if the target column has equal number values for both the classes.

We denote our dataset as S, entropy is calculated as:

Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n
Where  n is the total number of classes in the target column (in our case n = 2 i.e YES and NO)
pᵢ is the probability of class ‘i’ or the ratio of “number of rows with class i in the target column” to the “total number of rows” in the dataset.

Information Gain for a feature column A is calculated as:

IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))
where Sᵥ is the set of rows in S for which the feature column A has value v, |Sᵥ| is the number of rows in Sᵥ and likewise |S| is the number of rows in S.

ID3 Steps

1. Calculate the Information Gain of each feature.

2. Considering that all rows don’t belong to the same class, split the dataset S into subsets using the feature for which the Information Gain is maximum.

3. Make a decision tree node using the feature with the maximum Information gain.

4. If all rows belong to the same class, make the current node as a leaf node with the class as its label.

5. Repeat for the remaining features until we run out of all features, or the decision tree has all leaf nodes.

Example:

+----+-------+-------+------------------+----------+

| ID | Fever | Cough | Breathing issues | Infected |

+----+-------+-------+------------------+----------+

| 1 | NO | NO | NO | NO |

+----+-------+-------+------------------+----------+

| 2 | YES | YES | YES | YES |

+----+-------+-------+------------------+----------+

| 3 | YES | YES | NO | NO |

+----+-------+-------+------------------+----------+

| 4 | YES | NO | YES | YES |

+----+-------+-------+------------------+----------+

| 5 | YES | YES | YES | YES |

+----+-------+-------+------------------+----------+

| 6 | NO | YES | NO | NO |

+----+-------+-------+------------------+----------+

| 7 | YES | NO | YES | YES |

+----+-------+-------+------------------+----------+

| 8 | YES | NO | YES | YES |

+----+-------+-------+------------------+----------+

| 9 | NO | YES | YES | YES |

+----+-------+-------+------------------+----------+

| 10 | YES | YES | NO | YES |

+----+-------+-------+------------------+----------+

| 11 | NO | YES | NO | NO |

+----+-------+-------+------------------+----------+

| 12 | NO | YES | YES | YES |

+----+-------+-------+------------------+----------+

| 13 | NO | YES | YES | NO |

+----+-------+-------+------------------+----------+

| 14 | YES | YES | NO | NO

From the total of 14 rows in our dataset S, there are 8 rows with the target value YES and 6 rows with the target value NO. The entropy of S is calculated as:

Entropy(S) = - (8/14) * log₂(8/14) - (6/14) * log₂(6/14) = 0.99