用Python从零开始构造决策树("Python入门：从零构建决策树教程")

原创

ithorizon 6个月前 (10-19) 阅读数 28 #后端开发

Python入门：从零构建决策树教程

一、决策树简介

决策树是一种非常受欢迎的机器学习算法，它可以用于分类和回归任务。决策树的学习过程就是从数据中找到最佳的分割点，将数据集划分成子集，并递归地构建子树，直到满足停止条件。本文将向您介绍怎样使用Python从零起始构建一个易懂的决策树。

二、环境准备

首先，确保您的计算机已安装Python环境。接下来，我们需要安装以下库：

numpy：用于数值计算

scikit-learn：提供机器学习算法和工具

使用pip命令安装：

pip install numpy scikit-learn

三、构建决策树算法

以下是构建决策树算法的步骤：

1. 定义数据集

为了构建决策树，我们需要一个数据集。这里我们使用一个易懂的鸢尾花数据集，它包含了150个样本，每个样本有4个特征和1个标签。


import numpy as np
# 定义数据集
X = np.array([[5.1, 3.5, 1.4, 0.2],
              [4.9, 3.0, 1.4, 0.2],
              [4.7, 3.2, 1.3, 0.2],
              # ...
              [6.7, 3.0, 5.2, 2.3],
              [6.3, 2.5, 5.0, 1.9],
              [6.4, 2.8, 5.6, 2.1]])
y = np.array([0, 0, 0, # ...
              1, 1, 1])

2. 定义决策树类

接下来，我们定义一个易懂的决策树类。这个类将包含以下方法：构造函数、选择最佳分割点、构建树、预测。


class DecisionTree:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth
        self.tree = {}
    def _choose_best_split(self, X, y):
        # 省略最佳分割点的选择逻辑
        pass
    def _build_tree(self, X, y, depth=0):
        # 省略构建树的逻辑
        pass
    def fit(self, X, y):
        self._build_tree(X, y)
    def predict(self, X):
        predictions = []
        for sample in X:
            predictions.append(self._predict(sample))
        return predictions
    def _predict(self, sample):
        # 省略预测逻辑
        pass

3. 选择最佳分割点

选择最佳分割点的方法有很多，这里我们使用基尼不纯度作为标准。基尼不纯度越低，描述数据集的纯度越高。


def gini_impurity(y):
    _, counts = np.unique(y, return_counts=True)
    probabilities = counts / counts.sum()
    return 1 - sum(p**2 for p in probabilities)
def find_best_split(X, y):
    best_idx, best_value, best_score = None, None, float('inf')
    for idx in range(X.shape[1]):
        for value in X[:, idx]:
            score = gini_impurity(y[X[:, idx] <= value]) + gini_impurity(y[X[:, idx] > value])
            if score < best_score:
                best_idx, best_value, best_score = idx, value, score
    return best_idx, best_value, best_score

4. 构建树

使用递归方法构建树。在构建过程中，我们需要检查是否约为了最大深度或数据集是否已经足够纯。


def _build_tree(self, X, y, depth=0):
    if depth == self.max_depth or gini_impurity(y) == 0:
        return y.mode()[0]
    idx, value, _ = self._choose_best_split(X, y)
    left_y = y[X[:, idx] <= value]
    right_y = y[X[:, idx] > value]
    left_branch = self._build_tree(X[X[:, idx] <= value], left_y, depth + 1)
    right_branch = self._build_tree(X[X[:, idx] > value], right_y, depth + 1)
    return {'idx': idx, 'value': value, 'left': left_branch, 'right': right_branch}

5. 预测

最后，我们需要实现一个预测方法。这个方法将采取构建的决策树对新的样本进行分类。


def _predict(self, sample):
    node = self.tree
    while isinstance(node, dict):
        if sample[node['idx']] <= node['value']:
            node = node['left']
        else:
            node = node['right']
    return node

四、使用决策树进行分类

现在我们已经构建了一个易懂的决策树类，接下来我们可以使用它来对鸢尾花数据集进行分类。


# 创建决策树实例
tree = DecisionTree(max_depth=3)
# 训练决策树
tree.fit(X, y)
# 预测新样本
new_samples = np.array([[5.0, 3.6, 1.4, 0.2],
                         [6.5, 2.8, 4.6, 2.0]])
predictions = tree.predict(new_samples)
# 输出预测于是
print(predictions)

五、总结

本文介绍了怎样使用Python从零起始构建一个易懂的决策树。我们首先定义了一个数据集，然后创建了一个决策树类，包括选择最佳分割点、构建树和预测的方法。最后，我们使用这个决策树对新的样本进行了分类。当然，这个易懂的决策树还有很多可以改进的地方，例如增多剪枝策略、赞成连续值特征等。但期望这个教程能够为您提供一个入门的基础。

文章标签：后端开发