在面对量级如此巨大的短信数据时,为了保证更良好的用户体验,如何从数据中挖掘出更多有意义的信息为人们免受垃圾短信骚扰成为当前亟待解决的问题。
实验要求:
- 任务提供包括数据读取、基础模型、模型训练等基本代码
- 您需完成核心模型构建代码,并尽可能将模型调到最佳状态
- 模型单次推理时间不超过10秒
- 使用logistics回归和决策树实现,并且比较两者方法异同和性能区别
可以使用基于Python的Pandas、Numpy、Sklearn等库进行相关特征处理,使用Sklearn框架训练分类器,也可编写深度学习模型,使用过程中请注意Python包(库)的版本。
# 导入必要的库
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import make_pipeline
import time
# 1. 数据加载
# 假设数据存储在SMSSpamCollection.txt文件中,列之间用制表符分隔
data = pd.read_csv('SMSSpamCollection.txt', sep='\t', header=None, names=['label', 'message'])
# 2. 数据预处理
# 将标签映射为0和1,0表示ham, 1表示spam
data['label'] = data['label'].map({'ham': 0, 'spam': 1})
# 3. 特征工程:使用TF-IDF向量化文本数据
X = data['message']
y = data['label']
# 4. 拆分数据集为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 5. 使用逻辑回归和决策树构建模型
# 创建TF-IDF + 逻辑回归的pipeline
lr_pipeline = make_pipeline(TfidfVectorizer(), LogisticRegression(max_iter=1000))
# 创建TF-IDF + 决策树的pipeline
dt_pipeline = make_pipeline(TfidfVectorizer(), DecisionTreeClassifier(random_state=42))
# 训练逻辑回归模型
start_time = time.time()
lr_pipeline.fit(X_train, y_train)
lr_train_time = time.time() - start_time
# 训练决策树模型
start_time = time.time()
dt_pipeline.fit(X_train, y_train)
dt_train_time = time.time() - start_time
# 6. 模型预测与评估
# 逻辑回归预测
lr_predictions = lr_pipeline.predict(X_test)
# 决策树预测
dt_predictions = dt_pipeline.predict(X_test)
# 输出评估报告
print("Logistic Regression Performance:")
print(classification_report(y_test, lr_predictions))
print(f"Logistic Regression Training Time: {lr_train_time:.4f} seconds")
print("Decision Tree Performance:")
print(classification_report(y_test, dt_predictions))
print(f"Decision Tree Training Time: {dt_train_time:.4f} seconds")
# 7. 比较推理时间
start_time = time.time()
lr_pipeline.predict(X_test[:5]) # 预测前5条数据
lr_inference_time = time.time() - start_time
start_time = time.time()
dt_pipeline.predict(X_test[:5]) # 预测前5条数据
dt_inference_time = time.time() - start_time
print(f"Logistic Regression Inference Time (5 samples): {lr_inference_time:.4f} seconds")
print(f"Decision Tree Inference Time (5 samples): {dt_inference_time:.4f} seconds")