天池智能运维比赛-故障预测

问题描述

比赛官网地址
给定一段时间的内存系统日志,内存故障地址数据以及故障标签数据,参赛者应提出自己的解决方案,以预测每台服务器是否会发生DRAM故障。具体来说,参赛者需要从组委会提供的数据中挖掘出和DRAM故障相关的特征,并采用合适的机器学习算法予以训练,最终得到可以预测DRAM故障的最优模型。数据处理方法和算法不限,但选手应该综合考虑算法的效果和复杂度,以构建相对高效的解决方案

数据描述

memory_sample_kernel_log_*.csv是从Linux内核日志中收集的与DRAM故障相关的信息,共28列。其中,24列是布尔值。每个布尔列代表一个故障文本模板,其中True表示该故障文本模板出现在内核日志中。请注意,这里提供的模板并不保证都和DRAM故障相关,参赛者应自行判断选用哪些模板信息。下表仅列出除模版外的四列信息,每列的含义如下:

列名    字段类型    描述
serial_number    string    服务器代号
manufacturer    integer    server manufacturer id
vendor    integer    memory vendor id
collect_time    string    日志上报时间

memory_sample_failure_tag_*.csv为故障标签表,共5列。每列含义如下:

列名    字段类型    描述
serial_number    string    服务器代号
manufacturer    integer    server manufacturer id
vendor    integer    memory vendor id
failure_time    string    内存的故障时间,与上述3表里的collect_time不同
tag    integer    内存的故障类型代号

初赛训练集数据范围20190101 至 20190531。初赛A/B榜的测试集为memory_sample_mce_log_a/b.csv, memory_sample_address_log_a/b.csv, memory_sample_kernel_log_a/b.csv, A榜数据范围为20190601~20190630整月的日志数据,B榜数据范围为20190701~20190731整月的日志数据,选手根据测试集数据按时间维度,预测服务器是否会在未来7天内发生内存故障。初赛测试集不提供故障label。

复赛阶段,测试集的数据格式和初赛阶段相同,测试集数据范围为20190801~20190810,但是测试集数据不会提供给参赛选手。选手需要在docker代码中从指定的数据集目录中读取测试集内容,进行特征工程和模型预测,最后输出的格式也有变化,输出预测未来7天会发生内存故障的机器集合,且附带预测时间间隔(docker代码中需包含本地训练好的模型,预测时间间隔具体含义见评价指标(复赛))。

评价指标

在这里插入图片描述

代码

仅供参考,输出预测结果成绩约47分左右,A榜排名为44/1350。
在这里插入图片描述

01 - 导入库

1
2
3
4
5
6
7
8
9
10
11
import os
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import torch.utils.data as Data
from torch.autograd import Variable
from torch.utils.data import DataLoader
import torch.nn.functional as F
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split

02- 数据预处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#路径
kernel_log_data_path = 'memory_sample_kernel_log_round1_a_train.csv'# 内核日志路径
failure_tag_data_path = 'memory_sample_failure_tag_round1_a_train.csv'# 故障标签表路径
PARENT_FOLDER = 'data' # 数据的相对路径目录

# 计算每个agg_time区间的和
# path:内核日志路径
# agg_time:聚合时间粒度+AGG_UNIT类型
def etl(path, agg_time):
data = pd.read_csv(os.path.join(PARENT_FOLDER, path))#读取内核日志(os.path.join为路径组合方法)
#print(data)
data['collect_time'] = pd.to_datetime(data['collect_time']).dt.ceil(agg_time)# 将数据转为日期类型,向上取整
group_data = data.groupby(['serial_number','collect_time'],as_index=False).agg('sum')
return group_data

# 设置训练集聚合时间粒度 h/min/s
AGG_VALUE = 5
AGG_UNIT = 'min'
AGG_TIME= str(AGG_VALUE)+AGG_UNIT

训练数据准备

1
etl(kernel_log_data_path, AGG_TIME)

serial_number collect_time 1_hwerr_f 1_hwerr_e 2_hwerr_c 2_sel 3_hwerr_n 2_hwerr_s 3_hwerr_m 1_hwerr_st ... 3_hwerr_r _hwerr_cd 3_sup_mce_note 3_cmci_sub 3_cmci_det 3_hwerr_pi 3_hwerr_o 3_hwerr_mce_l manufacturer vendor
0 server_1 2019-01-01 00:05:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 4 0.0
1 server_1 2019-01-01 00:10:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 2 0.0
2 server_1 2019-01-01 00:20:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 4 0.0
3 server_1 2019-01-01 00:25:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 4 0.0
4 server_1 2019-01-01 00:30:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 6 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
490464 server_9998 2019-04-19 21:50:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6 6.0
490465 server_9998 2019-04-20 22:20:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6 6.0
490466 server_9998 2019-04-23 07:40:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4 4.0
490467 server_9998 2019-04-23 08:05:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6 6.0
490468 server_9998 2019-04-23 15:50:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6 6.0

490469 rows × 28 columns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 目前仅使用kernel数据
group_min = etl(kernel_log_data_path, AGG_TIME)
failure_tag = pd.read_csv(os.path.join(PARENT_FOLDER,failure_tag_data_path))
failure_tag['failure_time']= pd.to_datetime(failure_tag['failure_time'])

# 为数据打标
merged_data = pd.merge(group_min,failure_tag[['serial_number','failure_time']],how='left',on=['serial_number'])
merged_data['failure_tag']=(merged_data['failure_time'].notnull()) & ((merged_data['failure_time']
-merged_data['collect_time']).dt.seconds <= AGG_VALUE*60)
merged_data['failure_tag']= merged_data['failure_tag']+0
feature_data = merged_data.drop(['serial_number', 'collect_time','manufacturer','vendor','failure_time'], axis=1)

# 负样本下采样
sample_0 = feature_data[feature_data['failure_tag']==0].sample(frac=0.05)#返回的比例
sample = sample_0.append(feature_data[feature_data['failure_tag']==1])

# 转为torch
X_train = torch.from_numpy(sample.iloc[:,:-1].values).type(torch.FloatTensor)
y_train = torch.from_numpy(sample['failure_tag'].values).type(torch.LongTensor)

# 查看训练数据结构
X_train.shape,y_train.shape
(torch.Size([24851, 24]), torch.Size([24851]))

A榜预测数据准备

1
2
3
4
5
6
7
# 测试数据准备
group_data_test = etl('memory_sample_kernel_log_round1_a_test.csv', AGG_TIME)
group_min_sn_test = pd.DataFrame(group_data_test[['serial_number','collect_time']])
group_min_test = group_data_test.drop(['serial_number', 'collect_time','manufacturer','vendor'], axis=1)
# 转为torch
X_test = torch.from_numpy(group_min_test.values).type(torch.FloatTensor)
X_test.shape
torch.Size([115629, 24])

B榜预测数据准备

1
2
3
4
5
6
7
# 测试数据准备
group_data_test = etl('memory_sample_kernel_log_round1_b1_test.csv', AGG_TIME)
group_min_sn_test = pd.DataFrame(group_data_test[['serial_number','collect_time']])
group_min_test = group_data_test.drop(['serial_number', 'collect_time','manufacturer','vendor'], axis=1)
# 转为torch
X_test = torch.from_numpy(group_min_test.values).type(torch.FloatTensor)
X_test.shape
torch.Size([210672, 24])
1
2
3
4
5
6
7
8
9
#预测数据集
torch_dataset = Data.TensorDataset(X_test)
testloader = Data.DataLoader(
dataset=torch_dataset,
batch_size=X_test.size(0),
shuffle=False,
drop_last=False,
num_workers=0
)

构建Dataloader

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#训练集
# # 转换成torch可以识别的Dataset
torch_dataset = Data.TensorDataset(X_train,y_train)
#将dataset 放入DataLoader
trainloader = Data.DataLoader(
dataset=torch_dataset,
batch_size=1242,
shuffle=True,
drop_last=True,
num_workers=0
)


#预测数据集
torch_dataset = Data.TensorDataset(X_test)
testloader = Data.DataLoader(
dataset=torch_dataset,
batch_size=X_test.size(0),
shuffle=False,
drop_last=False,
num_workers=0
)

03 - 配置网络

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class Model(nn.Module):
def __init__(self,D_in,H,D_out):
super(Model,self).__init__()
self.hidden1 = torch.nn.Linear(D_in,H)
self.hidden2 = torch.nn.Linear(H,H)
self.predict = torch.nn.Linear(H,D_out)

def forward(self, input):
out = self.hidden1(input)
out = F.relu(out)
out = self.hidden2(out)
out = F.relu(out)
out = self.predict(out)
# out = F.softmax(out,dim=1)#行概率归一
return out

#定义模型传参
model = Model(24,15,2)
print(model)

# 使用GPU
if torch.cuda.is_available():
model.cuda()
print("GPU")
else:
print("CPU")

#定义参数
epochs = 2000
learn_rate = 0.1
momentum = 0.5

#定义损失函数
#loss_fn = torch.nn.MSELoss(reduction='sum')
loss_fn = torch.nn.CrossEntropyLoss()

#定义优化器
optimizer = torch.optim.Adam(model.parameters(),lr=learn_rate)
#optimizer = torch.optim.SGD(model.parameters(),lr=learn_rate,momentum =momentum)


Model(
  (hidden1): Linear(in_features=24, out_features=15, bias=True)
  (hidden2): Linear(in_features=15, out_features=15, bias=True)
  (predict): Linear(in_features=15, out_features=2, bias=True)
)
GPU

04 - 训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
%%time

for epoch in tqdm(range(epochs)):

model.train()
for i, (X_train,y_train) in enumerate(trainloader):
#数据放入GoU
if torch.cuda.is_available():
X_train = Variable(X_train).cuda()
y_train = Variable(y_train).cuda()

#前向传播
out = model(X_train)

#执行计算损失函数
loss = loss_fn(out,y_train)
# loss = F.nll_loss(out,y)

#执行梯度归零
optimizer.zero_grad()

#执行反向传播
loss.backward()

#执行优化器
optimizer.step()

#输出误差 精度
if i%1 == 0:
print("Train Epoch: {}, Iteration {}, Loss: {}".format(epoch+1,i,loss.item()))

#softmax行提取(1为DRAM故障)
# out = torch.max(out,dim = 1)[1]
pre = torch.max(F.softmax(out),dim = 1)[1]

05 -预测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#读取存储的模型
#model.load_state_dict(torch.load('5b2485.pt'))```

​```python
model.eval()
for (X_test,) in testloader:
if torch.cuda.is_available():
X_test = Variable(X_test).cuda()
out = model(X_test)
#pre = torch.max(out,1)[1]
pre = torch.max(F.softmax(out), 1)[1]
print("data ok")

for i in range(X_test.size(0)):
if pre[i]==1:
print(pre[i])

保存预测结果

1
2
3
4
5
6
7
b = pre.cpu().numpy()
group_min_sn_test['predict']=b
# 输出预测结果
group_min_sn_test=group_min_sn_test[group_min_sn_test['predict']==1]
group_min_sn_res = group_min_sn_test.drop('predict',axis=1)
group_min_sn_res.to_csv('memory_predit_res_nn Bpre.csv', header=False, index=False)
print("Save OK")

保存模型

1
#torch.save(model.state_dict(), '5b2485.pt')