问题描述 比赛官网地址 给定一段时间的内存系统日志,内存故障地址数据以及故障标签数据,参赛者应提出自己的解决方案,以预测每台服务器是否会发生DRAM故障。具体来说,参赛者需要从组委会提供的数据中挖掘出和DRAM故障相关的特征,并采用合适的机器学习算法予以训练,最终得到可以预测DRAM故障的最优模型。数据处理方法和算法不限,但选手应该综合考虑算法的效果和复杂度,以构建相对高效的解决方案
数据描述 memory_sample_kernel_log_*.csv是从Linux内核日志中收集的与DRAM故障相关的信息,共28列。其中,24列是布尔值。每个布尔列代表一个故障文本模板,其中True表示该故障文本模板出现在内核日志中。请注意,这里提供的模板并不保证都和DRAM故障相关,参赛者应自行判断选用哪些模板信息。下表仅列出除模版外的四列信息,每列的含义如下:
memory_sample_failure_tag_*.csv为故障标签表,共5列。每列含义如下:
初赛训练集数据范围20190101 至 20190531。初赛A/B榜的测试集为memory_sample_mce_log_a/b.csv, memory_sample_address_log_a/b.csv, memory_sample_kernel_log_a/b.csv, A榜数据范围为20190601~20190630整月的日志数据,B榜数据范围为20190701~20190731整月的日志数据,选手根据测试集数据按时间维度,预测服务器是否会在未来7天内发生内存故障。初赛测试集不提供故障label。
复赛阶段,测试集的数据格式和初赛阶段相同,测试集数据范围为20190801~20190810,但是测试集数据不会提供给参赛选手。选手需要在docker代码中从指定的数据集目录中读取测试集内容,进行特征工程和模型预测,最后输出的格式也有变化,输出预测未来7天会发生内存故障的机器集合,且附带预测时间间隔(docker代码中需包含本地训练好的模型,预测时间间隔具体含义见评价指标(复赛))。
评价指标
代码 仅供参考,输出预测结果成绩约47分左右,A榜排名为44/1350。
01 - 导入库 1 2 3 4 5 6 7 8 9 10 11 import osimport torchimport torch.nn as nnimport numpy as npimport pandas as pdimport torch.utils.data as Datafrom torch.autograd import Variablefrom torch.utils.data import DataLoaderimport torch.nn.functional as Ffrom tqdm.notebook import tqdmfrom sklearn.model_selection import train_test_split
02- 数据预处理 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 kernel_log_data_path = 'memory_sample_kernel_log_round1_a_train.csv' failure_tag_data_path = 'memory_sample_failure_tag_round1_a_train.csv' PARENT_FOLDER = 'data' def etl (path, agg_time ): data = pd.read_csv(os.path.join(PARENT_FOLDER, path)) data['collect_time' ] = pd.to_datetime(data['collect_time' ]).dt.ceil(agg_time) group_data = data.groupby(['serial_number' ,'collect_time' ],as_index=False ).agg('sum' ) return group_data AGG_VALUE = 5 AGG_UNIT = 'min' AGG_TIME= str (AGG_VALUE)+AGG_UNIT
训练数据准备 1 etl(kernel_log_data_path, AGG_TIME)
serial_number
collect_time
1_hwerr_f
1_hwerr_e
2_hwerr_c
2_sel
3_hwerr_n
2_hwerr_s
3_hwerr_m
1_hwerr_st
...
3_hwerr_r
_hwerr_cd
3_sup_mce_note
3_cmci_sub
3_cmci_det
3_hwerr_pi
3_hwerr_o
3_hwerr_mce_l
manufacturer
vendor
0
server_1
2019-01-01 00:05:00
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
...
0.0
0.0
2.0
0.0
0.0
0.0
0.0
0.0
4
0.0
1
server_1
2019-01-01 00:10:00
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
...
0.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
2
0.0
2
server_1
2019-01-01 00:20:00
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
...
0.0
0.0
2.0
0.0
0.0
0.0
0.0
0.0
4
0.0
3
server_1
2019-01-01 00:25:00
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
...
0.0
0.0
2.0
0.0
0.0
0.0
0.0
0.0
4
0.0
4
server_1
2019-01-01 00:30:00
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
...
0.0
0.0
3.0
0.0
0.0
0.0
0.0
0.0
6
0.0
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
490464
server_9998
2019-04-19 21:50:00
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
...
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
6
6.0
490465
server_9998
2019-04-20 22:20:00
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
...
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
6
6.0
490466
server_9998
2019-04-23 07:40:00
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
...
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
4
4.0
490467
server_9998
2019-04-23 08:05:00
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
...
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
6
6.0
490468
server_9998
2019-04-23 15:50:00
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
...
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
6
6.0
490469 rows × 28 columns
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 group_min = etl(kernel_log_data_path, AGG_TIME) failure_tag = pd.read_csv(os.path.join(PARENT_FOLDER,failure_tag_data_path)) failure_tag['failure_time' ]= pd.to_datetime(failure_tag['failure_time' ]) merged_data = pd.merge(group_min,failure_tag[['serial_number' ,'failure_time' ]],how='left' ,on=['serial_number' ]) merged_data['failure_tag' ]=(merged_data['failure_time' ].notnull()) & ((merged_data['failure_time' ] -merged_data['collect_time' ]).dt.seconds <= AGG_VALUE*60 ) merged_data['failure_tag' ]= merged_data['failure_tag' ]+0 feature_data = merged_data.drop(['serial_number' , 'collect_time' ,'manufacturer' ,'vendor' ,'failure_time' ], axis=1 ) sample_0 = feature_data[feature_data['failure_tag' ]==0 ].sample(frac=0.05 ) sample = sample_0.append(feature_data[feature_data['failure_tag' ]==1 ]) X_train = torch.from_numpy(sample.iloc[:,:-1 ].values).type (torch.FloatTensor) y_train = torch.from_numpy(sample['failure_tag' ].values).type (torch.LongTensor) X_train.shape,y_train.shape
(torch.Size([24851, 24]), torch.Size([24851]))
A榜预测数据准备 1 2 3 4 5 6 7 group_data_test = etl('memory_sample_kernel_log_round1_a_test.csv' , AGG_TIME) group_min_sn_test = pd.DataFrame(group_data_test[['serial_number' ,'collect_time' ]]) group_min_test = group_data_test.drop(['serial_number' , 'collect_time' ,'manufacturer' ,'vendor' ], axis=1 ) X_test = torch.from_numpy(group_min_test.values).type (torch.FloatTensor) X_test.shape
torch.Size([115629, 24])
B榜预测数据准备 1 2 3 4 5 6 7 group_data_test = etl('memory_sample_kernel_log_round1_b1_test.csv' , AGG_TIME) group_min_sn_test = pd.DataFrame(group_data_test[['serial_number' ,'collect_time' ]]) group_min_test = group_data_test.drop(['serial_number' , 'collect_time' ,'manufacturer' ,'vendor' ], axis=1 ) X_test = torch.from_numpy(group_min_test.values).type (torch.FloatTensor) X_test.shape
torch.Size([210672, 24])
1 2 3 4 5 6 7 8 9 torch_dataset = Data.TensorDataset(X_test) testloader = Data.DataLoader( dataset=torch_dataset, batch_size=X_test.size(0 ), shuffle=False , drop_last=False , num_workers=0 )
构建Dataloader 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 torch_dataset = Data.TensorDataset(X_train,y_train) trainloader = Data.DataLoader( dataset=torch_dataset, batch_size=1242 , shuffle=True , drop_last=True , num_workers=0 ) torch_dataset = Data.TensorDataset(X_test) testloader = Data.DataLoader( dataset=torch_dataset, batch_size=X_test.size(0 ), shuffle=False , drop_last=False , num_workers=0 )
03 - 配置网络 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 class Model (nn.Module ): def __init__ (self,D_in,H,D_out ): super (Model,self).__init__() self.hidden1 = torch.nn.Linear(D_in,H) self.hidden2 = torch.nn.Linear(H,H) self.predict = torch.nn.Linear(H,D_out) def forward (self, input ): out = self.hidden1(input ) out = F.relu(out) out = self.hidden2(out) out = F.relu(out) out = self.predict(out) return out model = Model(24 ,15 ,2 ) print (model)if torch.cuda.is_available(): model.cuda() print ("GPU" ) else : print ("CPU" ) epochs = 2000 learn_rate = 0.1 momentum = 0.5 loss_fn = torch.nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(),lr=learn_rate)
Model(
(hidden1): Linear(in_features=24, out_features=15, bias=True)
(hidden2): Linear(in_features=15, out_features=15, bias=True)
(predict): Linear(in_features=15, out_features=2, bias=True)
)
GPU
04 - 训练 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 %%time for epoch in tqdm(range (epochs)): model.train() for i, (X_train,y_train) in enumerate (trainloader): if torch.cuda.is_available(): X_train = Variable(X_train).cuda() y_train = Variable(y_train).cuda() out = model(X_train) loss = loss_fn(out,y_train) optimizer.zero_grad() loss.backward() optimizer.step() if i%1 == 0 : print ("Train Epoch: {}, Iteration {}, Loss: {}" .format (epoch+1 ,i,loss.item())) pre = torch.max (F.softmax(out),dim = 1 )[1 ]
05 -预测 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ```python model.eval () for (X_test,) in testloader: if torch.cuda.is_available(): X_test = Variable(X_test).cuda() out = model(X_test) pre = torch.max (F.softmax(out), 1 )[1 ] print ("data ok" )for i in range (X_test.size(0 )): if pre[i]==1 : print (pre[i])
保存预测结果
1 2 3 4 5 6 7 b = pre.cpu().numpy() group_min_sn_test['predict' ]=b group_min_sn_test=group_min_sn_test[group_min_sn_test['predict' ]==1 ] group_min_sn_res = group_min_sn_test.drop('predict' ,axis=1 ) group_min_sn_res.to_csv('memory_predit_res_nn Bpre.csv' , header=False , index=False ) print ("Save OK" )
保存模型