问题描述 比赛官网地址 
数据描述 memory_sample_kernel_log_*.csv是从Linux内核日志中收集的与DRAM故障相关的信息,共28列。其中,24列是布尔值。每个布尔列代表一个故障文本模板,其中True表示该故障文本模板出现在内核日志中。请注意,这里提供的模板并不保证都和DRAM故障相关,参赛者应自行判断选用哪些模板信息。下表仅列出除模版外的四列信息,每列的含义如下:
memory_sample_failure_tag_*.csv为故障标签表,共5列。每列含义如下:
初赛训练集数据范围20190101 至 20190531。初赛A/B榜的测试集为memory_sample_mce_log_a/b.csv, memory_sample_address_log_a/b.csv, memory_sample_kernel_log_a/b.csv, A榜数据范围为20190601~20190630整月的日志数据,B榜数据范围为20190701~20190731整月的日志数据,选手根据测试集数据按时间维度,预测服务器是否会在未来7天内发生内存故障。初赛测试集不提供故障label。
复赛阶段,测试集的数据格式和初赛阶段相同,测试集数据范围为20190801~20190810,但是测试集数据不会提供给参赛选手。选手需要在docker代码中从指定的数据集目录中读取测试集内容,进行特征工程和模型预测,最后输出的格式也有变化,输出预测未来7天会发生内存故障的机器集合,且附带预测时间间隔(docker代码中需包含本地训练好的模型,预测时间间隔具体含义见评价指标(复赛))。
评价指标 
代码 仅供参考,输出预测结果成绩约47分左右,A榜排名为44/1350。
01 - 导入库 1 2 3 4 5 6 7 8 9 10 11 import  osimport  torchimport  torch.nn as  nnimport  numpy as  npimport  pandas as  pdimport  torch.utils.data as  Datafrom  torch.autograd import  Variablefrom  torch.utils.data import  DataLoaderimport  torch.nn.functional as  Ffrom  tqdm.notebook import  tqdmfrom  sklearn.model_selection import  train_test_split
02- 数据预处理 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 kernel_log_data_path = 'memory_sample_kernel_log_round1_a_train.csv'  failure_tag_data_path = 'memory_sample_failure_tag_round1_a_train.csv'  PARENT_FOLDER = 'data'   def  etl (path, agg_time ):    data = pd.read_csv(os.path.join(PARENT_FOLDER, path))          data['collect_time' ] = pd.to_datetime(data['collect_time' ]).dt.ceil(agg_time)     group_data = data.groupby(['serial_number' ,'collect_time' ],as_index=False ).agg('sum' )     return  group_data AGG_VALUE = 5  AGG_UNIT = 'min'  AGG_TIME= str (AGG_VALUE)+AGG_UNIT 
训练数据准备 1 etl(kernel_log_data_path, AGG_TIME) 
  
    
      serial_number 
      collect_time 
      1_hwerr_f 
      1_hwerr_e 
      2_hwerr_c 
      2_sel 
      3_hwerr_n 
      2_hwerr_s 
      3_hwerr_m 
      1_hwerr_st 
      ... 
      3_hwerr_r 
      _hwerr_cd 
      3_sup_mce_note 
      3_cmci_sub 
      3_cmci_det 
      3_hwerr_pi 
      3_hwerr_o 
      3_hwerr_mce_l 
      manufacturer 
      vendor 
     
   
  
    
      0 
      server_1 
      2019-01-01 00:05:00 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      ... 
      0.0 
      0.0 
      2.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      4 
      0.0 
     
    
      1 
      server_1 
      2019-01-01 00:10:00 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      ... 
      0.0 
      0.0 
      1.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      2 
      0.0 
     
    
      2 
      server_1 
      2019-01-01 00:20:00 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      ... 
      0.0 
      0.0 
      2.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      4 
      0.0 
     
    
      3 
      server_1 
      2019-01-01 00:25:00 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      ... 
      0.0 
      0.0 
      2.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      4 
      0.0 
     
    
      4 
      server_1 
      2019-01-01 00:30:00 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      ... 
      0.0 
      0.0 
      3.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      6 
      0.0 
     
    
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
     
    
      490464 
      server_9998 
      2019-04-19 21:50:00 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      ... 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      6 
      6.0 
     
    
      490465 
      server_9998 
      2019-04-20 22:20:00 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      ... 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      6 
      6.0 
     
    
      490466 
      server_9998 
      2019-04-23 07:40:00 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      ... 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      4 
      4.0 
     
    
      490467 
      server_9998 
      2019-04-23 08:05:00 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      ... 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      6 
      6.0 
     
    
      490468 
      server_9998 
      2019-04-23 15:50:00 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      ... 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      0.0 
      6 
      6.0 
     
   
490469 rows × 28 columns
 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 group_min = etl(kernel_log_data_path, AGG_TIME) failure_tag = pd.read_csv(os.path.join(PARENT_FOLDER,failure_tag_data_path)) failure_tag['failure_time' ]= pd.to_datetime(failure_tag['failure_time' ]) merged_data = pd.merge(group_min,failure_tag[['serial_number' ,'failure_time' ]],how='left' ,on=['serial_number' ]) merged_data['failure_tag' ]=(merged_data['failure_time' ].notnull()) & ((merged_data['failure_time' ] -merged_data['collect_time' ]).dt.seconds <= AGG_VALUE*60 ) merged_data['failure_tag' ]= merged_data['failure_tag' ]+0  feature_data = merged_data.drop(['serial_number' , 'collect_time' ,'manufacturer' ,'vendor' ,'failure_time' ], axis=1 ) sample_0 = feature_data[feature_data['failure_tag' ]==0 ].sample(frac=0.05 ) sample = sample_0.append(feature_data[feature_data['failure_tag' ]==1 ]) X_train = torch.from_numpy(sample.iloc[:,:-1 ].values).type (torch.FloatTensor) y_train = torch.from_numpy(sample['failure_tag' ].values).type (torch.LongTensor) X_train.shape,y_train.shape 
(torch.Size([24851, 24]), torch.Size([24851]))
A榜预测数据准备 1 2 3 4 5 6 7 group_data_test = etl('memory_sample_kernel_log_round1_a_test.csv' , AGG_TIME) group_min_sn_test = pd.DataFrame(group_data_test[['serial_number' ,'collect_time' ]]) group_min_test = group_data_test.drop(['serial_number' , 'collect_time' ,'manufacturer' ,'vendor' ], axis=1 ) X_test = torch.from_numpy(group_min_test.values).type (torch.FloatTensor) X_test.shape 
torch.Size([115629, 24])
B榜预测数据准备 1 2 3 4 5 6 7 group_data_test = etl('memory_sample_kernel_log_round1_b1_test.csv' , AGG_TIME) group_min_sn_test = pd.DataFrame(group_data_test[['serial_number' ,'collect_time' ]]) group_min_test = group_data_test.drop(['serial_number' , 'collect_time' ,'manufacturer' ,'vendor' ], axis=1 ) X_test = torch.from_numpy(group_min_test.values).type (torch.FloatTensor) X_test.shape 
torch.Size([210672, 24])
1 2 3 4 5 6 7 8 9 torch_dataset = Data.TensorDataset(X_test)  testloader = Data.DataLoader(     dataset=torch_dataset,     batch_size=X_test.size(0 ),     shuffle=False ,     drop_last=False ,     num_workers=0  ) 
构建Dataloader 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 torch_dataset = Data.TensorDataset(X_train,y_train)  trainloader = Data.DataLoader( dataset=torch_dataset,     batch_size=1242 ,      shuffle=True ,      drop_last=True ,      num_workers=0   ) torch_dataset = Data.TensorDataset(X_test)  testloader = Data.DataLoader(     dataset=torch_dataset,     batch_size=X_test.size(0 ),     shuffle=False ,     drop_last=False ,     num_workers=0  ) 
03 - 配置网络 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 class  Model (nn.Module ):    def  __init__ (self,D_in,H,D_out ):         super (Model,self).__init__()         self.hidden1 = torch.nn.Linear(D_in,H)         self.hidden2 = torch.nn.Linear(H,H)         self.predict = torch.nn.Linear(H,D_out)     def  forward (self, input  ):         out = self.hidden1(input )         out = F.relu(out)         out = self.hidden2(out)         out = F.relu(out)         out = self.predict(out)         return  out model = Model(24 ,15 ,2 ) print (model)if  torch.cuda.is_available():    model.cuda()     print ("GPU" ) else :    print ("CPU" )      epochs = 2000  learn_rate = 0.1  momentum = 0.5  loss_fn = torch.nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(),lr=learn_rate) 
Model(
  (hidden1): Linear(in_features=24, out_features=15, bias=True)
  (hidden2): Linear(in_features=15, out_features=15, bias=True)
  (predict): Linear(in_features=15, out_features=2, bias=True)
)
GPU
04 - 训练 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 %%time for  epoch in  tqdm(range (epochs)):         model.train()     for  i, (X_train,y_train) in  enumerate (trainloader):                          if  torch.cuda.is_available():             X_train = Variable(X_train).cuda()             y_train = Variable(y_train).cuda()                        out = model(X_train)                  loss = loss_fn(out,y_train)                           optimizer.zero_grad()                           loss.backward()                           optimizer.step()                         if  i%1  == 0 :             print ("Train Epoch: {}, Iteration {}, Loss: {}" .format (epoch+1 ,i,loss.item()))                   pre = torch.max (F.softmax(out),dim = 1 )[1 ] 
05 -预测 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ```python model.eval () for  (X_test,) in  testloader:    if  torch.cuda.is_available():         X_test = Variable(X_test).cuda()                 out = model(X_test)                  pre = torch.max (F.softmax(out), 1 )[1 ] print ("data ok" )for  i in  range (X_test.size(0 )):    if  pre[i]==1 :         print (pre[i]) 
保存预测结果 
1 2 3 4 5 6 7 b = pre.cpu().numpy() group_min_sn_test['predict' ]=b group_min_sn_test=group_min_sn_test[group_min_sn_test['predict' ]==1 ] group_min_sn_res = group_min_sn_test.drop('predict' ,axis=1 ) group_min_sn_res.to_csv('memory_predit_res_nn Bpre.csv' , header=False , index=False ) print ("Save OK" )
保存模型