keras多显卡训练方式

2023-07-23 05:44:05 76

使用keras进行训练，默认使用单显卡，即使设置了os.environ['CUDA_VISIBLE_DEVICES']为两张显卡，也只是占满了显存，再设置tf.GPUOptions(allow_growth=True)之后可以清楚看到，只占用了第一张显卡，第二张显卡完全没用。

要使用多张显卡，需要按如下步骤：

（1）importmulti_gpu_model函数：fromkeras.utilsimportmulti_gpu_model

（2）在定义好model之后，使用multi_gpu_model设置模型由几张显卡训练，如下：

model=Model(...)#定义模型结构
model_parallel=multi_gpu_model(model,gpu=n)#使用几张显卡n等于几
model_parallel.compile(...)#注意是model_parallel，不是model

通过以上代码，model将作为CPU上的原始模型，而model_parallel将作为拷贝模型被复制到各个GPU上进行梯度计算。如果batchsize为128，显卡n=2，则每张显卡单独计算128/2=64张图像，然后在CPU上将两张显卡计算得到的梯度进行融合更新，并对模型权重进行更新后再将新模型拷贝到GPU再次训练。

（3）从上面可以看出，进行训练时，仍然在model_parallel上进行：

model_parallel.fit(...)#注意是model_parallel

（4）保存模型时，model_parallel保存了训练时显卡数量的信息，所以如果直接保存model_parallel的话，只能将模型设置为相同数量的显卡调用，否则训练的模型将不能调用。因此，为了之后的调用方便，只保存CPU上的模型，即model:

model.save(...)#注意是model，不是model_parallel

如果用到了callback函数，则默认保存的也是model_parallel（因为训练函数是针对model_parallel的），所以要用回调函数保存model的话需要自己对回调函数进行定义：

classOwnCheckpoint(keras.callbacks.Callback):
def__init__(self,model):
self.model_to_save=model
defon_epoch_end(self,epoch,logs=None):#这里logs必须写
self.model_to_save.save('model_advanced/model_%d.h5'%epoch)

定以后具体使用如下：

checkpoint=OwnCheckpoint(model)
model_parallel.fit_generator(...,callbacks=[checkpoint])

这样就没问题了！

补充知识：keras.fit_generator及多卡训练记录

1.环境问题

使用keras，以tensorflow为背景，tensorflow1.14多卡训练会出错python3.6

2.代码

2.1

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ['CUDA_VISIBLE_DEVICES']='4,5'

2.2自定义generator函数

defimg_image_generator(path_img,path_lab,batch_size,data_list):
whileTrue:
#'train_list.csv'
file_list=pd.read_csv(data_list,sep=',',usecols=[1]).values.tolist()
file_list=[i[0]foriinfile_list]
cnt=0
X=[]
Y1=[]
forfile_iinfile_list:
x=cv2.imread(path_img+'/'+file_i,cv2.IMREAD_GRAYSCALE)
x=x.astype('float32')
x/=255.
y=cv2.imread(path_lab+'/'+file_i,cv2.IMREAD_GRAYSCALE)
y=y.astype('float32')
y/=255.
X.append(x.reshape(256,256,1))
Y1.append(y.reshape(256,256,1))
cnt+=1
ifcnt==batch_size:
cnt=0
yield(np.array(X),[np.array(Y1),np.array(Y1)])
X=[]
Y1=[]

2.3函数调用及训练

generator_train=img_image_generator(path1,path2,4,pathcsv_train)
generator_test=img_image_generator(path1,path2,4,pathcsv_test)
model.fit_generator(generator_train,steps_per_epoch=237*2,epochs=50,callbacks=callbacks_list,validation_data=generator_test,validation_steps=60*2)

3.多卡训练

3.1复制model

model_parallel=multi_gpu_model(model,gpus=2)

3.2checkpoint定义

classParallelModelCheckpoint(ModelCheckpoint):
def__init__(self,model,filepath,monitor='val_out_final_score',verbose=0,\
save_best_only=False,save_weights_only=False,mode='auto',period=1):
self.single_model=model
super(ParallelModelCheckpoint,self).__init__(filepath,monitor,verbose,save_best_only,save_weights_only,mode,period)

defset_model(self,model):
super(ParallelModelCheckpoint,self).set_model(self.single_model)

使用

model_checkpoint=ParallelModelCheckpoint(model=model,filepath=filepath,monitor='val_loss',verbose=1,save_best_only=True,mode='min')

3.3注意的问题

保存模型是时候需要使用以原来的模型保存，不能使用model_parallel保存

以上这篇keras多显卡训练方式就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持毛票票。

keras多显卡训练方式

热门推荐

随机推荐