文章/答案/技术大牛

发布

社区首页 >问答首页 >如何将PyTorch的DataLoader与skorch的GridSearchCV结合使用

问如何将PyTorch的DataLoader与skorch的GridSearchCV结合使用
EN

Stack Overflow用户

提问于 2020-11-01 02:18:54

回答 1查看 1.7K关注 0票数 0

我正在运行一个PyTorch ANN模型(用于分类任务)，我使用skorch的GridSearchCV来搜索最优的超参数。

当我使用GridSearchCV运行n_jobs=1时(即。一次做一个超参数组合)，它运行得非常慢。

当我将n_jobs设置为大于1时，就会出现内存溢出错误.因此，我现在正试图看看是否可以使用PyTorch的DataLoader将数据集分割成批处理，以避免内存崩溃的问题。根据另一个PyTorch论坛问题(https://discuss.pytorch.org/t/how-to-use-skorch-for-data-that-does-not-fit-into-memory/70081/2)，我们似乎可以使用SliceDataset。我的代码如下：

# Setting up artifical neural net model
class TabularModel(nn.Module):
    # Initialize parameters embeds, emb_drop, bn_cont and layers
    def __init__(self, emb_szs, n_cont, out_sz, layers, p=0.5):
        super().__init__()
        self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in emb_szs])
        self.emb_drop = nn.Dropout(p)
        self.bn_cont = nn.BatchNorm1d(n_cont)
        
        # Create empty list for each layer in the neural net
        layerlist = []
        # Number of all embedded columns for categorical features 
        n_emb = sum((nf for ni, nf in emb_szs))
        # Number of inputs for each layer 
        n_in = n_emb + n_cont
        
        for i in layers:
            # Set the linear function for the weights and biases, wX + b
            layerlist.append(nn.Linear(n_in, i)) 
            # Using ReLu activation function
            layerlist.append(nn.ReLU(inplace=True))   
            # Normalised all the activation function output values
            layerlist.append(nn.BatchNorm1d(i))   
            # Set some of the normalised activation function output values to zero
            layerlist.append(nn.Dropout(p))
            # Reassign number of inputs for the next layer
            n_in = i
        # Append last layer
        layerlist.append(nn.Linear(layers[-1], out_sz))          
        # Create sequential layers
        self.layers = nn.Sequential(*layerlist)
    
    # Function for feedforward
    def forward(self, x_cat_cont):
        x_cat = x_cat_cont[:,0:cat_train.shape[1]].type(torch.int64)
        x_cont = x_cat_cont[:,cat_train.shape[1]:].type(torch.float32)

        # Create empty list for embedded categorical features
        embeddings = []
        # Embed categorical features
        for i, e in enumerate(self.embeds):
            embeddings.append(e(x_cat[:,i]))
        # Concatenate embedded categorical features
        x = torch.cat(embeddings, 1)
        # Apply dropout rates to categorical features
        x = self.emb_drop(x)
        
        # Batch normalize continuous features
        x_cont = self.bn_cont(x_cont)
        
        # Concatenate categorical and continuous features
        x = torch.cat([x, x_cont], 1)
        
        # Feed categorical and continuous features into neural net layers
        x = self.layers(x)
        return x


# Use cross entropy loss function since this is a classification problem
# Assign class weights to the loss function
criterion_skorch = nn.CrossEntropyLoss
# Use Adam solver with learning rate 0.001
optimizer_skorch = torch.optim.Adam

from skorch import NeuralNetClassifier

# Random seed chosen to ensure results are reproducible by using the same initial random weights and biases, 
# and applying dropout rates to the same random embedded categorical features and neurons in the hidden layers
torch.manual_seed(0)

net = NeuralNetClassifier(module=TabularModel,
                          module__emb_szs=emb_szs,
                          module__n_cont=con_train.shape[1],
                          module__out_sz=2,
                          module__layers=[30],
                          module__p=0.0,
                          criterion=criterion_skorch,
                          criterion__weight=cls_wgt,
                          optimizer=optimizer_skorch,
                          optimizer__lr=0.001,
                          max_epochs=150,
                          device='cuda'
                          )

from sklearn.model_selection import GridSearchCV

param_grid = {'module__layers': [[30], [50,20]],
              'module__p': [0.0],
              'max_epochs': [150, 175]
             }

from torch.utils.data import TensorDataset, DataLoader
from skorch.helper import SliceDataset

# cat_con_train and y_train is a PyTorch tensor
tsr_ds = TensorDataset(cat_con_train.cpu(), y_train.cpu())

torch.manual_seed(0)   # Set random seed for shuffling results to be reproducible
d_loader = DataLoader(tsr_ds, batch_size=100000, shuffle=True)

d_loader_slice_X = SliceDataset(d_loader, idx=0)
d_loader_slice_y = SliceDataset(d_loader, idx=1)

models = GridSearchCV(net, param_grid, scoring='roc_auc', n_jobs=2).fit(d_loader_slice_X, d_loader_slice_y)

但是，当我运行这段代码时，会收到以下错误消息：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-47-df3fc792ad5e> in <module>()
    104 
--> 105 models = GridSearchCV(net, param_grid, scoring='roc_auc', n_jobs=2).fit(d_loader_slice_X, d_loader_slice_y)
    106 

6 frames
/usr/local/lib/python3.6/dist-packages/skorch/helper.py in __getitem__(self, i)
    230     def __getitem__(self, i):
    231         if isinstance(i, (int, np.integer)):
--> 232             Xn = self.dataset[self.indices_[i]]
    233             Xi = self._select_item(Xn)
    234             return self.transform(Xi)

TypeError: 'DataLoader' object does not support indexing

我该怎么解决这个问题？有没有办法将PyTorch的DataLoader和skorch的GridSearchCV (即。是否有一种方法可以将数据批量加载到skorch的GridSearchCV中，以避免当我在GridSearchCV中将n_jobs设置为大于1时内存耗尽问题)？

非常感谢！

python

pytorch

grid-search

dataloader

skorch

回答 1

Stack Overflow用户

发布于 2020-11-11 23:49:00

所以第一件事是找出你的记忆耗尽的地方。你有一个很高的批次大小，大概只有一个GPU。如果您有多个GPU，您已经设置了您可以遵循这些步骤并行网格搜索多个GPU使用skorch + dask。

如果您只有一个GPU，那么您的GPU的RAM显然是一个瓶颈，它不支持RAM中模型的两个实例。你可以：

缩小模型大小(减少参数)
减少批处理大小(数据占用的空间较少)

不过，你走哪条路取决于你。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64628130

复制

相似问题

问如何将PyTorch的DataLoader与skorch的GridSearchCV结合使用
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何将PyTorch的DataLoader与skorch的GridSearchCV结合使用EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何将PyTorch的DataLoader与skorch的GridSearchCV结合使用
EN