目前,我正在执行运行深度的步骤,在城市景观数据集上对exception_65骨干网进行培训,但不幸的是,我遇到了一个分段错误。我无法重现这个错误。例如,关于PASCAL数据集的培训效果很好。我检查了路径、tensorflow和驱动程序等的几个版本和组合。即使我不支持GPU运行train.py脚本,我也会遇到相同的分段错误。我在另一台电脑上做了同样的步骤,然后我开始工作。有人知道问题出在哪里吗?
我的设置:
通过跑步:
python3 "${WORK_DIR}"/train.py \
--logtostderr \
--training_number_of_steps=${NUM_ITERATIONS} \
--train_split="train_fine" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size="769,769" \
--train_batch_size=1 \
--fine_tune_batch_norm=False \
--dataset="cityscapes" \
--tf_initial_checkpoint="${INIT_FOLDER}/deeplabv3_cityscapes_train/model.ckpt" \
--train_logdir="${TRAIN_LOGDIR}" \
--dataset_dir="${CITYSCAPES_DATASET}" 我得到以下输出:
I1119 16:52:49.856512 139832269989696 learning.py:768] Starting Queues.
Fatal Python error: Segmentation fault
Thread 0x00007f2cd086b700 (most recent call first):
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/threading.py", line 296 in wait
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/queue.py", line 170 in get
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/threading.py", line 926 in _bootstrap_inner
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/threading.py", line 890 in _bootstrap
Thread 0x00007f2d3cc7e740 (most recent call first):
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350 in _run_fn
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365 in _do_call
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359 in _do_run
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180 in _run
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956 in run
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 490 in train_step
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 775 in train
File "/home/kuschnig/tensorflow/models/research/deeplab/train.py", line 466 in main
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/absl/app.py", line 250 in _run_main
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/absl/app.py", line 299 in run
File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40 in run
File "/home/kuschnig/tensorflow/models/research/deeplab/train.py", line 472 in <module>
Segmentation fault (core dumped)使用gdb的回溯跟踪显示:GDB输出
发布于 2020-01-22 09:52:46
我也遇到了同样的问题。我通过做两件事成功地解决了这个问题:
train-00000-of-00010.tfrecord)与--train_split="train"相同。data_generator.py中的变化,围绕第72行splits_to_sizes={'train_fine': 2975由splits_to_sizes={'train': 2975。诀窍是在启动培训的.sh、data_generator.py和tfrecord文件夹中使用相同的名称(对我来说是)。
发布于 2020-09-11 08:23:09
我的问题看起来是你的,我意识到--dataset_dir应该指向包含cityscape的tfrecord数据的目录,而不是cityscape目录本身。
在data_generator中检索数据的代码。
def _get_all_files(self):
"""Gets all the files to read data from.
Returns:
A list of input files.
"""
file_pattern = _FILE_PATTERN
file_pattern = os.path.join(self.dataset_dir,
file_pattern % self.split_name)
return tf.gfile.Glob(file_pattern)发布于 2019-11-20 14:42:41
我仍然不知道是什么导致了分割错误,但我的解决方案是在data_generator.py中为城市景观指定一个新的数据集。
https://stackoverflow.com/questions/58938886
复制相似问题