做rl_abs过程中遇到的问题

问题一

运行

train_abstractor.py就出现这个问题
nohup: ignoring input
start training with the following hyper-parameters:
{‘net‘: ‘base_abstractor‘, ‘net_args‘: {‘vocab_size‘: 30004, ‘emb_dim‘: 128, ‘n_hidden‘: 256, ‘bidirectional‘: True, ‘n_layer‘: 1}, ‘traing_params‘: {‘optimizer‘: (‘adam‘, {‘lr‘: 0.001}), ‘clip_grad_norm‘: 2.0, ‘batch_size‘: 32, ‘lr_decay‘: 0.5}}
Start training
/root/anaconda3/envs/jjenv_pytorch/lib/python3.6/site-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction=‘none‘ instead.
  warnings.warn(warning.format(ret))
Traceback (most recent call last):
  File "train_abstractor.py", line 220, in <module>
    main(args)
  File "train_abstractor.py", line 166, in main
    trainer.train()
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 211, in train
    log_dict = self._pipeline.train_step()
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 107, in train_step
    log_dict.update(self._grad_fn())
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 20, in f
    grad_norm = grad_norm.item()
AttributeError: ‘float‘ object has no attribute ‘item‘
然后把pytorch0.4.1换为0.4.0版本后就解决了如上的问题.

问题二

运行

train_full_rl.py的过程中出现如下问题,
train step: 6993, reward: 0.4023
train step: 6994, reward: 0.4021
train step: 6995, reward: 0.4025
train step: 6996, reward: 0.4023
train step: 6997, reward: 0.4024
train step: 6998, reward: 0.4025
train step: 6999, reward: 0.4024
train step: 7000, reward: 0.4024
start running validation...Traceback (most recent call last):
  File "train_full_rl.py", line 228, in <module>
    train(args)
  File "train_full_rl.py", line 182, in train
    trainer.train()
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 216, in train
    stop = self.checkpoint()
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 185, in checkpoint
    val_metric = self.validate()
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 171, in validate
    val_log = self._pipeline.validate()
  File "/data/rl_abs_other/fast_abs_rl/rl.py", line 178, in validate
    return a2c_validate(self._net, self._abstractor, self._val_batcher)
  File "/data/rl_abs_other/fast_abs_rl/rl.py", line 26, in a2c_validate
    for art_batch, abs_batch in loader:
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 451, in __iter__
    return _DataLoaderIter(self)
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 239, in __init__
    w.start()
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
nohup: ignoring input
Warning: METEOR is not configured
loading checkpoint ckpt-2.530862-102000...
loading checkpoint ckpt-2.498729-24000...
loading checkpoint ckpt-2.530862-102000...
Traceback (most recent call last):
  File "train_full_rl.py", line 228, in <module>
    train(args)
  File "train_full_rl.py", line 148, in train
    os.makedirs(join(abs_dir, ‘ckpt‘))
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: ‘/data/rl_abs_other/fast_abs_rl/full_rl_model/abstractor/ckpt‘

  经过google后,说是线程开太多导致的,然后我就train_full_rl.py把其中的

从4改为1. 这样再次运行就可以了。

问题三

nohup: ignoring input
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Warning: METEOR is not configured
loading checkpoint ckpt-2.530862-102000...
loading checkpoint ckpt-2.498729-24000...
loading checkpoint ckpt-2.530862-102000...
start training with the following hyper-parameters:
{‘net‘: ‘rnn-ext_abs_rl‘, ‘net_args‘: {‘abstractor‘: {‘net‘: ‘base_abstractor‘, ‘net_args‘: {‘vocab_size‘: 30004, ‘emb_dim‘: 128, ‘n_hidden‘: 256, ‘bidirectional‘: True, ‘n_layer‘: 1}, ‘traing_params‘: {‘optimizer‘: [‘adam‘, {‘lr‘: 0.001}], ‘clip_grad_norm‘: 2.0, ‘batch_size‘: 32, ‘lr_decay‘: 0.5}}, ‘extractor‘: {‘net‘: ‘ml_rnn_extractor‘, ‘net_args‘: {‘vocab_size‘: 30004, ‘emb_dim‘: 128, ‘conv_hidden‘: 100, ‘lstm_hidden‘: 256, ‘lstm_layer‘: 1, ‘bidirectional‘: True}, ‘traing_params‘: {‘optimizer‘: [‘adam‘, {‘lr‘: 0.001}], ‘clip_grad_norm‘: 2.0, ‘batch_size‘: 32, ‘lr_decay‘: 0.5}}}, ‘train_params‘: {‘optimizer‘: (‘adam‘, {‘lr‘: 0.0001}), ‘clip_grad_norm‘: 2.0, ‘batch_size‘: 32, ‘lr_decay‘: 0.5, ‘gamma‘: 0.95, ‘reward‘: ‘rouge-l‘, ‘stop_coeff‘: 1.0, ‘stop_reward‘: ‘rouge-1‘}}
Start training
train step: 1, reward: 0.1918^MTraceback (most recent call last):
  File "train_full_rl.py", line 228, in <module>
    train(args)
  File "train_full_rl.py", line 182, in train
    trainer.train()
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 211, in train
    log_dict = self._pipeline.train_step()
  File "/data/rl_abs_other/fast_abs_rl/rl.py", line 173, in train_step
    self._stop_reward_fn, self._stop_coeff
  File "/data/rl_abs_other/fast_abs_rl/rl.py", line 64, in a2c_train_step
    summaries = abstractor(ext_sents)
  File "/data/rl_abs_other/fast_abs_rl/decoding.py", line 94, in __call__
    decs, attns = self._net.batch_decode(*dec_args)
  File "/data/rl_abs_other/fast_abs_rl/model/copy_summ.py", line 63, in batch_decode
    attention, init_dec_states = self.encode(article, art_lens)
  File "/data/rl_abs_other/fast_abs_rl/model/summ.py", line 81, in encode
    init_enc_states, self._embedding
  File "/data/rl_abs_other/fast_abs_rl/model/rnn.py", line 41, in lstm_encoder
    lstm_out = reorder_sequence(lstm_out, reorder_ind, lstm.batch_first)
  File "/data/rl_abs_other/fast_abs_rl/model/util.py", line 58, in reorder_sequence
    sorted_ = sequence_emb.index_select(index=order, dim=batch_dim)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCStorage.cu:58

应该是显存不够了。之前上一次运行都行,不可能这一次就突然显存不够了,于是重启云GPU。然后重新运行。问题消失,程序跑出正确结果。

原文地址:https://www.cnblogs.com/www-caiyin-com/p/10217093.html

时间: 2024-11-09 07:07:44

做rl_abs过程中遇到的问题的相关文章

自己做项目过程中的心得体会

首先我得感谢老师和这门课.如果没有这么门课,我的生活将一直是下课后回宿舍就打开电脑玩游戏或者看电影,宁愿发呆睡觉也不会看书学习的.有了这门课和老师布置的作业,我觉得自己什么都不会,就是个麻瓜.所以,这门课要做的项目使我要学习很多知识,如:html,css,javascript,数据库等.这样自学的感觉很充实,我很喜欢这种感觉.谢谢老师!

ACM做题过程中的一些小技巧

1.一般用C语言节约空间,要用C++库函数或STL时才用C++; cout.cin和printf.scanf最好不要混用. 2.有时候int型不够用,可以用long long或__int64型(两个下划线__). 值类型表示值介于 -2^63 ( -9,223,372,036,854,775,808) 到2^63-1(+9,223,372,036,854,775,807 )之间的整数. printf("%I64d",a); printf("%lld",a); 3.O

做项目过程中的css reset

@charset 'utf-8'; /*css reset*/ body,div,dl,dt,dd,ul,ol,li,h1,h2,h3,h4,h5,h6,pre,code,form,fieldset,legend,input,textarea, p,blockquote,th,td,a,b,em,i,strong,html,article,aside, details, figcaption, figure,  footer, header, hgroup, menu, nav, section

做题过程中得到的注意点

如果一个函数是以return 结束的,那么在调用时,并不会在控制台上输出返回值,必须使用显式的print才行. 对于参数为数组的函数(如int[][] m),当不知道其数组长度时,可以使用 m.length来获得,如下 1 public static int sumMajorDiagonal(int[][] m) { 2 int sum = 0; 3 for (int i = 0; i < m.length; i++) { //这里的i小于多少怎么写? 答案:就是这么写 4 sum += m[i

iOS 自己主动登录,登录过程中一直显示载入页

iOS开发中 假设client做的人性化一点肯定会考虑自己主动登录 事实上原理非常easy,就是再首次登录成功之后将username和password存入userdefault 下次登录的时候推断usedefault中有没有存储usernamepassword,假设有的话就绕过登录界面直接登录 可是在做的过程中遇到了一个问题 如今做一整理,供大家学习,指正 一般实现自己主动登录都是在载入页中去实现. 可是登录一般都须要与server通信.是异步的,而载入页在主线程中.这样就无法控制载入页在登录完

iOS 自动登录,登录过程中一直显示加载页

iOS开发中 如果客户端做的人性化一点肯定会考虑自动登录 其实原理很简单,就是再首次登录成功之后将用户名和密码存入userdefault 下次登录的时候判断usedefault中有没有存储用户名密码,如果有的话就绕过登录界面直接登录 但是在做的过程中遇到了一个问题 现在做一整理,供大家学习,指正 一般实现自动登录都是在加载页中去实现, 但是登录一般都需要与服务器通信,是异步的,而加载页在主线程中,这样就无法控制加载页在登录完成(也就是服务器返回信息之后)之后再消失 我遇到的问题就是登录还没有完成

讲一下数据分析有哪些步骤,在你做项目的过程中哪个步骤需要花费的时间最久?为什么?

数据分析的步骤 1.定义问题,明确需要解决的需求是什么. 2.问题拆分.对于最终需要解决或探索的问题,进行细分,拆分成不同层面的问题. 3.确定指标.根据不同的细分问题,确定需要探索的指标 4.数据收集.收集整理分析项目所需的数据. 5.数据清洗.删除重复数据,异常值,缺失值处理她,特征筛选,数据归一化或标准化处理. 6.数据分析.对相关数据指标进行描述分析,利用可视化进行探索性分析. 7.趋势预测.根据数据建立数据挖掘模型,利用历史数据预测未来数据,并提升预测精度. 8.撰写报告.梳理分析结论

编译过程中,termcap.h 文件找不到路径 licli.a终于生成

编译过程中,termcap.h      文件找不到路径 查看是linux  源码下找不到termcap.h文件 安装了所有关于*cap*的源码包也不起作用 今天终于解决了这个问题,搜termcap.h  发现一篇文章,如下 ----------------------------------------------------------------------------------------- 安装minicom2.3出现termcap.h错误解决方法 2010-05-06 17:12:

20170514002Oracle 11g R2安装过程中遇到的报错及解决办法

Oracle 11g R2安装过程中遇到的报错及解决办法 1.提示Check if the DISPLAYvariable is set.    Failed<<<< 解决方案: #xhost +  //切换到root用户输入 #su – Oracle  //切换到oracle用户 $./runInstaller  //执行安装程序 xhost 是用来控制X server访问权限的. 通常当你从hostA登陆到hostB上运行hostB上的应用程序时, 做为应用程序来说,hostA