文章/答案/技术大牛

发布

社区首页 >问答首页 >JVM挂起上的100个IO等待(AWS EC2 xlarge实例)

问JVM挂起上的100个IO等待(AWS EC2 xlarge实例)
EN

Server Fault用户

提问于 2011-04-19 03:33:37

回答 3查看 754关注 0票数 0

在堆栈过流上问了这个问题，重新发布：

在AWS中有一个xlarge实例，运行9只Tomcat，堆从256米到4G。在Ubuntu10.04中，盒子偶尔挂几个小时，有大量的运行队列(30-40)，CPU上没有任何东西，然后恢复。怀疑GC，但无论有无CMS GC都受到了谴责。

升级到10.10之后，机器在启动后的几个小时内就会进入100%的等待状态，CPU上也没有进程。下面是来自顶部的输出：

top - 18:33:44 up  3:11,  2 users,  load average: 26.99, 26.80, 25.82
Tasks: 126 total,   1 running, 125 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  15373736k total, 15174780k used,   198956k free,    51288k buffers
Swap:        0k total,        0k used,        0k free,  6208956k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                       
 5910 tomcat6   20   0  746m 361m 9872 S    0  2.4   2:01.32 java                                                                           
10147 tomcat6   20   0  919m 173m 9.8m S    0  1.2   0:22.60 java                                                                           
12328 ubuntu    20   0 19276 1320  968 R    0  0.0   0:01.41 top                                                                            
    1 root      20   0 23864 2012 1300 S    0  0.0   0:00.38 init                                                                           
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd    
...

在GC日志中没有任何有用的东西(在更大的实例上，使用MarkSweep，主GC每5分钟发生一次，占用~4s，增量在.1 - .2s中完成，在所有代中都有大量的空闲内存)。

以下是dstat输出：

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  5   1  51  43   0   0|  63k  512k|   0     0 |   0     0 | 435   401 
  0   0   0 100   0   0|   0     0 |  52B  834B|   0     0 | 185   315 
  0   0   0 100   0   0|   0     0 |4997B   14k|   0     0 | 247   360 
  0   0   0 100   0   0|   0     0 |  52B  354B|   0     0 | 146   318 
  0   0   0 100   0   0|   0     0 |  52B  354B|   0     0 | 149   314 
  0   0   0 100   0   0|   0     0 |  52B  354B|   0     0 | 145   318 
  0   0   0 100   0   0|   0     0 |4997B   14k|   0     0 | 227   345 
  0   0   0 100   0   0|   0     0 |  52B  354B|   0     0 | 158   325 
  0   0   0 100   0   0|   0     0 |  52B  354B|   0     0 | 160   306 
  0   0   0 100   0   0|   0     0 |  52B  354B|   0     0 | 148   319 
  0   0   0 100   0   0|   0     0 |4619B   14k|   0     0 | 224   353

在wait开始执行任务时，它是在从s3下载/解析大量文件并将它们本地写入磁盘(实例存储)的末尾。线程转储(在j控制台上，在方框挂起时不能杀死-3 )，显示在写入磁盘时阻塞的单个线程。

我迷路了。下一个转弯的是哪块石头？这到底是怎么回事？

更新：

这似乎与挂着“任务阻塞超过120秒”有关，无论是10.04还是10.10。来自kernel.log 10.04：

Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909553] INFO: task kjournald:91 blocked for more than 120 seconds.
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909571] kjournald     D ffff8803be10c424     0    91      2 0x00000000
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909575]  ffff8803be147d50 0000000000000246 0000000000000000 ffff8803be147cd0
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909579]  0000000000000000 ffff8803be147d18 ffff8803be1448b8 ffff8803be147fd8
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909581]  ffff8803be144500 ffff8803be144500 ffff8803be144500 ffff8803be147fd8
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909584] Call Trace:
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909595]  [<ffffffff811ff29d>] journal_commit_transaction+0x18d/0xf20
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909601]  [<ffffffff81059d50>] ? autoremove_wake_function+0x0/0x40
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909606]  [<ffffffff8104c1be>] ? try_to_del_timer_sync+0x6e/0xd0
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909610]  [<ffffffff812040da>] kjournald+0xfa/0x290
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909613]  [<ffffffff81059d50>] ? autoremove_wake_function+0x0/0x40
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909615]  [<ffffffff81203fe0>] ? kjournald+0x0/0x290
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909617]  [<ffffffff8105986e>] kthread+0x8e/0xa0
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909622]  [<ffffffff8100a70a>] child_rip+0xa/0x20
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909625]  [<ffffffff810597e0>] ? kthread+0x0/0xa0
Apr 19 02:47:11 ip-10-110-67-175 kernel: [51985.909627]  [<ffffffff8100a700>] ? child_rip+0x0/0x20

来自kernel.log的10.10：

Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462810] INFO: task kjournald:716 blocked for more than 120 seconds.
Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462842] kjournald     D ffff880005bfb980     0   716      2 0x00000000
Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462849]  ffff8803aee4ba20 0000000000000246 ffff880300000000 0000000000015980
Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462855]  ffff8803aee4bfd8 0000000000015980 ffff8803aee4bfd8 ffff8803aef1c4a0
Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462861]  0000000000015980 0000000000015980 ffff8803aee4bfd8 0000000000015980
Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462867] Call Trace:
Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462880]  [<ffffffff815a20f3>] io_schedule+0x73/0xc0
Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462887]  [<ffffffff812a2f1c>] get_request_wait+0xcc/0x1a0
Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462893]  [<ffffffff8107f080>] ? autoremove_wake_function+0x0/0x40
Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462897]  [<ffffffff812a3083>] __make_request+0x93/0x4b0
Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462903]  [<ffffffff81102cc5>] ? mempool_alloc_slab+0x15/0x20
Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462907]  [<ffffffff812a1c63>] generic_make_request+0x1b3/0x540
Apr 18 05:58:07 ip-10-70-147-162 kernel: [899447.462911]  [<ffffffff81102cc5>] ? mempool_alloc_slab+0x15/0x20

这似乎总是以kjournald开头，然后其他进程出现类似的消息(刷新、java等)。

java

amazon-web-services

iowait

ubuntu

amazon-ec2

回答 3

Server Fault用户

发布于 2011-04-22 13:26:10

EC2是现在就下来，而您正在经历的行为几乎与我正在经历的行为相同，所以影响您的可能是停机，而不是您的实际设置。尝试移动到另一个区域，如果这是你的选择。

票数 1

Server Fault用户

发布于 2011-04-24 17:46:47

见回答这里。简而言之，降级到Ubuntu8.04LTS解决了问题。

票数 1

Server Fault用户

发布于 2011-04-24 17:40:01

结果发现与Ubuntu10.*内核与EC2大型/xlarge实例中使用的某些CPU模型不兼容有关：https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/708920。

我们的AWS联系人还指出了以下可能相关的问题：https://forums.aws.amazon.com/thread.jspa?threadID=59753

降级到Ubuntu8.04LTS目前解决了这个问题。

票数 0

页面原文内容由Server Fault提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://serverfault.com/questions/261133

复制

相似问题

问JVM挂起上的100个IO等待(AWS EC2 xlarge实例)
EN

回答 3

Server Fault用户

Server Fault用户

Server Fault用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问JVM挂起上的100个IO等待(AWS EC2 xlarge实例)EN

回答 3

Server Fault用户

Server Fault用户

Server Fault用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问JVM挂起上的100个IO等待(AWS EC2 xlarge实例)
EN