www.centos.org Forum Index CentOS 6 - General Support [RESOLVED] sadc hangs causing machine to partly seize up.
|
Bottom Previous Topic Next Topic |
| |
|
|
|---|
| Poster | Thread |
|---|
|
Re: sadc hangs causing machine to partly seize up. | #2 |
|
|---|---|---|---|
|
Moderator
![]()
Joined: 2009/9/24
From Brighton, UK
Posts: 6351
|
My guess would be a hardware error - perhaps a failed disk though I'd expect errors in dmesg.
|
||
|
_________________
Linux/VoIP Systems Administrator |
|||
Posted on: 2012/3/15 19:11
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #3 |
|
|---|---|---|---|
|
Regular Board Member
![]()
Joined: 2012/1/26
From
Posts: 117
|
Thanks for that. I am concerned it might be a disk problem, though the machine has a hardware RAID controller and two disks mirrored by it. I rebooted and there were orphan inodes deleted at boot time. I'm not sure how serious that is, I suspect the system was out of file descriptors and all sorts of oddities might have happened as a result. Here's the current getinfo stuff:
http://driftwoodcomputer.blogspot.com/ There are some weird messages from fdisk -l, but exactly the same messages appear on the other Compaq Proliant that we have, and it's working fine, though it has the 32-bit OS running on it as it only has 2GB RAM I will monitor dmesg on an ongoing basis. If a disk had failed completely I would expect to have seen errors on the console at boot time, from the RAID controller, and it looked the same as before. |
||
Posted on: 2012/3/16 12:52
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #4 |
|
|---|---|---|---|
|
Moderator
![]()
Joined: 2009/9/24
From Brighton, UK
Posts: 6351
|
I see you have a HP/Compaq Smart Array controller in there. You can get an hpacucli rpm that allows you to query the RAID controller status from within Linux something like
|
||
|
_________________
Linux/VoIP Systems Administrator |
|||
Posted on: 2012/3/16 13:22
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #5 |
|
|---|---|---|---|
|
Regular Board Member
![]()
Joined: 2012/1/26
From
Posts: 117
|
Thanks for that! Here's the o/p. It just says they are OK. The machine hasn't started accumulating sadc processes again, though it has semi-crashed twice now. The first time I couldn't get in, so I rebooted and set up the telnet/stunnel thing, which worked the second time, which is the one I posted about. Bit of a puzzle.
[root@web02 ~]# hpacucli ctrl all show config detail Smart Array 6i in Slot 0 Bus Interface: PCI Slot: 0 RAID 6 (ADG) Status: Disabled Controller Status: OK Chassis Slot: Hardware Revision: Rev B Firmware Version: 2.36 Rebuild Priority: Low Expand Priority: Low Surface Scan Delay: 15 sec Cache Board Present: True Cache Status: OK Accelerator Ratio: 100% Read / 0% Write Total Cache Size: 64 MB Battery Pack Count: 0 SATA NCQ Supported: False Array: A Interface Type: Parallel SCSI Unused Space: 0 MB Status: OK Logical Drive: 1 Size: 33.9 GB Fault Tolerance: RAID 1+0 Heads: 255 Sectors Per Track: 32 Cylinders: 8711 Stripe Size: 128 KB Status: OK Array Accelerator: Enabled Unique Identifier: 600508B1001FFFFFA00C003469C30001 Disk Name: /dev/cciss/c0d0 Mount Points: /boot 500 MB Logical Drive Label: A00C003469C2 physicaldrive 1:0 SCSI Bus: 1 SCSI ID: 0 Status: OK Drive Type: Data Drive Interface Type: Parallel SCSI Transfer Mode: Ultra 320 Wide Size: 36.3 GB Transfer Speed: 320 MB/Sec Rotational Speed: 15000 Firmware Revision: C901 Serial Number: A0F9P49056Y7 Model: IBM-ESXSMAS3367NC FN physicaldrive 1:1 SCSI Bus: 1 SCSI ID: 1 Status: OK Drive Type: Data Drive Interface Type: Parallel SCSI Transfer Mode: Ultra 320 Wide Size: 36.3 GB Transfer Speed: 320 MB/Sec Rotational Speed: 15000 Firmware Revision: B85E Serial Number: 3HX0RV5F000073444T25 Model: IBM-ESXSST336753LC FN |
||
Posted on: 2012/3/16 15:08
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #6 |
|
|---|---|---|---|
|
Regular Board Member
![]()
Joined: 2012/1/26
From
Posts: 117
|
It's happened again. This time I can't get in via telnet either, but I had managed to set up a cron job to email the o/p from dmesg -c and hpacucli, and keep an eye on sadc processes.
hpacucli says: Another instance of hpacucli is running! Stop it first. I have an email showing sadc processes accumulating. I have two emails from dmesg. The second is similar to the dmesg material on my original post. Here is the first dmesg stuff: INFO: task jbd2/dm-0-8:341 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. jbd2/dm-0-8 D 0000000000000001 0 341 2 0x00000000 ffff88011413dc20 0000000000000046 0000000000000000 ffffffffa00041fc ffff88011413db90 ffffffff81012b59 ffff88011413dbd0 ffffffff8109b6a9 ffff880114761b38 ffff88011413dfd8 000000000000f4e8 ffff880114761b38 Call Trace: [<ffffffffa00041fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod] [<ffffffff81012b59>] ? read_tsc+0x9/0x20 [<ffffffff8109b6a9>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff811a93d0>] ? sync_buffer+0x0/0x50 [<ffffffff814ed293>] io_schedule+0x73/0xc0 [<ffffffff811a9410>] sync_buffer+0x40/0x50 [<ffffffff814edc4f>] __wait_on_bit+0x5f/0x90 [<ffffffff811a93d0>] ? sync_buffer+0x0/0x50 [<ffffffff814edcf8>] out_of_line_wait_on_bit+0x78/0x90 [<ffffffff81090ad0>] ? wake_bit_function+0x0/0x50 [<ffffffff81090957>] ? bit_waitqueue+0x17/0xd0 [<ffffffff811a93c6>] __wait_on_buffer+0x26/0x30 [<ffffffffa009d0e6>] jbd2_journal_commit_transaction+0xa76/0x14b0 [jbd2] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa00a2958>] kjournald2+0xb8/0x220 [jbd2] [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa00a28a0>] ? kjournald2+0x0/0x220 [jbd2] [<ffffffff81090726>] kthread+0x96/0xa0 [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffff81090690>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 INFO: task flush-253:0:571 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-253:0 D 0000000000000001 0 571 2 0x00000000 ffff8801170d55b0 0000000000000046 0000000000000000 0000000000000000 ffff8801142e0320 0000000000000246 ffff8801170d55c0 ffffffff81267cd9 ffff8801147606b8 ffff8801170d5fd8 000000000000f4e8 ffff8801147606b8 Call Trace: [<ffffffff81267cd9>] ? cfq_set_request+0x329/0x560 [<ffffffffa009c09d>] do_get_write_access+0x29d/0x520 [jbd2] [<ffffffff8125c6a1>] ? blkiocg_update_io_add_stats+0x61/0x90 [<ffffffff81090ad0>] ? wake_bit_function+0x0/0x50 [<ffffffff811108ce>] ? find_get_page+0x1e/0xa0 [<ffffffffa009c471>] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [<ffffffffa00e8b78>] __ext4_journal_get_write_access+0x38/0x80 [ext4] [<ffffffffa00ea9ba>] ext4_mb_mark_diskspace_used+0x7a/0x300 [ext4] [<ffffffffa00e9f69>] ? ext4_mb_use_preallocated+0x219/0x230 [ext4] [<ffffffffa00ea372>] ? ext4_mb_initialize_context+0x82/0x1d0 [ext4] [<ffffffffa00f1c09>] ext4_mb_new_blocks+0x2a9/0x560 [ext4] [<ffffffffa00e4d80>] ? ext4_ext_find_extent+0x130/0x320 [ext4] [<ffffffffa00e7fc3>] ext4_ext_get_blocks+0x1113/0x1a10 [ext4] [<ffffffff810566a3>] ? perf_event_task_sched_out+0x33/0x80 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffffa00c5335>] ext4_get_blocks+0xf5/0x2a0 [ext4] [<ffffffff81127155>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffffa00c62fc>] mpage_da_map_blocks+0xac/0x450 [ext4] [<ffffffffa009b3c5>] ? jbd2_journal_start+0xb5/0x100 [jbd2] [<ffffffffa00c6f37>] ext4_da_writepages+0x2f7/0x660 [ext4] [<ffffffff81126301>] do_writepages+0x21/0x40 [<ffffffff811a041d>] writeback_single_inode+0xdd/0x2c0 [<ffffffff811a085e>] writeback_sb_inodes+0xce/0x180 [<ffffffff811a09bb>] writeback_inodes_wb+0xab/0x1b0 [<ffffffff811a0d5b>] wb_writeback+0x29b/0x3f0 [<ffffffff814ecb0e>] ? thread_return+0x4e/0x760 [<ffffffff8107caa2>] ? del_timer_sync+0x22/0x30 [<ffffffff811a1049>] wb_do_writeback+0x199/0x240 [<ffffffff811a1153>] bdi_writeback_task+0x63/0x1b0 [<ffffffff81090957>] ? bit_waitqueue+0x17/0xd0 [<ffffffff81134cf0>] ? bdi_start_fn+0x0/0x100 [<ffffffff81134d76>] bdi_start_fn+0x86/0x100 [<ffffffff81134cf0>] ? bdi_start_fn+0x0/0x100 [<ffffffff81090726>] kthread+0x96/0xa0 [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffff81090690>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 INFO: task mysqld:1879 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mysqld D 0000000000000002 0 1879 1760 0x00000080 ffff8801177839a8 0000000000000082 0000000000000000 ffff8800dd8f1980 ffff8801177839b8 ffffffff81467b00 0000000000000000 0000000000000000 ffff8801139a3ab8 ffff880117783fd8 000000000000f4e8 ffff8801139a3ab8 Call Trace: [<ffffffff81467b00>] ? ip_queue_xmit+0x190/0x420 [<ffffffffa009c09d>] do_get_write_access+0x29d/0x520 [jbd2] [<ffffffff81090ad0>] ? wake_bit_function+0x0/0x50 [<ffffffffa009c471>] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [<ffffffffa00e8b78>] __ext4_journal_get_write_access+0x38/0x80 [ext4] [<ffffffffa00c4253>] ext4_reserve_inode_write+0x73/0xa0 [ext4] [<ffffffffa00c42cc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] [<ffffffff814ef5cb>] ? _spin_unlock_bh+0x1b/0x20 [<ffffffffa00c45c0>] ext4_dirty_inode+0x40/0x60 [ext4] [<ffffffff8119fdfb>] __mark_inode_dirty+0x3b/0x160 [<ffffffff81190372>] file_update_time+0xf2/0x170 [<ffffffff81112ce0>] __generic_file_aio_write+0x220/0x480 [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13 [<ffffffff81112faf>] generic_file_aio_write+0x6f/0xe0 [<ffffffffa00bdde1>] ext4_file_write+0x61/0x1e0 [ext4] [<ffffffff8117628a>] do_sync_write+0xfa/0x140 [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13 [<ffffffff8120c1e6>] ? security_file_permission+0x16/0x20 [<ffffffff81176588>] vfs_write+0xb8/0x1a0 [<ffffffff810d4692>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81176f91>] sys_write+0x51/0x90 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task mysqld:20157 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mysqld D 0000000000000003 0 20157 1760 0x00000080 ffff8800bdb99988 0000000000000082 0000000000000000 ffff880028215fe8 ffff880028216018 ffff880028215fe8 ffff88011426e0f8 0000000000000286 ffff8801143870b8 ffff8800bdb99fd8 000000000000f4e8 ffff8801143870b8 Call Trace: [<ffffffff81110ac0>] ? sync_page+0x0/0x50 [<ffffffff814ed293>] io_schedule+0x73/0xc0 [<ffffffff81110afd>] sync_page+0x3d/0x50 [<ffffffff814edafa>] __wait_on_bit_lock+0x5a/0xc0 [<ffffffff81110a97>] __lock_page+0x67/0x70 [<ffffffff81090ad0>] ? wake_bit_function+0x0/0x50 [<ffffffff811108ce>] ? find_get_page+0x1e/0xa0 [<ffffffff81111a6c>] find_lock_page+0x4c/0x80 [<ffffffff81111aea>] grab_cache_page_write_begin+0x4a/0xc0 [<ffffffffa00c85d4>] ext4_da_write_begin+0xb4/0x200 [ext4] [<ffffffffa009a9f6>] ? jbd2_journal_stop+0x1e6/0x2b0 [jbd2] [<ffffffff811113be>] generic_file_buffered_write+0x10e/0x2a0 [<ffffffffa00c45cf>] ? ext4_dirty_inode+0x4f/0x60 [ext4] [<ffffffff81112d10>] __generic_file_aio_write+0x250/0x480 [<ffffffff81010b2e>] ? copy_user_generic+0xe/0x20 [<ffffffff81112faf>] generic_file_aio_write+0x6f/0xe0 [<ffffffffa00bdde1>] ext4_file_write+0x61/0x1e0 [ext4] [<ffffffff8117628a>] do_sync_write+0xfa/0x140 [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8120c1e6>] ? security_file_permission+0x16/0x20 [<ffffffff81176588>] vfs_write+0xb8/0x1a0 [<ffffffff810d4692>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81176f91>] sys_write+0x51/0x90 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task jbd2/dm-0-8:341 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. jbd2/dm-0-8 D 0000000000000001 0 341 2 0x00000000 ffff88011413dc20 0000000000000046 0000000000000000 ffffffffa00041fc ffff88011413db90 ffffffff81012b59 ffff88011413dbd0 ffffffff8109b6a9 ffff880114761b38 ffff88011413dfd8 000000000000f4e8 ffff880114761b38 Call Trace: [<ffffffffa00041fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod] [<ffffffff81012b59>] ? read_tsc+0x9/0x20 [<ffffffff8109b6a9>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff811a93d0>] ? sync_buffer+0x0/0x50 [<ffffffff814ed293>] io_schedule+0x73/0xc0 [<ffffffff811a9410>] sync_buffer+0x40/0x50 [<ffffffff814edc4f>] __wait_on_bit+0x5f/0x90 [<ffffffff811a93d0>] ? sync_buffer+0x0/0x50 [<ffffffff814edcf8>] out_of_line_wait_on_bit+0x78/0x90 [<ffffffff81090ad0>] ? wake_bit_function+0x0/0x50 [<ffffffff81090957>] ? bit_waitqueue+0x17/0xd0 [<ffffffff811a93c6>] __wait_on_buffer+0x26/0x30 [<ffffffffa009d0e6>] jbd2_journal_commit_transaction+0xa76/0x14b0 [jbd2] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa00a2958>] kjournald2+0xb8/0x220 [jbd2] [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa00a28a0>] ? kjournald2+0x0/0x220 [jbd2] [<ffffffff81090726>] kthread+0x96/0xa0 [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffff81090690>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 INFO: task flush-253:0:571 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-253:0 D 0000000000000001 0 571 2 0x00000000 ffff8801170d55b0 0000000000000046 0000000000000000 0000000000000000 ffff8801142e0320 0000000000000246 ffff8801170d55c0 ffffffff81267cd9 ffff8801147606b8 ffff8801170d5fd8 000000000000f4e8 ffff8801147606b8 Call Trace: [<ffffffff81267cd9>] ? cfq_set_request+0x329/0x560 [<ffffffffa009c09d>] do_get_write_access+0x29d/0x520 [jbd2] [<ffffffff8125c6a1>] ? blkiocg_update_io_add_stats+0x61/0x90 [<ffffffff81090ad0>] ? wake_bit_function+0x0/0x50 [<ffffffff811108ce>] ? find_get_page+0x1e/0xa0 [<ffffffffa009c471>] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [<ffffffffa00e8b78>] __ext4_journal_get_write_access+0x38/0x80 [ext4] [<ffffffffa00ea9ba>] ext4_mb_mark_diskspace_used+0x7a/0x300 [ext4] [<ffffffffa00e9f69>] ? ext4_mb_use_preallocated+0x219/0x230 [ext4] [<ffffffffa00ea372>] ? ext4_mb_initialize_context+0x82/0x1d0 [ext4] [<ffffffffa00f1c09>] ext4_mb_new_blocks+0x2a9/0x560 [ext4] [<ffffffffa00e4d80>] ? ext4_ext_find_extent+0x130/0x320 [ext4] [<ffffffffa00e7fc3>] ext4_ext_get_blocks+0x1113/0x1a10 [ext4] [<ffffffff810566a3>] ? perf_event_task_sched_out+0x33/0x80 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffffa00c5335>] ext4_get_blocks+0xf5/0x2a0 [ext4] [<ffffffff81127155>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffffa00c62fc>] mpage_da_map_blocks+0xac/0x450 [ext4] [<ffffffffa009b3c5>] ? jbd2_journal_start+0xb5/0x100 [jbd2] [<ffffffffa00c6f37>] ext4_da_writepages+0x2f7/0x660 [ext4] [<ffffffff81126301>] do_writepages+0x21/0x40 [<ffffffff811a041d>] writeback_single_inode+0xdd/0x2c0 [<ffffffff811a085e>] writeback_sb_inodes+0xce/0x180 [<ffffffff811a09bb>] writeback_inodes_wb+0xab/0x1b0 [<ffffffff811a0d5b>] wb_writeback+0x29b/0x3f0 [<ffffffff814ecb0e>] ? thread_return+0x4e/0x760 [<ffffffff8107caa2>] ? del_timer_sync+0x22/0x30 [<ffffffff811a1049>] wb_do_writeback+0x199/0x240 [<ffffffff811a1153>] bdi_writeback_task+0x63/0x1b0 [<ffffffff81090957>] ? bit_waitqueue+0x17/0xd0 [<ffffffff81134cf0>] ? bdi_start_fn+0x0/0x100 [<ffffffff81134d76>] bdi_start_fn+0x86/0x100 [<ffffffff81134cf0>] ? bdi_start_fn+0x0/0x100 [<ffffffff81090726>] kthread+0x96/0xa0 [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffff81090690>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 INFO: task mysqld:1879 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mysqld D 0000000000000002 0 1879 1760 0x00000080 ffff8801177839a8 0000000000000082 0000000000000000 ffff8800dd8f1980 ffff8801177839b8 ffffffff81467b00 0000000000000000 0000000000000000 ffff8801139a3ab8 ffff880117783fd8 000000000000f4e8 ffff8801139a3ab8 Call Trace: [<ffffffff81467b00>] ? ip_queue_xmit+0x190/0x420 [<ffffffffa009c09d>] do_get_write_access+0x29d/0x520 [jbd2] [<ffffffff81090ad0>] ? wake_bit_function+0x0/0x50 [<ffffffffa009c471>] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [<ffffffffa00e8b78>] __ext4_journal_get_write_access+0x38/0x80 [ext4] [<ffffffffa00c4253>] ext4_reserve_inode_write+0x73/0xa0 [ext4] [<ffffffffa00c42cc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] [<ffffffff814ef5cb>] ? _spin_unlock_bh+0x1b/0x20 [<ffffffffa00c45c0>] ext4_dirty_inode+0x40/0x60 [ext4] [<ffffffff8119fdfb>] __mark_inode_dirty+0x3b/0x160 [<ffffffff81190372>] file_update_time+0xf2/0x170 [<ffffffff81112ce0>] __generic_file_aio_write+0x220/0x480 [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13 [<ffffffff81112faf>] generic_file_aio_write+0x6f/0xe0 [<ffffffffa00bdde1>] ext4_file_write+0x61/0x1e0 [ext4] [<ffffffff8117628a>] do_sync_write+0xfa/0x140 [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13 [<ffffffff8120c1e6>] ? security_file_permission+0x16/0x20 [<ffffffff81176588>] vfs_write+0xb8/0x1a0 [<ffffffff810d4692>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81176f91>] sys_write+0x51/0x90 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task mysqld:20157 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mysqld D 0000000000000003 0 20157 1760 0x00000080 ffff8800bdb99988 0000000000000082 0000000000000000 ffff880028215fe8 ffff880028216018 ffff880028215fe8 ffff88011426e0f8 0000000000000286 ffff8801143870b8 ffff8800bdb99fd8 000000000000f4e8 ffff8801143870b8 Call Trace: [<ffffffff81110ac0>] ? sync_page+0x0/0x50 [<ffffffff814ed293>] io_schedule+0x73/0xc0 [<ffffffff81110afd>] sync_page+0x3d/0x50 [<ffffffff814edafa>] __wait_on_bit_lock+0x5a/0xc0 [<ffffffff81110a97>] __lock_page+0x67/0x70 [<ffffffff81090ad0>] ? wake_bit_function+0x0/0x50 [<ffffffff811108ce>] ? find_get_page+0x1e/0xa0 [<ffffffff81111a6c>] find_lock_page+0x4c/0x80 [<ffffffff81111aea>] grab_cache_page_write_begin+0x4a/0xc0 [<ffffffffa00c85d4>] ext4_da_write_begin+0xb4/0x200 [ext4] [<ffffffffa009a9f6>] ? jbd2_journal_stop+0x1e6/0x2b0 [jbd2] [<ffffffff811113be>] generic_file_buffered_write+0x10e/0x2a0 [<ffffffffa00c45cf>] ? ext4_dirty_inode+0x4f/0x60 [ext4] [<ffffffff81112d10>] __generic_file_aio_write+0x250/0x480 [<ffffffff81010b2e>] ? copy_user_generic+0xe/0x20 [<ffffffff81112faf>] generic_file_aio_write+0x6f/0xe0 [<ffffffffa00bdde1>] ext4_file_write+0x61/0x1e0 [ext4] [<ffffffff8117628a>] do_sync_write+0xfa/0x140 [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8120c1e6>] ? security_file_permission+0x16/0x20 [<ffffffff81176588>] vfs_write+0xb8/0x1a0 [<ffffffff810d4692>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81176f91>] sys_write+0x51/0x90 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task jbd2/dm-0-8:341 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. jbd2/dm-0-8 D 0000000000000001 0 341 2 0x00000000 ffff88011413dc20 0000000000000046 0000000000000000 ffffffffa00041fc ffff88011413db90 ffffffff81012b59 ffff88011413dbd0 ffffffff8109b6a9 ffff880114761b38 ffff88011413dfd8 000000000000f4e8 ffff880114761b38 Call Trace: [<ffffffffa00041fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod] [<ffffffff81012b59>] ? read_tsc+0x9/0x20 [<ffffffff8109b6a9>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff811a93d0>] ? sync_buffer+0x0/0x50 [<ffffffff814ed293>] io_schedule+0x73/0xc0 [<ffffffff811a9410>] sync_buffer+0x40/0x50 [<ffffffff814edc4f>] __wait_on_bit+0x5f/0x90 [<ffffffff811a93d0>] ? sync_buffer+0x0/0x50 [<ffffffff814edcf8>] out_of_line_wait_on_bit+0x78/0x90 [<ffffffff81090ad0>] ? wake_bit_function+0x0/0x50 [<ffffffff81090957>] ? bit_waitqueue+0x17/0xd0 [<ffffffff811a93c6>] __wait_on_buffer+0x26/0x30 [<ffffffffa009d0e6>] jbd2_journal_commit_transaction+0xa76/0x14b0 [jbd2] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa00a2958>] kjournald2+0xb8/0x220 [jbd2] [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa00a28a0>] ? kjournald2+0x0/0x220 [jbd2] [<ffffffff81090726>] kthread+0x96/0xa0 [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffff81090690>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 INFO: task flush-253:0:571 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-253:0 D 0000000000000001 0 571 2 0x00000000 ffff8801170d55b0 0000000000000046 0000000000000000 0000000000000000 ffff8801142e0320 0000000000000246 ffff8801170d55c0 ffffffff81267cd9 ffff8801147606b8 ffff8801170d5fd8 000000000000f4e8 ffff8801147606b8 Call Trace: [<ffffffff81267cd9>] ? cfq_set_request+0x329/0x560 [<ffffffffa009c09d>] do_get_write_access+0x29d/0x520 [jbd2] [<ffffffff8125c6a1>] ? blkiocg_update_io_add_stats+0x61/0x90 [<ffffffff81090ad0>] ? wake_bit_function+0x0/0x50 [<ffffffff811108ce>] ? find_get_page+0x1e/0xa0 [<ffffffffa009c471>] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [<ffffffffa00e8b78>] __ext4_journal_get_write_access+0x38/0x80 [ext4] [<ffffffffa00ea9ba>] ext4_mb_mark_diskspace_used+0x7a/0x300 [ext4] [<ffffffffa00e9f69>] ? ext4_mb_use_preallocated+0x219/0x230 [ext4] [<ffffffffa00ea372>] ? ext4_mb_initialize_context+0x82/0x1d0 [ext4] [<ffffffffa00f1c09>] ext4_mb_new_blocks+0x2a9/0x560 [ext4] [<ffffffffa00e4d80>] ? ext4_ext_find_extent+0x130/0x320 [ext4] [<ffffffffa00e7fc3>] ext4_ext_get_blocks+0x1113/0x1a10 [ext4] [<ffffffff810566a3>] ? perf_event_task_sched_out+0x33/0x80 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffffa00c5335>] ext4_get_blocks+0xf5/0x2a0 [ext4] [<ffffffff81127155>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffffa00c62fc>] mpage_da_map_blocks+0xac/0x450 [ext4] [<ffffffffa009b3c5>] ? jbd2_journal_start+0xb5/0x100 [jbd2] [<ffffffffa00c6f37>] ext4_da_writepages+0x2f7/0x660 [ext4] [<ffffffff81126301>] do_writepages+0x21/0x40 [<ffffffff811a041d>] writeback_single_inode+0xdd/0x2c0 [<ffffffff811a085e>] writeback_sb_inodes+0xce/0x180 [<ffffffff811a09bb>] writeback_inodes_wb+0xab/0x1b0 [<ffffffff811a0d5b>] wb_writeback+0x29b/0x3f0 [<ffffffff814ecb0e>] ? thread_return+0x4e/0x760 [<ffffffff8107caa2>] ? del_timer_sync+0x22/0x30 [<ffffffff811a1049>] wb_do_writeback+0x199/0x240 [<ffffffff811a1153>] bdi_writeback_task+0x63/0x1b0 [<ffffffff81090957>] ? bit_waitqueue+0x17/0xd0 [<ffffffff81134cf0>] ? bdi_start_fn+0x0/0x100 [<ffffffff81134d76>] bdi_start_fn+0x86/0x100 [<ffffffff81134cf0>] ? bdi_start_fn+0x0/0x100 [<ffffffff81090726>] kthread+0x96/0xa0 [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffff81090690>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 |
||
Posted on: 2012/3/20 11:34
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #7 |
|
|---|---|---|---|
|
Regular Board Member
![]()
Joined: 2012/1/26
From
Posts: 117
|
It only ran for a few hours before another seizure. When rebooting I noticed a message flashing briefly on the screen. I have tracked it down using 'fgrep collision *' in /var/log:
dmesg:pci 0000:00:1d.0: BAR 4: address space collision on of device [0x2000-0x201f] dmesg:pci 0000:00:1d.1: BAR 4: address space collision on of device [0x2020-0x203f] dmesg.old:pci 0000:00:1d.0: BAR 4: address space collision on of device [0x2000-0x201f] dmesg.old:pci 0000:00:1d.1: BAR 4: address space collision on of device [0x2020-0x203f] messages:Mar 20 11:40:49 lingrayweb02 kernel: pci 0000:00:1d.0: BAR 4: address space collision on of device [0x2000-0x201f] messages:Mar 20 11:40:49 lingrayweb02 kernel: pci 0000:00:1d.1: BAR 4: address space collision on of device [0x2020-0x203f] messages:Mar 20 12:53:05 lingrayweb02 kernel: pci 0000:00:1d.0: BAR 4: address space collision on of device [0x2000-0x201f] messages:Mar 20 12:53:05 lingrayweb02 kernel: pci 0000:00:1d.1: BAR 4: address space collision on of device [0x2020-0x203f] |
||
Posted on: 2012/3/20 14:03
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #8 |
|
|---|---|---|---|
|
Regular Board Member
![]()
Joined: 2012/1/26
From
Posts: 117
|
Well, I tried turning off the sadc cron job, but it has still seized up again, so the sadc processes must have been a symptom rather than cause. Does anyone recognise the address space collision message?
|
||
Posted on: 2012/3/26 14:51
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #9 |
|
|---|---|---|---|
|
Moderator
![]()
Joined: 2009/9/24
From Brighton, UK
Posts: 6351
|
I don't recognise it but the pci 0000:00:1d.0 part of the message will correspond with a device listed by running `lspci -nn` and then you can see if it's a device that's likely to be involved in the hang or not.
From the oopses in your dmesg output this looks suspiciously like a hang accessing your disk subsystem though. I did notice that you have a SmartArray 6i in the machine and in the output from your last hpacucli it claims it has firmware 2.36 installed. I checked one of mine that I recently upgraded to the then latest and it has 2.84 and I seem to remember reading several HP mails telling me I needed to upgrade to beyond a certain level urgently (can no longer remember what that level was). |
||
|
_________________
Linux/VoIP Systems Administrator |
|||
Posted on: 2012/3/26 14:59
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #10 |
|
|---|---|---|---|
|
Regular Board Member
![]()
Joined: 2012/1/26
From
Posts: 117
|
[root@lingrayweb02 ~]# lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation E7520 Memory Controller Hub [8086:3590] (rev 0c) 00:02.0 PCI bridge [0604]: Intel Corporation E7525/E7520/E7320 PCI Express Port A [8086:3595] (rev 0c) 00:04.0 PCI bridge [0604]: Intel Corporation E7525/E7520 PCI Express Port B [8086:3597] (rev 0c) 00:06.0 PCI bridge [0604]: Intel Corporation E7520 PCI Express Port C [8086:3599] (rev 0c) 00:1c.0 PCI bridge [0604]: Intel Corporation 6300ESB 64-bit PCI-X Bridge [8086:25ae] (rev 02) 00:1d.0 USB controller [0c03]: Intel Corporation 6300ESB USB Universal Host Controller [8086:25a9] (rev 02) 00:1d.1 USB controller [0c03]: Intel Corporation 6300ESB USB Universal Host Controller [8086:25aa] (rev 02) 00:1d.4 System peripheral [0880]: Intel Corporation 6300ESB Watchdog Timer [8086:25ab] (rev 02) 00:1d.5 PIC [0800]: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt Controller [8086:25ac] (rev 02) 00:1d.7 USB controller [0c03]: Intel Corporation 6300ESB USB2 Enhanced Host Controller [8086:25ad] (rev 02) 00:1e.0 PCI bridge [0604]: Intel Corporation 82801 PCI Bridge [8086:244e] (rev 0a) 00:1f.0 ISA bridge [0601]: Intel Corporation 6300ESB LPC Interface Controller [8086:25a1] (rev 02) 00:1f.1 IDE interface [0101]: Intel Corporation 6300ESB PATA Storage Controller [8086:25a2] (rev 02) 01:03.0 VGA compatible controller [0300]: ATI Technologies Inc Rage XL [1002:4752] (rev 27) 01:04.0 System peripheral [0880]: Compaq Computer Corporation Integrated Lights Out Controller [0e11:b203] (rev 01) 01:04.2 System peripheral [0880]: Compaq Computer Corporation Integrated Lights Out Processor [0e11:b204] (rev 01) 02:01.0 RAID bus controller [0104]: Compaq Computer Corporation Smart Array 64xx [0e11:0046] (rev 01) 02:02.0 Ethernet controller [0200]: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet [14e4:1648] (rev 10) 02:02.1 Ethernet controller [0200]: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet [14e4:1648] (rev 10) 06:00.0 PCI bridge [0604]: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A [8086:0329] (rev 09) 06:00.2 PCI bridge [0604]: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B [8086:032a] (rev 09) Looks like USB controller is causing the messages, but is unlikely to cause the hangs , as it is not normally in use. |
||
Posted on: 2012/4/3 15:53
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #11 |
|
|---|---|---|---|
|
Moderator
![]()
Joined: 2009/9/24
From Brighton, UK
Posts: 6351
|
Upgrade your RAID controller firmware.
|
||
|
_________________
Linux/VoIP Systems Administrator |
|||
Posted on: 2012/4/3 16:18
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #12 |
|
|---|---|---|---|
|
Regular Board Member
![]()
Joined: 2012/1/26
From
Posts: 117
|
The problem I have now is that it ran for over a fortnight without trouble before the last weird state set in. So if I upgrade the firmware, how do I prove that has fixed the problem? This machine isn't operational yet, I need to give it a go-ahead before going live on it. The problem is too intermittent.
Consequently, I have been trying to find a way to trigger the problem off, so I can show it doesn't happen after the upgrade. I have not yet succeeded. I wrote a small program which writes random data to the disk, and re-reads it, and set several instances going. The disks are taking a hammering, but the problem hasn't recurred. I would be interested in any suggestions as to how else I might get it to kick off. |
||
Posted on: 2012/4/17 14:28
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #13 |
|
|---|---|---|---|
|
Moderator
![]()
Joined: 2009/9/24
From Brighton, UK
Posts: 6351
|
Well your current firmware is 2.36 and my Smart Array 6i has 2.84. I did have the same random hang problem on the server that is running this and have not had the problem since the upgrade (about 6 months ago). I'm about 99% sure that you need to update. The newer firmware also has a new function that will tell you if you need to flash the firmware on any of the drives too so you might want to download the updates for those too if they're backlevel.
|
||
|
_________________
Linux/VoIP Systems Administrator |
|||
Posted on: 2012/4/17 18:38
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #14 |
|
|---|---|---|---|
|
Regular Board Member
![]()
Joined: 2012/1/26
From
Posts: 117
|
Well, I've tracked down and installed upgrades to ROM and SmartArray firmware. It's not recommending any disk firmware upgrades during bootup, and I didn't see the address space collision message either. I'll hope for the best.
Smart Array 6i in Slot 0 Bus Interface: PCI Slot: 0 RAID 6 (ADG) Status: Disabled Controller Status: OK Chassis Slot: Hardware Revision: Rev B Firmware Version: 2.84 Rebuild Priority: Low Expand Priority: Low Surface Scan Delay: 15 sec Cache Board Present: True Cache Status: OK Accelerator Ratio: 100% Read / 0% Write Total Cache Size: 64 MB Battery Pack Count: 0 SATA NCQ Supported: False Array: A Interface Type: Parallel SCSI Unused Space: 0 MB Status: OK Logical Drive: 1 Size: 33.9 GB Fault Tolerance: RAID 1+0 Heads: 255 Sectors Per Track: 32 Cylinders: 8711 Stripe Size: 128 KB Status: OK Array Accelerator: Enabled Unique Identifier: 600508B1001FFFFFA00C003469C30001 Disk Name: /dev/cciss/c0d0 Mount Points: /boot 500 MB Logical Drive Label: A00C003469C2 physicaldrive 1:0 SCSI Bus: 1 SCSI ID: 0 Status: OK Drive Type: Data Drive Interface Type: Parallel SCSI Transfer Mode: Ultra 320 Wide Size: 36.3 GB Transfer Speed: 320 MB/Sec Rotational Speed: 15000 Firmware Revision: C901 Serial Number: A0F9P49056Y7 Model: IBM-ESXSMAS3367NC FN physicaldrive 1:1 SCSI Bus: 1 SCSI ID: 1 Status: OK Drive Type: Data Drive Interface Type: Parallel SCSI Transfer Mode: Ultra 320 Wide Size: 36.3 GB Transfer Speed: 320 MB/Sec Rotational Speed: 15000 Firmware Revision: B85E Serial Number: 3HX0RV5F000073444T25 Model: IBM-ESXSST336753LC FN |
||
Posted on: 2012/4/19 14:52
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #15 |
|
|---|---|---|---|
|
Moderator
![]()
Joined: 2009/9/24
From Brighton, UK
Posts: 6351
|
Quote:
So will I but I'm fairly confident that it'll fix the issue. Come back in a month and let us know ![]() |
||
|
_________________
Linux/VoIP Systems Administrator |
|||
Posted on: 2012/4/20 0:12
|
|||
|
Re: sadc hangs causing machine to partly seize up. | #16 |
|
|---|---|---|---|
|
Regular Board Member
![]()
Joined: 2012/1/26
From
Posts: 117
|
The machine has finally gone live. It's been stable since the firmware upgrade. Thanks very much to TrevorH for his help.
|
||
Posted on: 2012/7/3 10:34
|
|||
|
Re: [RESOLVED] sadc hangs causing machine to partly seize up. | #17 |
|
|---|---|---|---|
|
Moderator
![]()
Joined: 2007/10/22
From ~/Earth/UK/England/Suffolk
Posts: 9137
|
That is good to read. Thank you for reporting back.
For posterity and on your behalf, this thread is now marked [RESOLVED]. |
||
Posted on: 2012/7/3 17:14
|
|||
Top Previous Topic Next Topic |
|



Topic options
Print Topic
Threaded
Newest First
driftwood





You cannot start a new topic.
You can view topic.