首页手记「幕客技术」一块磁盘导致的后端服务崩溃

「幕客技术」一块磁盘导致的后端服务崩溃

标签：

Linux 职场生活测试

前两天DBA和另外一位硬件工程师，在更换硬盘的时候发现的问题，还好处理及时，没有导致更大的影响面。

什么问题呢？

这次问题就是因为服务器raid出现坏道，导致数据库写入数据出现问题，mysql不断的回写磁盘，最终，mysql的服务时段时续。

一、数据库错误现象如下

1、mysql的error日志

171208 19:16:07 InnoDB: Rollback of non-prepared transactions completed

171208 19:16:18 InnoDB: Warning: purge reached the head of the history list,

InnoDB: but its length is still reported as 583274! Make a detailed bug

InnoDB: report, and submit it to http://bugs.mysql.com

InnoDB: Error: tried to read 16384 bytes at offset 0 413483008.

InnoDB: Was only able to read -1.

171208 19:18:58 InnoDB: Operating system error number 5 in a file operation.

InnoDB: Error number 5 means 'Input/output error'.

InnoDB: Some operating system error numbers are described at

InnoDB: http://dev.mysql.com/doc/refman/5.1/en/operating-system-error-codes.html

InnoDB: File operation call: 'read'.

InnoDB: Cannot continue operation.

2、mysql的连接中断，连接不上，现象如下：

(1)监控系统报警

Mysql of 3301 port is down

(2)终端操作情况，偶尔发现连接不上

终端时常连接不上，这样客户端连接也会出现问题

3、这个时候查看操作系统日志

tail -f /var/log/message 日志，错误情况如下：

Dec 8 19:48:40 kernel: end_request: I/O error, dev sdb, sector 72629874

Dec 8 19:49:51 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002

Dec 8 19:49:51 kernel: end_request: I/O error, dev sdb, sector 72629874

Dec 8 19:51:01 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002

Dec 8 19:51:01 kernel: end_request: I/O error, dev sdb, sector 72629874

Dec 8 19:51:48 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002

Dec 8 19:51:48 kernel: end_request: I/O error, dev sdb, sector 72629874

Dec 8 19:53:00 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002

Dec 8 19:53:00 kernel: end_request: I/O error, dev sdb, sector 72629874

二、原因分析

背景：磁盘作的raid5,曾经掉过盘，工程师修复时再直接插拔一块磁盘，操作系统出现的I/O错误频发，最后出现的上述的问题。

1、查看dell磁盘状态命令

omreport storage controller controller=0

发现物理磁盘都正常，但是raid5的磁盘状态不正常。如下：

ID : 1

Status : Critical

Name : Virtual Disk 1

State : Ready

Encrypted : No

Layout : RAID-5

Size : 2,233.50 GB (2398202363904 bytes)

Device Name : /dev/sdb

Bus Protocol : SAS

Media : HDD

Read Policy : Adaptive Read Ahead

Write Policy : Force Write Back

Cache Policy : Not Applicable

Stripe Element Size : 64 KB

Disk Cache Policy : Enabled

三、修复操作

1、修复命令

什么原因呢？怀疑是在磁盘更换过程中，或者掉盘过程，整个raid产生了坏道。

所以通过如下方式，进行修复：

2、重启系统

系统的错误日志消失

3、通过omconfig修复

omconfig storage vdisk action=clearvdbadblocks controller=0 vdisk=1

虽然我们硬件服务器，作了raid,raid卡电池，但硬件也有着自身的一些问题。

这些问题，需要每一个工程师去发现和了解。

欢迎继续关注幕客技术

点击查看更多内容

12人点赞

共同学习，写下你的评论

评论加载中...

作者其他优质文章

正在加载中

Jeson

Python工程师

手记

篇

粉丝

2万

获赞与收藏

3371

关注作者，订阅最新文章

阅读免费教程

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

热搜

最近搜索清空

「幕客技术」一块磁盘导致的后端服务崩溃

阅读免费教程