Over the last couple weeks I’ve had services on my server go down twice with what appears to be a file system error. The second time was last night, however it just went down just now for the third time. I’ve got a ticket into the data center to run fsck.
While I was shh into the box I ran server httpd restart and got this:
rm: cannot remove `/var/run/httpd.pid’: Read-only file system OK ]
rm: cannot remove
/var/lock/subsys/httpd': Read-only file system rm: cannot remove /var/run/httpd.pid’: Read-only file system
Starting httpd: (30)Read-only file system: httpd: could not open error log file /etc/httpd/logs/error_log.
Unable to open logs
Does anyone have any insight into what might be happening?
The kind of error you’re seeing there could definitely happen if a disk is having problems.
If an error on the disk is detected, it could go into read-only mode.
Ideally, you’d want the folks at the data center run a fsck on it before continuing to use it, otherwise you could risk data corruption.
You’ll want to use SMART (there’s a Webmin module), if the disk supports it, to figure out the health of your disk. badblocks may also be useful in this circumstance (or if SMART is unavailable on your disk).
Two weeks ago when it first happened, the DC got it back up with fsck. But I found out today that last night they only rebooted it. This afternoon they ran fsck twice and the second run came up clean.
The first two times were at the tail end of the backup and the third was probably a result of not running fsck last night. Is fsck a reliable way to know if the HD is ok?
Your reply must have come in while I was typing mine. I’ll check to see which method I can employ to check out the disk.
No, fsck only tells you that as of this moment in time, the filesystem is okay.
It doesn’t check the hardware itself.
Joe is definitely right, you want to use something like the SMART tools in order to check the integrity of your hard drive, and not just the filesystem on top of it.
My HD’s appear to support SMART, however it seems to only go 10% of the way:
Self-test execution status Self-test routine in progress…
90% of test remaining.
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
1 Short offline Aborted by host 90% 5569 -
2 Short offline Aborted by host 90% 5568 -
3 Short offline Aborted by host 90% 5568 -
4 Extended offline Aborted by host 90% 5568 -
5 Short offline Aborted by host 90% 5559 -
6 Short offline Self-test routine in progress 90% 5569 -
Is this a setup issue or a problem?
I wish there was nothing further needed to report on this issue, but unfortunately that’s not the case.
I’m still being plagued with periodic file system errors bringing down the services. The data center has concluded the HD is ok. I’m not 100% convinced of their conclusion, but they have offered to replace the drive if I say the word. However I don’t want to go through the restore process if they are right and it’s not a bad HD.
I’m looking for clues as to what could be the cause if it’s not the hardware. The issue has been happening every 7-15 days at various time of the day. The message log shows this each time: “Write protecting the kernel read-only data: 392k”.
Could this be caused by a bad script in one of my virtuals, or some file size or some other software issue? What can I check? any pointers would be appreciated.