Wednesday, August 08, 2007

Smartmontools and fixing Unreadable Disk Sectors

Smartmontools was showing some problems on the disk.
At least two bad LBAs:
# smartctl -l selftest /dev/hda
smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 20% 1596 44724966
# 2 Extended offline Completed: read failure 40% 1519 12622427

# smartctl -A /dev/hda | egrep 'Reallocated|Pending|Uncorrectable'
5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 2
196 Reallocated_Event_Count 0x0008 252 252 000 Old_age Offline - 1
197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 2
198 Offline_Uncorrectable 0x0008 252 252 000 Old_age Offline - 1

I found a document:
"SHOWS HOW TO IDENTIFY THE FILE ASSOCIATED
WITH AN UNREADABLE DISK SECTOR, AND HOW TO
FORCE THAT SECTOR TO REALLOCATE."

and followed the procedure.

Note the LBA values are given as decimal values. The document seems to refer
to an older version of smarctl that gives the LBA as a hexadecimal number.

Lets look at the partition sizes to see where this LBA drops in.
# fdisk -lu /dev/hda

Disk /dev/hda: 255 heads, 63 sectors, 3738 cylinders
Units = sectors of 1 * 512 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 63 160649 80293+ 83 Linux
/dev/hda2 160650 1204874 522112+ 82 Linux swap
/dev/hda3 1204875 53737424 26266275 83 Linux
/dev/hda4 53737425 60050969 3156772+ f Win95 Ext'd (LBA)
/dev/hda5 53737488 55841939 1052226 83 Linux
/dev/hda6 55842003 57946454 1052226 83 Linux
/dev/hda7 57946518 60050969 1052226 83 Linux

Ok, so the problem is in '/dev/hda3 '.
What's mounted there? As the partitions are labeled, we need to use:
# grep `e2label /dev/hda3` /etc/fstab
LABEL=/var /var ext3 defaults 1 2

Ok, so the problem is in '/var'.
# tune2fs -l /dev/hda3 | grep Block
Block count: 6566568
Block size: 4096

Let's do the maths:
LBA 12622427 - 1204875 and multiply by (512/4096) equals 1427194.
LBA 44724966 - 1204875 and multiply by (512/4096) equals 5440011.375

Ok, now let's use 'debugfs':
# debugfs
debugfs 1.27 (8-Mar-2002)
debugfs: open /dev/hda3
debugfs: icheck 1427194
Block Inode number
1427194 526482
debugfs: ncheck 526482
Inode Pathname
526482 /log/ntp/peers.20070717
debugfs: icheck 5440011
icheck: Can't read next inode while doing inode scan
debugfs: quit

So that means LBA 12622427 is in file "/var/log/ntp/peers.20070717".
And it looks like LBA 44724966 is in currently unused space on the disk.

As this file is not critical, I will just overwrite part of it
to force it to be reallocated:
# dd if=/dev/zero of=/dev/hda3 bs=4096 count=1 seek=1427194
1+0 records in
1+0 records out
# sync
# smartctl -A /dev/hda | egrep 'Reallocated|Pending|Uncorrectable'
5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 1
196 Reallocated_Event_Count 0x0008 252 252 000 Old_age Offline - 1
197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 1
198 Offline_Uncorrectable 0x0008 252 252 000 Old_age Offline - 1

Ok, that seems to have made that error go away for the time being.

Then while googling I found this perl script to help with
automation of badblocks on Linux:

"smartfixdisk - assistant that helps to repair bad LBAs detected by Smartmontools"

developed by the "IT-Support-Group" (ISG.EE),
which is a service organisation of the
"Department of Information Technology and Electrical Engineering" (D-ITET)
of the "Swiss Federal Institute of Technology", Zurich.

I wanted to use it on a old RedHat 9 server.
The script immediately fell over on this line:
open(DISKEND,"</sys/block/$diskname/size") or die "$!";

Not too surprising, as the '/sys' does not exist on my old server!
It seems to be a feature of newer kernels.
On a Centos-5 box I tried this:
# cat /proc/ide/hda/capacity
78165360
# cat /sys/block/hda/size
78165360
# cat /proc/ide/hda/geometry
physical 16383/16/63
logical 65535/16/63

So DISKEND seems to be related to the number of sectors on the hard drive.
The '/proc' version was available on the old Redhat 9 server,
so I change the perl code line like this:
open(DISKEND,"</proc/ide/$diskname/capacity") or die "$!";

..and it was happy.
To figure out what the script is doing, it's useful to add in a few
'print' commands into the script, or just uncomment the ones that
already in place. Here's what the script told me on this server:
# ./smartfixdisk.pl --noaction /dev/hda
Block size = 4096, factor = 0.125
Searching for inode... this may take a while...

LBA 12622427
Partition and partition type: /dev/hda3 Linux_Ext2
Status: used
Comment: EXT2/3: File found at inode 526482: /log/ntp/peers.20070717

LBA 44724966
Partition and partition type: /dev/hda3 Linux_Ext2
Status: free
Comment: block not used in filesystem
dd if=/dev/zero of=/dev/hda seek=5590620 bs=4096 count=1 conv=sync

Looks good.
Ok, so let's finish off:
# dd if=/dev/zero of=/dev/hda seek=5590620 bs=4096 count=1 conv=sync
1+0 records in
1+0 records out
# sync
# smartctl -A /dev/hda | egrep 'Reallocated|Pending|Uncorrectable'
5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 1
196 Reallocated_Event_Count 0x0008 252 252 000 Old_age Offline - 1
197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 0
198 Offline_Uncorrectable 0x0008 252 252 000 Old_age Offline - 1

# smartctl -t long /dev/hda

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 17 minutes for test to complete.
Test will complete after Wed Aug 8 16:38:15 2007

Use smartctl -X to abort test.
# smartctl -l selftest /dev/hda
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 1599 -
# 2 Extended offline Completed: read failure 20% 1596 44724966
# 3 Extended offline Completed: read failure 40% 1519 12622427

# smartctl -A /dev/hda | egrep 'Reallocated|Pending|Uncorrectable'
5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 1
196 Reallocated_Event_Count 0x0008 252 252 000 Old_age Offline - 1
197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 0
198 Offline_Uncorrectable 0x0008 253 252 000 Old_age Offline - 0

Ok, that looks to have cleared the errors for the time being.
But I'm going to keep a careful eye on that disk, using smartd and logwatch.

No comments: