12:01 Saturday, August 08 2014

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

I woke up yesterday morning to find an email from smartd with the subject "Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors". I sighed, and girded my loins for a painful ordeal. The message itself simply means that there's 1 bad sector on /dev/sda which can only be reallocated by writing to it. The catch is whether there's any actual data on that sector that I care about (and will end up losing). There's no shortage of helpful information online detailing how to determine which file(s) reside on bad sectors, however they all assume that you're using one of the ext filesystem variants or sometimes even ReiserFS (people still use that? for fun??). Alas, I use XFS on all of the filesystems that I care about, and the process for determining which file resides on a sector of an XFS filesystem is poorly documented. I spent some time last night in the #xfs room on Freenode, asking the experts for guidance.

The first step in the process is determining the address of the bad sector. smartctl provides that information easily enough if you kick off a short test of the disk in question (smartctl -t short /dev/sda). The test should fail once the test attempts to read from the sector which is bad, and that happened for me. You can view that output by running 'smartctl -a /dev/sda' and reviewing the newest entries in the 'SMART Self-test log' section of the output. In my case, I saw:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 3  Short offline       Completed: read failure       90%     14501         1465501307

Thus, the address of the problematic sector of the disk was 1465501307. The next step in the process was to determine the offset from the beginning of the partition where the bad sector resided. To do that, I needed to get the partition table for the disk, in sector units (rather than bytes, etc). There are a few tools which will provide this information, but the most user friendly (in my opinion) is cfdisk:

$ cfdisk -P s /dev/sda
Partition Table for /dev/sda

               First       Last
 # Type       Sector      Sector   Offset    Length   Filesystem Type (ID) Flag
-- ------- ----------- ----------- ------ ----------- -------------------- ----
   Pri/Log           0        2047*     0#       2048*Free Space           None
 1 Primary        2048*    1437695*     0     1435648*Linux (83)           Boot
 2 Primary     1437696* 1953523711*     0  1952086016*Linux (83)           None

The above output makes it obvious that the bad sector is buried somewhere inside of the second partition (/dev/sda2), so the offset is simply the difference between the bad sector address and the the first sector of the partition (1465501307 - 1437696 = 1464063611).

This is where the help from #xfs was supposed to be key, but alas not all of their advice worked, and I ended up googling for alternative solutions. I did need to determine either the file system block or the inode of the bad sector. The tool, xfs_db (part of the xfsprogs package), makes this conversion trivial. The big catch is that you can't run xfs_db on a writable filesystem. In my case, the partition in question was / so I absolutely had to reboot to get the partition into a non-writable state. Once I did that, I ran xfs_db with the command 'xfs_db /dev/sda2' and then issued the following command:

convert daddr 1464063611 fsb

In theory, the next step was to run another xfs_db command (blockuse -n) against the file system block address returned from the previous command, but here's where it all failed to work. blockuse kept insisting that I run a 'blockget' command first. However, even after running blockget, blockuse continued to insist that I hadn't run blockget yet. And additional attempts to run blockget claimed that it had already been run. I tried to ping #xfs again, but no one seemed to be around, so I went to google for alternate solutions. I came across an ancient thread from 2008 where someone used the unix 'find' command with the -inum (inode number) option to determine the file at a particular location. So I gave that a try. I first needed to run the convert command again to get the actual inode:

convert daddr 1464063611 inode

Once I had an inode number, I mounted /dev/sda2 read only:

mount -o ro -t xfs /dev/sda2 /mnt/root

And then proceeded to run the find command against the inode that convert provided above:

find /mnt/root -inum 75425801

Except it returned nothing. However, that was actually a good thing. My / partition (/dev/sda2) is about 1TB in size, yet only 10% is currently in use. So its completely reasonable that the bad sector/block/inode has no data on it (yet). The remaining step was to write to that spot on the disk to (hopefully) force the disk firmware to mark that sector as bad, and reallocate it elsewhere. Before I did that, I wanted to further verify that I was dealing with the right block on the disk, and that it was truly empty. This required using a simple formula:

b = (int)((L-S)*512/B)
where:
b = File System block number
B = File system block size in bytes
L = LBA of bad sector
S = Starting sector of partition as shown by 'cfdisk -P s /dev/sda'
(int) = denotes the integer part of the result

I had all of the required values needed except for B. To get B, I ran xfs_info against the mounted partition:

$ xfs_info /
meta-data=/dev/sda2              isize=256    agcount=4, agsize=61002688 blks
         =                       sectsz=512   attr=2, projid32bit=0
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=244010752, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal               bsize=4096   blocks=119145, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The key above is the bsize value, which is 4096. Completing the formula:

b = int(1465501307 - 1437696)*512/4096
b = int(1464063611* 512/4096)
b = 183007951

I'm now ready to use the final tool in the process, dd. First I wanted to verify that I was unable to read the block in question:

dd if=/dev/sda2 of=/dev/null bs=4096 count=1 skip=183007951

And it failed with an IO error (also in dmesg, I saw a bunch of scary additional errors). I then confirmed that I could successfully read from the block immediately before & immediately after 183007951:

dd if=/dev/sda2 of=/tmp/183007950 bs=4096 count=1 skip=183007950
dd if=/dev/sda2 of=/tmp/183007952 bs=4096 count=1 skip=183007952

Both commands completed successfully, and created two 4096 byte files in /tmp. I took a quick look at the contents of each with less, and thankfully, both were nothing but zeroes (which less presented as a long string of '@' symbols). I was now reasonably confident that I had the correct block on the disk, and that it was almost certainly empty. The last step was to write over the bad block with the content from one of the other blocks (this accomplished two goals, first to write to the block and get it reallocated by the disk firmware, but also to ensure that it now held an xfs formatted piece of the filesystem, rather than a hole):

dd if=/tmp/183007950 of=/dev/sda2 bs=4096 count=1 seek=183007951

SUCCESS! Also, when I re-ran 'smartctl -a /dev/sda', the value of Current_Pending_Sector was now zero (while it was 1 previously). I also went ahead and re-ran another short test on the disk, and it completed successfully, with no errors.