Linux : How to replace failed drive in software RAID array

This tutorial is about how to replace a failed member of a Linux software RAID-1 array.

You can monitor the status of your software RAID array through mdadm with the following command :

cat /proc/mdstat

This is the kind of output you’ll get in case of the secondary drive is either dead or no longer in the array :

Personalities : [raid1]
md0 : active raid1 sda1[0]
104320 blocks [2/1] [U_]

md1 : active raid1 sda3[0]
1052160 blocks [2/1] [U_]

md2 : active raid1 sda5[0]
478841280 blocks [2/1] [U_]

unused devices: <none>

(See the [U_] ; this actually mean the secondary member is no longer active. On a healthy array, we should see [UU] instead.)

BEFORE YOU CONTINUE : If your drive experience bad block and some of the partitions are still untouched by those bad sectors and are still active, you actually need to voluntarily fail them before removing the defective drive and continue. If your drive is TOTALLY dead and no partition remain to be active, skip the following steps and go straight forward to the “adding new drive part”.

1. Fail the remaining active partitions (if needed) :

mdadm --manage /dev/md? --fail /dev/sd??

2. Remove remaining partition from the array (if needed) :

mdadm --manage /dev/md? --remove /dev/sd??

3. Remove the defective drive and add the new one.

4. Mirror the partition table from the active drive to the new one :

sfdisk -d /dev/sd? | sfdisk /dev/sd?

5. Resync all the RAID devices :

mdadm --manage /dev/md? --add /dev/sd??

(Do the last step for all RAID partitions)

*You can follow the resynchronization process using the mdadm command :

cat /proc/mdadm