Shouldn't the system detect and reallocate bad sectors in flash memory? - maemo.org

Active Topics

Is there a section to talk about Java ME? Apps, etc. (5)
to General by teroyk - 1 day ago
Extra softwares in Sailfish using CLI, repositories, etc (118)
to SailfishOS by nieldk - 1 day, 5 hrs ago
Paths (0)
to Maemo 7 / Leste by teroyk - 2 days, 3 hrs ago
more...

Page 1 of 2

Next >

Thread Tools

malfunctioning	2014-10-22 , 07:16
Posts: 330 \| Thanked: 556 times \| Joined on Oct 2012	#1

The flash memory on the N900, like in any other computer, has a limited lifetime. I had to completely reflash and reformat my N900 after having some problems, and I just found earlier that one of the ext3 partitions I created on eMMC had some errors.

Basically, I copied a large (4Gb) file, and whenever I tried to read it (or to md5sum it) I would get a read error (md5sum just fails silently in this case by default).

This is how I think I corrected the issue:
1. Reboot into Backup Menu.
2. Mount all partitions in storage mode (read/write).
3. Connect via USB to Linux laptop.
4. Run this command on the bad partition (which happened to be /dev/sdb5):
sudo mkfs.ext3 -c -b 1024 /dev/sdb5

This detected a number of errors (not too many, just a few), and wrote that information into the filesystem to prevent the OS from using those sectors.

Anybody else have any experience with this? If it works, at least it's reassuring to know that the problem can be remedied.

Quote & Reply |

The Following 5 Users Say Thank You to malfunctioning For This Useful Post:
endsormeans, panjgoori, pichlo, Sohil876, ste-phan

malfunctioning	2014-10-22 , 11:01
Posts: 330 \| Thanked: 556 times \| Joined on Oct 2012	#2

This didn't work. After rebooting the N900 into Maemo I copied the same file over, and I have the same issue. It's possible the Maemo kernel doesn't take advantage of the information for bad blocks.

So, next I'm going to format again, and then I will copy the file from my Linux machine.

Incidentally, you can check if you have any bad blocks in a given partition from the N900 itself, using badblocks.

1. Unmount <my_partition>
2. As root, run:
badblocks -b 1024 -sv /dev/<my_partition>

-b 1024 specifies 1024 byte as block size (for ext3), and -sv is for the command to be more verbose and output useful data to the terminal.

By default, the test is read-only. You can run it also in non-destructive read-write mode by adding the -n flag. But this test will take a very long time, compared to the read-only test.

By the way, my bad partition has 27 bad blocks out of over 7 million. Not bad, as long as the kernel can avoid them.

Quote & Reply |

The Following 3 Users Say Thank You to malfunctioning For This Useful Post:
endsormeans, panjgoori, sixwheeledbeast

malfunctioning	2014-10-22 , 11:16
Posts: 330 \| Thanked: 556 times \| Joined on Oct 2012	#3

I just ran the test into an equally sized ext3 partition which is adjacent to it (earlier in the eMMC), and this one gives me 212 bad blocks!

The funny thing is that I have no problem copying that same 3Gb file to this partition, and the checksum comes up fine. However, it looks like the distribution of block errors in this partition only starts at 55% into the scan, and this file occupies around 40% of the partition.

Now I'm going to copy a duplicate of the file to see if I get errors. I'm expecting that to be the case.

EDIT:
Still copying the file after 2 hours. I am almost sure that the reason for this is that bad sectors are avoided, therefore it's taking longer this time. This is a good thing.

My theory as to why I still have problems copying the file to the other partition: There are some block errors which are not detected. I need to run the non destructive read-write, or the destructive read-write tests on that partition, and that will gather more data and update the bad block list, which will allow the file to be copied.

Another idea: Whenever you get those random "Application MicroB had to close because of an internal error", etc, I wager bad blocks of memory are often the cause.

Last edited by malfunctioning; 2014-10-22 at 13:17.

Quote & Reply |

The Following User Says Thank You to malfunctioning For This Useful Post:
endsormeans

malfunctioning	2014-10-22 , 13:45
Posts: 330 \| Thanked: 556 times \| Joined on Oct 2012	#4

I didn't test the correct partition.

Results will follow.

One interesting thing: The partition which has the higher number of errors localized after the 55% point spans the area of the eMMC where the swap partition was before. More precisely, the swap partition was towards the end of this new partition, which is where I'm seeing all those errors. I don't think that's a coincidence.

Last edited by malfunctioning; 2014-10-22 at 13:59.

Quote & Reply |

	javispedro	2014-10-22 , 17:18
	Posts: 2,355 \| Thanked: 5,249 times \| Joined on Jan 2009 @ Barcelona	#5

The eMMC (NOT the 256MiB NAND chip) should actually perform its own wear-leveling and error correction.

If it's giving you bad blocks then I assume that either
a) the firmware is buggy (not impossible),
b) ran out of spare blocks, which is a very bad thing: Assuming that the firmware is not buggy, then the blocks have been "uniformly used". So if a large amount of blocks is failing now, the huge majority will also fail "soon".

Quote & Reply |

The Following 4 Users Say Thank You to javispedro For This Useful Post:
malfunctioning, pichlo, reinob, sixwheeledbeast

malfunctioning	2014-10-22 , 17:31
Posts: 330 \| Thanked: 556 times \| Joined on Oct 2012	#6

Originally Posted by javispedro

The eMMC (NOT the 256MiB NAND chip) should actually perform its own wear-leveling and error correction.

If it's giving you bad blocks then I assume that either
a) the firmware is buggy (not impossible),
b) ran out of spare blocks, which is a very bad thing: Assuming that the firmware is not buggy, then the blocks have been "uniformly used". So if a large amount of blocks is failing now, the huge majority will also fail "soon".

Thank you, I understand your points.

I don't know if this fact is relevant, but I have a custom partition setup, and I partitioned and formatted from a Linux machine and not from the N900 itself. I see a potential issue here, because I'm certain mkfs.ext3 in the laptop is not the same as what the N900 has onboard, due to age for one.

Also (very significant, I think): I could run fsck -af from the N900 and it didn't report any bad blocks. But the Linux system picked up bad blocks easily. I don't know what that means, except that it makes me lose my trust on the fsck command on the N900.

I don't know much about wear leveling, but, does that mean that the storage driver dynamically changes the list of known bad blocks and reallocates as it encounters bad blocks? That's what I think you mean.

Last edited by malfunctioning; 2014-10-22 at 17:33.

Quote & Reply |

malfunctioning	2014-10-22 , 17:35
Posts: 330 \| Thanked: 556 times \| Joined on Oct 2012	#7

Update: I still encountered errors in the second partition.

Now, I'm running mkfs.ext3 with a flag of -cc, which means that badblocks will run with the destructive read-write test in order to determine additional bad blocks.

If this doesn't work, I'll start from step 1, but I will format onboard the N900 and test again.

EDIT:
I ran mkfs.ext3 with -cc from the Linux machine. Then, I performed a test by copying data to the N900 from the Linux machine, and one of the big files is still bad. I guess it's just a question of time before the flash memory on the N900 dies.

BTW, somebody elsewhere mentioned that to increase flash memory life:
- It's a good idea not to use ext3 but ext2.
- The volume should be mounted as noatime and async.

The second piece of advice makes sense, but I wasn't aware about ext3 causing more writes to flash.

Last edited by malfunctioning; 2014-10-22 at 18:53.

Quote & Reply |

malfunctioning	2014-10-23 , 15:17
Posts: 330 \| Thanked: 556 times \| Joined on Oct 2012	#8

Finally, I ran mkfs.ext3 -cc -b 1024 on the partition, from the N900 itself. This command performs a destructive write test on memory using 4 different bit patterns. This should detect and tag bad blocks, so that the filesystem created avoids these bad blocks.

Unfortunately, after copying 2 large files, the second one is still corrupt. Not only that, but fsck reports no problems, and failures are not transparent to the user. Even more, I can mount this file (an Easy Debian image) and it apparently works, even if its bit structure lacks integrity.

This is extremely troubling, as I only found out this file is corrupt by running md5sum on it. Nobody can be expected to md5sum every single file in their linux filesystem. This leads me to believe that any number of files in the system might be corrupt. And this is not acceptable at all. It is a huge problem from many different perspectives. I will start a new thread about this.

I love my N900, but my confidence in it (at least in the OS configuration I use it, which is just PR1.3.1) has been severely damaged.

**EDIT**
I will have to temper my negativity here, but I will leave the above portion of the post untouched, as a testament to my impatience and/or ignorance.

I believe I made a breakthrough in understanding what is going on, and I will add additional information to the thread at a later time. I'm still testing and gathering data. But things don't seem to be as bad as I thought.

Last edited by malfunctioning; 2014-10-23 at 22:25.

Quote & Reply |

The Following 2 Users Say Thank You to malfunctioning For This Useful Post:
handaxe, reinob

	javispedro	2014-10-23 , 16:59
	Posts: 2,355 \| Thanked: 5,249 times \| Joined on Jan 2009 @ Barcelona	#9

Originally Posted by malfunctioning

I ran mkfs.ext3 with -cc from the Linux machine. Then, I performed a test by copying data to the N900 from the Linux machine, and one of the big files is still bad. I guess it's just a question of time before the flash memory on the N900 dies.

Yes, specially if you're getting _new_ bad blocks every often.

Things you should check:
- Stop testing from your PC using USB, because it may be a bad USB cable.
- When you ran badblocks on N900 did you get new ones every often? Did the bad blacks "move"? Did they look random? It may mean the problem is caused by some bad contact around the eMMC chip.
Please note that every time you run badblocks you're causing additional wear on the eMMC.
- Have you tried reflashing _both_ the N900 and the eMMC?

Unless it turns out to be a software side problem, if the number of blocks keeps increasing your N900 is as good as dead. Maybe the Neo900 guys can make some use for spares... or try find someone willing to replace the eMMC chip.

Originally Posted by malfunctioning

BTW, somebody elsewhere mentioned that to increase flash memory life:
- It's a good idea not to use ext3 but ext2.
- The volume should be mounted as noatime and async.

The second piece of advice makes sense, but I wasn't aware about ext3 causing more writes to flash.

ext3 causes more writes because of the journal, albeit in the default configuration ('writeback' iirc) the actual difference is rather small.

Originally Posted by malfunctioning

Not only that, but fsck reports no problems, and failures are not transparent to the user. Even more, I can mount this file (an Easy Debian image) and it apparently works, even if its bit structure lacks integrity.

This is extremely troubling, as I only found out this file is corrupt by running md5sum on it. Nobody can be expected to md5sum every single file in their linux filesystem.

This is not new at all. Fsck only detects inconsistencies in metadata. It does not detect metadata corruption, much less corruption in the actual data. This is the same in most current desktop filesystems -- ext*, FAT, NTFS (Windows), HFS+ (OS X), etc.

If you are really concerned about filesystems that guarantee data integrity you need to go towards cluster filesystems or at least something more advanced such as ZFS or btrfs.

Quote & Reply |

The Following 7 Users Say Thank You to javispedro For This Useful Post:
foobar, malfunctioning, peterleinchen, pichlo, reinob, sixwheeledbeast, xman

malfunctioning	2014-10-23 , 22:21
Posts: 330 \| Thanked: 556 times \| Joined on Oct 2012	#10

Originally Posted by javispedro

Yes, specially if you're getting _new_ bad blocks every often.

Things you should check:
- Stop testing from your PC using USB, because it may be a bad USB cable.
- When you ran badblocks on N900 did you get new ones every often? Did the bad blacks "move"? Did they look random? It may mean the problem is caused by some bad contact around the eMMC chip.
Please note that every time you run badblocks you're causing additional wear on the eMMC.
- Have you tried reflashing _both_ the N900 and the eMMC?

Thank you, this is great advice. Specifically, I think it wasn't a good idea to run mkfs/fsck/badblocks from the Linux computer on the attached N900 eMMC volume. I think you should perform this type of operations with the applications onboard.

As far as the bad blocks, I honestly think this is not the problem. I don't think I'm getting new bad blocks due to flash wear. I actually am convinced now that the problem has to do with creating a new partition scheme on the eMMC via GParted in the external machine.

I just made a breakthrough. But, basically, I have taken the number of bad blocks to 0 in one of the partitions. I have also been able to copy 2 instances of the large 3 GB file, and I have successfully run md5sum on both of them. Basically, the file system seems to be fine now. I will explain later, as I'm still testing and gathering data.

Originally Posted by javispedro

Unless it turns out to be a software side problem, if the number of blocks keeps increasing your N900 is as good as dead. Maybe the Neo900 guys can make some use for spares... or try find someone willing to replace the eMMC chip.

ext3 causes more writes because of the journal, albeit in the default configuration ('writeback' iirc) the actual difference is rather small.

Thank you, I am not too concerned about ext3 anymore. Also, I think that the health of my eMMC is no worse than the average N900 (this is just my hypothesis, based on the reasons I explained above, having to do with custom partitions created externally).

Originally Posted by javispedro

This is not new at all. Fsck only detects inconsistencies in metadata. It does not detect metadata corruption, much less corruption in the actual data. This is the same in most current desktop filesystems -- ext*, FAT, NTFS (Windows), HFS+ (OS X), etc.

If you are really concerned about filesystems that guarantee data integrity you need to go towards cluster filesystems or at least something more advanced such as ZFS or btrfs.

Thank you, that's good to know. I think my problem was bad blocks which hadn't been properly written to the list that ext3 maintains internally in order to avoid writting to them.

Interesting information about cluster filesystems, I will have to look into it!