Data Loss Prevention & shred

Data destruction is a critical part of a DLP. I was told that wear leveling made the shred command useless. So let's check.

Setup

We'll check the data destruction process using a simple setup: a Samsung 850 EVO SSD (128GB) data disk (not the disk OS is installed on) that is full of zeroes, formatted in ext4, and that contains a single file with clearly identifiable data. We will then try to delete this file and see how the data (on the disk as a block device ) are erased.

Format (zeroing)

Backup the disk (just in case), and fill it with zeroes
The process will stop (with error) when the disk is full
So now we no longer have any partition in this fully-zeroed disk

Create a partition

We create the ext4 partition
And we now have a partitionned disk with all content blocks filled with zeroes

A file with AAAs

We then create a file full of 1024*32 = 32k "A" so it will be easy to spot on the disk data blocks
We'll call the file "BBBBB" so we don't mix the file name ("B"s) and its content ("A"s)

Regular file modification

Now that we have a file on disk, let's check what its data (content) life cycle is. We'll check how the disk is storing the file content, and how that storage evolves with the file usage.

How block devices work

Any storage media is seen by the OS as a "block device", that is, a very long stripe of bytes. Bytes are grouped into sectors (512 bytes per sector) and sectors in blocks (8 sectors per block usually, so 4k, but that might be customized to, say, 64k).

So when a file must be stored, its content is split into blocks (here, 4k) and saved on the disk. When saving, each block is subdivided in sector, and each sector is stored contiguously inside each block. But blocks may be spread around on the disk if needed, leading to the so-called "fragmentation".

So for our setup file, we should get a single blocks-range (no fragmentation) of 32k bytes length (because the file has 32k characters), that is 8 blocks of 4k size, each split in 8 sectors of 512b size meaning a total of 64 blocks of 512 bytes.

Locate the file content

We check where the file is stored on the block device

As we can see, file is stored from sector 274432 to sector 274495, meaning it is 64 sectors size with 512 bytes per sector. This matches the expected 32k "A"s of the file content.

If we check the block device near that place, from sector 274430, we see the file content data is indeed here

dd will read "blocks" of raw data of the block device. With bs=512 option, we tell dd to use "blocks" of size 512 (matching the sector size), so we can skip (ignore) the first 274430 blocks, and see the first sector of file's data.

We could also have done a direct dd if=/dev/sde and then ignored the first 274432 * 512 = 140509184 characters. Result would have been the same, but it's uselessly tedious.

Using the precise values from hdparm, we can use dd to exactly read the file's data content

Lifecycle with nano

Let's use nano to change file content to emulate data life cycle
We put a "B" and a "Z" and the begining and end of the file content, then save
The data change is reflected on the cat command, and we see the "B" in the file content
But if we dd the device on the content blocks, we don't see the change!
Looking at the sectors before the file content, we still see "zeroes" …
…and same in the sectors after the file content ones

Content location has changed

If we check where the file content is stored, we'll see that the allocated blocks changed

LBA means Logical Block Address .

That means the file content has been stored in a different place than the previous one. It explains why, at the previous place, we still see the old data. They were not erased by the OS, and instead, the OS stored the file content in other sectors, somewhere else on the block device.

Altho I don't have the direct link to the documentation (lazy, you know!), this is actually an expected behavior: when saving a file after we made changes in nano, the text editor does not overwrite the existing file directly, but instead, makes a new one to save the changes to, and remove the old one leading the OS to reallocate the content location on the block device.
The blocks allocated for the original file's content are now unallocated (but not erased), and some unallocated blocks are not allocated to store the new file's content.

There might exist some file systems or OS that would overwrite any block when unallocating it to drastically reduce the data leak risk. But in such case, these FS or OS would have far weaker I/O performance.

This is a very common behavior of software in general, which are not guarantiing to overwrite the original file, and can often save the changes in a different location. Thos, this could also be the OS doing some sort of "software-based wear leveling" (see above), but I actually doubt so.

But also, caches

So, if we check the new content location, we should see the data, right?

Wrong. Checking the new location shows only "zeroes" here too, so where are the data?
Let's unmount the disk to have any write-cache flushed, in case there are some
Then we remount it
Now if we read the blocks at the new place, we see our new data, starting with a "B"!
And at the end of the new location, we see the "Z"

This is actually not a surprising result, because OS (kernel) have written caches on block devices, and this is somehow the same reason why you should unmount USB removable storages properly (by unmounting them first) to avoid data loss. If you don't do so, data "written" to removable device might actually be written in the cache, and not flushed to the physical drive, leading to data loss if that drive is physically removed without first telling the kernel (OS) that it should flush cache on it first.

So what we have learnt here is that editing software may leak data by reallocation the file content to new blocks/sectors, meaning the actual file content can appear several times on the drive, in the unallocated space.

Short files

We have done the tests, so far, with a file those content matches exactly the block size (4k) or one of its multiples. But what if we have a large file (say, 4k) and we shorten it (to 1k)?

We make a file with a long text

Now, what happens if we reduce the text inside the file? The text will not fill the block, but will the old text remain (in that block) or will it be erased?

And then, we delete part of its content and ssee the changes on the block device

It seems that the rest of the text is erased (with zeroes). But, I would actually not rely on that, since I'm unsure if it's a kernel-level action, or if it depends on the software used (here, hexedit, because nano would have reallocated the blocks on the device, making the test useless).

What might also happen (depending on software? kernel? OS? file system? I don't know) is given below as an example.

We make a file in which we have a password, at the last line
We then edit the file to remove this password

This is a forged example: in reality and as seen above, using nano would allocate a new range of blocks, and so, the password would remain somewhere on the disk, in the unallocated blocks

When investigating on the disk, we might see that file size is truncated, but data remains in the block

Again, this example is forged, so I cannot very tell if the rest of the block is always reset to zeroes by the kernel/OS, or if it is up to the software to do so.
In any case, I would not rely on this as a DLP, and I would consider that the blocks allocated for a file might have trailing data in them, and could contain sensitive information that were supposed to be removed from the files.

Shredding the data

Now, let's remove our previous file (with rm), and create a new one with a new content. Then, we'll try to see if we can completly destroy its data, making the actual data content unrecoverable.

We create a new CCCCCCC file containing YCCC…CCCZ characters, at LBA 274432

We can see that this new file content is located at the same block adress that the previous file, which we have deleted. That means the OS has allocated these blocks again to store some data content.

Once unmount and remount, the device blocks show the file content at the expected place
We then run the shred command on the file, and as expected (because of the cache), data remains here
After unmount/remount, we see that the data have been properly erased

Using the shred, the file content seems properly destroyed, so long you have unmounted the device to avoid any cache latency issue.

"Wear Leveling"?

Last thing, what about that "wear leveling" we mentionned at the start of this post? Does this mean the content of a shreded file could remain somewhere on the disk? Let's check.

If we supposed that "wear leveling" stores data at a different physical place on the disk, can we spot that? We could suppose that the Logical Block Adressing is not directly mapped to the physical blocks (PB) on the device. For instance, it means that block at LBA 1 could be stored in block at PB 42. But if that's so, where is stored LBA 42? It cannot be stored at PB 42 (already mapped and occupied), so it must be stored somewhere else, like PB 33. But then, where is stored LBA 33? And so on…

With the pigeonhole principle, that means a wear-leveling will always "swap" blocks from the mapping, and never add/remove blocks from it. If a swap is done, then by reading the entire disk, we should cross the path of both the swapped blocks, and we should see the data remapped by wear-leveling.

Lost? Here is an example: we have a file content stored at LBA 42, and we overwrite it with random data using shred. Suppose wear-leveling is storing the random data somewhere else on the physical drive (meaning the original data are still readable from the physical blocks), then if we read the entire drive, we should see the original data somewhere. If we don't see them, then wear leveling is not happening.

Let's read the entire drive, looking for AAAA or CCCC
We see the very original content from AAAA file (because of the nano's reallocation) and the file names in file table
Interestingly, we also see the .swp (temporary swap) file of nano and the shredded CCCC file name
But nothing else!

That means either the wear leveling is not existing on this drive, or, it's the OS/kernel responsibility.

Maybe I'm mistaking here and very advanced skilled attacker could extract data from physical blocks, but I highly doubt so (except by using the physical redundancy and error correction maybe). Still, but the very vast majority of data, this means shred properly destroyed the data content, but the file names might remain in copies of the file tables.

What about the mecanical drives?

The "wear leveling" of the introduction was aimed for SSD (to enhance their life duration) but is it applicable to mechanical drives?

Let's start by making the same type of file on a mechanical drive
We edit its content
And, as we can see, the LBA changed anyway
And the previous location still contains the old file's content

So no matter the drive type, it seems the blocks allocation would change and rotate for any actual drive technology.

Still, I also have some drives for which I delete and add files on a regular basis (system drives with regular system life). And, along time, the compressed size of a raw copy of the drive image does not change, meaning that the allocated blocks sounds to always remain the same.
Hence, it looks like the block allocations may "rotate" and change, but I would not rely on this behavior in general.

Conclusion

The main points we have see so far are

So the best for DLP is to zero the full device using either dd if=/bin/zero of=/dev/sd* or, for more critical data, shred -zn 5 /dev/sd* to fill the disk with random data. The number of iterations (5 here) might depend on the legal recommandations (so far I remember, 3 is often enough for NIST purge).

Bonuses

Crypto Erasing

Is it OK for DLP to just delete the encryption key if the device is encrypted?

I would say: no, because how can you be sure you have erased all copies of that key? We have seen that the file table is duplicated (BBBBBBB and CCCCCCC file names persist at the middle of the disk in the file tables backups), so the encryption kay could remain somewhere (not to mention a malicious user might have exfiltrated that key too).

Partitions

Can I shred/zero a partition and not the entire disk?

I would say "yes", because the partition should have a single block range and so should not spread around the physical device, but I'm not 100% sure. So I would still advice to shred/zero the entire disk in case you want to protect your data from leaks.

TL;DR

What's the easiest purge process to do?

sudo shred -zun 10 ./the-file-to-pruge.txt if the file was never edited/opened.
sudo dd if=/bin/zero of=/dev/sd* bs=120M (bs is optional, and just for speedness) or sudo shred -zn 5 /dev/sde for paranoiacs.