Data Loss Prevention & shred - Devblog de Xenos

Data destruction is a critical part of a DLP. I was told that wear leveling made the shred command useless. So let's check.

Setup

We'll check the data destruction process using a simple setup: a Samsung 850 EVO SSD (128GB) data disk (not the disk OS is installed on) that is full of zeroes, formatted in ext4, and that contains a single file with clearly identifiable data. We will then try to delete this file and see how the data (on the disk as a block device ) are erased.

Format (zeroing)

Backup the disk (just in case), and fill it with zeroes

The process will stop (with error) when the disk is full

So now we no longer have any partition in this fully-zeroed disk

Create a partition

And we now have a partitionned disk with all content blocks filled with zeroes

A file with AAAs

We then create a file full of *1024*32 = 32k* "A" so it will be easy to spot on the disk data blocks

We'll call the file "BBBBB" so we don't mix the file name ("B"s) and its content ("A"s)

Regular file modification

Now that we have a file on disk, let's check what its data (content) life cycle is. We'll check how the disk is storing the file content, and how that storage evolves with the file usage.

How block devices work

Any storage media is seen by the OS as a "block device", that is, a very long stripe of bytes. Bytes are grouped into sectors (512 bytes per sector) and sectors in blocks (8 sectors per block usually, so 4k, but that might be customized to, say, 64k).

So when a file must be stored, its content is split into blocks (here, 4k) and saved on the disk. When saving, each block is subdivided in sector, and each sector is stored contiguously inside each block. But blocks may be spread around on the disk if needed, leading to the so-called "fragmentation".

So for our setup file, we should get a single blocks-range (no fragmentation) of 32k bytes length (because the file has 32k characters), that is 8 blocks of 4k size, each split in 8 sectors of 512b size meaning a total of 64 blocks of 512 bytes.

Locate the file content

We check where the file is stored on the block device

As we can see, file is stored from sector 274432 to sector 274495, meaning it is 64 sectors size with 512 bytes per sector. This matches the expected 32k "A"s of the file content.

If we check the block device near that place, from sector *274430*, we see the file content data is indeed here

dd will read "blocks" of raw data of the block device. With bs=512 option, we tell dd to use "blocks" of size 512 (matching the sector size), so we can skip (ignore) the first 274430 blocks, and see the first sector of file's data.

We could also have done a direct dd if=/dev/sde and then ignored the first 274432 * 512 = 140509184 characters. Result would have been the same, but it's uselessly tedious.

Using the precise values from *hdparm*, we can use dd to exactly read the file's data content

Lifecycle with nano

Let's use *nano* to change file content to emulate data life cycle

We put a "B" and a "Z" and the begining and end of the file content, then save

The data change is reflected on the *cat* command, and we see the "B" in the file content

But if we dd the device on the content blocks, we don't see the change!

Looking at the sectors before the file content, we still see "zeroes" …

…and same in the sectors after the file content ones

Content location has changed

If we check where the file content is stored, we'll see that the allocated blocks changed

LBA means Logical Block Address .

That means the file content has been stored in a different place than the previous one. It explains why, at the previous place, we still see the old data. They were not erased by the OS, and instead, the OS stored the file content in other sectors, somewhere else on the block device.

Altho I don't have the direct link to the documentation (lazy, you know!), this is actually an expected behavior: when saving a file after we made changes in nano, the text editor does not overwrite the existing file directly, but instead, makes a new one to save the changes to, and remove the old one leading the OS to reallocate the content location on the block device.
The blocks allocated for the original file's content are now unallocated (but not erased), and some unallocated blocks are not allocated to store the new file's content.

There might exist some file systems or OS that would overwrite any block when unallocating it to drastically reduce the data leak risk. But in such case, these FS or OS would have far weaker I/O performance.

This is a very common behavior of software in general, which are not guarantiing to overwrite the original file, and can often save the changes in a different location. Thos, this could also be the OS doing some sort of "software-based wear leveling" (see above), but I actually doubt so.

But also, caches

So, if we check the new content location, we should see the data, right?

Wrong. Checking the new location shows only "zeroes" here too, so where are the data?

Let's unmount the disk to have any write-cache flushed, in case there are some

Now if we read the blocks at the new place, we see our new data, starting with a "B"!

And at the end of the new location, we see the "Z"

This is actually not a surprising result, because OS (kernel) have written caches on block devices, and this is somehow the same reason why you should unmount USB removable storages properly (by unmounting them first) to avoid data loss. If you don't do so, data "written" to removable device might actually be written in the cache, and not flushed to the physical drive, leading to data loss if that drive is physically removed without first telling the kernel (OS) that it should flush cache on it first.

So what we have learnt here is that editing software may leak data by reallocation the file content to new blocks/sectors, meaning the actual file content can appear several times on the drive, in the unallocated space.

Short files

We have done the tests, so far, with a file those content matches exactly the block size (4k) or one of its multiples. But what if we have a large file (say, 4k) and we shorten it (to 1k)?

Now, what happens if we reduce the text inside the file? The text will not fill the block, but will the old text remain (in that block) or will it be erased?

And then, we delete part of its content and ssee the changes on the block device

It seems that the rest of the text is erased (with zeroes). But, I would actually not rely on that, since I'm unsure if it's a kernel-level action, or if it depends on the software used (here, hexedit, because nano would have reallocated the blocks on the device, making the test useless).

What might also happen (depending on software? kernel? OS? file system? I don't know) is given below as an example.

We make a file in which we have a password, at the last line

We then edit the file to remove this password

This is a forged example: in reality and as seen above, using nano would allocate a new range of blocks, and so, the password would remain somewhere on the disk, in the unallocated blocks

When investigating on the disk, we *might* see that file size is truncated, but data remains in the block

Again, this example is forged, so I cannot very tell if the rest of the block is always reset to zeroes by the kernel/OS, or if it is up to the software to do so.
In any case, I would not rely on this as a DLP, and I would consider that the blocks allocated for a file might have trailing data in them, and could contain sensitive information that were supposed to be removed from the files.

Shredding the data

Now, let's remove our previous file (with rm), and create a new one with a new content. Then, we'll try to see if we can completly destroy its data, making the actual data content unrecoverable.

We create a new *CCCCCCC* file containing *YCCC…CCCZ* characters, at *LBA 274432*

We can see that this new file content is located at the same block adress that the previous file, which we have deleted. That means the OS has allocated these blocks again to store some data content.

Once unmount and remount, the device blocks show the file content at the expected place

We then run the *shred* command on the file, and as expected (because of the cache), data remains here

After unmount/remount, we see that the data have been properly erased

Using the shred, the file content seems properly destroyed, so long you have unmounted the device to avoid any cache latency issue.

"Wear Leveling"?

Last thing, what about that "wear leveling" we mentionned at the start of this post? Does this mean the content of a shreded file could remain somewhere on the disk? Let's check.

If we supposed that "wear leveling" stores data at a different physical place on the disk, can we spot that? We could suppose that the Logical Block Adressing is not directly mapped to the physical blocks (PB) on the device. For instance, it means that block at LBA 1 could be stored in block at PB 42. But if that's so, where is stored LBA 42? It cannot be stored at PB 42 (already mapped and occupied), so it must be stored somewhere else, like PB 33. But then, where is stored LBA 33? And so on…

With the pigeonhole principle, that means a wear-leveling will always "swap" blocks from the mapping, and never add/remove blocks from it. If a swap is done, then by reading the entire disk, we should cross the path of both the swapped blocks, and we should see the data remapped by wear-leveling.

Lost? Here is an example: we have a file content stored at LBA 42, and we overwrite it with random data using shred. Suppose wear-leveling is storing the random data somewhere else on the physical drive (meaning the original data are still readable from the physical blocks), then if we read the entire drive, we should see the original data somewhere. If we don't see them, then wear leveling is not happening.

Let's read the entire drive, looking for *AAAA* or *CCCC*

We see the very original content from AAAA file (because of the nano's reallocation) and the file names in file table

Interestingly, we also see the *.swp* (temporary swap) file of nano and the shredded *CCCC* file name

That means either the wear leveling is not existing on this drive, or, it's the OS/kernel responsibility.

Maybe I'm mistaking here and very advanced skilled attacker could extract data from physical blocks, but I highly doubt so (except by using the physical redundancy and error correction maybe). Still, but the very vast majority of data, this means shred properly destroyed the data content, but the file names might remain in copies of the file tables.

What about the mecanical drives?

The "wear leveling" of the introduction was aimed for SSD (to enhance their life duration) but is it applicable to mechanical drives?

Let's start by making the same type of file on a mechanical drive

And, as we can see, the LBA changed anyway

And the previous location still contains the old file's content

So no matter the drive type, it seems the blocks allocation would change and rotate for any actual drive technology.

Still, I also have some drives for which I delete and add files on a regular basis (system drives with regular system life). And, along time, the compressed size of a raw copy of the drive image does not change, meaning that the allocated blocks sounds to always remain the same.
Hence, it looks like the block allocations may "rotate" and change, but I would not rely on this behavior in general.

Conclusion

The main points we have see so far are

A file on a device may leak its data in all unallocated space throught its lifecycle, meaning tons of hidden copies of the file might exist on the storage device
Deleting a file (rm) does not clear its content on the physical device, and the content can be recovered quiet easily (the less fragmented, the easier)
shred is finely erasing the data content, by replacing the allocated blocks content with random data
The file names might be recovered from copies of the file tables
Wear-leveling does not seem to impact the recovery of delete files data, but OS blocks reallocation might make foresnics harder
Behavior may depend on OS and Filesystem, so this may differ for non ext4 system

So the best for DLP is to zero the full device using either dd if=/bin/zero of=/dev/sd* or, for more critical data, shred -zn 5 /dev/sd* to fill the disk with random data. The number of iterations (5 here) might depend on the legal recommandations (so far I remember, 3 is often enough for NIST purge).

Bonuses

Crypto Erasing

Is it OK for DLP to just delete the encryption key if the device is encrypted?

I would say: no, because how can you be sure you have erased all copies of that key? We have seen that the file table is duplicated (BBBBBBB and CCCCCCC file names persist at the middle of the disk in the file tables backups), so the encryption kay could remain somewhere (not to mention a malicious user might have exfiltrated that key too).

Partitions

Can I shred/zero a partition and not the entire disk?

I would say "yes", because the partition should have a single block range and so should not spread around the physical device, but I'm not 100% sure. So I would still advice to shred/zero the entire disk in case you want to protect your data from leaks.

TL;DR

What's the easiest purge process to do?

sudo shred -zun 10 ./the-file-to-pruge.txt if the file was never edited/opened.
sudo dd if=/bin/zero of=/dev/sd* bs=120M (bs is optional, and just for speedness) or sudo shred -zn 5 /dev/sde for paranoiacs.

⇇ Le blog de Xenos