Data Lakes: Cleaning Up Data’s Junk Drawer

We all have that place where we end up stashing those things we think we’ll need or want someday. Some of us throw the stuff in a junk drawer in the kitchen. Others squirrel it away to the attic or into a closet in the spare bedroom.

On occasion, we do venture into these storage spaces and unearth certain items that prove to be extremely beneficial in solving a problem. But in most instances, those things we deemed essential in the moment are left in a jam-packed drawer or dark corner of a closet — forgotten and worthless, yet taking up valuable space that could be utilized in some other way.

This is precisely the situation many organizations face today with their data.

A Junk Drawer Full of Data

Today, the amount of data produced by businesses continues to increase at a dizzying speed. Most organizations migrate their data into a Data Lake, thanks to its inherent scalability and flexibility. What goes in the lake, stays in the lake. On the surface, this appears to be a smart business move since data is their most valuable asset.

But dive beneath the surface and you’ll discover that using a Data Lake as a repository without giving consideration to its usage makes it no better than a junk drawer. Sure, the lake may store a vast amount of data, but all of the raw data in the world is of little worth if there isn’t a process in place for unlocking its value. Even worse, there may be private information in that unopened letter that you don’t want others to see.

The vast majority of businesses have Data Lakes that are little more than virtual junk drawers: reservoirs that house data from disparate sources across enterprise. The problem is, most of this data isn’t accessed. In fact, it’s not uncommon for the majority of users to find only a small percentage of truly valuable data sets. The remainder of it is submerged in the lake, an uncataloged, useless jumble of data sets taking up costly space without providing the ROI businesses expect. Users don’t know how to find data sets in the lake — or if they can, it’s difficult and time intensive to distinguish which ones are the best … or if they should have access to it.


Leave a Reply

Your email address will not be published. Required fields are marked *