For any business with a data strategy in place, the next step on the roadmap to data transformation is to capture all the structured and unstructured data flowing into the organization. To do so, organizations must create a data lake to store data from IoT devices, social media, mobile apps, and other disparate sources in a usable way.
A data lake, per AWS, is a centralized repository that allows an organization to store data as is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
Data lakes differ from data warehouses in that data warehouses are like libraries. As data comes into a warehouse, it gets carefully filed according to a structured system that has been defined in advance, making it easy and quick to find exactly what you’re looking for given a specific request. In data lakes, there’s no defined schema, which means data can be stored without needing to know what questions may require answers in the future. As in an e-Bookstore, you can search generally and call all relevant results from various media types and make decisions based on machine learning recommendations and other people’s insights.
Many organizations are evolving their data storage to incorporate data lakes. However, maintaining any online information storage comes with security risks that must be identified and mitigated.
Over the past few decades, improvements in compute power and storage space coupled with much more affordable storage prices have made it possible to store massive amounts of data in one place. Not long ago, storing a database of every citizen’s Social Security number would have been impractical—now it’s pennies on the dollars to store as a table in a data lake.
As much opportunity as large data storage provides organizations, it also creates risk. When vulnerabilities occur in repositories, their infrastructure, or any dependencies, the level of impact depends on the type and scale of the information that was compromised. Since data lakes have vast amounts in a single location, when breaches occur, the impact is often spectacular in size and in magnitude.
Common tactics hackers use to exploit enterprise data are Initial Access, Defense Evasion, and Credential Access. Kurt Alaybeyoglu, Senior Director of Cybersecurity and Compliance at Strive Consulting, says organizations often make a mistake by focusing too strongly on preventing Initial Access—a cybercriminal getting into the org’s network. Data lakes interact with so many sources that it doesn’t take network access to be able to cause damage.
“The two primary security risks in a data lake,” Kurt says, “are exfiltration of and impact to sensitive data.” As the name suggests, data exfiltration is the unauthorized transfer of data. Attackers can either steal specific piece(s) of data or, more often, simply take a copy of an entire lake—akin to a burglar carrying away a safe so they can open it and rifle through its contents at their leisure. Data impact ranges from encrypting the data in the lake, to wiping it, corrupting it, or destroying the means of access to the platform.
Both tactics can, and have been proven to, be catastrophic for an organization’s survival.
Facing such dire consequences in the event of a cyberattack, why do businesses choose to use data lakes? Conventional wisdom says not to keep all your eggs in one basket—compartmentalizing data to avoid total compromise is surely more secure. But for many, according to Kurt, the rewards of data lakes outweigh the risks.
“Being able to access massive data at your fingertips with simple queries is what allows modern apps to exist,” he explains. “Take Uber as an example. Uber, as a technology, completely disrupted the taxi service model. It got rid of the need for dispatchers because at its heart was software that acted as one, pairing users and drivers faster than most humans can. Their software functions because Uber created a data lake that contains information like riders, drivers, maps, payment information, etc. that allow all of these disparate aspects to function seamlessly”
While separating this data into different repositories may be more secure, it would take significantly longer times for the application to function, from running all the queries to payment processing, to time calculation for the ride—it would completely preclude the app’s usefulness. Not to mention the added complexity would make securing data just as-if not more-difficult.
“As security professionals, we have to try to mitigate those risks as best as possible,” he says. “At the end of the day, data security is a business function. Our job is to say ‘yes, we can do that, but here are the risks.’ Leaders must decide what they’re willing to pay to mitigate, what they’ll pay to transfer, and what risks they’re willing to accept.”
What makes data lakes so risky is that the valuable commodity, data, by necessity must be accessible, whether that’s to a platform, an end user, or someplace else. THE DATA MUST BE AVAILABLE IN ORDER TO BE USEFUL. So, an organization’s top three focus points to protect that data are as follows:
More people with unfettered access to the data lake means more potential entry points for a hacker to attempt to exploit. TO SECURE THE DATA LAKE, BE THOUGHTFUL ABOUT WHO CAN ACCESS IT AND WHEN. Validate those users’ identities using strong passwords and multi-factor authentication (MFA). If the data lake contains particularly sensitive information, consider more advanced hardware solutions such as FIDO2 keys.
Because data lakes and supporting platforms aren’t tied to a single device, hackers no longer need to achieve initial access to get ahold of the data. For most applications that interact with data lakes, a successful breach may only take a SQL or command injection that forces the system to respond with data it’s not supposed to—no device compromise needed. Because of that risk, proactively looking for the holes in a data lake’s security is paramount. Use a combination of application threat modeling, vulnerability scans, and application penetration testing to identify weak points, then remediate them quickly.
“Data lakes are examples of what modern storage/compute allows us to do,” Kurt says. “We haven’t put the same level of effort and value into collecting audit logs to be able to make detection and analytics earlier in the cyberattack chain possible.” The answer? Staffing and training. Proactive threat detection comes from a skillset that knows what to investigate. “How do I collect audit logs from the platform? What logs should I collect? How do I determine when someone has accessed the data versus what’s just noise? That investigative mindset and skillset is in high demand and low supply,” says Kurt.
His suggestion to overcome the talent gap: Companies that rely on data lakes should build detection skillsets from within. It’s easier to pay to train a person who is well-versed in the inner workings of an organization’s platform that can build data security than it is to bring in a security generalist to work within an org’s data lake.
The advantage of training an internal employee is that they have the full view of the data product roadmap, which means they can start developing future updates on the platform that build security in from the ground up. That’s security by design - the brass ring of risk management in a data lake.
Where exploitable data exists, opportunists will try to access it. Data lakes provide organizations an incomparable ability to un-silo work, answer new questions by drawing information from DIVERSE SOURCES, and innovate technology that creates the next apex experience. For that reason, businesses must up-level their investment in data security in concert with their investment in data storage and usability. On the data roadmap, that’s the ultimate step toward data transformation.