A holistic approach to securing data in a cloud-based data lake

Data-driven decision-making is a key driver for enterprises in their digital transformation journey. Businesses are now switching to scalable, unified data storage repositories like enterprise data lakes, built on cloud storage options such as Amazon Simple Storage Service (S3), Google Cloud Storage, Azure Data Lake Storage (ADLS), and Azure Blob Storage. But while the cloud offers unmatched speed, flexibility, and cost savings, security remains a major concern. This blog delves into the key pillars of cloud security and outlines how a holistic approach can help enterprises protect the confidentiality, integrity, and availability of their data.

Data access

Role-based access control, authentication, and authorization are vital security components of a healthy data lake. We recommend developing fine-grain controls and defining appropriate roles for key tasks – like moving data to cloud storage, deleting data, and accessing metadata.

While building a data lake for a Fortune Global 500 insurance brokerage and risk management company on AWS, we created different storage buckets for raw data, processed data, and consumption layers. We leveraged the cloud’s Identity and Access Management services to restrict access to each of these layers. No individual users had direct access to the raw data bucket – only the service account and ETL tools could copy data to this layer. We also created roles like Power Admin, Data Analyst, and Data Admin, and gave each of them different access permissions to read and write data. Further, to restrict access to underlying tables via Hive and Presto, we configured Ranger policies. Ranger offers easy management capabilities and enables granular control for role-based access at both table and column level.

Data transfer

It is critical to secure data when moving through the network, across devices, and services. Often, this can easily be configured for each storage service through built-in features. We recommend using Standard Transport Layer Security (TLS)/Secure Sockets Layer (SSL) with associated certificates. This allows you to securely upload/download data to the cloud through encrypted endpoints, accessible via the internet and within the Virtual Private Cloud (VPC).

While implementing an AWS data lake for an IoT solutions provider, we created SSL-enabled VPC endpoints to transfer all data to the cloud storage. This ensured that data never moved through the internet, thereby bolstering security. In AWS, we used SSL for communication between on-premise and cloud network, data ingested in the raw layer and intermediate data processing, and BI tools and consumption layer, ensuring end-to-end data security.

Data storage

As a best practice, encryption-at-rest should always be enabled in the cloud. This includes encryption for storage services as well as persistent disk volumes used by compute instances. For implementing encryption-at-rest effectively, we recommend allowing your cloud provider to manage the encryption keys to eliminate the risk of accidental key deletion/loss.

For the insurance brokerage and risk management customer mentioned earlier, we used cloud-managed keys to encrypt data in S3 and EBS. This enabled easy rotation of keys periodically. To further strengthen security of data residing in the raw layer, we used custom PGP encryption keys for Third-party Auditors (TPAs). Each TPA was provided a specific encryption key, which allowed them to send the necessary files in an encrypted format. These files were then decrypted for processing in the data lake using the PGP keys, ensuring fully secure transfers.

Data availability

The cloud is designed to provide high resilience and availability, which means objects are redundantly stored on multiple devices across different facilities. However, this availability is applicable in a specific region, and data is not automatically replicated across different regions.

To create a robust disaster recovery environment for a leading management e-publication, we enabled data replication in a region different from the source storage. This ensured data security, even in the event of a region failure. We also leveraged automated lifecycle management policies for cloud storage, which enabled automated movement of data from one storage tier to another. To meet the customer’s compliance requirements, we specified a retention period of 7 years, after which the raw archive data was automatically moved to cold storage. This helped reduce overall storage costs, and enabled users to seamlessly retrieve data as and when necessary.

You can also ensure high availability and strengthen protection against data loss through versioning, which lets you preserve, retrieve, and restore different versions of an object stored, enabling smooth recovery from human error and application failures.

In conclusion

Securing data in the cloud is a critical business need. Enterprises cannot afford to overlook the myriad security risks that arise while warehousing their data on a cloud-based data lake. With extensive experience in provisioning cloud-based data lakes for large scale enterprises, Impetus Technologies can help secure your data so that you can focus on your business goals with complete peace of mind.

Author
Mustufa Batterywala
Senior DevOps Architect