Looking for a better way to store, prepare and analyse your high-volume big data for future discovery and insights?
Cloud-based data lakes are specifically designed to run large scale analytics workloads in a cost-effective way, and do away with the management and costs of infrastructure. In this article, we breakdown what a data lake is, how it’s different to similar tools like data warehouse, and how services like Azure Data Lake can benefit your business.
What is a data lake?
A data lake in its most basic form is a central data storage repository or system that holds large amounts of data in its original format until it is needed for future operational or exploratory analysis - think of it as a place to bring your disparate sources of data together into one place. The types of raw data that are stored in a data lake can include:
- Audio, images and video
- Communications (blogs, emails, tweets)
- Operational data (inventory, sales, tickets)
- Machine-generated data (log files, IoT sensor readings)
Data lake as a concept is closely linked to Hadoop-based object storage frameworks; both ingest massive amounts of structured and unstructured data, manage data processing and storage for big data applications, and perform advanced data analytics on your data using advanced modelling techniques like machine learning and predictive analytics - so you can produce actionable intelligence that can be used to help make better business decisions.
Why use a data lake?
Data lake architecture differ from other solutions like data warehouses in that data is not governed or structured on its way into the data lake - it’s done on its way out. This makes it extremely efficient for storing and processing massive amounts of data, and especially suitable for storing big data as a persistent staging layer.
So, if you don’t have to prepare it or refined it before storing it there, why use it beyond storage?
- Upfront costs associated with data ingestion and transformation is significantly reduced because you don’t have to refine data before storing it in an enterprise data lake.
- Data lake solutions allow for data analysis and discovery at your own pace - you can determine whether or not your big data stored there is useful for insights now or in the future, and don't have to worry about analysing and modelling before storing it.
- You can use a data lake as a central store for the eventual processing (cleansing, aggregation, integration, transformation) of your big data.
- Data lake is useful for offloading your data (historical and legacy data) from other stores like your data warehouse, where the cost of processing and storage is higher.
Ultimately with data lake, data is available to everyone. You don’t need to understand how the data is related when it’s ingested, as data lakes expect the end-user who consumes it to define those links down the track. With cloud-based data lakes offering all of the above benefits in addition to the scale and analysis capabilities of the wider platform, businesses can further lower their cost and bypass barriers preventing them from setting one up.
What is Azure Data Lake?
Azure Data Lake (ADL) is an on-demand data repository for big data analytic workloads. As a public cloud service, it provides organisations data storage and data analytic solutions with instant scale, similar to other Azure tools.
ADL is popular with data analysts and developers to store data of various sizes (CSV, flat, log files) and shapes (structured, semi-structured or raw data) at any speed with a number of sources supported. You have no limit on file sizes or the amount of data you store in a provisioned Azure Data Lake. Most importantly, it allows users to gain deeper, actionable insights from complex data sets of significant volume without having to clean or define it first.
Essentially, it's an easy way for your business to manage big data in a cost-effective manner. The service actually encompasses two powerful resources - Azure Data Lake Store and Azure Data Lake Analytics - which merge affordable storage capacity and powerful analytics into one useful tool.
Azure Data Lake Store
Azure Data Lake Store (ADLS) is the storage part of Azure Data Lake. It’s a fully compliant Hadoop Distributed File System (HDFS) used to store a large range of structured, unstructured and raw data types, including media files, relational data and streaming data.
ADLS is one of the best optimised data lake tools available for storing big data for future analytic workloads and offers several benefits for businesses. For starters, it’s scalable like other Azure cloud services and offers unlimited storage of big data; it doesn’t impose limits on volume (you can have individual files amounting to kilobyte to petabyte), and it allows you to store data for as long as you need. Azure Active Directory (AAD) is also natively integrated with ADLS, so we can secure our data and our hierarchies within the same file structures set up in AAD.
One of Azure Data Lake Store’s other big draws is its low-cost of storage capacity. Stored data doesn’t need to be moved or transformed before you perform data analysis, and the total cost of ownership (TCO) is further lowered because of the hierarchical namespace of stored data. This improves the performance of any future analytics jobs you run and because it requires less compute power to process, there’s less to bill.
Azure Data Lake Analytics
Azure Data Lake Analytics (ADLA) is the compute part of the service you use to move, process and transform your big data located in Azure Data Lake Store. Using linked analytic engines like Azure HDInsight, Hadoop and Spark, you can apply batch and interactive queries, move refined data to Azure Data Warehouse (ADW) to build a report and run real-time analytics and machine learning to your data to produce better, actionable insights.
ADLA like the rest of Azure Data Lake is cloud-based, which means the traditional process of deploying and managing infrastructure is void. Because it’s highly scalable, you can set how much processing power you need when you need it, making it extremely cost-effective. Ultimately, ADLA allows businesses to instead more time on transforming data, writing queries and gaining more insights.
How do cloud-based data lakes benefit my business?
Cloud-based data lake services like Azure Data Lake are powerful yet simplified tools that makes collecting, storing, managing and analysing high volumes of big data easy and more efficient, and provide several benefits for the average business to leverage to their advantage.
1. An easy way to store big data for future analysis: Data lakes used to be time-consuming projects, but Azure Data Lake and its huge list of fully managed supporting analytics and storage services (Azure Data Factory, Azure Machine Learning, etc) remove that former barrier. It gives your business a place to store high volumes of big data, sourced from both on-premises and cloud-based sources, without having to define or transform it first. Without a time limit on how long you can store data in Azure Data Lake, you can come back to it for exploration and analysis and produce your desired insights at your own pace.
2. Consolidate your big data into one place: ADLS brings together all of your big data from disparate sources across cloud and on-premises environments into one central place. You can monitor and manage all stored data more easily, without having to go back and forth between more than one silo. If you’re looking to reduce the number of places you store your data or tools you use for data analytics, it’s an ideal solution for data consolidation.
3. Cost-effective data lake solution: Running your big data workloads on Azure Data Lake Analytics is charged based on a per-job basis whenever your data is processed, you can or use an on-demand cluster. Without any hardware or traditional licensing or support agreements, you basically only pay for what you need and what you use.
4. It’s secure and compliant: Azure Data Lake is backed by enterprise-grade security and makes it safer to manage data overall - staff don’t have to manually store or migrate your big data, so risk is reduced. Because it’s in the cloud, it also makes compliance, governance and logging much easier. Finally, it is integrated Azure Active Directory (AAD), which means you can provide seamless authentication around all your stored data.
5. Remote access: Cloud-based options like Azure Data Lake naturally makes big data more easily available remotely. This enables better enable collaborative analysis and improves overall information accessibility.
What’s new in Azure Data Lake Storage Gen 2?
Microsoft officially opened its preview program for its second generation version of Azure Data Lake Storage Service in June 2018, and with it comes an exciting new list of capabilities.
- A Hadoop-compatible file system endpoint, integrated into Azure Blob Storage
- A full hierarchical namespace for files and folders structuring
- New Web browser-based user interface with drag and drop visualisation
- Native integration with Azure Active Directory (AAD) and Power BI
- Over 70 supported data source connectors
Gen2 consolidates the core capabilities of the first version of ADL, like Azure Active Directory integration and a Hadoop-compatible file system, and integrates them into Azure Blob Storage. Given that it’s constantly being improved with new features, ADL is one of the best cloud-based data lake options available at this time.
Why you should use a cloud-based data lake
Because of its scalability and cost-effective nature, cloud-based data lakes like Azure Data Lake are increasingly being used to handle big data. Big data is generally high in volume and takes a long time to process and analyse for meaningful insights, so having a scalable solution to store massive amounts of raw, unstructured information without having to transform it first - while having native integration with powerful data analysis tools - is becoming an increasingly essential toolset for the average business that want to produce better, more actionable insights.
Understanding where Azure Data Lake can fit into your overall analytics process is important, as it can be easy to use it as a data dumping ground. Evaluate how it can provide value as a persistent staging layer for your big data to eventually deliver transformed data for consumption by the right business intelligence solutions like PowerBI.