Azure Data Factory: How to centralise and transform your big data

Big data is more available to us than ever before, but moving and transforming our raw data into something meaningful is still proving to be a complex task for many organisations.

According to Forrester, 74% of businesses want to be “data-driven,” but only 29% are actually successful at centralising their data, refining it for analytics and producing real insights.

The reality is unprepared data on its own - scattered across multiple data stores - cannot provide us with the kind of actionable insights we seek to make better operational decisions.

The missing link for many companies is a powerful solution that leverages the scale of cloud-computing, brings together all of our diverse data stores and prepares it for use in another data sources and data analytic tools - and this is the job Azure Data Factory fulfils.

 

What is Azure Data Factory?

Azure Data Factory is a tool that automates, monitors and orchestrates the movement and transformation of your datda across cloud and on-premises. As a Microsoft Cloud managed service, it allows you to build, manage and schedule business-critical data at scale and across environments to create better data-driven workflows.  

Data Factory is often forgotten amidst the powerful data analysis capabilities of other highly useful and similar Azure services such as Azure Data Lake, but it plays a key role for businesses properly leveraging the power of cloud computing: data integration.

 

When would I use Azure Data Factory?

Most businesses store their big data - raw, semi-structured and structured - across multiple systems, whether it’s relational, non-relational, on-premises or in the cloud. These islands of data on its own cannot help your analysts or decision-makers achieve greater insights, especially without a way to integrate it all into one place and refine it.

Imagine a company that has undergone digital transformation with Azure and wants to find a better way to gauge customer satisfaction and usage and grow its business. It has all of this key customer data in the cloud, in addition to high-volumes of raw data on its customers demographics scattered across its on-premises stores. They want to:

  • Extract all data from cloud and on-premises and consolidate it into one central place
  • Load the data into where it needs to be for future analysis and processing
  • Transform the combined data to further refine it and produce trusted insights
  • Publish the transformed data into a cloud data warehouse like a Azure SQL Data Warehouse to build a report
  • Automate, monitor and manage all data based on specific schedules

It would take significant time with traditional tools to integrate all of this data and perform these complex tasks with such a high-volume of big data, and because they want to compare their data based on continuity (daily, weekly, monthly, etc), it’s also not a one-time job.

With Azure Data Factory, you can accomplish these types of complex hybrid extract-transform-load (ETL), extract-load-transform (ELT) pipelines end-to-end in Azure.

 

How Azure Data Factory works

Using the infrastructure and services of the Azure cloud data platform, Data Factory enables businesses to create data-driven workflows (called data pipelines) to ingest your data across your environments, use activities to prepare the data for further use, and deliver trusted data that can be turned into meaningful and valuable information.

Data Factory breaks down the entire pipeline process in four key steps:

1. Connect and collect: To use your data you must first collect it. Data Factory brings together all the required sources of your structured, unstructured or semi-structured data - Software as a Service (SaaS) applications, file shares, FTP serves, SQL Databases and web services - and moves it as needed to a centralised location such as Azure Data Lake store or Azure Blob Storage for further processing. When your data is sourced from on-premises stores, the Azure Data Management Gateway acts as a secure channel for migrating your raw data into the cloud. 

2. Transform and enrich: When your data is moved to the cloud, you can then transform it using several linked compute services like Data Lake Analytics, Azure HDInsight Hadroop and Spark, and Azure Machine Learning, to produce refined data on a controlled schedule that can provide your production environments with reliable data.

3. Publish: With your data transformed and business-ready, you can keep it in the cloud for consumption by business intelligence and visualisation tools like Power BI, or move it to your cloud-based or on-premises sources like Azure Data Warehouse, Azure SQL Database or SQL Server - anywhere data can be better consumed by the end user.

4. Schedule and monitor: Azure Data Factory v2 introduced several new monitoring and visualisation capabilities in the pipeline process: After successful transformation and deployment of your automated data integration pipeline, you can now monitor its health for continuity to ensure jobs are running and pipelines are flowing properly, using built in, enterprise-grade pipeline monitoring tools like Azure Monitor, Log Analytics and PowerShell. You can schedule its movements daily, hourly, weekly, monthly, etc.

In summary, rather than just existing as another service to centralise and store data, Data Factory’s main purpose is to provide businesses a better way to orchestrate the movement of their raw data between various stores and build an effective information production system that refines it for further, reliable use so you can discover better insights.

 

How can Azure Data Factory help my business?

Azure Data Factory provides a number of benefits for many businesses moving and transforming big data in the cloud, and with Version 2 receiving continuous updates, its rich-feature set and capabilities is only growing better.

 

Lift and shift your SSIS ETL packages to the cloud

Server Integration Services (SSIS) has been the main ETL tool used with SQL Server for over a decade. Without an equivalent service in Azure, many businesses have been unable to migrate their existing solutions to the cloud. Other complex workarounds, like setting up an IaaS SQL Server Box to run SSIS or manually redeveloping the ETL code in Azure Data Factory were out-of-scope for many organisations.

Azure Data Factory Version 2 introduced the ability to lift and shift SSIS workloads in the cloud, and deploy and run them as a managed service in Azure by provisioning virtual machines (VMs) in Azure where the packages are then run on said VMs. This allows businesses with extensive on-premises data warehouses to:

  • Move into the Azure PaaS environment and take advantage of scalable infrastructure that manages your resources for you
  • Expand the data transform capabilities for your SSIS workloads, now in the cloud
  • Increase productivity and lower your total cost of ownership (TCO)

In short, this makes Data Factory an essential tool for larger enterprises customers with massive SSI packages who don’t want to start at ground zero with their stored data, or companies with hardware at end-of-life or in hybrid environments that want to consolidate big data efficiently.

 

Visualise your data pipeline workflows

Azure Data Factory big data insights

 

Azure Data Factory Version 2 introduced powerful new visual tools for easier, interactive visual authoring and monitoring for end-users. You can now create, configure, test, deploy and monitor all of your automated data integration pipelines without having to write any code, increasing efficiency and productivity and opening up the program to less technical users.

For example, you can now visualise your Data Factory metrics to see patterns over days, months and years in a simplified graphical interface, and monitor real-time progress of your data consolidation activities.

The visual UI and drag-and-drop design tools of Data Factory is similar to other services like Azure Machine Learning, and makes the Azure cloud platform even more compelling for big data enterprises seeking intuitive analytics tools.

 

Extensive connectors and language support

With other 70 supported data source connectors, Data Factory enables your business to move data seamlessly between your various data stores - whether it’s Azure Blob Storage, Azure SWL Database, or third-party stores like MySQL, Oracle or Amazon S3.

Data Factory also allows your IT team to use their existing skills and write their own code in multiple languages, including ARM, Python or .NET to build your data pipelines. You can either use a large number of supported processing services and put them into managed data pipelines or insert custom code as a processing step in any pipeline depending on your needs.

 

Consolidate, centralise and modernise your data integration

The reality is traditional data services or custom-made migration components are expensive, require expertise to integrate the kind of high-volume, high-velocity and high-variety big data we deal with today successfully, and take a lot of time to build and maintain. That’s not even taking into account the need to then analyse and transform it. 

As a single cloud service, Data Factory provides valuable data integration across all of your data sources - Azure, on-premises, or even on other public clouds like Amazon Web Services (AWS) and it allow your business to consolidate, manage and modernise the data integration experience all in one common place.

 

Azure Data Factory: Next steps

With the scale of Azure and its supporting data analysis services behind it, Azure Data Factory enables your business to achieve serverless, automated data workflows with zero infrastructure management and plenty of elastic capabilities that can effectively match your business growth.