<aside>
Introduction
With millions of data points generated daily within various enterprises, a robust solution enabling businesses to make data-driven decisions at an affordable price is essential.
Architecture Overview
Data is extracted from GitHub using Synapse pipelines, creating a copy of the compressed file in the first layer of raw data. A webhook activity then decompresses this data and stores it in the second layer of the raw layer.
The Silver layer hosts all dimensionalized data, partitioned and normalized for efficient processing. We create a snowflake schema that introduces hierarchies to customer and product data. This normalization step is crucial as it reduces file size by approximately 40%.
The Gold Layer contains all business-ready queries. Data in this layer is aggregated to minimize the processing power required by Analysis Services during query execution.
This solution updates and creates new partitions only when changes occur between the raw layer and the silver layer. Raw layer data is deleted once the pipeline completes execution.
This data is made available by the SQLBI team and you can find the data here.
This data is available in a compressed form of 7z that you can choose to decompress manually but you can automate the decompression using webhooks and PowerShell.
</aside>
<aside>
Project Requirements
This project will focus on reducing data volume as it moves from the raw layer to the silver layer, making it accessible for downstream users. Data analysts will access this silver layer, where the data is properly organized into dimensions.
A serverless solution is used to deliver this solution on Synapse analytics
</aside>
<aside>
Project Sections
<aside>
Prerequisites
The following configurations are assumed to be in place:
<aside>
To execute these activities effectively, we will develop several key objects within our Synapse workspace. This includes creating a database to serve as a reference point for accessing external tables and managing data structures efficiently, defining external file formats to ensure proper parsing and interpretation of externally stored data, and establishing external data sources to seamlessly connect with Azure Data Lake Storage Gen2 (ADLS Gen2). Additionally, we will develop stored procedures to automate and streamline complex data processing tasks and create views to provide simplified and consistent access to data tailored to specific business requirements. This structured approach ensures optimal performance, scalability, and efficiency within our Synapse Analytics environment.
Database and Collation
The database was named Contoso, utilizing UTF8 encoding for compatibility and efficient handling of diverse character sets (click here to know more). Two schemas were defined within the database: Bronze, designed to manage all transformations of the raw data, and Contoso, designated as the reporting schema to support data consumption and reporting needs. This structure ensures a clear separation of raw data processing and reporting workflows, enabling better organization and streamlined operations.
External File format and data source
The external file format was configured as Parquet, partitioned by year, month, and day. This setup minimizes the number of deleted and processed partitions in the raw layer, improving efficiency in data management. Additionally, an external location was defined to point to each container, enabling Synapse to seamlessly access the data stored in Azure Data Lake Storage Gen2. This approach enhances performance and ensures better organization of the raw data layer.
</aside>
<aside>
The process began with transferring data from the GitHub repository to the raw layer in our Data Lake. To achieve this, I set up the required data source and sink, creating an integrated dataset to connect to the GitHub repository. The data was copied as binary files into the raw layer to preserve its integrity.
Given that Azure Synapse Pipelines and Data Factory do not support the compression format of 7z files, I implemented a solution using webhook activities and PowerShell scripts to handle decompression. I will provide additional details on this implementation in a separate write-up.
A linked service was configured to enable seamless connections between the source and sink datasets, supporting various data sources, including Azure Data Lake Storage Gen2 (ADLS Gen2), S3 buckets, and HTTPS endpoints.
Once the data was decompressed through the webhook process, it was prepared for dimensionalization. The silver layer was enhanced by reducing data size, introducing hierarchies, and normalizing the model as illustrated in the provided diagram. This approach ensures a streamlined data flow from raw ingestion to transformation, supporting efficient data management and advanced analytics.
The raw layer initially followed a predefined logical schema; however, this was transformed into the silver layer schema. This transition significantly optimized the data structure by reducing redundancy in the Sales and Exchange tables, leading to a 45% reduction in overall data size, thereby improving storage efficiency and processing performance.
</aside>