Mastering Mixed Data Ingestion: A Hybrid Approach for Engineering Quality Software in Medallion DWH
Navigating Mixed Data Sources for Your Medallion DWH
Building a robust Medallion Architecture data warehouse often involves integrating data from diverse sources, ranging from structured relational databases to less predictable user-managed files. A recent GitHub Community discussion highlights a common challenge faced by beginners: how to effectively ingest data from both MySQL databases and complex, multi-tab Excel files into a PostgreSQL-based Medallion DWH, especially when those Excel files reside locally.
The user, akhil-as-tnei, sought architectural guidance, having explored Airbyte for ingestion but encountering difficulties with local Excel files—specifically, pointing to local paths and programmatically selecting specific tabs. This dilemma underscores a critical decision point in data engineering: choosing the right tool for the right job to ensure engineering quality software and avoid 'spaghetti' pipelines.
The Hybrid Ingestion Solution: A Standard for Engineering Quality Software
The consensus among experts points towards a hybrid ingestion strategy as the most robust and maintainable approach. While Airbyte is an excellent choice for database-to-database replication, its capabilities can be stretched when dealing with the nuances of local, multi-tab Excel files. Here’s the breakdown:
- For MySQL Databases: Stick with Airbyte. Airbyte excels at handling schema changes, incremental loads, and efficient syncing from structured relational databases like MySQL into your PostgreSQL Bronze layer. It's built for this kind of consistent, structured data movement.
- For Complex Excel Files: Embrace Scripting with Python/Polars (or Pandas). Excel files are inherently 'messy' and often require 'pre-bronze' logic—filtering specific tabs, handling merged cells, or performing light transformations—that low-code tools struggle with. Python, leveraging libraries like Polars or Pandas, provides the flexibility and control needed for this.
Why Python/Polars for Excel?
Directly scripting Excel ingestion offers several advantages, particularly when files are co-located with your DWH on the same VM:
- Precision Control: Programmatically select specific tabs, apply custom filtering, and handle data anomalies before ingestion.
- Performance: Reading files directly from disk on the same VM is extremely fast, bypassing potential network or containerization overhead. Polars, in particular, is lauded for its speed and memory efficiency, making it ideal for the Bronze layer.
- Metadata Enrichment: Easily add crucial metadata (e.g., source sheet name, file path) during ingestion, which is a standard practice in Medallion Architectures for traceability.
Here’s an example Python script using Polars to ingest specific tabs from a local Excel file into PostgreSQL:
import polars as pl
from sqlalchemy import create_engine
FILE_PATH = "your_data_file.xlsx"
DB_URL = "postgresql://user:password@localhost:5432/your_db"
TARGET_SCHEMA = "bronze"
# Define which tabs you actually want
VALID_TABS = ["Sales_2023", "Marketing_Data", "Inventory"]
def ingest_excel_to_bronze():
engine = create_engine(DB_URL)
# Load the Excel file to inspect sheet names
excel = pl.read_excel(FILE_PATH, sheet_name=None) # Reads all sheets as a dict
for sheet_name, df in excel.items():
if sheet_name in VALID_TABS:
print(f"🚀 Processing sheet: {sheet_name}")
# Add metadata columns (Standard Medallion practice)
df = df.with_columns([
pl.lit(sheet_name).alias("_source_sheet"),
pl.lit(FILE_PATH).alias("_source_file")
])
# Write to PostgreSQL Bronze layer
# table_name is normalized to lowercase
df.write_database(
table_name=f"{TARGET_SCHEMA}.raw_{sheet_name.lower()}",
c
if_table_exists="replace"
)
if __name__ == "__main__":
ingest_excel_to_bronze()
print("✅ Ingestion complete!")
While one reply suggested Airbyte's local file connector and data mapping *could* potentially handle some Excel scenarios, the community's stronger recommendation for complex, multi-tab Excel files leans towards dedicated scripting. This ensures greater control and resilience against the inherent volatility of spreadsheet data.
Architectural Takeaway
The key takeaway for any developer working on data pipelines, especially those aiming for high engineering quality software, is to design ingestion strategies around the fundamental nature of each source. Relational databases benefit from managed connectors, while volatile, semi-structured files like Excel demand explicit parsing and validation steps. This mixed ingestion strategy isn't a compromise; it's a pragmatic reflection of source characteristics, leading to clearer boundaries, reduced operational overhead, and a more scalable, maintainable data warehouse.