Master's Thesis · Communication Engineering

Transforming EDA Data into Actionable Insights for Chip-Package-Board Co-Design

Building the data infrastructure that makes AI possible
Hamza Ghaffar · FH Kärnten, Villach · Infineon Technologies Austria AG · 2026
A 17-stage data pipeline that takes raw, fragmented output from 8 EDA Codesign tools and transforms it into clean, structured, AI-ready data products through a Medallion architecture with a Nectar AI serving layer.
17
Pipeline Stages
8
EDA Tool Domains
4
Medallion Layers
6
Pipeline Phases
Executive Summary
⚠️

The Problem

8 EDA Codesign tools generate vast data across simulation, testing, and design cycles. Storage locations unknown. Formats undocumented. No central catalog. Data is fragmented, uncataloged, and completely inaccessible to AI systems.

🔧

The Solution

A 17-stage data pipeline processing raw tool output through Bronze, Silver, Gold, and Nectar layers. Data Mesh architecture gives each tool domain ownership of its data while maintaining quality through federated governance.

🎯

The Outcome

Clean, schema-governed, AI-ready data products. Automated quality gates. Documented lineage. Versioned data. From hours of manual data prep to structured, queryable datasets that ML systems consume directly.

Chapter 01

The Data Fragmentation Problem

Before any AI or analytics work can begin, we need to understand what data exists and where it lives.
Infineon's chip-package-board co-design process uses 8 specialized software tools to design and test semiconductor products. Each tool produces output data during design work: simulation results, test measurements, configuration files. This data exists somewhere in Infineon's infrastructure (network drives, file servers, internal cloud storage), but nobody has a complete map. Nobody knows all the formats. Nobody knows if the data is complete or accurate. And AI systems cannot use any of it in its current state.

Unknown Data Locations

No knowledge of where data is stored across systems. No centralized catalog exists. Data scattered in network shares, databases, and files. Investigation required before any pipeline work can begin.

📋

Unidentified Data Types

Unknown number of data formats per tool. XML, JSON, CSV, binary, text logs all mixed together. No schema profiling done. Data types undocumented.

🏝️

Siloed Ownership

Each of the 8 tools stores data differently. No cross-tool data integration. No shared identifiers. Engineers cannot combine data from different tools for analysis.

🚫

No Quality Framework

No automated validation. No data quality metrics. No checks for completeness, accuracy, or consistency. Unknown percentage of usable data.

🔬

AI Systems Blocked

Machine learning models cannot consume fragmented, unstructured, unvalidated data. Engineers spend time searching for data instead of using it. Valuable insights remain hidden.

📊

No Governance

No data lineage tracking. No schema versioning. No access patterns documented. No data products with SLAs. No discoverability for downstream consumers.

Scientific Gap

Current EDA data management research focuses on individual tool optimization or narrow ML applications. No existing work addresses an end-to-end data pipeline architecture that spans the full CPB co-design tool ecosystem. This thesis fills that gap by designing a 17-stage pipeline with four data layers, validated through a concrete use case that can scale to additional tools over time.

The Goal

Build a data pipeline that transforms raw tool output into clean, structured, AI-ready data. This is not about building AI models. This is about preparing the data so AI systems can consume it effectively. Think of it as building the infrastructure that makes data usable.

Before vs. After: How Data Handling Changes

Before Pipeline
Unknown Locations

Engineers search manually through network drives, file servers, and databases to find tool output.

📋
Mixed Formats

XML, JSON, CSV, binary files all scattered. No schema documentation. No format catalog.

🔍
Manual Quality Checks

No automated validation. Engineers eyeball data for completeness. Bad records pass undetected.

🏝️
Siloed and Disconnected

Each tool stores data independently. No cross-tool identifiers. No way to combine datasets.

🚫
AI Inaccessible

ML models cannot consume fragmented, unstructured data. Hours wasted on manual preparation.

After Pipeline
📂
Cataloged and Documented

Every data source mapped, documented, and cataloged. Storage paths standardized and versioned.

📦
Unified Parquet Format

All data normalized into columnar Parquet with Delta Lake. Schema-registered and typed.

Automated Quality Gates

Stage 07 validates every record against schemas. Invalid records rejected with documented reasons.

🔗
Mesh-Connected Domains

8 domains with consistent primary keys, federated governance, and cross-domain queryability.

🔬
AI-Ready Products

Nectar layer delivers feature-engineered, scaled, and split datasets ready for ML consumption.

The Transformation in Three Stages

Before: Fragmented
📋

Data DiscoveryNobody knows where tool data lives or what format it uses

💻

Data AccessManual search through network shares, databases, and file servers

🔍

QualityNo validation, no schema checks, no consistency across tools

📦

IntegrationEach tool is its own silo with no shared identifiers

During: Pipeline Processing
🗺️

Bronze: IngestConnect to all 6 source types, extract raw data, preserve original structure

🧹

Silver: ValidateSchema checks, dedup, normalize units, detect anomalies, assign keys

⚙️

Gold: TransformDomain-specific business logic, aggregations, derived fields

🍯

Nectar: ServeFeature engineering, encoding, scaling, train/val/test splits

After: AI-Ready
📊

Data ProductsSchema-governed, quality-verified, discoverable datasets with SLAs

ML-ConsumableFeature vectors in Parquet, JSONL for LLMs, data cards for reproducibility

📈

Cross-Domain InsightsCombined datasets from all 8 tools enable previously impossible analysis

🔄

ContinuousPipeline runs independently per domain. New data flows automatically.

Before

Fragmented & Manual
Unknown data landscape

During

17-Stage Pipeline
Medallion processing

After

AI-Ready Products
Governed & discoverable


Chapter 02

The Architecture

Medallion layers for progressive data refinement. Data Mesh for distributed domain ownership. Data as a Product for governance.

Medallion Architecture: Four Layers of Refinement

Data enters raw and exits AI-ready. Each layer increases quality and reduces noise. Data flows one direction: Bronze to Silver to Gold to Nectar.

🥉

Bronze

Raw ingestion and first cleanup. Data lands as-is from sources with basic deduplication and irrelevant field removal. No transformation or normalization.
🥈

Silver

Quality checkpoint. Consistency, schema validation, accuracy, normalization, anomaly detection, key assignment, and partitioning.
🥇

Gold

Business value. Domain-specific transformations, derived fields, and aggregations. Summarized counts, averages, distributions, and percentiles.
🍯

Nectar

AI-ready. Feature engineering, encoding, scaling, train/val/test split. Also produces LLM-consumable JSONL, automation-ready Parquet, and data cards.

Data Mesh: 4 Core Principles

Data Mesh distributes data ownership across the tool domains while maintaining consistency through federated governance. Each of the 8 EDA tools becomes one data domain.

1

Domain-Oriented Ownership

Each of the 8 EDA Codesign tools is a separate data domain. The team that produces the data also owns and manages it. No central data team collects everything.

2

Data as a Product

Every dataset is schema-registered, quality-verified, documented, versioned, and discoverable. Outputs have SLAs, ownership, and quality guarantees. Not raw data dumps.

3

Self-Serve Data Infrastructure

Shared tooling available to all 8 domains. Each team uses the pipeline independently, accessing their own partition without dependency on a central team.

4

Federated Computational Governance

Global standards enforced uniformly (naming, key formats, quality gates), but each domain team implements them for their specific tools. Centralized rules, distributed execution.

Data Partition Path Pattern

Storage Convention

All data follows the pattern: /{layer}/{domain}/{run_id}/*.parquet

/bronze/bdg/run_2026_04_08/data.parquet
/silver/sigrity/run_2026_04_08/stage_07/validated.parquet
/gold/ansysem/run_2026_04_08/transformed.parquet
/nectar/bdg/train/features.parquet

Technology Stack

🐍

Python 3.9+

Primary language for all pipeline stages. Pandas, PyArrow, Delta Lake, Great Expectations, SQLAlchemy, lxml, fsspec.

📦

Parquet + Delta Lake

Columnar analytics-optimized storage with ACID transactions, time travel, and schema enforcement. UTF-8 text, ISO 8601 timestamps.

🔗

Data Connectors

SQLAlchemy for MySQL. lxml/xmltodict for XML. fsspec for unified file access (local, network, cloud). DVC Python API for versioning.

Convergence: 8 Tools into One Pipeline

bdg
dfe
sigrity
ansysem
cdcore
cdilplugins
iop
pkgImpl
17-Stage Medallion Pipeline
Bronze Ingest Silver Validate Gold Transform Nectar Serve AI-Ready Data Products

The Data Quality Flywheel

Each cycle through the pipeline improves downstream quality. Better input schemas lead to tighter validation, which produces cleaner data, which enables better feature engineering.

DATA PIPELINE Quality Flywheel ↻ continuous improvement INGEST Bronze acquires raw data from 6 sources VALIDATE Silver checks quality & rejects bad records SERVE Nectar delivers AI-ready features

Each layer feeds quality back to the layer before it. Rejected records inform better schemas. Better schemas catch more errors. The flywheel accelerates with every pipeline run.


Chapter 03

The 17-Stage Pipeline

Six phases, seventeen stages. Each stage has a specific job. No stage can be skipped because later stages depend on earlier ones.

Phase 1

Discovery & Planning
00Tool Data Mapping
01Data Sources
02Big Data 5Vs
03Version Control

Phase 2

Bronze Layer
04Data Acquisition
05Data Cleaning

Phase 3

Validation
06Consistency
07Quality Gate
08Accuracy
09Normalization
10Anomaly Detection

Phase 4

Silver Staging
11Data Staging
12Primary Key / Index
13Data Partitioning

Phase 5

Gold Layer
14ETL Transformation
15Data Aggregation

Phase 6

Nectar AI Serving
16Feature Engineering & AI Serving

Complete Stage Registry

Stage Name Layer What It Does
00 Tool Data Mapping Planning Map each tool's inputs, outputs, storage locations, and file formats
01 Data Sources Planning Catalog all data sources and identify access methods (MySQL, XML, network shares, HCIP, DVC)
02 Big Data 5Vs Planning Measure Volume, Velocity, Variety, Veracity, Value per domain
03 Version Control Planning Set up DVC, HCIP storage, credential management, and versioning infrastructure
04 Data Acquisition Bronze Connect to sources, extract raw data as-is, preserve original format
05 Data Cleaning Bronze Remove duplicates, drop empty fields, handle nulls, strip irrelevant metadata
06 Data Consistency Silver Verify same identifiers mean same entities across all 8 domains
07 Data Quality Silver Schema validation quality gate. Reject invalid records. Never pass unstructured data downstream.
08 Data Accuracy Silver Verify values within valid ranges, timestamps current, precision appropriate
09 Data Normalization Silver Standardize units (mm, seconds), naming conventions, encoding across all domains
10 Anomaly Detection Silver Flag statistical outliers, volume spikes, pattern breaks for investigation
11 Data Staging Silver Checkpoint validated data in queryable, versioned, protected staging area
12 Primary Key / Index Silver Assign unique identifiers per record using {domain}-{id} pattern (e.g. bdg-001)
13 Data Partitioning Silver Divide data into domain partitions, enable partition pruning by tool
14 ETL Gold Extract from Silver, apply domain-specific business logic, load to Gold
15 Data Aggregation Gold Summarize with counts, averages, distributions, and percentiles
16 Nectar AI Serving Nectar Feature engineering, categorical encoding, numeric scaling, train/val/test split (70/15/15)
Quality Gate Rule

Schema validation is mandatory at Stage 07. Records that fail validation are rejected and stored separately with rejection reasons. They are never propagated downstream. The rule is simple: never give unstructured data to AI.

Pipeline Implementation Progress

17
Stages
Planning
4
Bronze
2
Silver
8
Gold
2
Nectar
1

Chapter 04

The 8 Data Domains

Each EDA Codesign tool is a separate partition in the Data Mesh. Domains own their data independently.
bdg-

Board Design

Board-level design and layout. First domain fully automated through AutoBDG / ProDiGI.

dfe-

Design for Excellence

Design optimization and manufacturability checks across the design flow.

sigrity-

Signal Integrity

Signal integrity analysis and simulation for high-speed interconnects.

ansysem-

Electromagnetic Analysis

Electromagnetic field analysis for package and board-level structures.

cdcore-

Core Design

Core codesign functionality shared across the CPB co-design flow.

cdilplugins-

Plugin Interface

Plugin integration layer connecting tools within the codesign ecosystem.

iop-

I/O Planning

Input/Output planning and optimization for chip-package interfaces.

pkgImpl-

Package Implementation

Package design and implementation for semiconductor packaging.

Infrastructure Context

These 8 tools live inside codesigndeploy, a subflow within Infineon's CAMINO production design flow. They exist as symbolic links pointing to actual tool implementations. Data outputs generated by these tools have historically unknown storage locations and unknown formats, which is why Stage 00 (Tool Data Mapping) must complete before any pipeline work begins.

Data Source Types

The pipeline connects to six source types: MySQL / Relational Databases, XML Files, Network Shared Folders, Internal File Server, HCIP (Infineon Internal Cloud), and DVC (Data Version Control). The exact source-to-tool mapping is confirmed during the Phase 1 Discovery stage.

Primary Key Convention

DomainKey PrefixExample Record ID
Board Designbdg-bdg-00142
Design for Excellencedfe-dfe-00089
Signal Integritysigrity-sigrity-00231
Electromagnetic Analysisansysem-ansysem-00056
Core Designcdcore-cdcore-00178
Plugin Interfacecdilplugins-cdilplugins-00034
I/O Planningiop-iop-00092
Package ImplementationpkgImpl-pkgImpl-00115

Chapter 05

Research Methodology

Four phases of work. Technical discovery combined with qualitative research and iterative stakeholder validation.
🔍

Phase 1: Discovery

Investigate the data landscape. Map storage locations, catalog formats, profile schemas, assess data volumes, interview R&D engineers.

🔧

Phase 2: Pipeline Build

Implement the 17-stage pipeline. Bronze ingestion, Silver validation, quality gates, normalization, staging, partitioning.

🏗️

Phase 3: Platform

Data serving layer for AI consumption. APIs, batch access, feature store. Data catalog, lineage, schema registry, access control.

📏

Phase 4: Validation

Measure pipeline performance. Data quality scores, ingestion throughput, processing latency, schema stability, coverage percentage.

Investigation Methods

🖥️

Technical Discovery

Directory structure analysis (recursive file scanning), schema profiling (automated tooling), data lineage tracing (source to storage), volume assessment (storage audits).

🗣️

Qualitative Research

Semi-structured interviews with R&D engineers, surveys of current practices, direct observation of design workflows (shadowing), stakeholder workshops.

📚

Literature & Industry Analysis

Industry reports on EDA data management, existing design datasets, literature review of current tools, competitive analysis of similar solutions.

🤝

Stakeholder Engagement

Weekly meetings with Manuel Lexer (Team Lead), technical reviews with Salem Mohamed (Technical Contact), presentations to engineering teams, continuous feedback loops.

Validation Metrics

CategoryMetricWhat It Measures
Pipeline Performance Ingestion Rate Records per second processed through the pipeline
Pipeline Performance Pipeline Latency Time from data creation to availability in Silver/Gold/Nectar
Data Quality Quality Score Percentage of data passing all validation gates (Stages 06-08)
Data Quality Schema Stability Backward compatibility maintained across schema versions
AI Readiness Coverage Percentage of EDA tools integrated into the pipeline
AI Readiness Data Freshness Age of data available to downstream AI systems
Business Impact Time to Access Before/after comparison: hours of manual data prep to minutes
Business Impact Developer Productivity Time saved on data preparation by AI/ML teams

⚠ Current Phase: Concept Study

This thesis is in the Concept Study phase (February to June 2026). The data landscape is being discovered for the first time. The pipeline architecture is designed, the Medallion layers are defined, and the Data Mesh framework is in place. The next step is implementing the concept with a concrete use case, starting with one domain (ansysem) and expanding the pattern to all 8 tools. Stages 00 through 03 (Discovery and Planning) must complete before any pipeline code runs.


Chapter 06

Thesis Project Timeline

From discovery to validation. Six phases, four months of concept study, one pipeline architecture delivering dual value.

Project Milestones

Phase 1
Discovery & Planning

🔍 Stages 00-03: Data Landscape Mapping

  • Stage 00: Catalog each tool's inputs, outputs, storage locations, file formats
  • Stage 01: Identify all data sources and access methods (MySQL, XML, HCIP, DVC)
  • Stage 02: Measure Volume, Velocity, Variety, Veracity, Value per domain
  • Stage 03: Set up DVC, credential management, version control infrastructure
Phase 2
Bronze Layer

📦 Stages 04-05: Raw Ingestion & First Cleanup

  • Stage 04: Connect to 6 source types, extract raw data preserving original format
  • Stage 05: Remove duplicates, drop empty fields, strip irrelevant metadata
  • Data lands as-is from sources with basic deduplication
Phase 3
Silver Layer

🛡️ Stages 06-13: Quality Checkpoint

  • Stage 06: Cross-domain consistency verification
  • Stage 07: Schema validation quality gate (reject invalid records)
  • Stage 08: Value range and accuracy verification
  • Stage 09: Normalize units, naming conventions, encoding
  • Stage 10: Statistical outlier and anomaly detection
  • Stages 11-13: Staging, primary key assignment, partitioning
● CONCEPT STUDY
NOW
Feb - Jun 2026

Step 2: Concept Study Phase

  • Architecture design complete: Medallion + Data Mesh + Nectar
  • 17-stage pipeline framework defined with schema contracts
  • Data discovery in progress across all 8 EDA tool domains
  • First use case (ansysem) selected for concrete validation
  • Platform evaluation against each pipeline stage ongoing
  • Research methodology: mixed-methods with 4 phases
Phase 4
Gold Layer

⚙️ Stages 14-15: Business Value

  • Stage 14: ETL with domain-specific business logic transformations
  • Stage 15: Aggregations: counts, averages, distributions, percentiles
  • Derived fields and cross-domain derived metrics
Phase 5
Nectar Layer

🍯 Stage 16: AI Serving

  • Feature engineering and categorical encoding
  • Numeric scaling and train/val/test split (70/15/15)
  • LLM-consumable JSONL, automation-ready Parquet, data cards
  • AI-ready data products with full lineage documentation
Final
Validation

Measured Results & Thesis Delivery

  • Data quality scores: percentage passing validation gates
  • Pipeline performance: ingestion rate, latency, throughput
  • Coverage: percentage of EDA tools integrated
  • Before/after comparison: manual hours vs. automated minutes
  • 5 university deliverables + 5 company deliverables

Chapter 07

Deliverables & Scope

5 academic deliverables for the university. 5 practical deliverables for Infineon. Both come from the same work.
🎓 University Deliverables
1
End-to-End EDA Data Pipeline Architecture

Novel 17-stage pipeline with four data layers, filling the gap in EDA data management research.

2
Data-as-a-Product Framework for EDA

First application of Data Mesh and Data-as-a-Product concepts to the semiconductor EDA context.

3
Internal Platform Evaluation

Fact-based assessment of existing platforms mapped against each pipeline stage.

4
Mixed-Methods Research Methodology

Technical discovery combined with qualitative research and iterative validation, documented for reuse.

5
Measured Validation

Quantitative metrics: data quality scores, throughput, latency, schema stability. Before/after comparisons.

🏢 Infineon Deliverables
1
Data Discovery & Cataloging

Documented inventory of where EDA tools store data, what formats exist, and how outputs are accessed.

2
Standardized Pipeline with Quality Gates

Working pipeline: ingestion, cleaning, validation, quality checks. Proven on one use case, ready to scale.

3
Governed Data Products

Schema-governed, partitioned data products with quality standards. Template for additional tools.

4
Platform Integration

Pipeline plugs into existing infrastructure. Custom code only fills genuine gaps.

5
Documentation & Team Handoff

Architecture docs, deployment procedures, operational guides. Team maintains it independently after thesis ends.

Scope Boundary

✅ In Scope
✓ Data engineering, pipeline architecture, implementation
✓ Data discovery, profiling, and cataloging
✓ Data quality framework with automated validation
✓ Data Mesh and Data-as-a-Product architecture
✓ Nectar layer design for AI-ready data serving
✓ Data governance: lineage, catalogs, access patterns
✗ Out of Scope
✗ AI/ML model development
✗ Dashboard or analytics platform creation
✗ Modifying existing internal platforms
✗ Full-scale production rollout across all teams
✗ Training or deploying ML models on served data
✗ Building visualizations or BI tools
The Bottom Line

The university gets a novel architecture, an applied framework, a platform evaluation, and measured results from a real use case. The company gets a working pipeline, discovered data, quality gates, and a standardized design that scales beyond the thesis. Both outcomes come from the same work, and both parties can point to concrete results when the project wraps up.

Thesis Information

Student: Hamza Ghaffar · Program: MSc Communication Engineering, FH Kärnten, Villach · Company: Infineon Technologies Austria AG · Role: Software Automation Intern (Full-Time)
Supervisor 1: Manuel Lexer (Team Lead) · Supervisor 2: Salem Mohamed (Technical Contact) · Phase: Step 2 - Concept Study (Feb-Jun 2026)