Master's Thesis EDA Data Pipeline for CPB Co-Design

Chapter 01

The Data Fragmentation Problem

Before any AI or analytics work can begin, we need to understand what data exists and where it lives.

Infineon's chip-package-board co-design process uses 8 specialized software tools to design and test semiconductor products. Each tool produces output data during design work: simulation results, test measurements, configuration files. This data exists somewhere in Infineon's infrastructure (network drives, file servers, internal cloud storage), but nobody has a complete map. Nobody knows all the formats. Nobody knows if the data is complete or accurate. And AI systems cannot use any of it in its current state.

❓

Unknown Data Locations

No knowledge of where data is stored across systems. No centralized catalog exists. Data scattered in network shares, databases, and files. Investigation required before any pipeline work can begin.

📋

Unidentified Data Types

Unknown number of data formats per tool. XML, JSON, CSV, binary, text logs all mixed together. No schema profiling done. Data types undocumented.

🏝️

Siloed Ownership

Each of the 8 tools stores data differently. No cross-tool data integration. No shared identifiers. Engineers cannot combine data from different tools for analysis.

🚫

No Quality Framework

No automated validation. No data quality metrics. No checks for completeness, accuracy, or consistency. Unknown percentage of usable data.

🔬

AI Systems Blocked

Machine learning models cannot consume fragmented, unstructured, unvalidated data. Engineers spend time searching for data instead of using it. Valuable insights remain hidden.

📊

No Governance

No data lineage tracking. No schema versioning. No access patterns documented. No data products with SLAs. No discoverability for downstream consumers.

Scientific Gap

Current EDA data management research focuses on individual tool optimization or narrow ML applications. No existing work addresses an end-to-end data pipeline architecture that spans the full CPB co-design tool ecosystem. This thesis fills that gap by designing a 17-stage pipeline with four data layers, validated through a concrete use case that can scale to additional tools over time.

The Goal

Build a data pipeline that transforms raw tool output into clean, structured, AI-ready data. This is not about building AI models. This is about preparing the data so AI systems can consume it effectively. Think of it as building the infrastructure that makes data usable.

Before vs. After: How Data Handling Changes

Before Pipeline

❓

Unknown Locations

Engineers search manually through network drives, file servers, and databases to find tool output.

📋

Mixed Formats

XML, JSON, CSV, binary files all scattered. No schema documentation. No format catalog.

🔍

Manual Quality Checks

No automated validation. Engineers eyeball data for completeness. Bad records pass undetected.

🏝️

Siloed and Disconnected

Each tool stores data independently. No cross-tool identifiers. No way to combine datasets.

🚫

AI Inaccessible

ML models cannot consume fragmented, unstructured data. Hours wasted on manual preparation.

→

After Pipeline

📂

Cataloged and Documented

Every data source mapped, documented, and cataloged. Storage paths standardized and versioned.

📦

Unified Parquet Format

All data normalized into columnar Parquet with Delta Lake. Schema-registered and typed.

✅

Automated Quality Gates

Stage 07 validates every record against schemas. Invalid records rejected with documented reasons.

🔗

Mesh-Connected Domains

8 domains with consistent primary keys, federated governance, and cross-domain queryability.

🔬

AI-Ready Products

Nectar layer delivers feature-engineered, scaled, and split datasets ready for ML consumption.

The Transformation in Three Stages

Before: Fragmented

📋

Data DiscoveryNobody knows where tool data lives or what format it uses

💻

Data AccessManual search through network shares, databases, and file servers

🔍

QualityNo validation, no schema checks, no consistency across tools

📦

IntegrationEach tool is its own silo with no shared identifiers

→

During: Pipeline Processing

🗺️

Bronze: IngestConnect to all 6 source types, extract raw data, preserve original structure

🧹

Silver: ValidateSchema checks, dedup, normalize units, detect anomalies, assign keys

⚙️

Gold: TransformDomain-specific business logic, aggregations, derived fields

🍯

Nectar: ServeFeature engineering, encoding, scaling, train/val/test splits

→

After: AI-Ready

📊

Data ProductsSchema-governed, quality-verified, discoverable datasets with SLAs

�

ML-ConsumableFeature vectors in Parquet, JSONL for LLMs, data cards for reproducibility

📈

Cross-Domain InsightsCombined datasets from all 8 tools enable previously impossible analysis

🔄

ContinuousPipeline runs independently per domain. New data flows automatically.

Before

Fragmented & Manual
Unknown data landscape

During

17-Stage Pipeline
Medallion processing

After

AI-Ready Products
Governed & discoverable

Chapter 02

The Architecture

Medallion layers for progressive data refinement. Data Mesh for distributed domain ownership. Data as a Product for governance.

Medallion Architecture: Four Layers of Refinement

Data enters raw and exits AI-ready. Each layer increases quality and reduces noise. Data flows one direction: Bronze to Silver to Gold to Nectar.

🥉

Bronze

Raw ingestion and first cleanup. Data lands as-is from sources with basic deduplication and irrelevant field removal. No transformation or normalization.

🥈

Silver

Quality checkpoint. Consistency, schema validation, accuracy, normalization, anomaly detection, key assignment, and partitioning.

🥇

Gold

Business value. Domain-specific transformations, derived fields, and aggregations. Summarized counts, averages, distributions, and percentiles.

🍯

Nectar

AI-ready. Feature engineering, encoding, scaling, train/val/test split. Also produces LLM-consumable JSONL, automation-ready Parquet, and data cards.

Data Mesh: 4 Core Principles

Data Mesh distributes data ownership across the tool domains while maintaining consistency through federated governance. Each of the 8 EDA tools becomes one data domain.

1

Domain-Oriented Ownership

Each of the 8 EDA Codesign tools is a separate data domain. The team that produces the data also owns and manages it. No central data team collects everything.

2

Data as a Product

Every dataset is schema-registered, quality-verified, documented, versioned, and discoverable. Outputs have SLAs, ownership, and quality guarantees. Not raw data dumps.

3

Self-Serve Data Infrastructure

Shared tooling available to all 8 domains. Each team uses the pipeline independently, accessing their own partition without dependency on a central team.

4

Federated Computational Governance

Global standards enforced uniformly (naming, key formats, quality gates), but each domain team implements them for their specific tools. Centralized rules, distributed execution.

Data Partition Path Pattern

Storage Convention

All data follows the pattern: /{layer}/{domain}/{run_id}/*.parquet

/bronze/bdg/run_2026_04_08/data.parquet
/silver/sigrity/run_2026_04_08/stage_07/validated.parquet
/gold/ansysem/run_2026_04_08/transformed.parquet
/nectar/bdg/train/features.parquet

Technology Stack

🐍

Python 3.9+

Primary language for all pipeline stages. Pandas, PyArrow, Delta Lake, Great Expectations, SQLAlchemy, lxml, fsspec.

📦

Parquet + Delta Lake

Columnar analytics-optimized storage with ACID transactions, time travel, and schema enforcement. UTF-8 text, ISO 8601 timestamps.

🔗

Data Connectors

SQLAlchemy for MySQL. lxml/xmltodict for XML. fsspec for unified file access (local, network, cloud). DVC Python API for versioning.

Convergence: 8 Tools into One Pipeline

bdg

dfe

sigrity

ansysem

cdcore

cdilplugins

iop

pkgImpl

↓

17-Stage Medallion Pipeline

Bronze Ingest → Silver Validate → Gold Transform → Nectar Serve → AI-Ready Data Products

The Data Quality Flywheel

Each cycle through the pipeline improves downstream quality. Better input schemas lead to tighter validation, which produces cleaner data, which enables better feature engineering.

Each layer feeds quality back to the layer before it. Rejected records inform better schemas. Better schemas catch more errors. The flywheel accelerates with every pipeline run.

Chapter 03

The 17-Stage Pipeline

Six phases, seventeen stages. Each stage has a specific job. No stage can be skipped because later stages depend on earlier ones.

Phase 1

Discovery & Planning

00Tool Data Mapping

→

01Data Sources

→

02Big Data 5Vs

→

03Version Control

Phase 2

Bronze Layer

04Data Acquisition

→

05Data Cleaning

Phase 3

Validation

06Consistency

→

07Quality Gate

→

08Accuracy

→

09Normalization

→

10Anomaly Detection

Phase 4

Silver Staging

11Data Staging

→

12Primary Key / Index

→

13Data Partitioning

Phase 5

Gold Layer

14ETL Transformation

→

15Data Aggregation

Phase 6

Nectar AI Serving

16Feature Engineering & AI Serving

Complete Stage Registry

Stage	Name	Layer	What It Does
00	Tool Data Mapping	Planning	Map each tool's inputs, outputs, storage locations, and file formats
01	Data Sources	Planning	Catalog all data sources and identify access methods (MySQL, XML, network shares, HCIP, DVC)
02	Big Data 5Vs	Planning	Measure Volume, Velocity, Variety, Veracity, Value per domain
03	Version Control	Planning	Set up DVC, HCIP storage, credential management, and versioning infrastructure
04	Data Acquisition	Bronze	Connect to sources, extract raw data as-is, preserve original format
05	Data Cleaning	Bronze	Remove duplicates, drop empty fields, handle nulls, strip irrelevant metadata
06	Data Consistency	Silver	Verify same identifiers mean same entities across all 8 domains
07	Data Quality	Silver	Schema validation quality gate. Reject invalid records. Never pass unstructured data downstream.
08	Data Accuracy	Silver	Verify values within valid ranges, timestamps current, precision appropriate
09	Data Normalization	Silver	Standardize units (mm, seconds), naming conventions, encoding across all domains
10	Anomaly Detection	Silver	Flag statistical outliers, volume spikes, pattern breaks for investigation
11	Data Staging	Silver	Checkpoint validated data in queryable, versioned, protected staging area
12	Primary Key / Index	Silver	Assign unique identifiers per record using {domain}-{id} pattern (e.g. bdg-001)
13	Data Partitioning	Silver	Divide data into domain partitions, enable partition pruning by tool
14	ETL	Gold	Extract from Silver, apply domain-specific business logic, load to Gold
15	Data Aggregation	Gold	Summarize with counts, averages, distributions, and percentiles
16	Nectar AI Serving	Nectar	Feature engineering, categorical encoding, numeric scaling, train/val/test split (70/15/15)

Quality Gate Rule

Schema validation is mandatory at Stage 07. Records that fail validation are rejected and stored separately with rejection reasons. They are never propagated downstream. The rule is simple: never give unstructured data to AI.

Pipeline Implementation Progress

17

Stages

Planning

4

Bronze

2

Silver

8

Gold

2

Nectar

1

Chapter 04

The 8 Data Domains

Each EDA Codesign tool is a separate partition in the Data Mesh. Domains own their data independently.

bdg-

Board Design

Board-level design and layout. First domain fully automated through AutoBDG / ProDiGI.

dfe-

Design for Excellence

Design optimization and manufacturability checks across the design flow.

sigrity-

Signal Integrity

Signal integrity analysis and simulation for high-speed interconnects.

ansysem-

Electromagnetic Analysis

Electromagnetic field analysis for package and board-level structures.

cdcore-

Core Design

Core codesign functionality shared across the CPB co-design flow.

cdilplugins-

Plugin Interface

Plugin integration layer connecting tools within the codesign ecosystem.

iop-

I/O Planning

Input/Output planning and optimization for chip-package interfaces.

pkgImpl-

Package Implementation

Package design and implementation for semiconductor packaging.

Infrastructure Context

These 8 tools live inside codesigndeploy, a subflow within Infineon's CAMINO production design flow. They exist as symbolic links pointing to actual tool implementations. Data outputs generated by these tools have historically unknown storage locations and unknown formats, which is why Stage 00 (Tool Data Mapping) must complete before any pipeline work begins.

Data Source Types

The pipeline connects to six source types: MySQL / Relational Databases, XML Files, Network Shared Folders, Internal File Server, HCIP (Infineon Internal Cloud), and DVC (Data Version Control). The exact source-to-tool mapping is confirmed during the Phase 1 Discovery stage.

Primary Key Convention

Domain	Key Prefix	Example Record ID
Board Design	`bdg-`	bdg-00142
Design for Excellence	`dfe-`	dfe-00089
Signal Integrity	`sigrity-`	sigrity-00231
Electromagnetic Analysis	`ansysem-`	ansysem-00056
Core Design	`cdcore-`	cdcore-00178
Plugin Interface	`cdilplugins-`	cdilplugins-00034
I/O Planning	`iop-`	iop-00092
Package Implementation	`pkgImpl-`	pkgImpl-00115

Chapter 05

Research Methodology

Four phases of work. Technical discovery combined with qualitative research and iterative stakeholder validation.

🔍

Phase 1: Discovery

Investigate the data landscape. Map storage locations, catalog formats, profile schemas, assess data volumes, interview R&D engineers.

🔧

Phase 2: Pipeline Build

Implement the 17-stage pipeline. Bronze ingestion, Silver validation, quality gates, normalization, staging, partitioning.

🏗️

Phase 3: Platform

Data serving layer for AI consumption. APIs, batch access, feature store. Data catalog, lineage, schema registry, access control.

📏

Phase 4: Validation

Measure pipeline performance. Data quality scores, ingestion throughput, processing latency, schema stability, coverage percentage.

Investigation Methods

🖥️

Technical Discovery

Directory structure analysis (recursive file scanning), schema profiling (automated tooling), data lineage tracing (source to storage), volume assessment (storage audits).

🗣️

Qualitative Research

Semi-structured interviews with R&D engineers, surveys of current practices, direct observation of design workflows (shadowing), stakeholder workshops.

📚

Literature & Industry Analysis

Industry reports on EDA data management, existing design datasets, literature review of current tools, competitive analysis of similar solutions.

🤝

Stakeholder Engagement

Weekly meetings with Manuel Lexer (Team Lead), technical reviews with Salem Mohamed (Technical Contact), presentations to engineering teams, continuous feedback loops.

Validation Metrics

Category	Metric	What It Measures
Pipeline Performance	Ingestion Rate	Records per second processed through the pipeline
Pipeline Performance	Pipeline Latency	Time from data creation to availability in Silver/Gold/Nectar
Data Quality	Quality Score	Percentage of data passing all validation gates (Stages 06-08)
Data Quality	Schema Stability	Backward compatibility maintained across schema versions
AI Readiness	Coverage	Percentage of EDA tools integrated into the pipeline
AI Readiness	Data Freshness	Age of data available to downstream AI systems
Business Impact	Time to Access	Before/after comparison: hours of manual data prep to minutes
Business Impact	Developer Productivity	Time saved on data preparation by AI/ML teams

⚠ Current Phase: Concept Study

This thesis is in the Concept Study phase (February to June 2026). The data landscape is being discovered for the first time. The pipeline architecture is designed, the Medallion layers are defined, and the Data Mesh framework is in place. The next step is implementing the concept with a concrete use case, starting with one domain (ansysem) and expanding the pattern to all 8 tools. Stages 00 through 03 (Discovery and Planning) must complete before any pipeline code runs.

Chapter 06

Thesis Project Timeline

From discovery to validation. Six phases, four months of concept study, one pipeline architecture delivering dual value.

Project Milestones

Phase 1

Discovery & Planning

🔍 Stages 00-03: Data Landscape Mapping

Stage 00: Catalog each tool's inputs, outputs, storage locations, file formats
Stage 01: Identify all data sources and access methods (MySQL, XML, HCIP, DVC)
Stage 02: Measure Volume, Velocity, Variety, Veracity, Value per domain
Stage 03: Set up DVC, credential management, version control infrastructure

Phase 2

Bronze Layer

📦 Stages 04-05: Raw Ingestion & First Cleanup

Stage 04: Connect to 6 source types, extract raw data preserving original format
Stage 05: Remove duplicates, drop empty fields, strip irrelevant metadata
Data lands as-is from sources with basic deduplication

Phase 3

Silver Layer

🛡️ Stages 06-13: Quality Checkpoint

Stage 06: Cross-domain consistency verification
Stage 07: Schema validation quality gate (reject invalid records)
Stage 08: Value range and accuracy verification
Stage 09: Normalize units, naming conventions, encoding
Stage 10: Statistical outlier and anomaly detection
Stages 11-13: Staging, primary key assignment, partitioning

● CONCEPT STUDY

NOW

Feb - Jun 2026

Step 2: Concept Study Phase

Architecture design complete: Medallion + Data Mesh + Nectar
17-stage pipeline framework defined with schema contracts
Data discovery in progress across all 8 EDA tool domains
First use case (ansysem) selected for concrete validation
Platform evaluation against each pipeline stage ongoing
Research methodology: mixed-methods with 4 phases

Phase 4

Gold Layer

⚙️ Stages 14-15: Business Value

Stage 14: ETL with domain-specific business logic transformations
Stage 15: Aggregations: counts, averages, distributions, percentiles
Derived fields and cross-domain derived metrics

Phase 5

Nectar Layer

🍯 Stage 16: AI Serving

Feature engineering and categorical encoding
Numeric scaling and train/val/test split (70/15/15)
LLM-consumable JSONL, automation-ready Parquet, data cards
AI-ready data products with full lineage documentation

Final

Validation

Measured Results & Thesis Delivery

Data quality scores: percentage passing validation gates
Pipeline performance: ingestion rate, latency, throughput
Coverage: percentage of EDA tools integrated
Before/after comparison: manual hours vs. automated minutes
5 university deliverables + 5 company deliverables

Chapter 07

Deliverables & Scope

5 academic deliverables for the university. 5 practical deliverables for Infineon. Both come from the same work.

🎓 University Deliverables

1

End-to-End EDA Data Pipeline Architecture

Novel 17-stage pipeline with four data layers, filling the gap in EDA data management research.

2

Data-as-a-Product Framework for EDA

First application of Data Mesh and Data-as-a-Product concepts to the semiconductor EDA context.

3

Internal Platform Evaluation

Fact-based assessment of existing platforms mapped against each pipeline stage.

4

Mixed-Methods Research Methodology

Technical discovery combined with qualitative research and iterative validation, documented for reuse.

5

Measured Validation

Quantitative metrics: data quality scores, throughput, latency, schema stability. Before/after comparisons.

🏢 Infineon Deliverables

1

Data Discovery & Cataloging

Documented inventory of where EDA tools store data, what formats exist, and how outputs are accessed.

2

Standardized Pipeline with Quality Gates

Working pipeline: ingestion, cleaning, validation, quality checks. Proven on one use case, ready to scale.

3

Governed Data Products

Schema-governed, partitioned data products with quality standards. Template for additional tools.

4

Platform Integration

Pipeline plugs into existing infrastructure. Custom code only fills genuine gaps.

5

Documentation & Team Handoff

Architecture docs, deployment procedures, operational guides. Team maintains it independently after thesis ends.

Scope Boundary

✅ In Scope

✓ Data engineering, pipeline architecture, implementation

✓ Data discovery, profiling, and cataloging

✓ Data quality framework with automated validation

✓ Data Mesh and Data-as-a-Product architecture

✓ Nectar layer design for AI-ready data serving

✓ Data governance: lineage, catalogs, access patterns

✗ Out of Scope

✗ AI/ML model development

✗ Dashboard or analytics platform creation

✗ Modifying existing internal platforms

✗ Full-scale production rollout across all teams

✗ Training or deploying ML models on served data

✗ Building visualizations or BI tools

The Bottom Line

The university gets a novel architecture, an applied framework, a platform evaluation, and measured results from a real use case. The company gets a working pipeline, discovered data, quality gates, and a standardized design that scales beyond the thesis. Both outcomes come from the same work, and both parties can point to concrete results when the project wraps up.

Thesis Information

Student: Hamza Ghaffar · Program: MSc Communication Engineering, FH Kärnten, Villach · Company: Infineon Technologies Austria AG · Role: Software Automation Intern (Full-Time)
Supervisor 1: Manuel Lexer (Team Lead) · Supervisor 2: Salem Mohamed (Technical Contact) · Phase: Step 2 - Concept Study (Feb-Jun 2026)

Transforming EDA Data into Actionable Insights for Chip-Package-Board Co-Design

The Problem

The Solution

The Outcome

The Data Fragmentation Problem

Unknown Data Locations

Unidentified Data Types

Siloed Ownership

No Quality Framework

AI Systems Blocked

No Governance

Before vs. After: How Data Handling Changes

Unknown Locations

Mixed Formats

Manual Quality Checks

Siloed and Disconnected

AI Inaccessible

Cataloged and Documented

Unified Parquet Format

Automated Quality Gates

Mesh-Connected Domains

AI-Ready Products

The Transformation in Three Stages

Before

During

After

The Architecture

Medallion Architecture: Four Layers of Refinement

Bronze

Silver

Gold

Nectar

Data Mesh: 4 Core Principles

Domain-Oriented Ownership

Data as a Product

Self-Serve Data Infrastructure

Federated Computational Governance

Data Partition Path Pattern

Technology Stack

Python 3.9+

Parquet + Delta Lake

Data Connectors

Convergence: 8 Tools into One Pipeline

The Data Quality Flywheel

The 17-Stage Pipeline

Phase 1

Phase 2

Phase 3

Phase 4

Phase 5

Phase 6

Complete Stage Registry

Pipeline Implementation Progress

The 8 Data Domains

Board Design

Design for Excellence

Signal Integrity

Electromagnetic Analysis

Core Design

Plugin Interface

I/O Planning

Package Implementation

Infrastructure Context

Primary Key Convention

Research Methodology

Phase 1: Discovery

Phase 2: Pipeline Build

Phase 3: Platform

Phase 4: Validation

Investigation Methods

Technical Discovery

Qualitative Research

Literature & Industry Analysis

Stakeholder Engagement

Validation Metrics

⚠ Current Phase: Concept Study

Thesis Project Timeline

Project Milestones

🔍 Stages 00-03: Data Landscape Mapping

📦 Stages 04-05: Raw Ingestion & First Cleanup