8 EDA Codesign tools generate vast data across simulation, testing, and design cycles. Storage locations unknown. Formats undocumented. No central catalog. Data is fragmented, uncataloged, and completely inaccessible to AI systems.
A 17-stage data pipeline processing raw tool output through Bronze, Silver, Gold, and Nectar layers. Data Mesh architecture gives each tool domain ownership of its data while maintaining quality through federated governance.
Clean, schema-governed, AI-ready data products. Automated quality gates. Documented lineage. Versioned data. From hours of manual data prep to structured, queryable datasets that ML systems consume directly.
No knowledge of where data is stored across systems. No centralized catalog exists. Data scattered in network shares, databases, and files. Investigation required before any pipeline work can begin.
Unknown number of data formats per tool. XML, JSON, CSV, binary, text logs all mixed together. No schema profiling done. Data types undocumented.
Each of the 8 tools stores data differently. No cross-tool data integration. No shared identifiers. Engineers cannot combine data from different tools for analysis.
No automated validation. No data quality metrics. No checks for completeness, accuracy, or consistency. Unknown percentage of usable data.
Machine learning models cannot consume fragmented, unstructured, unvalidated data. Engineers spend time searching for data instead of using it. Valuable insights remain hidden.
No data lineage tracking. No schema versioning. No access patterns documented. No data products with SLAs. No discoverability for downstream consumers.
Current EDA data management research focuses on individual tool optimization or narrow ML applications. No existing work addresses an end-to-end data pipeline architecture that spans the full CPB co-design tool ecosystem. This thesis fills that gap by designing a 17-stage pipeline with four data layers, validated through a concrete use case that can scale to additional tools over time.
Build a data pipeline that transforms raw tool output into clean, structured, AI-ready data. This is not about building AI models. This is about preparing the data so AI systems can consume it effectively. Think of it as building the infrastructure that makes data usable.
Engineers search manually through network drives, file servers, and databases to find tool output.
XML, JSON, CSV, binary files all scattered. No schema documentation. No format catalog.
No automated validation. Engineers eyeball data for completeness. Bad records pass undetected.
Each tool stores data independently. No cross-tool identifiers. No way to combine datasets.
ML models cannot consume fragmented, unstructured data. Hours wasted on manual preparation.
Every data source mapped, documented, and cataloged. Storage paths standardized and versioned.
All data normalized into columnar Parquet with Delta Lake. Schema-registered and typed.
Stage 07 validates every record against schemas. Invalid records rejected with documented reasons.
8 domains with consistent primary keys, federated governance, and cross-domain queryability.
Nectar layer delivers feature-engineered, scaled, and split datasets ready for ML consumption.
Data DiscoveryNobody knows where tool data lives or what format it uses
Data AccessManual search through network shares, databases, and file servers
QualityNo validation, no schema checks, no consistency across tools
IntegrationEach tool is its own silo with no shared identifiers
Bronze: IngestConnect to all 6 source types, extract raw data, preserve original structure
Silver: ValidateSchema checks, dedup, normalize units, detect anomalies, assign keys
Gold: TransformDomain-specific business logic, aggregations, derived fields
Nectar: ServeFeature engineering, encoding, scaling, train/val/test splits
Data ProductsSchema-governed, quality-verified, discoverable datasets with SLAs
ML-ConsumableFeature vectors in Parquet, JSONL for LLMs, data cards for reproducibility
Cross-Domain InsightsCombined datasets from all 8 tools enable previously impossible analysis
ContinuousPipeline runs independently per domain. New data flows automatically.
Fragmented & Manual
Unknown data landscape
17-Stage Pipeline
Medallion processing
AI-Ready Products
Governed & discoverable
Data enters raw and exits AI-ready. Each layer increases quality and reduces noise. Data flows one direction: Bronze to Silver to Gold to Nectar.
Data Mesh distributes data ownership across the tool domains while maintaining consistency through federated governance. Each of the 8 EDA tools becomes one data domain.
Each of the 8 EDA Codesign tools is a separate data domain. The team that produces the data also owns and manages it. No central data team collects everything.
Every dataset is schema-registered, quality-verified, documented, versioned, and discoverable. Outputs have SLAs, ownership, and quality guarantees. Not raw data dumps.
Shared tooling available to all 8 domains. Each team uses the pipeline independently, accessing their own partition without dependency on a central team.
Global standards enforced uniformly (naming, key formats, quality gates), but each domain team implements them for their specific tools. Centralized rules, distributed execution.
All data follows the pattern: /{layer}/{domain}/{run_id}/*.parquet
/bronze/bdg/run_2026_04_08/data.parquet
/silver/sigrity/run_2026_04_08/stage_07/validated.parquet
/gold/ansysem/run_2026_04_08/transformed.parquet
/nectar/bdg/train/features.parquet
Primary language for all pipeline stages. Pandas, PyArrow, Delta Lake, Great Expectations, SQLAlchemy, lxml, fsspec.
Columnar analytics-optimized storage with ACID transactions, time travel, and schema enforcement. UTF-8 text, ISO 8601 timestamps.
SQLAlchemy for MySQL. lxml/xmltodict for XML. fsspec for unified file access (local, network, cloud). DVC Python API for versioning.
Each cycle through the pipeline improves downstream quality. Better input schemas lead to tighter validation, which produces cleaner data, which enables better feature engineering.
Each layer feeds quality back to the layer before it. Rejected records inform better schemas. Better schemas catch more errors. The flywheel accelerates with every pipeline run.
| Stage | Name | Layer | What It Does |
|---|---|---|---|
| 00 | Tool Data Mapping | Planning | Map each tool's inputs, outputs, storage locations, and file formats |
| 01 | Data Sources | Planning | Catalog all data sources and identify access methods (MySQL, XML, network shares, HCIP, DVC) |
| 02 | Big Data 5Vs | Planning | Measure Volume, Velocity, Variety, Veracity, Value per domain |
| 03 | Version Control | Planning | Set up DVC, HCIP storage, credential management, and versioning infrastructure |
| 04 | Data Acquisition | Bronze | Connect to sources, extract raw data as-is, preserve original format |
| 05 | Data Cleaning | Bronze | Remove duplicates, drop empty fields, handle nulls, strip irrelevant metadata |
| 06 | Data Consistency | Silver | Verify same identifiers mean same entities across all 8 domains |
| 07 | Data Quality | Silver | Schema validation quality gate. Reject invalid records. Never pass unstructured data downstream. |
| 08 | Data Accuracy | Silver | Verify values within valid ranges, timestamps current, precision appropriate |
| 09 | Data Normalization | Silver | Standardize units (mm, seconds), naming conventions, encoding across all domains |
| 10 | Anomaly Detection | Silver | Flag statistical outliers, volume spikes, pattern breaks for investigation |
| 11 | Data Staging | Silver | Checkpoint validated data in queryable, versioned, protected staging area |
| 12 | Primary Key / Index | Silver | Assign unique identifiers per record using {domain}-{id} pattern (e.g. bdg-001) |
| 13 | Data Partitioning | Silver | Divide data into domain partitions, enable partition pruning by tool |
| 14 | ETL | Gold | Extract from Silver, apply domain-specific business logic, load to Gold |
| 15 | Data Aggregation | Gold | Summarize with counts, averages, distributions, and percentiles |
| 16 | Nectar AI Serving | Nectar | Feature engineering, categorical encoding, numeric scaling, train/val/test split (70/15/15) |
Schema validation is mandatory at Stage 07. Records that fail validation are rejected and stored separately with rejection reasons. They are never propagated downstream. The rule is simple: never give unstructured data to AI.
Board-level design and layout. First domain fully automated through AutoBDG / ProDiGI.
Design optimization and manufacturability checks across the design flow.
Signal integrity analysis and simulation for high-speed interconnects.
Electromagnetic field analysis for package and board-level structures.
Core codesign functionality shared across the CPB co-design flow.
Plugin integration layer connecting tools within the codesign ecosystem.
Input/Output planning and optimization for chip-package interfaces.
Package design and implementation for semiconductor packaging.
These 8 tools live inside codesigndeploy, a subflow within Infineon's CAMINO production design flow. They exist as symbolic links pointing to actual tool implementations. Data outputs generated by these tools have historically unknown storage locations and unknown formats, which is why Stage 00 (Tool Data Mapping) must complete before any pipeline work begins.
The pipeline connects to six source types: MySQL / Relational Databases, XML Files, Network Shared Folders, Internal File Server, HCIP (Infineon Internal Cloud), and DVC (Data Version Control). The exact source-to-tool mapping is confirmed during the Phase 1 Discovery stage.
| Domain | Key Prefix | Example Record ID |
|---|---|---|
| Board Design | bdg- | bdg-00142 |
| Design for Excellence | dfe- | dfe-00089 |
| Signal Integrity | sigrity- | sigrity-00231 |
| Electromagnetic Analysis | ansysem- | ansysem-00056 |
| Core Design | cdcore- | cdcore-00178 |
| Plugin Interface | cdilplugins- | cdilplugins-00034 |
| I/O Planning | iop- | iop-00092 |
| Package Implementation | pkgImpl- | pkgImpl-00115 |
Investigate the data landscape. Map storage locations, catalog formats, profile schemas, assess data volumes, interview R&D engineers.
Implement the 17-stage pipeline. Bronze ingestion, Silver validation, quality gates, normalization, staging, partitioning.
Data serving layer for AI consumption. APIs, batch access, feature store. Data catalog, lineage, schema registry, access control.
Measure pipeline performance. Data quality scores, ingestion throughput, processing latency, schema stability, coverage percentage.
Directory structure analysis (recursive file scanning), schema profiling (automated tooling), data lineage tracing (source to storage), volume assessment (storage audits).
Semi-structured interviews with R&D engineers, surveys of current practices, direct observation of design workflows (shadowing), stakeholder workshops.
Industry reports on EDA data management, existing design datasets, literature review of current tools, competitive analysis of similar solutions.
Weekly meetings with Manuel Lexer (Team Lead), technical reviews with Salem Mohamed (Technical Contact), presentations to engineering teams, continuous feedback loops.
| Category | Metric | What It Measures |
|---|---|---|
| Pipeline Performance | Ingestion Rate | Records per second processed through the pipeline |
| Pipeline Performance | Pipeline Latency | Time from data creation to availability in Silver/Gold/Nectar |
| Data Quality | Quality Score | Percentage of data passing all validation gates (Stages 06-08) |
| Data Quality | Schema Stability | Backward compatibility maintained across schema versions |
| AI Readiness | Coverage | Percentage of EDA tools integrated into the pipeline |
| AI Readiness | Data Freshness | Age of data available to downstream AI systems |
| Business Impact | Time to Access | Before/after comparison: hours of manual data prep to minutes |
| Business Impact | Developer Productivity | Time saved on data preparation by AI/ML teams |
This thesis is in the Concept Study phase (February to June 2026). The data landscape is being discovered for the first time. The pipeline architecture is designed, the Medallion layers are defined, and the Data Mesh framework is in place. The next step is implementing the concept with a concrete use case, starting with one domain (ansysem) and expanding the pattern to all 8 tools. Stages 00 through 03 (Discovery and Planning) must complete before any pipeline code runs.
Novel 17-stage pipeline with four data layers, filling the gap in EDA data management research.
First application of Data Mesh and Data-as-a-Product concepts to the semiconductor EDA context.
Fact-based assessment of existing platforms mapped against each pipeline stage.
Technical discovery combined with qualitative research and iterative validation, documented for reuse.
Quantitative metrics: data quality scores, throughput, latency, schema stability. Before/after comparisons.
Documented inventory of where EDA tools store data, what formats exist, and how outputs are accessed.
Working pipeline: ingestion, cleaning, validation, quality checks. Proven on one use case, ready to scale.
Schema-governed, partitioned data products with quality standards. Template for additional tools.
Pipeline plugs into existing infrastructure. Custom code only fills genuine gaps.
Architecture docs, deployment procedures, operational guides. Team maintains it independently after thesis ends.
The university gets a novel architecture, an applied framework, a platform evaluation, and measured results from a real use case. The company gets a working pipeline, discovered data, quality gates, and a standardized design that scales beyond the thesis. Both outcomes come from the same work, and both parties can point to concrete results when the project wraps up.
Student: Hamza Ghaffar ·
Program: MSc Communication Engineering, FH Kärnten, Villach ·
Company: Infineon Technologies Austria AG ·
Role: Software Automation Intern (Full-Time)
Supervisor 1: Manuel Lexer (Team Lead) ·
Supervisor 2: Salem Mohamed (Technical Contact) ·
Phase: Step 2 - Concept Study (Feb-Jun 2026)