About Me

Loading models
0%

Hi there! 👋 I'm Thoai.

I work in the cloud and data platform space, mostly around Kubernetes, Kafka, Spark, and the tools that make modern data systems run smoothly.

I enjoy digging into how data actually flows through systems, how storage works under the hood (pages, blocks, execution…), and how to turn a bunch of scattered services into a clean, maintainable pipeline.

Recently, I've been focusing more on Data Engineering, including:

  • designing reliable and scalable data platforms
  • technical stacks such as Iceberg, Lakehouse, Trino, Spark
  • data modeling techniques
  • building ETL/ELT pipeline
  • and so on...

I created this blog/docs site to capture what I learn, what I experiment with, and the mistakes I run into along the way. Hopefully it helps someone else, or at least helps future me.

If you're into data, distributed systems, or just want to debug a burning pipeline together, feel free to reach out.

I also keep longer writeups and experiments in my blog/docs notes.

My Experiences

March 2026 - Present
  • Designed and implemented an LLM-powered data dictionary and metadata governance workflow on Databricks leveraging Unity Catalog for 10K+ datasets.
  • Generated table and column descriptions, business grain, ownership and governance tags, key field suggestions, and PII classifications.
  • Added human-in-the-loop validation to improve catalog completeness and support downstream data quality, masking, access control, and discovery use cases.
September 2025 - February 2026
  • Designed and integrated a unified feature store based on Feast into a legacy data platform, supporting batch and real-time feature engineering and API-based online serving with GitOps-style governance and versioning.
  • Operated feature serving at scale with 14M MAUs, around 2M streaming events/day, and sub-200 ms feature retrieval latency.
  • Built batch and streaming ETL / feature engineering pipelines using Spark and Airflow.
  • Developed internal frameworks and libraries to automate streaming pipeline deployment and management, allowing ML Engineers and Analysts to focus on business logic instead of infrastructure complexity.
  • Owned and improved the Risk data and ML platform with Spark, Airflow, HDFS, and related systems, ensuring stable daily operation of 50-60 Spark applications, each processing up to 500M-1B records per run.
April 2023 - August 2025

Contributed as a core member of the Data Platform team at a cloud service provider, delivering a platform-as-a-service for real-time ingestion, distributed processing, governed access, and self-service analytics to enterprise clients.

  • Engineered a full-fledged LakeHouse platform with comprehensive data governance for Spark and Trino.
  • Integrated OAuth2-based identity propagation, fine-grained access control, and dynamic data masking through Apache Ranger.
  • Added automated lineage tracking through OpenMetadata and standardized encryption at rest with S3 SSE-C.
  • Developed high-throughput CDC pipelines (100GB/day, 5K TPS) using Kafka Connect & Debezium, migrating 500+ PostgreSQL tables to ClickHouse, Iceberg, and S3.
  • Built a self-service Spark environment on JupyterHub with a custom Profile Manager, secure session provisioning, LakeHouse integration, and dynamic environment configuration.
  • Enhanced Spark orchestration by creating custom Airflow plugins integrated with Spark Operator for modular job submission, runtime tracking, and real-time log streaming.
  • Built unified monitoring dashboards using Prometheus and Grafana to track pipeline SLAs and detect anomalies across Spark, Kafka, and Airflow.
October 2022 - March 2023
  • Researched Kafka architecture and deployment feasibility, then designed Kafka-as-a-Service solutions on both VMs and Kubernetes.
  • Deployed Kafka on Kubernetes using Strimzi and implemented end-to-end monitoring with JMX, Telegraf, Prometheus, Grafana, and alerting through Telegram.
  • Built Kong plugins and integrated API gateway into microservices on Kubernetes.

My Education

Open Source Contribution

Feast

  • Optimized MySQL Online Store write performance by implementing batch insert and transaction grouping, significantly reducing write latency. #5699
  • Introduced HDFS Registry backend, allowing teams to manage Feast feature definitions on Hadoop-compatible file systems. #5655
  • Added HDFS Staging support for Spark Offline Store, enabling distributed materialization and more efficient large-scale feature computation. #5635