Writing
Field notes on data engineering, big data, and the systems behind them.
Databricks Managed File Events for Auto Loader and File Arrival Trigger (AWS S3 + Provided SQS)
An end-to-end, copy-pasteable runbook for setting up Databricks file events on AWS using your own SQS queue (Provided queue mode), direct S3 → SQS notifications, Unity Catalog external locations, file arrival triggers, and Auto Loader.
databricksauto-loaderunity-catalogawssqsRead Delta Tables with Snowflake via Unity Catalog
How Snowflake can read Delta tables governed by Unity Catalog using Iceberg Uniform reads — via credential vending, external volumes, and object storage catalog integration.
icebergdelta-lakesnowflakeunity-catalogData Objects in Databricks
In Databricks, metadata of data Objects (tables, views, etc.) is registered in Metastore. Previously, Databricks uses Hive metastore by default to register schemas, tables, and views. However, it's highly recommended to upgrade…
delta-lakeDeep Dive Into Delta Lake
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs…
sparkdelta-lakedatabricksSpark working internals, and why should you care?
Most Big Data developers and Data Engineers start learning Spark by writing SparkSQL codes to perform ETL on DataFrame (I know I did). I also wrote a post about SparkSQL Programming. However,…
sparkSpark SQL Programming Primer
TL,DR SparkSQL is a huge component of Spark Programming. This post introduces programming in SparkSQL through Spark DataFrame API. It's important to be aware of Spark SQL builtin functions to be a…
sparkDebug long running Spark job
You Spark job is running for a long time, what to do? Generally, longrunning Spark jobs can be due to various factors. We like to call them the 5S Spill, Skew, Shuffle,…
sparkDeploy Debezium and Kafka on AKS using Strimzi Operator
This tutorial follows Debezium official documentation Deploying Debezium on Kubernetes, but modified for Azure Kubernetes Service and Azure Container Registry. You need an Azure account, Azure subscription, and Azure resource group before…
kafkakubernetesdebeziumchange-data-capturedocker