A hands-on research and engineering project exploring how data pipeline architecture directly drives cloud spend — documenting real optimizations with measured, reproducible results.
I'm Raj Bandaru, a Data Engineer from India building a deep understanding of how architecture decisions drive cloud costs. This project documents real optimizations I've designed, implemented, and measured across AWS data infrastructure.
My focus is on the systems layer — pipelines, storage layout, compute configuration, and query engine behavior — where the biggest cost levers live and where engineering judgment matters most.
Every optimization documented here is reproducible. The numbers are real. The approach generalizes across AWS, GCP, and Azure environments.
Identifying where pipeline architecture, storage design, and compute configuration generate unnecessary spend — then fixing it at the source.
Glue, Athena, S3, EMR, Lambda, Cost Explorer. Python for ETL logic. SQL for query analysis and optimization across Athena, Redshift, and Snowflake.
Actively seeking full-time Data Engineer positions across the USA. Strong background in AWS data infrastructure and pipeline optimization.
From India, working and building in the United States. Bringing engineering rigor, cost-first thinking, and a systems perspective to every problem.
Most organizations know their cloud bill is increasing. Few understand which specific system behaviors are responsible. Cost monitoring surfaces the numbers — it doesn't explain the architecture decisions generating them.
Spend increases month over month with no clear connection to individual pipelines, jobs, or teams.
Pipelines reprocess entire datasets when only a fraction of records have changed. Compute runs regardless.
Millions of small files. No partitioning. Wrong storage classes for access frequency. Athena scanning full datasets for filtered queries.
Clusters sized for worst-case load, idle most of the time. On-demand pricing where spot or reserved is appropriate.
Analyzing ingestion, transformation, and loading workflows. Eliminating redundant computation, implementing incremental patterns, right-sizing compute.
Redesigning data layout for query efficiency. Partition structures, file format migration, compaction strategies, and lifecycle policy enforcement.
Reducing the volume of data scanned per query through structural changes. Applies across Athena, BigQuery, and Synapse query engines.
Analyzing cluster configurations and worker allocations. Identifying over-provisioned resources and shifting workloads to appropriate pricing models.
Connecting spend to specific pipelines, teams, and workloads using Cost Explorer, CloudWatch, and tagging strategies.
Identifying duplicated jobs, redundant data movement, and repeated transformations across multi-stage pipelines to reduce end-to-end compute cost.
I don't start with tools or dashboards. I start by understanding what the system is actually doing — where work is repeated, where data moves unnecessarily, where compute runs without purpose.
Identify over-provisioned resources, idle clusters, and jobs consuming more capacity than their workload requires.
Find full reprocessing patterns where incremental is viable. Identify repeated transformations across workflows producing the same output.
Audit storage classes, file layout, retention policies, and data formats relative to how data is actually queried and accessed.
Review architectural decisions — partitioning, data movement, orchestration logic — that compound cost across the full pipeline lifecycle.
Apply structural changes with before-and-after cost measurement. Validate that performance is maintained as spend decreases.
If you're building data infrastructure, optimizing pipelines, or working on cloud-scale engineering problems — I'd love to connect. Always happy to talk data engineering, cloud architecture, or anything in this space.