COST
Data Engineering · Cloud Cost Optimization · Portfolio

Your cloud bill is a
systems problem,
not a pricing problem.

A hands-on research and engineering project exploring how data pipeline architecture directly drives cloud spend — documenting real optimizations with measured, reproducible results.

RB
Raj Bandaru
🇮🇳 · Data Engineer · AWS · Python · SQL · Based in USA
Pipeline Efficiency Storage Optimization Compute Right-Sizing Query Cost Reduction Data Movement Analysis Lifecycle Management Incremental Processing Warehouse Trade-offs
About

Engineering-first.
Measured always.

I'm Raj Bandaru, a Data Engineer from India building a deep understanding of how architecture decisions drive cloud costs. This project documents real optimizations I've designed, implemented, and measured across AWS data infrastructure.

My focus is on the systems layer — pipelines, storage layout, compute configuration, and query engine behavior — where the biggest cost levers live and where engineering judgment matters most.

Every optimization documented here is reproducible. The numbers are real. The approach generalizes across AWS, GCP, and Azure environments.

FOCUS

Cloud cost engineering

Identifying where pipeline architecture, storage design, and compute configuration generate unnecessary spend — then fixing it at the source.

STACK

AWS · Python · SQL

Glue, Athena, S3, EMR, Lambda, Cost Explorer. Python for ETL logic. SQL for query analysis and optimization across Athena, Redshift, and Snowflake.

STATUS

Open to Data Engineer roles

Actively seeking full-time Data Engineer positions across the USA. Strong background in AWS data infrastructure and pipeline optimization.

ORIGIN

🇮🇳 Hyderabad → USA

From India, working and building in the United States. Bringing engineering rigor, cost-first thinking, and a systems perspective to every problem.

Projects

Real optimizations,
measured results.

Project 01  ·  Query Cost
Reducing Athena Query Cost by 98% Through Data Layout Optimization
An analytics workload querying data in Amazon Athena was generating high per-query costs due to inefficient storage design. Every query performed a full dataset scan regardless of the filter applied.
Before    S3 (raw CSV, unpartitioned) → Athena
After     S3 (Parquet, partitioned) ← Glue ETL → Athena
Data stored in CSV — row-based, uncompressed
No partitioning strategy; queries forced full table scans
High scan volume regardless of filter selectivity
Converted CSV → Parquet (columnar, Snappy compressed)
Implemented date-based partition structure (year/month/day)
Updated query filters to leverage partition pruning
214 MB
Data scanned before
3.66 MB
Data scanned after
98%
Scan reduction
~$19K
Est. annual savings at 50K queries/day
Athena cost is driven by data scanned, not query complexity. Storage design directly impacts compute cost — columnar format and partition pruning eliminate unnecessary reads before the query engine runs. Approach generalizes across Athena, BigQuery, and Synapse query engines.
Amazon Athena AWS Glue S3 Parquet Snappy Partition Pruning
Project 02  ·  Compute Efficiency
Reducing AWS Glue ETL Waste with Incremental Processing & Right-Sized Compute
A Glue ETL pipeline was executing full job runs on every trigger — processing ~1 MB of data with 10 G.1X workers. Compute capacity was provisioned for peak, while actual data processed per run remained minimal.
Before    S3 (raw) → Glue (10× G.1X, full reload) → S3 ❌
After     S3 (raw) → Glue (2–3× G.1X, incremental) → S3 ✓
10 × G.1X workers provisioned for a workload processing ~1 MB per run
Fixed cluster size — no workload-aware scaling or adjustment
Full pipeline execution on every trigger — no incremental logic or change tracking
Inefficient cost-to-data ratio across repeated runs (~55s–1m 9s each)
Right-sized compute from 10 × G.1X → 2–3 × G.1X based on observed workload
Introduced incremental processing — only new/changed data processed per run
Added watermark logic to track last processed state and skip unchanged records
Shifted from fixed batch to event-driven execution aligned to actual data delta
10×
Workers before (G.1X)
2–3×
Workers after (right-sized)
~65%
Compute usage reduction
~45%
Faster execution time
The system shifted from compute-heavy batch processing to lightweight, event-driven execution aligned with actual data changes. The fix was architectural — not infrastructural. Approach generalizes across Glue, Spark, and Databricks workloads.
AWS Glue S3 Incremental ETL DPU Optimization Watermark Logic Right-Sizing
The Problem Space

Cloud costs rise.
Root causes stay hidden.

Most organizations know their cloud bill is increasing. Few understand which specific system behaviors are responsible. Cost monitoring surfaces the numbers — it doesn't explain the architecture decisions generating them.

01

No attribution to specific workloads

Spend increases month over month with no clear connection to individual pipelines, jobs, or teams.

02

Full reprocessing where incremental is sufficient

Pipelines reprocess entire datasets when only a fraction of records have changed. Compute runs regardless.

03

Storage designed for writes, not reads

Millions of small files. No partitioning. Wrong storage classes for access frequency. Athena scanning full datasets for filtered queries.

04

Compute provisioned for peak, running at average

Clusters sized for worst-case load, idle most of the time. On-demand pricing where spot or reserved is appropriate.

Before vs After

System-level changes,
measured outcomes.

Current State
compute.workers G.2X × 20 nodes
query.scanned 1.4 TB per query
storage.files 2.1M objects (avg 4KB)
cluster.pricing on-demand, always-on
data.format CSV, uncompressed
partitioning none
processing.mode full reload, daily
After Optimization
compute.workers G.1X × 6 nodes
query.scanned 22 GB per query
storage.files compacted, 128–512 MB
cluster.pricing spot + auto-terminate
data.format Parquet, Snappy
partitioning year / month / day
processing.mode incremental, event-driven
Areas of Focus

What I work on
and understand deeply.

01 ——

Pipeline Efficiency

Analyzing ingestion, transformation, and loading workflows. Eliminating redundant computation, implementing incremental patterns, right-sizing compute.

02 ——

Storage Architecture

Redesigning data layout for query efficiency. Partition structures, file format migration, compaction strategies, and lifecycle policy enforcement.

03 ——

Query Cost Reduction

Reducing the volume of data scanned per query through structural changes. Applies across Athena, BigQuery, and Synapse query engines.

04 ——

Compute Optimization

Analyzing cluster configurations and worker allocations. Identifying over-provisioned resources and shifting workloads to appropriate pricing models.

05 ——

Cost Attribution

Connecting spend to specific pipelines, teams, and workloads using Cost Explorer, CloudWatch, and tagging strategies.

06 ——

Workflow Consolidation

Identifying duplicated jobs, redundant data movement, and repeated transformations across multi-stage pipelines to reduce end-to-end compute cost.

How I Think

System behavior
drives cost.

I don't start with tools or dashboards. I start by understanding what the system is actually doing — where work is repeated, where data moves unnecessarily, where compute runs without purpose.

Step 01

Locate where compute is overused

Identify over-provisioned resources, idle clusters, and jobs consuming more capacity than their workload requires.

Step 02

Determine if data is processed more than once

Find full reprocessing patterns where incremental is viable. Identify repeated transformations across workflows producing the same output.

Step 03

Evaluate storage against access patterns

Audit storage classes, file layout, retention policies, and data formats relative to how data is actually queried and accessed.

Step 04

Assess pipeline structure end-to-end

Review architectural decisions — partitioning, data movement, orchestration logic — that compound cost across the full pipeline lifecycle.

Step 05

Implement and measure

Apply structural changes with before-and-after cost measurement. Validate that performance is maintained as spend decreases.

Technical Skills

Technical depth across
modern data infrastructure.

Capability Tools & Technologies
01
Incremental data processing patterns
GlueAirflowLambda
02
Partition design and storage layout
S3HiveIceberg
03
Query engine cost optimization
AthenaBigQuerySynapse
04
Warehouse vs query engine trade-offs
RedshiftSnowflakeDatabricks
05
Compute right-sizing and spot strategies
EMRDataprocHDInsight
06
Multi-stage pipeline restructuring
MedallionDeltadbt
07
Cost anomaly detection and attribution
Cost ExplorerCloudWatch
08
ETL pipeline development
PythonPySparkSQL
Connect

Open to Data Engineer opportunities across the USA.

If you're building data infrastructure, optimizing pipelines, or working on cloud-scale engineering problems — I'd love to connect. Always happy to talk data engineering, cloud architecture, or anything in this space.

Name Raj Bandaru
Email contact@cloudspendops.com
Focus Data Engineering · Cloud Infrastructure · AWS
Status Open to opportunities · USA