DS-551: Data Engineering@Scale

🎓 • Boston University

Course Overview

Advanced course in large-scale data management, processing, and analytics. Students build components of a hypothetical “Epidemic Engine” application, learning MapReduce systems (Hadoop, Spark), NoSQL and RDBMS management, data publishing with Flink and Kafka, and visualization tools. Emphasizes real-world scalable applications and enterprise-scale data ecosystems.

Teaching Impact

Spring 2024 – Present: Authored and launched the course for Spring 2024. Taught four times through Spring 2026. Maintained primary instructor role throughout all offerings.

Advanced elective for data science undergraduates and master’s students interested in understanding how data flows through systems to enable data science and ML applications.

Learning Objectives

By the end of this course, students will be able to:

Large-Scale Data Processing: Design and implement MapReduce workflows using Hadoop and Spark
Database Systems: Navigate NoSQL and RDBMS systems for different data use cases
Stream Processing: Build real-time data pipelines using Kafka and Flink
System Architecture: Understand enterprise-scale data ecosystem complexity
Applied Integration: Connect data engineering components to support BI, alerting, and ML predictions

Course Innovation

Project-based learning: Students build components of the “Epidemic Engine” throughout the semester
Real-world context: Hypothetical but realistic application provides coherent framework for learning
Tool differentiation: Emphasis on understanding concepts vs. specific tool implementations
Documentation literacy: Students learn to adapt to changing technical landscapes through reading software documentation
End-to-end perspective: Covers full data lifecycle from collection to visualization and prediction

Technologies Used

Apache Hadoop: Distributed storage and processing framework
Apache Spark: Fast cluster computing for large-scale data processing
Apache Kafka: Distributed streaming platform
Apache Flink: Stream processing framework
NoSQL Databases: Document, graph, and key-value stores
PowerBI: Business intelligence and visualization
Python: Primary programming language for data processing
SQL/DDL: Database design and querying