DS-551: Data Engineering@Scale

🎓 • Boston University

Course Overview

Advanced course in large-scale data management, processing, and analytics. Students build components of a hypothetical “Epidemic Engine” application, learning MapReduce systems (Hadoop, Spark), NoSQL and RDBMS management, data publishing with Flink and Kafka, and visualization tools. Emphasizes real-world scalable applications and enterprise-scale data ecosystems.

Teaching Impact

Spring 2024 – Present: Authored and launched the course for Spring 2024. Taught four times through Spring 2026. Maintained primary instructor role throughout all offerings.

Advanced elective for data science undergraduates and master’s students interested in understanding how data flows through systems to enable data science and ML applications.

Learning Objectives

By the end of this course, students will be able to:

  1. Large-Scale Data Processing: Design and implement MapReduce workflows using Hadoop and Spark
  2. Database Systems: Navigate NoSQL and RDBMS systems for different data use cases
  3. Stream Processing: Build real-time data pipelines using Kafka and Flink
  4. System Architecture: Understand enterprise-scale data ecosystem complexity
  5. Applied Integration: Connect data engineering components to support BI, alerting, and ML predictions

Course Innovation

  • Project-based learning: Students build components of the “Epidemic Engine” throughout the semester
  • Real-world context: Hypothetical but realistic application provides coherent framework for learning
  • Tool differentiation: Emphasis on understanding concepts vs. specific tool implementations
  • Documentation literacy: Students learn to adapt to changing technical landscapes through reading software documentation
  • End-to-end perspective: Covers full data lifecycle from collection to visualization and prediction

Technologies Used

  • Apache Hadoop: Distributed storage and processing framework
  • Apache Spark: Fast cluster computing for large-scale data processing
  • Apache Kafka: Distributed streaming platform
  • Apache Flink: Stream processing framework
  • NoSQL Databases: Document, graph, and key-value stores
  • PowerBI: Business intelligence and visualization
  • Python: Primary programming language for data processing
  • SQL/DDL: Database design and querying