Data Acquisition Management System

Course Overview

Data Acquisition Management System / Data Engineering is a key skill for AI engineering.

Course Objective

After completiton of the course, students should be able to:

acquire data from various sources and ingest them to a data store( data lake, eDW ,data mart, delta lakes)
work on cloud ecosystem and be able to complete a certification in on cloud provider (AWS, Google, Azure)
Create ETL and ELT pipelines through various data processing (in SQL, DBT, DataFoam (part of bigquery), Spark) for different applications including data visualization and reporting (understand BI concepts)
Demonstrate understanding of data management issues, data quality, data governance
Orchestration (airflow) and creation of pipeline and be able to handover the pipeline to the operation team taking care of data management aspect such as incident management and such
Data governance: creating data catalogue, lineage of data, identifying personal information from data, standard data models, using Open APIs.
Pull requirement from business stakeholder to build high level design by enterprise architect, solution architect design mid level, data engineer build solution architect

Prerequisites

Python Programming : Numpy, Pandas, Matplotlib, REST API, Web Scraping,
Linux and Shell Scripting
Git and Github
Basic Data Science
Docker

Certain Pre-requisites are revised in the course

Chapter Breakdown

Chapter 1: Introduction Data Science, Engineering and Management [4 Hr]

Introduction to Data Science, Data Engineering and Data Management
DIKW Pyramid and its issues
Big Data and Big Data Ecosystem
Data Lifecycle
Data Management Principles and Challenges
Data Management Strategy and Frameworks
Data Engineering in Data Science (or ML) Lifecycle

Chapter 2: Data Handling [12 Hr]

Data Acquisition and Ingestion
Data Formats
Web Scraping: Scrapy and BeautifulSoap
Data Quality
Data Wrangling and Cleaning
Data Handling Ethics & Governance
Data Processing
Hadoop and MapReduce
Apache Spark: RDDs, DAtaFrames, SQL, MLLib, Streaming, GraphX
Data Streams
Apache Kafka: Topics, Parititions, Producer,Consumer, Kafka Connects
Apache Flume
YARN and Zookeeper
Cloud Services provided by AWS, GCP, Azure
AWS EC2, AMI, EVS, S3,RDS, Athena, Redshift, Lambda, CloudWatch, Glue for ETL jobs and EMR
GCP Cloudstorage, DataFusion, BigQuery, Data-proc and Data Flow
Azure data factory, SQL DB, Blob Storage, HDInsight, Databricks,

Chapter 3: Data Modelling, Design and Storage [15 Hr]

Data Storage
Distributed Storage: GFS & HDFS
Database Schema and Notations
Relational Database Management System (RDMS)
- Relationship and Entity Relationship Diagram
- ORM and UML Notations
Document Database: MongoDB
Columnar Database: Cassandra, HBase
Key-Value Pair DB: Redis
Graph Database: Neo4J
Multi-support, multi-paradigm database:
Search DB: ElasticSearch
Time-series DB
Cloud Data Warehouse
Data Lake and Data Mesh
SnowFlake: Star schema and Snowflake Schema
ETL and ELT pipeline

Chapter 4: Data Architecture and Orchestration

Importance of Data Architecture
Data Engineering Architectures, Pipelines and Best Practices
Lambda and Kappa Architectures
Orchestration: Apache Airflow
Streaming Pipelines
Model Deployment
Model Monitoring

Chapter 5: Data Management Concepts

Data Governance
Data Security
Data Integration and Interoperability
Context Management
Meta-data Management
Data Management Maturity
Organizational Change Management

Chapter 6 (Optional): System Design

Data System Design
System Design System Components
Scaling Data Systems
Distributed System Design
System Design Patterns for distributed systems
Case Studies

Assessment

Project-based assignments (40%), a midterm implementation milestone (20%), and a capstone research project with written report and defense (40%). Every module ends with something built and running.

References

DAMA-DMBOK2 Data Management Book of Knowledge
Fundamentals of Data Engineering by Joe Reis, Matt Housley
Data Pipelines Pocket Reference by James Densmore
Streaming Systems The What, Where, When, and How of Large-Scale Data Processing. by Akidau, Tyler Chernyak, Slava Lax, Reuven
Designing Data-Intensive Applications The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann