KU-AIAC557 Data Acquisition Management System
Kathmandu University
Data EngineeringData ManagementOrchestrationSystem Design
Data Acquisition Management System
Course Overview
Data Acquisition Management System / Data Engineering is a key skill for AI engineering.
Course Objective
After completiton of the course, students should be able to:
- acquire data from various sources and ingest them to a data store( data lake, eDW ,data mart, delta lakes)
- work on cloud ecosystem and be able to complete a certification in on cloud provider (AWS, Google, Azure)
- Create ETL and ELT pipelines through various data processing (in SQL, DBT, DataFoam (part of bigquery), Spark) for different applications including data visualization and reporting (understand BI concepts)
- Demonstrate understanding of data management issues, data quality, data governance
- Orchestration (airflow) and creation of pipeline and be able to handover the pipeline to the operation team taking care of data management aspect such as incident management and such
- Data governance: creating data catalogue, lineage of data, identifying personal information from data, standard data models, using Open APIs.
- Pull requirement from business stakeholder to build high level design by enterprise architect, solution architect design mid level, data engineer build solution architect
Prerequisites
- Python Programming : Numpy, Pandas, Matplotlib, REST API, Web Scraping,
- Linux and Shell Scripting
- Git and Github
- Basic Data Science
- Docker
Certain Pre-requisites are revised in the course
Chapter Breakdown
Chapter 1: Introduction Data Science, Engineering and Management [4 Hr]
- Introduction to Data Science, Data Engineering and Data Management
- DIKW Pyramid and its issues
- Big Data and Big Data Ecosystem
- Data Lifecycle
- Data Management Principles and Challenges
- Data Management Strategy and Frameworks
- Data Engineering in Data Science (or ML) Lifecycle
Chapter 2: Data Handling [12 Hr]
- Data Acquisition and Ingestion
- Data Formats
- Web Scraping: Scrapy and BeautifulSoap
- Data Quality
- Data Wrangling and Cleaning
- Data Handling Ethics & Governance
- Data Processing
- Hadoop and MapReduce
- Apache Spark: RDDs, DAtaFrames, SQL, MLLib, Streaming, GraphX
- Data Streams
- Apache Kafka: Topics, Parititions, Producer,Consumer, Kafka Connects
- Apache Flume
- YARN and Zookeeper
- Cloud Services provided by AWS, GCP, Azure
- AWS EC2, AMI, EVS, S3,RDS, Athena, Redshift, Lambda, CloudWatch, Glue for ETL jobs and EMR
- GCP Cloudstorage, DataFusion, BigQuery, Data-proc and Data Flow
- Azure data factory, SQL DB, Blob Storage, HDInsight, Databricks,
Chapter 3: Data Modelling, Design and Storage [15 Hr]
- Data Storage
- Distributed Storage: GFS & HDFS
- Database Schema and Notations
- Relational Database Management System (RDMS)
- Relationship and Entity Relationship Diagram
- ORM and UML Notations
- Document Database: MongoDB
- Columnar Database: Cassandra, HBase
- Key-Value Pair DB: Redis
- Graph Database: Neo4J
- Multi-support, multi-paradigm database:
- Search DB: ElasticSearch
- Time-series DB
- Cloud Data Warehouse
- Data Lake and Data Mesh
- SnowFlake: Star schema and Snowflake Schema
- ETL and ELT pipeline
Chapter 4: Data Architecture and Orchestration
- Importance of Data Architecture
- Data Engineering Architectures, Pipelines and Best Practices
- Lambda and Kappa Architectures
- Orchestration: Apache Airflow
- Streaming Pipelines
- Model Deployment
- Model Monitoring
Chapter 5: Data Management Concepts
- Data Governance
- Data Security
- Data Integration and Interoperability
- Context Management
- Meta-data Management
- Data Management Maturity
- Organizational Change Management
Chapter 6 (Optional): System Design
- Data System Design
- System Design System Components
- Scaling Data Systems
- Distributed System Design
- System Design Patterns for distributed systems
- Case Studies
Assessment
Project-based assignments (40%), a midterm implementation milestone (20%), and a capstone research project with written report and defense (40%). Every module ends with something built and running.
References
- DAMA-DMBOK2 Data Management Book of Knowledge
- Fundamentals of Data Engineering by Joe Reis, Matt Housley
- Data Pipelines Pocket Reference by James Densmore
- Streaming Systems The What, Where, When, and How of Large-Scale Data Processing. by Akidau, Tyler Chernyak, Slava Lax, Reuven
- Designing Data-Intensive Applications The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann