Graduate

KU-AIAC557 Data Acquisition Management System

Kathmandu University

Data EngineeringData ManagementOrchestrationSystem Design

Data Acquisition Management System

Course Overview

Data Acquisition Management System / Data Engineering is a key skill for AI engineering.


Course Objective

After completiton of the course, students should be able to:

  • acquire data from various sources and ingest them to a data store( data lake, eDW ,data mart, delta lakes)
  • work on cloud ecosystem and be able to complete a certification in on cloud provider (AWS, Google, Azure)
  • Create ETL and ELT pipelines through various data processing (in SQL, DBT, DataFoam (part of bigquery), Spark) for different applications including data visualization and reporting (understand BI concepts)
  • Demonstrate understanding of data management issues, data quality, data governance
  • Orchestration (airflow) and creation of pipeline and be able to handover the pipeline to the operation team taking care of data management aspect such as incident management and such
  • Data governance: creating data catalogue, lineage of data, identifying personal information from data, standard data models, using Open APIs.
  • Pull requirement from business stakeholder to build high level design by enterprise architect, solution architect design mid level, data engineer build solution architect

Prerequisites

  • Python Programming : Numpy, Pandas, Matplotlib, REST API, Web Scraping,
  • Linux and Shell Scripting
  • Git and Github
  • Basic Data Science
  • Docker

Certain Pre-requisites are revised in the course

Chapter Breakdown

Chapter 1: Introduction Data Science, Engineering and Management [4 Hr]

  • Introduction to Data Science, Data Engineering and Data Management
  • DIKW Pyramid and its issues
  • Big Data and Big Data Ecosystem
  • Data Lifecycle
  • Data Management Principles and Challenges
  • Data Management Strategy and Frameworks
  • Data Engineering in Data Science (or ML) Lifecycle

Chapter 2: Data Handling [12 Hr]

  • Data Acquisition and Ingestion
  • Data Formats
  • Web Scraping: Scrapy and BeautifulSoap
  • Data Quality
  • Data Wrangling and Cleaning
  • Data Handling Ethics & Governance
  • Data Processing
  • Hadoop and MapReduce
  • Apache Spark: RDDs, DAtaFrames, SQL, MLLib, Streaming, GraphX
  • Data Streams
  • Apache Kafka: Topics, Parititions, Producer,Consumer, Kafka Connects
  • Apache Flume
  • YARN and Zookeeper
  • Cloud Services provided by AWS, GCP, Azure
  • AWS EC2, AMI, EVS, S3,RDS, Athena, Redshift, Lambda, CloudWatch, Glue for ETL jobs and EMR
  • GCP Cloudstorage, DataFusion, BigQuery, Data-proc and Data Flow
  • Azure data factory, SQL DB, Blob Storage, HDInsight, Databricks,

Chapter 3: Data Modelling, Design and Storage [15 Hr]

  • Data Storage
  • Distributed Storage: GFS & HDFS
  • Database Schema and Notations
  • Relational Database Management System (RDMS)
    • Relationship and Entity Relationship Diagram
    • ORM and UML Notations
  • Document Database: MongoDB
  • Columnar Database: Cassandra, HBase
  • Key-Value Pair DB: Redis
  • Graph Database: Neo4J
  • Multi-support, multi-paradigm database:
  • Search DB: ElasticSearch
  • Time-series DB
  • Cloud Data Warehouse
  • Data Lake and Data Mesh
  • SnowFlake: Star schema and Snowflake Schema
  • ETL and ELT pipeline

Chapter 4: Data Architecture and Orchestration

  • Importance of Data Architecture
  • Data Engineering Architectures, Pipelines and Best Practices
  • Lambda and Kappa Architectures
  • Orchestration: Apache Airflow
  • Streaming Pipelines
  • Model Deployment
  • Model Monitoring

Chapter 5: Data Management Concepts

  • Data Governance
  • Data Security
  • Data Integration and Interoperability
  • Context Management
  • Meta-data Management
  • Data Management Maturity
  • Organizational Change Management

Chapter 6 (Optional): System Design

  • Data System Design
  • System Design System Components
  • Scaling Data Systems
  • Distributed System Design
  • System Design Patterns for distributed systems
  • Case Studies

Assessment

Project-based assignments (40%), a midterm implementation milestone (20%), and a capstone research project with written report and defense (40%). Every module ends with something built and running.


References

  • DAMA-DMBOK2 Data Management Book of Knowledge
  • Fundamentals of Data Engineering by Joe Reis, Matt Housley
  • Data Pipelines Pocket Reference by James Densmore
  • Streaming Systems The What, Where, When, and How of Large-Scale Data Processing. by Akidau, Tyler Chernyak, Slava Lax, Reuven
  • Designing Data-Intensive Applications The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann