KU-AIAC557-DataAcquisitionManagementSystem

KU-AIAC557 Data Acquisition Management System

Kathmandu University Department of Computer Science and Engineering

Subject: Data Acquisition Management System

Course Code: AIAC 557

Level: MTech in AI, Year 1, Semester II

Credit Hours: 3

Type: Elective [Theory + Practical]

After completiton of the course, students should be able to

acquire data from various sources and ingest them to a data store( data lake, eDW ,data mart, delta lakes)
work on cloud ecosystem and be able to complete a certification in on cloud provider (AWS, Google, Azure)
Create ETL and ELT pipelines through various data processing (in SQL, DBT, DataFoam (part of bigquery), Spark) for different applications including data visualization and reporting (understand BI concepts)
demonstrate understanding of data management issues, data quality, data governance
Perform Basics of ML Operations: deploying model in production (batch/). Monitoring the performance the model (DataDrift/ ModelDrift)
- Orchestration (airflow) and creation of pipeline and be able to handover the pipeline to the operation team taking care of data management aspect such as incident management and such
- Data governance: creating data catalogue, lineage of data, identifying personal information from data, standard data models, using Open APIs.
pull requirement from business stakeholder to build high level design by enterprise architect, solution architect design mid level, data engineer build solution architect

In-Semester evaluation - 60 marks End-Semester Evaluation - 40 marks

Data Acquisition and Ingestion
- Data Formats
- Web Scraping: Scrapy and BeautifulSoap
Data Quality
Data Wrangling and Cleaning
Data Handling Ethics
Data Governance
Data Processing
- Hadoop and MapReduce
- Apache Spark: RDDs, DAtaFrames, SQL, MLLib, Streaming, GraphX
Data Streams
- Apache Kafka: Topics, Parititions, Producer,Consumer, Kafka Connects
- Apache Flume
YARN and Zookeeper
Cloud Services provided by AWS, GCP, Azure
- AWS EC2, AMI, EVS, S3,RDS, Athena, Redshift, Lambda, CloudWatch, Glue for ETL jobs and EMR
- GCP Cloudstorage, DataFusion, BigQuery, Data-proc and Data Flow
- Azure data factory, SQL DB, Blob Storage, HDInsight, Databricks,

Data Storage
Distributed Storage: GFS & HDFS
Database Schema and Notations
Relational Database Management System (RDMS)
- Relationship and Entity Relationship Diagram
Object-oriented DB and UML Notations
Document Database: MongoDB
Columnar Database: Cassandra, HBase
Key-Value Pair DB: Redis
Graph Database: Neo4J
Multi-support, multi-paradigm database:
- Search DB: ElasticSearch
- Time-series DB
Cloud Data Warehouse
Data Lake and Data Mesh
- SnowFlake: Star schema and Snowflake Schema
ETL and ELT pipeline

DAMA-DMBOK2 Data Management Book of Knowledge
Fundamentals of Data Engineering by Joe Reis, Matt Housley
Data Pipelines Pocket Reference by James Densmore
Streaming Systems The What, Where, When, and How of Large-Scale Data Processing. by Akidau, Tyler Chernyak, Slava Lax, Reuven
Designing Data-Intensive Applications The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann

This site is open source. Improve this page.