KU-AIAC557-DataAcquisitionManagementSystem

Data Storage and Data Modeling

Data needs to be stored or recorded for future references and processing. In 1986, approximately 1% of the world’s capacity to store information was in digital format; this grew to 3% by 1993, to 25% by 2000, and to 97% by 2007. Data is growing in volume, variety and velocity than ever before. According to Forbes 2019, 90% of the world’s data was generated in the last two years with 2.5 quintillion bytes data being created each day.

A modern digital computer represents data (numbers, text, images, audio, etc) using strings of binary numbers through different encodings like ASCII, UTF-8 for characters, JPEG for image encoding, MPEG-4 for video encoding and so on. A cosmic ray striking computer memory at just the right time can flip a bit, turning a 0 into a 1 or vice versa. This error is easily caught as its in power of 2, which happened in Belgian election 2003 where the votes were recounted and it showed that the person in question got exactly 4096 $2^12$ votes. By adding bits to each encoded unit, redundancy allows the computer to both detect errors in coded data and correct them based on mathematical algorithms and correct such bugs upon detection.

The Von Neumann machines stores operating instruction and data into memory. There’s also a hierarchy of memory (storage). The memory or the primary storage is typically referred to as stored in Random-Access Memory while storage is considered as for long term storage devices or secondary memory. RAM memory device has a set of addresses lines and for each combination of bits applied to these lines, a set of memory cells are activated (64 cells for 64-bit microprocessors). A 64 bit OS can theoretically address 2^64 bytes, thou no hardware yet bothers to exploit this huge range.

ComputerMemoryHierarchy

Data in Hard-disk and Solid State drive can be randomly read or written. Even though an OS can access data at individual byte level, the actual data is read in a fixed-length (block-size) sequence of bytes or bits called blocks (called physical record). Data thus structured are said to be blocked. While blocking (putting) or deblocking (extracting) the blocked data is stored in a data buffer. Block storage is normally abstracted by a file system or database management system (DBMS) for use by applications and end users.

Data is separated into pieces and each piece is given a name to easily isolate and identify the data. Each group of data is called a file, the structure and logic rules used to manage the groups of data and their filenames is called a file system. File system (fs) is the method and data structure that the OS uses to control how data is stored and retrieved. Most file systems (NTFS, EXT*,) are based on a block device abstraction which lead to space inefficiency due to internal fragmentation (multiple partially empty blocks). Some newer file systems like Btrfs and FreeBSD UFS2 attempts to solve this issue through block sub-allocation and tail merging, while ZFS support variable block sizes. File system manages the space, filenames, directories, metadata of files and directories along with access permissions and other file attributes, maintains file systems integrity.

Disadvantages of file-based data management system where data is stored in permanent files.

Different File Systems

Cloud Storage

Data Modeling

Data Modeling is an abstraction that organizes elements of real world data and how they will relate to each other. We are used to create data models to manage our data in a spreadsheet, but it can’t be used for operations on a large scale such as large number of sheets with millions of rows. Hence, we use a database to handle most of such data for different business process and user applications. A database is a structured repository or collection of (inter-related) data that is stored and retrieved under a pre-specified format and constraints for use in applications. Data can be stored, updated, or deleted from a database. Data Organization is critical for any application or business use cases. Organized data determines later data uses, if data is well organized, complicated queries can be simple. Hence, planning data organization and data modeling should start early and improved iteratively with new requirements and new data.

Evolution of Database

Data Modeling Process

Relational Database

Object-ORiented Databases

Contains data in form of objects and classes. Objects are treated as real-world entities and types as the category or collection of objects. Object-Oriented databases combines both relational model features with object oriented principles. OOP properties such as objects, classes, inheritance, polymorphism, encapsulation along with relational database properties such as Atomicity, consistency, integrity, durability, concurrency, query processing,

When Not to Use a Relational Database

These issues lead to creation of NoSQL database (aka Not Only SQL, Non Relational) . NoSQL database has a simple design, simpler horizontal scaling and finer control of availability. Data structures used are different than those in Relational databases makes some operations faster. There are various types of NoSQL databases.

Document Database

Data is stored in documents grouped into collections. Each document can be different.

Key-Value Store Database

Data is stored in array of key value pairs.

Columnar or wide-column Database

Instead of tables colum families are the container for rows No need of knowing all columns upfront. Each row can have different no of columns. Enables analysis of large dataset

Graph Database

For data that are connected to each other and best represented in graphs. Entities (nodes) with properties (information about entity) are connected via lines.

Multi-paradigm Database

Data Warehouse and Data Lake

When not to use a NoSQL

How to Pick the Right Database for your application ?

IF your business is not experiencing rapid growth or sudden changes. There are no requirements of more servers and data is structured and consistent, then there is no reason to use system design to support variety of data and high traffic.

While building a highly scalable system, choosing the right database is often the single most important decision. For an exponentially growing business, bad choice of database would lead to extended downtime, latency issues, even data loss impacting customers. When should you consider looking for a different database than the one you are already using, mostly when you have following issues.

  1. When P95 Threshold (i.e 5% of transaction durations are greater than the threshold) latency is extremely high
  2. Memory overflow is common
  3. Even basic requests require disks I/O slowing things down

Before considering to switch make sure you cannot solve using the same database just but changing the configurations such as memory size, changing compaction strategy, garbage collection behavior and pushing it to the limit. Read the Manuals about the limits of the database. Because migrating a database is risky, costly and take a long time months to years.

Keep points to consider

When all components of system are fast -> queryin & searching for data usually becomes bottleneck. NoSQL prevent data from being bottleneck.

ACID Compliance

SQL interview problems

1- How to find duplicates in a table 2- How to delete duplicates from a table 3- Difference between union and union all 4- Difference between rank,row_number and dense_rank 5- Find records in a table which are not present in another table 6- Find second highest salary employees in each department 7- Find employees with salary more than their manager’s salary 8- Difference between inner and left join 9- update a table and swap gender values. 10- Number of records in output with different kinds of join.

solutions: https://youtu.be/Iv9qBz-cyVA

Reference

https://www.sqliteonline.com https://sqlitebrowser.org/dl/