With the growing internet population and growth in technology (IoT, Social Media, Digitialization) the data is growing exponentially everyday.
“As of April 2022, the internet reaches 63% of the world’s population, representing roughly 5 billion. Of this total, 4.65 billion — over 93% — were social media users. According to Statista, the total amount of data predicted to be created, captured, copied and consumed globally in 2022 is 97 zettabytes, a number projected to grow to 181 zettabytes by 2025.” - Data Never Sleeps
We are generating data through events (Internet of Events, Internet of Everything):
Data is an asset from which enterprises can derive ongoing value through strategic planning & coordination. Data and information are vital to the day-to-day operations of organizations, hence it has been named the “New Gold”, the “New Currency”, the “Life Blood”, “New Oil”, and so on. Failure to manage data is failure to manage capital. It results in waste and lost of opportunity. Data can give enterprise insights about their existing and potential customers, their products and services, and interal processes which can enable making better decisions that drive future values.
Just like valuation of goodwill, it is important to valuate data and its likely that we’ll see a record of data valuation in P&L chart. But valuation of data is not standardize, it is contextual (data valuable for organisation might not be relevant for other) and temporal (data valuable now, might not be useful tomorrow). Data has unique properties that makes it challenging to manage and valuate.
The goal of data science and engineering is to convert data into information and information into knowledge or actionable insights, which gradually changes into wisdom.
DIKW pyramid aka DIKW hierarchy shows structural/functional relationship or stages of data as Data, Information, Knowledge, and Wisdom, also explained through other types of diagrams. Data such as facts, signals, symbols that are organized into meaningful and informative structures, through certain integrations, aggregations and other processing, are termed as Information. Knowledge is defined in reference to relevant and contextual information which can be put into actions through certain subjective interpretation or other cognitive processes. Applied knowledge becomes experiences and wisdom for the organization; wisdom is non-material amd more about human judgments and decisions for using the knowledge for greater good. The more we process the raw data volume or size gets compressed giving more value from it, but at the same time it might lose some information.
A lot of work needs to happen to convert data to information and to knowledge. The data are acquired, cleaned, explored, processed/transformed, moved, integrated, validated, stored and visualized through multiple pipelines to get the data to end user for data-driven decision making. The process of extracting knowledge was initially appeared in database mining where the term “knowledge discovery in databases” (KDD) was coined; KDD process is commonly defined with the stages: Selection –> Pre-processing –> Transformation –> Data mining –> Interpretation/evaluation. However, there are many variation of this theme, and many attempts to standardize the processes one such open standard process model is CRoss-Industry Standard Process for Data Mining (CRISP-DM) conceived in 1996 through the effort of a consortium which is currently adopted by IBM and other organizations. Another standard developed by SAS is SEMMA which stands for Sample, Explore, Modify, Model, Assess; however CRISP-DM incorporates all the stages in SEMMA and KDD with addition of business understanding and deployment stages as well.
Various stages and numerous data tasks requires unique skills and hence disparate data-related roles and responsibilities have emerged mostly Data Scientist, Data Engineer, and Data Analyst.
Data Scientist was called the sexiest job of 21st century in 2012; however by 2020 the focus shifted more towards Data Engineers. Some companies even expect data scientists to be full-stack with data engineering and data ops knowledge including containerization and infrastructure tools. Furthermore, The expectation from data engineer has grown a lot. Data Engineers are expected to come back with a solution within hours which could take days a decade back. A data engineer, today needs to do a lot less from scratch and spend less time on setting up and managing systems. Data science and engineering is a team sport, no one person is expected to have all the knowledge, skills, and specializations required for the wide-ranging tasks covered within the scope of data engineering.
Strategic use case areas