Chapter 1: Data Analytics Process. Summary of Chapter 1 of Data Science Concepts and Techniques with Application.
Introduction to Data. Data is crucial across various sectors, including research and business. The volume and complexity of data have increased significantly due to smartphones, IoT, and social media. Challenges include the 'curse of dimensionality' and the complexity of processing large datasets..
What is Analytics?. Data analytics is the process of examining raw data to uncover meaningful knowledge. Steps include data gathering, cleaning, applying statistical techniques, and using machine learning algorithms. Analytics transforms raw data into actionable insights. The development of data analytics has enabled advancements like Artificial Neural Networks, which aim to mimic the human brain to perform tasks at a human level. New challenges have emerged, such as optimizing computing resources, enhancing efficiency, and addressing data issues like inaccuracy, large volumes, and anomalies..
Big Data vs Small Data. Big Data: Characterized by the following: Volume: Handling vast amounts of data Velocity: Data is generated at an extremely high rate Veracity: Concerns with data quality, including biases and anomalies. Variety: The wide range of data types (text, pictures). Small Data: Easier to process with traditional tools and systems. Big Data requires distributed systems like Hadoop and cloud computing..
Comparison Between Small and Big Data. Small Data Big Data Small volumes of data, mostly organizational data Large volumes including text, images, and videos Stored on single systems Data is distributed Stand-alone storage devices, ,e.g., local servers Data is connected, e.g., Clouds, IoTs Homogeneous data, i.e., data has the same format Heterogeneous data, i.e., data may be in different formats and shapes Mostly structured data May include both structured, unstructured, and semi-structured data Simple computing resources are sufficient to process the data May require more computing resources including cloud computing and grid computing, etc. Processing such data is less challenging Processing such data is more challenging when it comes to accuracy and efficiency of processing.
Role of Data Analytics. Uses knowledge from statistics, AI, machine learning, etc. to uncover patterns in data. Includes data storage, cleansing, mining, and visualization to inform decision-making. Widely used in industries like banking, healthcare, and security..
Types of Data Analytics. Descriptive Analytics: 'What happened?' Diagnostic Analytics: 'Why did it happen?' Predictive Analytics: 'What might happen?' Prescriptive Analytics: 'What should be done?'.
Challenges in Data Analytics. Managing large volumes of data. Processing real-time data. Presenting complex data in a simple format Ensuring data quality and integrating data from multiple sources. Organizational challenges: lack of skills, budget constraints, and management pressure..
Top Tools in Data Analytics. Hadoop: Framework for distributed storage and processing of large data sets. Spark: Fast data processing engine for big data analytics. Tableau: Data visualization tool for translating data into intuitive graphs. Python and R: Programming languages widely used for data manipulation and analysis..
Business Intelligence. Business Intelligence (BI) helps organizations make data-driven decisions. Effective BI needs to meet four major criteria: Accuracy: Accurate input data is essential to prevent errors and support effective decisions, requiring cleansing to address issues like missing values. Valuable Insights: Insights must align with business needs to be relevant and effective. Timeliness: Insights should be timely, considering both data availability and prompt delivery. Actionable: Insights should be practical and consider constraints like budget and sales limits..
BI Process. The BI Process has four steps that loop over and over Data Gathering: Collect and cleanse data to prepare it for BI processing. Analysis: Process data to extract insights. Action: Take action based on analyzed information. Measurement: Measure the results of actions against desired outcomes. Feedback: Use measurements to improve the BI process..
Data Analytics vs Data Analysis. Data analytics is a broad process for decision-making, while data analysis is a subset focused on extracting insights from data. Data analytics involves general collection and analysis, while data analysis includes cleaning and transforming data for deeper insights. Data analytics uses tools like Python and TensorFlow, whereas data analysis uses tools like RapidMiner and KNIME. Data analysis focuses on examining and transforming data, while data analytics manages the entire data process, including collection, organization, and storage..
Data Analytics vs Data Visualization. Data Analytics: The process of analyzing raw data for actionable insights. Data Visualization: Graphical representation of data to communicate findings. Visualization is a crucial part of analytics to make complex data easier to understand..
Data Analyst vs Data Scientist. Data Analyst: Focuses on interpreting existing data using BI tools. Data Scientist: Uses advanced techniques like machine learning to predict future outcomes. Data Scientists require more programming skills than Data Analysts..
Data Analytics vs Business Intelligence. Business intelligence uses historical data for decision-making; data analytics finds relationships for deeper insights. Business intelligence supports growth, while data analytics addresses specific business needs. Business intelligence looks at the past; data analytics predicts future outcomes. Business intelligence uses dashboards and reporting; data analytics uses techniques like data mining and big data analytics..
Data Analysis vs Data Mining. Data mining finds patterns; data analysis extracts insights. Data mining needs math, stats, and machine learning; data analysis also requires subject knowledge. Data miners identify patterns; data analysts clean and transform data for insights..
ETL: Extraction, Transformation, Loading. ETL is a process used to extract data from various sources, transform it into a usable format, and load it into a database. Extraction: Gathering data from multiple sources. Transformation: Cleaning and organizing the data. Loading: Storing the data in a target system, often for analysis..
Data Science. Data science is a multidisciplinary field that transforms data into valuable insights using math, statistics, machine learning, and AI. Businesses of all sizes use data science for decision-making, driving demand for related skills and tools. Efforts are underway to make data science tools more affordable and accessible..
Life Cycle of a Data Science Project. Phase 1 – Discovery: Identify available data sources, requirements, desired outputs, feasibility, and infrastructure. Phase 2 – Data Preparation: Explore and preprocess the data, including ETL processes. Phase 3 – Model Plan: Define the model to extract knowledge and identify patterns in the data. Phase 4 – Model Development: Build the model using training and testing data with techniques like classification or clustering. Phase 5 – Operationalize: Implement the project by delivering code, installation, documentation, and demonstrations. Phase 6 – Communicate Results: Evaluate the project based on metrics like customer satisfaction and goal achievement..