Data Preprocessing and Exploratory Data Analysis

Published on
Embed video
Share video
Ask about this video

Scene 1 (0s)

Data Preprocessing and Exploratory Data Analysis.

Scene 2 (16s)

Aim. Perform data preprocessing and exploratory data analysis (EDA). Use a real-world dataset with more than 1000 records. Apply Python libraries: Pandas, NumPy, Matplotlib, Seaborn..

Scene 3 (28s)

Dataset Information. Dataset: Netflix Movies and TV Shows Dataset Source: Kaggle Records: ~8800+ Attributes: 12 Format: CSV.

Scene 4 (39s)

Dataset Attributes. show_id – Unique ID type – Movie or TV Show title – Content title director – Director name country – Country of origin release_year – Year of release.

Scene 5 (49s)

Tools and Technologies. Python – Programming language Pandas – Data manipulation NumPy – Numerical operations Matplotlib – Visualization Seaborn – Advanced graphs Jupyter Notebook – Execution environment.

Scene 6 (59s)

What is Data Preprocessing?. Process of cleaning and preparing raw data. Removes noise, missing values, and inconsistencies. Improves data quality before analysis..

Scene 7 (1m 10s)

Common Data Problems. Missing values Duplicate records Incorrect data types Outliers Noisy data.

Scene 8 (1m 18s)

Steps in Data Preprocessing. Raw Dataset Data Cleaning Handling Missing Values Removing Duplicates Data Transformation Exploratory Data Analysis.

Scene 9 (1m 28s)

Exploratory Data Analysis (EDA). Understand dataset structure Identify patterns and trends Detect anomalies Validate assumptions.

Scene 10 (1m 36s)

Common EDA Techniques. Histogram – Distribution of data Bar Chart – Category comparison Heatmap – Correlation visualization Boxplot – Outlier detection.

Scene 11 (1m 46s)

System Architecture. Data Collection from Kaggle Load dataset using Pandas Perform Data Cleaning Transform data Perform Exploratory Data Analysis Generate insights.

Scene 12 (1m 56s)

Python Implementation. Import libraries: pandas, numpy, matplotlib, seaborn Load dataset using read_csv() Check dataset using info() Handle missing values Perform visualization.

Scene 13 (2m 6s)

Implementation Using Python. # Step 1: Import Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import LabelEncoder from scipy import stats # Setting visualization style sns.set(style="whitegrid") # Step 2: Load Dataset # Note: Ensure 'netflix_titles.csv' is in your working directory try: df = pd.read_csv("netflix_titles.csv") print("Dataset Loaded Successfully.\n") except FileNotFoundError: print("Error: The file 'netflix_titles.csv' was not found. Please check the path.") # Creating a dummy dataframe for demonstration if file is missing (Optional) # df = pd.DataFrame() # Display first 5 records if not df.empty: print("First 5 Records:") print(df.head()) print("\n" + "="*50 + "\n").

Scene 14 (2m 39s)

# Step 3: Dataset Information print("Dataset Information:") print(df.info()) print("\n" + "="*50 + "\n") # Check for Missing Values print("Missing Values Count:") print(df.isnull().sum()) print("\n" + "="*50 + "\n") # Step 4: Handling Missing Values # Replacing missing values in 'director', 'cast', 'country' with meaningful labels df['director'].fillna("Unknown", inplace=True) df['cast'].fillna("Not Available", inplace=True) df['country'].fillna("Unknown", inplace=True) # For 'rating' and 'date_added', we might drop rows or fill with mode. # Here we fill rating with mode as it's categorical df['rating'].fillna(df['rating'].mode()[0], inplace=True) # Drop remaining rows with missing date_added (very few usually) df.dropna(subset=['date_added'], inplace=True) print("Missing Values After Handling:") print(df.isnull().sum()) print("\n" + "="*50 + "\n").

Scene 15 (3m 15s)

# Step 5: Remove Duplicate Records before_dedup = df.shape[0] df.drop_duplicates(inplace=True) after_dedup = df.shape[0] print(f"Duplicate Removal: duplicates removed.") print("\n" + "="*50 + "\n") # Step 6: Data Type Conversion # Converting 'date_added' to datetime df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce') print("Data Type of 'date_added' converted to datetime.") print(df.dtypes['date_added']) print("\n" + "="*50 + "\n") # ========================================== # 8. Exploratory Data Analysis (EDA) # ========================================== if not df.empty: # Graph 1: Movies vs TV Shows plt.figure(figsize=(6,4)) sns.countplot(x='type', data=df, palette='Set2') plt.title("Distribution of Movies and TV Shows") plt.xlabel("Content Type") plt.ylabel("Count") plt.show().

Scene 16 (3m 49s)

# Graph 2: Content Release Trend plt.figure(figsize=(10,5)) sns.histplot(df['release_year'], bins=20, kde=True, color='skyblue') plt.title("Release Year Distribution") plt.xlabel("Release Year") plt.ylabel("Frequency") plt.show() # Graph 3: Top 10 Countries Producing Content # Excluding 'Unknown' countries for better visualization filtered_countries = df[df['country'] != 'Unknown'] top_country = filtered_countries['country'].value_counts().head(10) plt.figure(figsize=(10,5)) top_country.plot(kind='bar', color='teal') plt.title("Top 10 Content Producing Countries") plt.xlabel("Country") plt.ylabel("Number of Titles") plt.show() # Graph 4: Correlation Heatmap # Selecting only numeric columns numeric_df = df.select_dtypes(include=['int64', 'int32']) if not numeric_df.empty: plt.figure(figsize=(6,4)) sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm') plt.title("Correlation Heatmap") plt.show().

Scene 17 (4m 31s)

else: print("No numeric columns found for heatmap.") # Graph 5: Outlier Detection Using Boxplot plt.figure(figsize=(6,4)) sns.boxplot(x=df['release_year'], color='orange') plt.title("Boxplot for Release Year (Outlier Detection)") plt.show() # ========================================== # 12. Variations for the Assignment # ========================================== print("\n" + "="*50) print("IMPLEMENTING VARIATIONS") print("="*50 + "\n") # # Variation 1: Missing Value Handling (Mean/Median/Mode) # print(" Variation 1: Missing Value Handling ") # Note: The Netflix dataset has very few numeric columns suitable for mean imputation. # 'release_year' is usually complete. For demonstration, let's assume we want to ensure # no zeros or nulls in a numeric field. # We will demonstrate Mode replacement on 'rating' (categorical) as done in Step 4, # and show code for Mean/Median imputation..

Scene 18 (5m 5s)

# Example Code for Mean Imputation (Hypothetical numeric column) # df['numeric_column'].fillna(df['numeric_column'].mean(), inplace=True) # Example Code for Median Imputation (Robust to outliers) # df['numeric_column'].fillna(df['numeric_column'].median(), inplace=True) # Mode Replacement for 'rating' print("Mode value for rating:", df['rating'].mode()[0]) # Already implemented in Step 4, but here is the syntax again: # df['rating'].fillna(df['rating'].mode()[0], inplace=True) print("Missing value handling techniques (Mean, Median, Mode) reviewed.\n") # # Variation 2: Outlier Detection (Z-Score & IQR) # print(" Variation 2: Outlier Detection ") # Method A: Z-Score # Typically used for Gaussian distributions. We use release_year. z_scores = np.abs(stats.zscore(df['release_year'])) threshold = 3 outliers_z = np.where(z_scores > threshold) print(f"Number of outliers detected using Z-Score:") # Method B: IQR (Interquartile Range) Q1 = df['release_year'].quantile(0.25).

Scene 19 (5m 39s)

Q3 = df['release_year'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Filtering data (keeping non-outliers) df_no_outliers = df[(df['release_year'] >= lower_bound) & (df['release_year'] <= upper_bound)] print(f"Original Dataset Size:") print(f"Dataset Size after IQR Outlier Removal:") print("\n") # # Variation 3: Categorical Encoding # print(" Variation 3: Categorical Encoding ") # Using LabelEncoder to convert 'type' (Movie/TV Show) to numbers (0/1) le = LabelEncoder() df['type_encoded'] = le.fit_transform(df['type']) # Viewing the result print("Encoding 'type' column:") print(df[['type', 'type_encoded']].head()) print("0 = Movie, 1 = TV Show (Depending on alphabet order)\n").

Scene 20 (6m 6s)

# # Variation 4: Feature Engineering # print(" Variation 4: Feature Engineering ") # Creating a new column 'content_age' based on current year minus release year current_year = 2024 df['content_age'] = current_year - df['release_year'] print("New Feature 'content_age' created:") print(df[['title', 'release_year', 'content_age']].head()) print("\n").

Scene 21 (6m 20s)

Graph: Movies vs TV Shows Distribution. Distribution of Movies and TV Shows 6000 5000 4000 3000 2000 1000 Movie TV Show Type.

Scene 22 (6m 29s)

Graph: Release Year Distribution. Release Year Distribution 200 175 150 125 100 75 50 25 2000 2005 2010 hear 2015 2020.

Scene 23 (6m 36s)

Graph: Top Countries Producing Netflix Content. 3000 2500 2000 1500 1000 500 Top 10 Content Producing Countries country.

Scene 24 (6m 44s)

Graph: Correlation Heatmap. Correlation Heatmap —0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.0 0.0 0.5 1.0 1.5 0.9 0.8 0.7 0.6 0.5 0.4 2.5.

Scene 25 (6m 59s)

Graph: Outlier Detection using Boxplot. Outlier Detection Using Boxplot 2020 2015 2010 a: 2005 2000 1.

Scene 26 (7m 7s)

Results and Observations. Movies are more than TV Shows. USA produces most Netflix content. Content production increased after 2015. Some attributes contain missing values..

Scene 27 (7m 18s)

Advantages of Data Preprocessing. Improves data quality Removes noise and errors Improves machine learning performance Helps detect anomalies.

Scene 28 (7m 28s)

Conclusion. Data preprocessing prepares data for analysis. EDA helps extract meaningful insights. Visualization helps understand patterns in data..