Data Preprocessing and Exploratory Data Analysis.
Aim. Perform data preprocessing and exploratory data analysis (EDA). Use a real-world dataset with more than 1000 records. Apply Python libraries: Pandas, NumPy, Matplotlib, Seaborn..
Dataset Information. Dataset: Netflix Movies and TV Shows Dataset Source: Kaggle Records: ~8800+ Attributes: 12 Format: CSV.
Dataset Attributes. show_id – Unique ID type – Movie or TV Show title – Content title director – Director name country – Country of origin release_year – Year of release.
Tools and Technologies. Python – Programming language Pandas – Data manipulation NumPy – Numerical operations Matplotlib – Visualization Seaborn – Advanced graphs Jupyter Notebook – Execution environment.
What is Data Preprocessing?. Process of cleaning and preparing raw data. Removes noise, missing values, and inconsistencies. Improves data quality before analysis..
Common Data Problems. Missing values Duplicate records Incorrect data types Outliers Noisy data.
Steps in Data Preprocessing. Raw Dataset Data Cleaning Handling Missing Values Removing Duplicates Data Transformation Exploratory Data Analysis.
Exploratory Data Analysis (EDA). Understand dataset structure Identify patterns and trends Detect anomalies Validate assumptions.
Common EDA Techniques. Histogram – Distribution of data Bar Chart – Category comparison Heatmap – Correlation visualization Boxplot – Outlier detection.
System Architecture. Data Collection from Kaggle Load dataset using Pandas Perform Data Cleaning Transform data Perform Exploratory Data Analysis Generate insights.
Python Implementation. Import libraries: pandas, numpy, matplotlib, seaborn Load dataset using read_csv() Check dataset using info() Handle missing values Perform visualization.
Implementation Using Python. # Step 1: Import Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import LabelEncoder from scipy import stats # Setting visualization style sns.set(style="whitegrid") # Step 2: Load Dataset # Note: Ensure 'netflix_titles.csv' is in your working directory try: df = pd.read_csv("netflix_titles.csv") print("Dataset Loaded Successfully.\n") except FileNotFoundError: print("Error: The file 'netflix_titles.csv' was not found. Please check the path.") # Creating a dummy dataframe for demonstration if file is missing (Optional) # df = pd.DataFrame() # Display first 5 records if not df.empty: print("First 5 Records:") print(df.head()) print("\n" + "="*50 + "\n").
# Step 3: Dataset Information print("Dataset Information:") print(df.info()) print("\n" + "="*50 + "\n") # Check for Missing Values print("Missing Values Count:") print(df.isnull().sum()) print("\n" + "="*50 + "\n") # Step 4: Handling Missing Values # Replacing missing values in 'director', 'cast', 'country' with meaningful labels df['director'].fillna("Unknown", inplace=True) df['cast'].fillna("Not Available", inplace=True) df['country'].fillna("Unknown", inplace=True) # For 'rating' and 'date_added', we might drop rows or fill with mode. # Here we fill rating with mode as it's categorical df['rating'].fillna(df['rating'].mode()[0], inplace=True) # Drop remaining rows with missing date_added (very few usually) df.dropna(subset=['date_added'], inplace=True) print("Missing Values After Handling:") print(df.isnull().sum()) print("\n" + "="*50 + "\n").
# Step 5: Remove Duplicate Records before_dedup = df.shape[0] df.drop_duplicates(inplace=True) after_dedup = df.shape[0] print(f"Duplicate Removal: duplicates removed.") print("\n" + "="*50 + "\n") # Step 6: Data Type Conversion # Converting 'date_added' to datetime df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce') print("Data Type of 'date_added' converted to datetime.") print(df.dtypes['date_added']) print("\n" + "="*50 + "\n") # ========================================== # 8. Exploratory Data Analysis (EDA) # ========================================== if not df.empty: # Graph 1: Movies vs TV Shows plt.figure(figsize=(6,4)) sns.countplot(x='type', data=df, palette='Set2') plt.title("Distribution of Movies and TV Shows") plt.xlabel("Content Type") plt.ylabel("Count") plt.show().
# Graph 2: Content Release Trend plt.figure(figsize=(10,5)) sns.histplot(df['release_year'], bins=20, kde=True, color='skyblue') plt.title("Release Year Distribution") plt.xlabel("Release Year") plt.ylabel("Frequency") plt.show() # Graph 3: Top 10 Countries Producing Content # Excluding 'Unknown' countries for better visualization filtered_countries = df[df['country'] != 'Unknown'] top_country = filtered_countries['country'].value_counts().head(10) plt.figure(figsize=(10,5)) top_country.plot(kind='bar', color='teal') plt.title("Top 10 Content Producing Countries") plt.xlabel("Country") plt.ylabel("Number of Titles") plt.show() # Graph 4: Correlation Heatmap # Selecting only numeric columns numeric_df = df.select_dtypes(include=['int64', 'int32']) if not numeric_df.empty: plt.figure(figsize=(6,4)) sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm') plt.title("Correlation Heatmap") plt.show().
else: print("No numeric columns found for heatmap.") # Graph 5: Outlier Detection Using Boxplot plt.figure(figsize=(6,4)) sns.boxplot(x=df['release_year'], color='orange') plt.title("Boxplot for Release Year (Outlier Detection)") plt.show() # ========================================== # 12. Variations for the Assignment # ========================================== print("\n" + "="*50) print("IMPLEMENTING VARIATIONS") print("="*50 + "\n") # # Variation 1: Missing Value Handling (Mean/Median/Mode) # print(" Variation 1: Missing Value Handling ") # Note: The Netflix dataset has very few numeric columns suitable for mean imputation. # 'release_year' is usually complete. For demonstration, let's assume we want to ensure # no zeros or nulls in a numeric field. # We will demonstrate Mode replacement on 'rating' (categorical) as done in Step 4, # and show code for Mean/Median imputation..
# Example Code for Mean Imputation (Hypothetical numeric column) # df['numeric_column'].fillna(df['numeric_column'].mean(), inplace=True) # Example Code for Median Imputation (Robust to outliers) # df['numeric_column'].fillna(df['numeric_column'].median(), inplace=True) # Mode Replacement for 'rating' print("Mode value for rating:", df['rating'].mode()[0]) # Already implemented in Step 4, but here is the syntax again: # df['rating'].fillna(df['rating'].mode()[0], inplace=True) print("Missing value handling techniques (Mean, Median, Mode) reviewed.\n") # # Variation 2: Outlier Detection (Z-Score & IQR) # print(" Variation 2: Outlier Detection ") # Method A: Z-Score # Typically used for Gaussian distributions. We use release_year. z_scores = np.abs(stats.zscore(df['release_year'])) threshold = 3 outliers_z = np.where(z_scores > threshold) print(f"Number of outliers detected using Z-Score:") # Method B: IQR (Interquartile Range) Q1 = df['release_year'].quantile(0.25).
Q3 = df['release_year'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Filtering data (keeping non-outliers) df_no_outliers = df[(df['release_year'] >= lower_bound) & (df['release_year'] <= upper_bound)] print(f"Original Dataset Size:") print(f"Dataset Size after IQR Outlier Removal:") print("\n") # # Variation 3: Categorical Encoding # print(" Variation 3: Categorical Encoding ") # Using LabelEncoder to convert 'type' (Movie/TV Show) to numbers (0/1) le = LabelEncoder() df['type_encoded'] = le.fit_transform(df['type']) # Viewing the result print("Encoding 'type' column:") print(df[['type', 'type_encoded']].head()) print("0 = Movie, 1 = TV Show (Depending on alphabet order)\n").
# # Variation 4: Feature Engineering # print(" Variation 4: Feature Engineering ") # Creating a new column 'content_age' based on current year minus release year current_year = 2024 df['content_age'] = current_year - df['release_year'] print("New Feature 'content_age' created:") print(df[['title', 'release_year', 'content_age']].head()) print("\n").
Graph: Movies vs TV Shows Distribution. Distribution of Movies and TV Shows 6000 5000 4000 3000 2000 1000 Movie TV Show Type.
Graph: Release Year Distribution. Release Year Distribution 200 175 150 125 100 75 50 25 2000 2005 2010 hear 2015 2020.
Graph: Top Countries Producing Netflix Content. 3000 2500 2000 1500 1000 500 Top 10 Content Producing Countries country.
Graph: Correlation Heatmap. Correlation Heatmap —0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.0 0.0 0.5 1.0 1.5 0.9 0.8 0.7 0.6 0.5 0.4 2.5.
Graph: Outlier Detection using Boxplot. Outlier Detection Using Boxplot 2020 2015 2010 a: 2005 2000 1.
Results and Observations. Movies are more than TV Shows. USA produces most Netflix content. Content production increased after 2015. Some attributes contain missing values..
Advantages of Data Preprocessing. Improves data quality Removes noise and errors Improves machine learning performance Helps detect anomalies.
Conclusion. Data preprocessing prepares data for analysis. EDA helps extract meaningful insights. Visualization helps understand patterns in data..