← Back to All Projects

Predictive Imputation

Machine Learning
AIPythonscikit-learnpandas
Predictive Imputation cover image

Summary

A robust data engineering tool developed to handle missing data through custom K-means clustering and automated classification, optimizing predictive accuracy across varied datasets using Python.

Details

This project addressed a critical challenge in data science: maintaining dataset integrity when faced with incomplete information. I engineered a two-stage pipeline—Missing Value Estimation and Optimal Classification—to automate the cleaning and predictive analysis of high-dimensional datasets. The software architecture was designed to perform sophisticated data preprocessing and model selection using a combination of Pandas, NumPy, and Scikit-learn.

Advanced Missing Value Estimation

Data imputation was approached through two distinct methodologies to compare the impact of local vs. global mean substitution.

  • Basic Mean Imputation: Implemented a global strategy using df.fillna(df.mean()) to replace null values with the column-wide average.
  • K-Means Cluster Imputation: Developed a more localized approach by implementing a custom K-means algorithm to group similar data points.
    • Centroid Optimization: Built functions to initialize (pickKCenters) and iteratively update (pickNewCenters) centroids based on Euclidean distance.
    • Group-Based Filling: For rows with missing data, the system identifies the closest cluster and replaces null values with the mean specific to that cluster, preserving local data variance.

Feature Engineering and Preprocessing

Before classification, raw data undergoes a rigorous cleaning and transformation phase:

  • Outlier Handling: Implemented logic to detect and replace extreme noise values (e.g., $1.0e+99$) with NaN to prevent skewing the statistical mean.
  • Data Sanitization: Utilized df.dropna() for validation sets and ensured consistent data types across the dataframe to support mathematical operations.

Automated Classification and Model Selection

The final stage of the pipeline automates the selection of the most accurate machine learning model for a given dataset.

  • Model Testing: The system concurrently trains and evaluates two distinct approaches:
    • Random Forest Classifier: A non-linear ensemble method utilizing multiple decision trees (RandomForestClassifier).
    • Linear Regression: A standard linear approach to gauge baseline classification performance.
  • Optimization Logic: Using cross_validation.train_test_split, the tool measures the accuracy scores of both models. The testClassifiers function programmatically selects the "Optimal Classifier" (optClassifier) based on which method achieved the highest precision for that specific data distribution.

Outcomes

  • High Efficiency: Developed a writePredictions utility to streamline the deployment of the selected model, converting results to optimized lists for file output.
  • Physics-Aware Logic: By implementing distance-based clustering from scratch (np.sqrt(np.power(...).sum())), the project demonstrates a deep understanding of the vector math required for custom machine learning implementations beyond standard library calls.

Gallery

Predictive Imputation gallery image: 01img_ai2.png
Predictive Imputation gallery image: o2img_ai3.png