This project applies statistical learning techniques and neural networks to classify breast cancer malignancy based on cellular nuclei measurements. By comparing traditional classification methods with deep learning approaches, the goal was to improve diagnosis accuracy and reduce human error in cytological inspection.
Challenge: Feature selection and model architecture optimization.
Solution:
Languages/Technologies: Python, TensorFlow, Scikit-Learn
Libraries: Pandas, Seaborn, Matplotlib
Modeling: Neural Networks, Random Forest, KNN, SVM
Other Tools: Grid Search, Feature Engineering
Dataset: Breast Cancer Wisconsin Diagnostic dataset from Kaggle (569 samples).
Features: Ten cellular nuclei measurements for each sample (radius, texture, perimeter, area, etc.), along with standard error and "worst" values for certain features.
Feature Removal: Five features with correlations below 40% were removed, resulting in 33 features for the final model.
Data Preprocessing: Removed low-correlation features, applied MinMax scaling, created squared and log-transformed versions of the data.
Models Compared: Classifiers included KNN, SVM (Linear/RBF), Random Forest, Decision Trees, Neural Networks, etc.
Neural Network Architecture: 4 hidden layers, 33-24-36 neurons, ReLU activations, final layer with sigmoid activation.
The deep neural network model achieved high accuracy, but classical methods like Random Forests and KNN performed comparably. Concave points and area-related features were the most significant factors in diagnosing malignancy. Further hyperparameter tuning could offer marginal improvements, but the current model architecture is near optimal.
The deep neural network outperforms traditional models, offering a balanced trade-off between complexity and diagnostic accuracy. Future work could involve exploring more complex feature engineering, outlier detection, and advanced hyperparameter tuning.