Anil Demirok

Introduction

Classification and clustering are two fundamental techniques in machine learning used for analyzing and understanding data.

Classification

Classification is a supervised learning approach that involves training models on labeled data to predict predefined categories or classes for new, unseen data. Common examples include email spam detection, sentiment analysis, and disease diagnosis.

Clustering

Clustering, on the other hand, is an unsupervised learning technique used to group similar data points based on their features, without prior knowledge of class labels. It helps discover hidden patterns or structures within data, such as customer segmentation or document grouping.

In this project, both classification and clustering methods were applied to real-world datasets to explore their effectiveness in solving practical data-driven problems.

Research

Before developing the models, it was essential to find large, high-quality datasets suitable for both classification and clustering tasks. After conducting some research, I selected the following datasets:

Classification: Census Income Dataset from the UCI Machine Learning Repository — used to predict whether an individual income exceeds a certain threshold based on demographic attributes.

Clustering: Airline Passenger Satisfaction Dataset from Kaggle — used to group passengers based on their travel experiences and satisfaction levels.

These datasets provided diverse features and sufficient data volume, making them well-suited for experimenting with different machine learning models and techniques.

Training the Models

For the classification task, I experimented with various machine learning models using the Census Income dataset. The models included Artificial Neural Networks (ANNs) with one and two hidden layers, as well as Decision Trees based on Gini index, Information Gain, Gain Ratio and Suppor Vector Machines (SVM). After evaluating their initial performances, I applied ensemble learning techniques such as bagging and boosting to enhance model accuracy and robustness. Through this process, I compared the effectiveness of different algorithms and optimization methods, aiming to achieve the best possible predictive performance on the dataset.

For the clustering task, I worked with the Airline Passenger Satisfaction dataset. Before applying clustering algorithms, I performed data preprocessing to convert categorical variables into a binary format suitable for analysis. I then experimented with several clustering and association techniques, including AGNES (Agglomerative Nesting), Apriori, FP-Growth, and K-Means. These methods were used to uncover hidden patterns and group passengers based on their satisfaction levels and travel behaviors. By comparing the results of different algorithms, I gained insights into how data preprocessing and algorithm selection influence clustering quality and interpretability.

Software

For this project, I used Jupyter Notebook as the development environment and Python as the main programming language. Several Python libraries were utilized throughout the process: scikit-learn (sklearn) for implementing machine learning algorithms, pandas for data preprocessing and manipulation, and matplotlib and seaborn for data visualization and performance analysis. These tools provided an efficient and flexible workflow for model development, evaluation, and result presentation.

Conclusion

Overall, Gain and Gini Decision Trees with bagging achieved the highest classification accuracy, while boosting was more effective for the ANN models. Frequent pattern analysis using Apriori and FP-Growth yielded similar results, with ECLAT's performance depending on dataset characteristics. For clustering, K-Means and DBSCAN produced closely grouped clusters, whereas AGNES generated more distinct groupings.

Machine Learning Projects

Machine Learning Model Training

Project Background