Abstract:
Spam is any unsolicited communication sent in bulk. Spam messages often come in the form of harmless promotional messages. But sometimes spam is a fraudulent or malicious scam. This project proposes a spam message detection system based on Natural Language Processing method. As a part of data preprocessing: data cleaning, tokenization, stop-word removing are applied on the text dataset. Five machine learning algorithms (K-Nearest Neighbors, Random Forest, Logistic Regression, Naive Bayes, Support Vector Classifier, and Decision Tree) are trained with a dataset consisting of 11,572 English sentences. Exploratory Data Analysis is also applied on the dataset to analyze total spam and ham texts. After testing all the algorithms by Voting and Term Frequency-Inverse Document Frequency, the proposed system achieved the highest accuracy of 96% for Multinomial Naive Bayes model. A simple web based application is developed using Python pickle library and Stream lit Python library to test the whole system and it works properly.