DOI

The full paper on this project may be read at ResearchGate or Academia.edu.

The scripts used for data visualization and the dataset used were from Nick Brooks’ Kaggle report on Women’s Clothing E-commerce Review.

Abstract

Understanding customer sentiments is of paramount importance in marketing strategies today. Not only will it give companies an insight as to how customers perceive their products and/or services, but it will also give them an idea on how to improve their offers. This paper attempts to understand the correlation of different variables in customer reviews on a women clothing e-commerce, and to classify each review whether it recommends the reviewed product or not and whether it consists of positive, negative, or neutral sentiment. To achieve these goals, we employed univariate and multivariate analyses on dataset features except for review titles and review texts, and we implemented a bidirectional recurrent neural network (RNN) with long-short term memory unit (LSTM) for recommendation and sentiment classification. Results have shown that a recommendation is a strong indicator of a positive sentiment score, and vice-versa. On the other hand, ratings in product reviews are fuzzy indicators of sentiment scores. We also found out that the bidirectional LSTM was able to reach an F1-score of 0.88 for recommendation classification, and 0.93 for sentiment classification.

Citation

To cite the repository/software, kindly use the following BibTex entry:

@misc{abien_fred_agarap_2018_1188376,
  author       = {Abien Fred Agarap},
  title        = {AFAgarap/ecommerce-reviews-analysis: v0.1.0-alpha},
  month        = mar,
  year         = 2018,
  doi          = {10.5281/zenodo.1188376},
  url          = {https://doi.org/10.5281/zenodo.1188376}
}

Usage

If Jupyter Notebook or Jupyter Lab is not installed, run the following code

# installing jupyter notebook
pip3 install jupyter
# installing jupyter lab
pip3 install jupyterlab

Note that just either one of the packages above may be used. Installing both of them is not a requirement.

The notebook for the data visualizations and text classification in this paper is available here.

Results

All experiments in this study were conducted on a laptop computer with Intel Core(TM) i5-6300HQ CPU @ 2.30GHz x 4, 16GB of DDR3 RAM, and NVIDIA GeForce GTX 960M 4GB DDR5 GPU. The review texts and labels (recommendation indicator) in the dataset were partitioned in the following fashion: 60% for training dataset, 20% for the validation dataset, and 20% for the testing dataset. The sentiment label for each review text were tagged using NLTK’s sentiment analyzer.

Task Test Accuracy Test Loss
Recommendation classification ~88.2678% ~0.572342
Sentiment classification ~92.8414% ~0.453205

Table 1. Test Accuracy and Test Loss using Bidirectional RNN with LSTM.

Class Precision Recall F1-Score Support
Not recommended 0.70 0.65 0.68 847
Recommended 0.92 0.94 0.93 3679
Average/Total 0.88 0.88 0.88 4526

Table 2. Statistical Report on Recommendation Classification using Bidirectional LSTM.

Class Precision Recall F1-Score Support
Negative 0.47 0.50 0.49 289
Neutral 0.31 0.18 0.23 22
Positive 0.96 0.96 0.96 4215
Average/Total 0.93 0.93 0.93 4526

Table 3. Statistical Report on Sentiment Classification using Bidirectional LSTM.

The dataset used had an imbalanced frequency distribution for classes in recommendation indicator, and through NLTK sentiment analyzer, classes in review sentiment. It is noticeable that the model used had relatively better predictive performance towards the class with the highest frequency distribution, i.e. recommended and positive sentiment.

License

Copyright 2018 Abien Fred Agarap

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

DOI PyPI

The full paper may be read at arXiv.org, Academia.edu, and ResearchGate.

Abstract

Effective and efficient mitigation of malware is a long-time endeavor in the information security community. The development of an anti-malware system that can counteract previously-unknown malware is a prolific activity that may benefit several sectors. We envision an intelligent anti-malware system that utilizes the power of deep learning (DL) models. Using such models would enable the detection of newly-released malware through mathematical generalization. That is, finding the relationship between a given malware x and its corresponding malware family y, f : x → y. To accomplish this feat, we used the Malimg dataset[12] which consists of malware images that were processed from malware binaries, and then we trained the following DL models 1 to classify each malware family: CNN-SVM[16], GRU-SVM[3], and MLP-SVM. Empirical evidence has shown that the GRU-SVM stands out among the DL models with a predictive accuracy of ≈84.92%. This stands to reason for the mentioned model had the relatively most sophisticated architecture design among the presented models. The exploration of an even more optimal DL-SVM model is the next stage towards the engineering of an intelligent anti-malware system.

Usage

First, clone the project.

git clone https://github.com/AFAgarap/malware-classification.git/

Run the setup.sh to ensure that the pre-requisite libraries are installed in the environment.

$ sudo chmod +x setup.sh
$ ./setup.sh

Run the main.py with the following parameters.

usage: main.py [-h] -m MODEL -d DATASET -n NUM_EPOCHS -c PENALTY_PARAMETER -k
               CHECKPOINT_PATH -l LOG_PATH -r RESULT_PATH

Deep Learning Using Support Vector Machine for Malware Classification

optional arguments:
  -h, --help            show this help message and exit

Arguments:
  -m MODEL, --model MODEL
                        [1] CNN-SVM, [2] GRU-SVM, [3] MLP-SVM
  -d DATASET, --dataset DATASET
                        the dataset to be used
  -n NUM_EPOCHS, --num_epochs NUM_EPOCHS
                        number of epochs
  -c PENALTY_PARAMETER, --penalty_parameter PENALTY_PARAMETER
                        the SVM C penalty parameter
  -k CHECKPOINT_PATH, --checkpoint_path CHECKPOINT_PATH
                        path where to save the trained model
  -l LOG_PATH, --log_path LOG_PATH
                        path where to save the TensorBoard logs
  -r RESULT_PATH, --result_path RESULT_PATH
                        path where to save actual and predicted labels array

For instance, use the CNN-SVM model.

$ cd malware-classification
$ python3 main.py --model 1 --dataset ./dataset/malimg.npz --num_epochs 100 --penalty_parameter 10 --checkpoint_path ./checkpoint/ --log_path ./logs/ --result_path ./results/

To run a trained model, run the classifier.py with the following parameters.

usage: classifier.py [-h] -m MODEL -t MODEL_PATH -d DATASET

Deep Learning Using Support Vector Machine for Malware Classification

optional arguments:
  -h, --help            show this help message and exit

Arguments:
  -m MODEL, --model MODEL
                        [1] CNN-SVM, [2] GRU-SVM, [3] MLP-SVM
  -t MODEL_PATH, --model_path MODEL_PATH
                        path where to save the trained model
  -d DATASET, --dataset DATASET
                        the dataset to be classified

For instance, use a trained CNN-SVM model.

$ python3 classifier.py --model 1 --model_path ./trained-cnn-svm/ --dataset/malimg.npz
Loaded trained model from trained-cnn-svm/CNN-SVM-2400
Predictions : [ 1. -1. -1. ..., -1. -1.  1.]
Accuracies : [ 0.99609375  0.94140625  0.94921875  0.984375    0.95703125  0.9296875
  0.9296875   0.9609375   0.9296875   0.94921875  0.953125    0.92578125
  0.89453125  0.8203125   0.8125      0.75390625  0.8203125   0.84375
  0.8515625   0.94140625  0.7421875   0.94140625  0.984375    0.9921875   1.
  0.99609375  0.9765625   0.9609375   0.81640625  0.98828125  0.7890625
  0.8828125   0.94921875  0.96875     1.          1.        ]
Average accuracy : 0.9203559027777778

Results

The experiments were conducted on a laptop computer with Intel Core(TM) i5-6300HQ CPU @ 2.30GHz x 4, 16GB of DDR3 RAM, and NVIDIA GeForce GTX 960M 4GB DDR5 GPU. Table 1 shows the hyperparameters used in the study.

Table 1. Hyperparameters used in the DL-SVM models.

Hyperparameters CNN-SVM GRU-SVM MLP-SVM
Batch Size 256 256 256
Cell Size N/A [256, 256, 256, 256, 256] [512, 256, 128]
No. of Hidden Layers 2 5 3
Dropout Rate 0.85 0.85 None
Epochs 100 100 100
Learning Rate 1e-3 1e-3 1e-3
SVM C 10 10 0.5


Figure 1. Plotted using matplotlib. Training accuracy of the DL-SVM models on malware classification using the Malimg dataset.

Figure 1 summarizes the training accuracy of the DL-SVM models for 100 epochs (equivalent to 2500 steps, since 6400 × 100 ÷ 256 = 2500). First, the CNN-SVM model accomplished its training in 3 minutes and 41 seconds with an average training accuracy of 80.96875%. Meanwhile, the GRU-SVM model accomplished its training in 11 minutes and 32 seconds with an average training accuracy of 90.9375%. Lastly, the MLP-SVM model accomplished its training in 12 seconds with an average training accuracy of 99.5768229%.

Table 2. Summary of experiment results on the DL-SVM models.

Variables CNN-SVM GRU-SVM MLP-SVM
Accuracy 77.2265625% 84.921875% 80.46875%
Data points 256000 256000 256000
Epochs 100 100 100
F1 0.79 0.85 0.81
Precision 0.84 0.85 0.83
Recall 0.77 0.85 0.80

Table 2 summarizes the experiment results on the DL-SVM models on malware classification using the Malimg dataset.

Citation

To cite the paper, kindly use the following BibTex entry:

@article{agarap2017towards,
  title={Towards Building an Intelligent Anti-Malware System: A Deep Learning Approach using Support Vector Machine (SVM) for Malware Classification},
  author={Agarap, Abien Fred and Pepito, Francis John Hill},
  journal={arXiv preprint arXiv:1801.00318},
  year={2017}
}

To cite the repository/software, kindly use the following BibTex entry:

@misc{abien_fred_agarap_2017_1134207,
  author       = {Abien Fred Agarap},
  title        = {AFAgarap/malware-classification v0.1-alpha},
  month        = dec,
  year         = 2017,
  doi          = {10.5281/zenodo.1134207},
  url          = {https://doi.org/10.5281/zenodo.1134207}
}

License

Copyright 2017 Abien Fred Agarap

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Co-authored with Francis John Hill Pepito.