Home Leukemia Gene Expression Classification
Post
Cancel

Leukemia Gene Expression Classification

Leukemia describes a group of blood disorders characterized by the dysfunctional production and development of lymphocytes, a type of white blood cell.

In this notebook, 2 different models are provided to solve multiclass classification problem on leukemia gene expression dataset from Curated Microarray Database (CuMiDa)1.

The common challenge in gene expression arrays is that they violate the assumption that the number of samples is higher than the number of predictors (p » n). Therefore, Naive Bayes Classifier and Support Vector Machines will be used as model algorithms, which perform well on such data most of the time. The dimensionality of the original dataset will be reduced by decomposing it into principal components and keeping the principal components that explain the 90% of the variance. The reduced version of the data will be feeding the Support Vector Classifier.

Libraries

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score

from sklearn.decomposition import PCA

from sklearn.metrics import accuracy_score, confusion_matrix

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Importing and Preprocessing the Data

1
data = pd.read_csv('./Data/Leukemia_GSE9476.csv')
1
2
print(data.shape)
data.head()
1
(64, 22285)
samplestype1007_s_at1053_at117_at121_at1255_g_at1294_at1316_at1320_at...AFFX-r2-Hs28SrRNA-5_atAFFX-r2-Hs28SrRNA-M_atAFFX-r2-P1-cre-3_atAFFX-r2-P1-cre-5_atAFFX-ThrX-3_atAFFX-ThrX-5_atAFFX-ThrX-M_atAFFX-TrpnX-3_atAFFX-TrpnX-5_atAFFX-TrpnX-M_at
01Bone_Marrow_CD347.7452457.8112106.4779168.8415064.5469417.9577145.3449994.673364...5.0588496.81000412.80006012.7186125.3915124.6661663.9747593.6566934.1606224.139249
112Bone_Marrow_CD348.0872527.2406738.5846488.9835714.5489348.0116525.5796474.828184...4.4361536.75147112.47270612.3335935.3797384.6567864.1883483.7925354.2044144.122700
213Bone_Marrow_CD347.7920567.54936811.0535048.9097034.5493288.2370995.4064894.615572...4.3920616.08629512.63738412.4990385.3166044.6005663.8455613.6357154.1741994.067152
314Bone_Marrow_CD347.7672657.09446011.8164338.9946544.6970188.2834125.5821954.903684...4.6333346.37599112.90363012.8714545.1799514.6419523.9916343.7045874.1499383.910150
415Bone_Marrow_CD348.0101177.4052816.6560499.0506824.5149868.3770465.4937134.860754...5.3051926.70045312.94935212.7825155.3416894.5603153.8870203.6298534.1275134.004316

5 rows × 22285 columns

1
data['type'].value_counts()
1
2
3
4
5
6
AML                 26
Bone_Marrow         10
PB                  10
PBSC_CD34           10
Bone_Marrow_CD34     8
Name: type, dtype: int64
1
2
3
4
# Replacing labels with numeric values
data["type"] = data["type"].replace({"AML":0, "Bone_Marrow":1, "PB":2, "PBSC_CD34":3, "Bone_Marrow_CD34":4})
# Storing label names in a list
labels = ["AML", "Bone_Marrow", "PB", "PBSC_CD34", "Bone_Marrow_CD34"]

Train-Test Split

1
2
3
4
5
X = data.drop(["type"], axis=1)
y = data["type"]
  
# using the train test split function
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=200, test_size=0.30, shuffle=True)

Principal Component Analysis

1
2
pca = PCA()
pca.fit_transform(X_train)
1
2
3
4
5
6
7
8
9
10
11
12
13
array([[ 7.42136611e+01, -2.18219666e+01, -1.70596309e+01, ...,
         1.69833896e+00,  1.65937536e+00,  1.27173212e-13],
       [ 2.20025774e+01,  4.37665430e+01, -1.27529875e+01, ...,
         8.35724135e-02, -9.58648421e-01,  1.27173212e-13],
       [-8.31658114e+00, -2.42812693e+01, -2.09776274e+01, ...,
        -3.89558842e-01, -8.52854616e-03,  1.27173212e-13],
       ...,
       [ 2.36180585e+01,  3.74960783e+01, -2.36891534e+01, ...,
        -4.22934933e+00,  1.24303529e+01,  1.27173212e-13],
       [-1.60907152e+01, -1.68696788e+01,  1.83121977e+01, ...,
         2.19669798e-01, -1.57023579e-01,  1.27173212e-13],
       [ 4.35651139e+01,  1.13885969e+01,  5.10196868e+01, ...,
        -7.98296444e-02,  2.49958631e-01,  1.27173212e-13]])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
total = sum(pca.explained_variance_)
k = 0
current_variance = 0

while current_variance/total < 0.90:
    current_variance += pca.explained_variance_[k]
    k = k + 1
    
print(k, " features explain around 90% of the variance.", sep='')

pca = PCA(n_components=k)
X_train.pca = pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

var_exp = pca.explained_variance_ratio_.cumsum()
var_exp = var_exp*100
plt.bar(range(k), var_exp);
1
19 features explain around 90% of the variance.

Figure 1

Figure 1. Cumulative % Explained Variance Sums

1
2
pca_2 = PCA(n_components=2).fit(X_train)
X_train_reduced = pca_2.transform(X_train)
1
2
fig = plt.figure(1, figsize = (10, 6))
plt.scatter(X_train_reduced[:, 0],  X_train_reduced[:, 1], c = y_train, cmap = plt.cm.Paired, linewidths=10)
1
<matplotlib.collections.PathCollection at 0x1ebb669ba00>

Figure 2

Figure 2. Plot of the 2D Reduced Version

Clusters can be seen clearly when the data is reduced to 2D.

Modeling

Baseline

Firstly, the accuracy score of simply predicting each test observation as the dominating class, which is acute myeloid leukemia (AML) in this case, can be checked as a baseline.

1
2
base_acc = len(y_test[y_test == 0])/len(y_test)*100
print("Baseline Accuracy ", round(base_acc, 3), "%", sep = '')
1
Baseline Accuracy 45.0%

Naive Bayes Classifier

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Model
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

# Prediction
nb_pred = nb_model.predict(X_test)

# Accuracy
nb_acc = accuracy_score(y_test, nb_pred)*100
print('Naive Bayes Accuracy: ', round(nb_acc, 3), "%", sep="")

# Confusion Matrix
cm_nb = confusion_matrix(y_test, nb_pred)

# Plot
ax = plt.subplot()
sns.heatmap(cm_nb, annot=True, ax=ax, fmt='g', cmap='Greens') 

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.xaxis.set_ticklabels(labels, rotation=90) 
ax.yaxis.set_ticklabels(labels, rotation=360);
1
Naive Bayes Accuracy: 95.0%

Figure 3

Figure 3. Confusion Matrix of Naive Bayes

Support Vector Machines

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Parameter grid
svm_param_grid = {  'C': [0.1, 1, 10, 100],
                    "decision_function_shape" : ["ovo", "ovr"],
                    'gamma': [1, 0.1, 0.01, 0.001, 0.00001, 10],
                    "kernel": ["linear", "rbf", "poly"]} 

# Creating the SVM grid search classifier
svm_grid = GridSearchCV(SVC(), svm_param_grid, cv=3)

# Training the classifier
svm_grid.fit(X_train_pca, y_train)

print("Best Parameters:\n", svm_grid.best_params_)

# Selecting the best SVC
best_svc = svm_grid.best_estimator_

# Predictions
svm_pred = best_svc.predict(X_test_pca)

# Accuracy
svc_acc = accuracy_score(y_test, svm_pred)*100
print('SVM Accuracy: ', round(svc_acc, 3), "%", sep="")

cm_svm = confusion_matrix(y_test, svm_pred)

ax = plt.subplot()
sns.heatmap(cm_svm, annot=True, ax = ax, fmt='g', cmap='Greens') 

# Labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.xaxis.set_ticklabels(labels, rotation=90) 
ax.yaxis.set_ticklabels(labels, rotation=360);
1
2
3
Best Parameters:
 {'C': 10, 'decision_function_shape': 'ovo', 'gamma': 1e-05, 'kernel': 'rbf'}
SVM Accuracy: 100.0%

Figure 4

Figure 4. Confusion Matrix of Support Vector Classifier

As can be seen from the confusion matrices, both models perform well. If a larger part of the dataset (i.e. 75%) were used as training data, Naive Bayes would also reach to perfect classification.

Note that the models might not perform as well as this case on different datasets or might require further configurations, since the dataset used in this notebook is extremely clean.

Inspired by Gene Expression Classification
GitHub Repository

This post is licensed under CC BY 4.0 by the author.