Colorectal Cancer Drug Activity Prediction Server

About Model


Dataset

For the development of the QSAR models, we extracted drug data from GDSC website and merged it with the descriptors generated from PaDEL software[1]. Finally, we obtained 12 files with descriptors and logIC50 (µM) values for each corresponding cell line.

The dataset for each of the 12 cell lines are attached herewith and the user can download it by clicking on the respective cell line links.

Cell Line Name Download Link
COLO-678 Click Here
HT-115 Click Here
SW1463 Click Here
COLO-205 Click Here
GP5d Click Here
HT-29 Click Here
KM12 Click Here
SW1417 Click Here
MDST8 Click Here
SK-CO-1 Click Here
CCK-81 Click Here
SW620 Click Here

Descriptors

Descriptors were reduced using WEKA tool [2,3] and then F-stepping was applied using Sequential Feature Selection in Python from mlxtend library[4].

After F-stepping, we found 429 descriptors across the 12 cell lines. These descriptors were 1D, 2D, 3D and binary fingerprints. The descriptors give the information about the chemistry of the molecules. We performed descriptor analyses in order to understand the role of these descriptors in drug activity.

Some commonly found descriptors across the 12 cell lines are listed below:

Descriptors Java Class
KRFP314 KlekotaRothFingerprinter
KRFPC314 KlekotaRothFingerprintCount
FP3 Fingerprinter
APC2D9_O_I AtomPairs2DFingerprintCount
GraphFP252 GraphOnlyFingerprinter
JGI10 Mean topological charge index of order 10
KRFP3683 KlekotaRothFingerprintCount
KRFP803 KlekotaRothFingerprintCount
nC Number of carbon atoms
>

Algorithm

We developed QSAR models for the 12 cell lines using AI/ML algorithms. We used the following algorithms for choosing the best QSAR models:

1. Support vector machine (SVM)
2. Multilayer Perceptron (MLP)
3. Random Forest Regressor (RFR)

The performance measures used for analyzing the models were Pearson's coefficient (R), Coefficient of determination (R2), mean squared error (MSE), mean average error (MAE), and Root mean square error (RMSE).

Using SVM, we found that for all the 12 cell lines the coefficient of determination was at least 0.6. Hence out of the above three models, we found SVM to show the best performance for the 12 cell lines. The final table with the performance measures can be downloaded from here.

ABOUT SVM:

Support Vector Machine is a powerful machine learning algorithm that can be used for classification and regression tasks. It works by finding the optimal hyperplane that separates the data into different classes, using a kernel function to map the data into a higher-dimensional feature space. SVM is particularly useful when the data is not linearly separable and is less prone to overfitting compared to other algorithms.

References

1. Yap CW. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J Comput Chem 2011; 32:1466–1474;
2. Eibe Frank MAH and IHW. The WEKA Workbench. Online Appendix for ‘Data Mining: Practical Machine Learning Tools and Techniques’. 2016;
3. Hall M, Frank E, Holmes G, et al. The WEKA data mining software. ACM SIGKDD Explorations Newsletter 2009; 11:10–18
4. Raschka, Sebastian. MLxtend: Providing machine learning and data science utilities and extensions to Python's scientific computing stack. J Open Source Softw 2018; 3(24)