ColoRecPred-Colorectal Cancer Drug Activity Prediction Server

About Model

Dataset

For the development of the QSAR models, we extracted drug data from GDSC website and merged it with the descriptors generated from PaDEL software[1]. Finally, we obtained 12 files with descriptors and logIC50 (µM) values for each corresponding cell line.

The dataset for each of the 12 cell lines are attached herewith and the user can download it by clicking on the respective cell line links.

Cell Line Name	Download Link
COLO-678	Click Here
HT-115	Click Here
SW1463	Click Here
COLO-205	Click Here
GP5d	Click Here
HT-29	Click Here
KM12	Click Here
SW1417	Click Here
MDST8	Click Here
SK-CO-1	Click Here
CCK-81	Click Here
SW620	Click Here

Descriptors

Descriptors were reduced using WEKA tool [2,3] and then F-stepping was applied using Sequential Feature Selection in Python from mlxtend library[4].

After F-stepping, we found 429 descriptors across the 12 cell lines. These descriptors were 1D, 2D, 3D and binary fingerprints. The descriptors give the information about the chemistry of the molecules. We performed descriptor analyses in order to understand the role of these descriptors in drug activity.

Some commonly found descriptors across the 12 cell lines are listed below:

Descriptors	Java Class
KRFP314	KlekotaRothFingerprinter
KRFPC314	KlekotaRothFingerprintCount
FP3	Fingerprinter
APC2D9_O_I	AtomPairs2DFingerprintCount
GraphFP252	GraphOnlyFingerprinter
JGI10	Mean topological charge index of order 10
KRFP3683	KlekotaRothFingerprintCount
KRFP803	KlekotaRothFingerprintCount
nC	Number of carbon atoms

Algorithm

We developed QSAR models for the 12 cell lines using AI/ML algorithms. We used the following algorithms for choosing the best QSAR models:

1. Support vector machine (SVM)
2. Multilayer Perceptron (MLP)
3. Random Forest Regressor (RFR)

The performance measures used for analyzing the models were Pearson's coefficient (R), Coefficient of determination (R²), mean squared error (MSE), mean average error (MAE), and Root mean square error (RMSE).

Using SVM, we found that for all the 12 cell lines the coefficient of determination was at least 0.6. Hence out of the above three models, we found SVM to show the best performance for the 12 cell lines. The final table with the performance measures can be downloaded from here.

ABOUT SVM:

Support Vector Machine is a powerful machine learning algorithm that can be used for classification and regression tasks. It works by finding the optimal hyperplane that separates the data into different classes, using a kernel function to map the data into a higher-dimensional feature space. SVM is particularly useful when the data is not linearly separable and is less prone to overfitting compared to other algorithms.

References

1. Yap CW. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J Comput Chem 2011; 32:1466–1474;
2. Eibe Frank MAH and IHW. The WEKA Workbench. Online Appendix for ‘Data Mining: Practical Machine Learning Tools and Techniques’. 2016;
3. Hall M, Frank E, Holmes G, et al. The WEKA data mining software. ACM SIGKDD Explorations Newsletter 2009; 11:10–18
4. Raschka, Sebastian. MLxtend: Providing machine learning and data science utilities and extensions to Python's scientific computing stack. J Open Source Softw 2018; 3(24)

Colorectal Cancer Drug Activity Prediction Server

About Model

Dataset

Descriptors

Algorithm

References

Our Location

Website

Mail us

Useful Links