Critique
Hi Ke Dai,
I am so happy to see that you are working on this topic. I used this disease diagnosis model before, but applied in different field - online advertising. We at that time, built up a similar model to predict if one visitor of website would click a specific advertisement he saw on the website. We listed several related features that we think might have impact on the probability if the one would click, and trained the train the model with the click data that we get from real world. And I am very clear that the problem that you mentioned in this page is really a serious problem which would both affect the accuracy and the efficiency of the model. And I totally agree with your thinking.
I have several questions for you.
1. For the test of Dermatology, I found the there is a big gap between the two when C=0.004, can you explain why? Is it because that you do not have enough instances for testing?
2. I think actually you should use other disease as your test data, I noticed that for test of heart disease, you have only 270 instances, which might not be enough to support the accuracy of your model. Try to gather more instance, and make solid evidence.
3. How did you decide those C constraints, I noticed that you used 0.004 after 0.005 in test dermatology but 0.001 right after 0.005 in heart test. Do you want to add some explanation on that part?
4. Can you add evaluation part to explicitly figure out that your Hypothesis is true?
Hi Dandan,
Thanks for your kindly detailed feedback. Let me explain a little bit about C. C is the penalty coefficient of error which represents to what extent you are tolerant to the error. The bigger the value of C is, the less tolerant you are to the error. That is to say, a too big value of C gurantees better predictive accuracy but may lead to overfitting while a too small value of C leads to bad predictive accuracy. So you have to find an approriate value of C by trial in practice.
1. When C is set to 0.004, the tolerance of error increases and some important and necessary features are removed. So the learning model trained with new train data set cannot fit test data set well.
2. From my perspective, the size of a data set is not the most important factor affecting predictive accuracy of a model. The distribution of samples of differnt classes in a data set is more important. An imbalanced data set will misdirect feature selection and make the model meaningless. Take thyroid data set for example. When C is set to 0.005, only the feature sex is selected. But the model trained with the new data set still achieve very high predictive accuracy, which means the model can predict whether a patient suffers from thyroid disease only by his or her sex. It is ridiculous, right? Do you know iris data set? This data set only contains 150 instances, but samples of 3 classes are uniformly distributed. The model trained with this data set can attain predictive accuracy of 100% in my experiment.
3. As metioned above, no one knows what the approriate value of C is given a certian data set. The only method is trial and error.
4. Yes. I have added a conclusion section in this page.
Sincerely,
Ke Dai