Course talk:CPSC522/The Automation of Disease Diagnosis

From UBC Wiki

Contents

Thread titleRepliesLast modified
Critique111:15, 23 April 2016
Critique111:09, 23 April 2016
Critique115:31, 22 April 2016
Comments102:49, 21 April 2016

Good work! I think you should explain the pseudo code you used as well as provide link to your codebase, so that people can build upon your stuffs.

TanujKrAasawat (talk)02:53, 21 April 2016

Hi Tanuj, Thanks for your feedback. I just used sklearn to implement my idea. There are various machine learning module available. It's very easy to use. Here is the link: http://scikit-learn.org/stable/.

KeDai (talk)11:15, 23 April 2016
 

Hi Ke Dai,
Good work! I Here are few comments/suggestions:

  • I like how you presented the background material in detail so that it makes it a lot easier for the reader to understand your work without needing to look elsewhere.
  • Do you think having only 270 instances for the heart disease would undermine the accuracy of your results? Maybe, you could use some other data set with more instances in the future to verify that the same conclusion holds.


Keep up the good work,
Best Regards,
Adnan

AdnanReza (talk)17:34, 22 April 2016

Hi Adnan,

Thanks for your kindly constructive suggestion. The size of heart data set is indeed a little small. But I did not find any other better disease data set before. I will try to find more disease data set to conduct feature selection in the future.

Sincerely,

Ke Dai

KeDai (talk)11:09, 23 April 2016
 

Hi Ke Dai,

I am so happy to see that you are working on this topic. I used this disease diagnosis model before, but applied in different field - online advertising. We at that time, built up a similar model to predict if one visitor of website would click a specific advertisement he saw on the website. We listed several related features that we think might have impact on the probability if the one would click, and trained the train the model with the click data that we get from real world. And I am very clear that the problem that you mentioned in this page is really a serious problem which would both affect the accuracy and the efficiency of the model. And I totally agree with your thinking.


I have several questions for you.

1. For the test of Dermatology, I found the there is a big gap between the two when C=0.004, can you explain why? Is it because that you do not have enough instances for testing?

2. I think actually you should use other disease as your test data, I noticed that for test of heart disease, you have only 270 instances, which might not be enough to support the accuracy of your model. Try to gather more instance, and make solid evidence.

3. How did you decide those C constraints, I noticed that you used 0.004 after 0.005 in test dermatology but 0.001 right after 0.005 in heart test. Do you want to add some explanation on that part?

4. Can you add evaluation part to explicitly figure out that your Hypothesis is true?

DandanWang (talk)04:03, 21 April 2016

Hi Dandan,

Thanks for your kindly detailed feedback. Let me explain a little bit about C. C is the penalty coefficient of error which represents to what extent you are tolerant to the error. The bigger the value of C is, the less tolerant you are to the error. That is to say, a too big value of C gurantees better predictive accuracy but may lead to overfitting while a too small value of C leads to bad predictive accuracy. So you have to find an approriate value of C by trial in practice.

1. When C is set to 0.004, the tolerance of error increases and some important and necessary features are removed. So the learning model trained with new train data set cannot fit test data set well.

2. From my perspective, the size of a data set is not the most important factor affecting predictive accuracy of a model. The distribution of samples of differnt classes in a data set is more important. An imbalanced data set will misdirect feature selection and make the model meaningless. Take thyroid data set for example. When C is set to 0.005, only the feature sex is selected. But the model trained with the new data set still achieve very high predictive accuracy, which means the model can predict whether a patient suffers from thyroid disease only by his or her sex. It is ridiculous, right? Do you know iris data set? This data set only contains 150 instances, but samples of 3 classes are uniformly distributed. The model trained with this data set can attain predictive accuracy of 100% in my experiment.

3. As metioned above, no one knows what the approriate value of C is given a certian data set. The only method is trial and error.

4. Yes. I have added a conclusion section in this page.

Sincerely,

Ke Dai

KeDai (talk)12:09, 22 April 2016
 

Nice page!

1. In your experiment results, does "original dataset" mean no regularization?

2. Is accuracy a good metric for disease diagnosis? Should you weight more for false negatives?

YanZhao (talk)02:13, 21 April 2016

Hi Yan Zhao,

Thanks for your critique.

1. Original dataset means there is no feature reduction but I still use L1-regularization to train the model.

2. For this page, my objective is to reduce feature dimensionality to make the learning model simple and the prediction of the model more explainable to the doctors without significant loss of predictive accuracy. So accuracy is one metric, the other is feature dimensionality. If the model trained with new data sets still has good predictive accuracy after feature selection, I can assume selected features are important for the learning objective and those removed features are less relevant or irrelevant.

Sincerely,

Ke Dai

KeDai (talk)02:47, 21 April 2016