Feedback

Hey Ritika,

  • I see Deadpoolz2 sent you a message. Did you ever respond? ;)
  • Seeing how you go about understanding OkCupid is actually quite interesting haha. (Though this seems more like a blog post than a wiki page.)
  • I get that your features are the answers to four questions, but how are you defining the distance measure between people?
  • How are you defining error? I don't see why clustering would require an error function.
  • Why is there a need for a test set if the model doesn't have a method for validation? Am I missing something?
  • Why did you choose to only use four questions? Would be more interesting to use more.. no?
  • From your results, it actually seems like k-means is incapable of extracting clusters since all it does it basically output one giant cluster..
TianQiChen (talk)04:42, 21 April 2016

Hi Ricky,

  • I did not respond to Deadpoolz2 :P
  • I guess I kept my page a little informal to keep things interesting and the readers engaged. Don't know if that worked well with everyone, but that was my thinking behind it.
  • So the questions are the attributes and the answers to these questions are my values between which I am calculating the distance. Since the attributes are categorical, my Euclidean distance(in K-means) works like: if they have the same answer, distance is 1; 0 if they have different answers.
  • I am not defining the error, this is how OkCupid finds your true percentage. The error is with regards to finding the true percentage, not with my experimental error.
  • So initially I am clustering the data and then checking if the rest of data is actually falling in the clusters. I tried with just training as well; they gave me the same clusters (so that's positive)
  • In my first set of experiments, I used just four of the most popular attributes (which everyone had answered, i.e. no missing data). I have extended it to incorporate more attributes (as I did in my presentation and my wikipage now).
  • With better granularity on the data, with more attributes with higher variance we can get better clusters. Since in the experiment with 4 attributes those questions were answered almost same by everyone.

Ritika

RitikaJain (talk)06:55, 23 April 2016