Feedback

Hey Ritika,

I see Deadpoolz2 sent you a message. Did you ever respond? ;)
Seeing how you go about understanding OkCupid is actually quite interesting haha. (Though this seems more like a blog post than a wiki page.)
I get that your features are the answers to four questions, but how are you defining the distance measure between people?
How are you defining error? I don't see why clustering would require an error function.
Why is there a need for a test set if the model doesn't have a method for validation? Am I missing something?
Why did you choose to only use four questions? Would be more interesting to use more.. no?
From your results, it actually seems like k-means is incapable of extracting clusters since all it does it basically output one giant cluster..

Hi Ricky,

I did not respond to Deadpoolz2 :P
I guess I kept my page a little informal to keep things interesting and the readers engaged. Don't know if that worked well with everyone, but that was my thinking behind it.
So the questions are the attributes and the answers to these questions are my values between which I am calculating the distance. Since the attributes are categorical, my Euclidean distance(in K-means) works like: if they have the same answer, distance is 1; 0 if they have different answers.
I am not defining the error, this is how OkCupid finds your true percentage. The error is with regards to finding the true percentage, not with my experimental error.
So initially I am clustering the data and then checking if the rest of data is actually falling in the clusters. I tried with just training as well; they gave me the same clusters (so that's positive)
In my first set of experiments, I used just four of the most popular attributes (which everyone had answered, i.e. no missing data). I have extended it to incorporate more attributes (as I did in my presentation and my wikipage now).
With better granularity on the data, with more attributes with higher variance we can get better clusters. Since in the experiment with 4 attributes those questions were answered almost same by everyone.

Ritika