UBC Wiki - User contributions [en]

Course:CPSC522/Variable Elimination

2017-09-24T22:23:47Z

RitikaJain: /* End of civilization? or Not? */

[[Category:CPSC522]]
== Variable Elimination ==
Variable elimination(VE) is an algorithm to perform inference on Bayesian networks by manipulating conditional probabilities in the form of factors.

Principal Author: Ritika Jain

Collaborators: Tanuj Kr Aasawat, Prithu Banerjee

== Abstract ==
The most common type of query we often need to find the probabilities for; is of the form P(Y|E=e) where Y is the set of variables that need to be queried and E is the set of observations we have made.
This can be computed using any entry of the full joint probability distribution table, given the conditional probability tables in the network and then do inference by enumeration as shown below.

;Given:
: a prior joint probability distribution(JPD) on a set of variables X.
: specific values 'e' for the evidence variable E (subset of X).
;We need to compute:
: posterior joint distribution of query variables Y (a subset of X) given evidence 'e'.
;Method:
:Step 1: condition to get distribution P(X|e).
:Step 2: marginalize to get distribution P(Y|e).
However, this method is extremely inefficient and does not scale well. We can perform better by taking advantage of variables independence. We can represent the joint probability distribution as a product of marginal distributions and simplify some terms when the variables involved are independent or conditionally independent.
== Background knowledge==
To understand variable elimination, we will first introduce the concept of [[Course:CPSC522/Variable_Elimination#Factors|factors]] and understand the [[Course:CPSC522/Variable_Elimination#Assigning_a_variable|operations]] that can be performed on them. This will be followed by the variable elimination algorithm, explained with an example.
===Factors===
A factor is a function from a tuple of random variables to the real numbers R
We write a factor on variables X1,… ,Xj as f(X1,… ,Xj)
A factor denotes one or more (possibly partial) distributions over the given tuple of variables, e.g., P(X1,X2) is a factor f(X1,X2). We shall look at three basic operations on factors: assigning a variable, summing out a variable and multiplying factors.
====Assigning a variable====
We can assign values to some or all variables of an existing factor to create a new factor.
[[File:Factor1.png|right|Assigning a variable for factors]]
The value of variables which are not satisfied are not considered in the new factor.

====Summing out a variable====
We can marginalize out (or sum out) a variable. Marginalizing out a variable X from a factor f(X1,...,Xn) yields a new factor defined on {X1,...,Xn} \ {X}
[[File:Factor2.png|Summing out a variable]]

====Multiplying factors====
Two factors can be multiplied on the basis of a common variable to get a new factor. The product of factor f1(A,B) and f2(B,C) where B is the variable in common, is the factor (f1 x f2)(A,B,C) defined by: (f1 x f2)(A,B,C) = f1 (A,B) f2(B,C). The domain of f1 x f2 is AυBυC. This is shown in the figure below.
[[File:Factor3.png|600px|Multiplying factors]]

== Introduction ==
We can express the joint probability as a factor of observed variables and other variables not involved in the query.
f(Y, E1,...,Ej, Z1,...,Zk). E's are the observed variables and Z's are the other variables not involved in the query.
We can compute P(Y, E1=e1,....,Ej=ej) by assigning E1=e1,....,Ej=ej and marginalizing out variables Z1,...,Zk one at a time. This is represented as:
[[File:F4.png|center|500px|Joint probability distribution as factors]]
The order in which the marginalization of variables is done is called the elimination order. Finding the [[Course:CPSC522/Variable_Elimination#Complexity|optimal elimination order]] is a NP complete problem.

We know the joint probability distribution of a Bayesian network as:
[[File:Fig5.png|center|550px|Joint probability distribution]]
We can express the joint factor as a product of factors, one for each conditional probability.
[[File:Fig6.png|550px|center]]
Inference in Bayesian networks thus reduces to computing the above sum of products. This can be easily computed by
*partitioning factors into those that contain a particular 'Zk' and that do not.
*summing out 'Zk' over all the factors that contain 'Zk'. We explain this by the following example.
[[File:Figex.png|center|400px|Example]]
In the example above, we initially have four factors f1(C,D), f2(A,B,D), f3(E,A) and f4(D). We have one unobserved variable A that we need to sum over. We partition the factors into the sets that contain A: f2(A,B,D) and f3(E,A) and that do not contain a: f1(C,D) and f4(D). We sum over only the factors that contain A. From previous operation of [[Course:CPSC522/Variable_Elimination#Multiplying_factors| multiplication of factors]], we compute a new factor f5(A,B,D,E). [[Course:CPSC522/Variable_Elimination#Summing_out_a_variable|Summing out the variable]] A, we get a new factor f6(B,D,E).
===General case===
Decomposition of sum of products can be seen as a general case as follows:
[[File:Generalcase.png|500px|center|General_case]]

==Algorithm==
To compute the conditional probability P(Y=yi | E=e), where E are the observed variables and Z are the variables not involved in the query, variable elimination algorithm[http://artint.info/html/ArtInt_148.html] dictates as follows:
1. Construct a factor for each conditional probability.
2. For each factor, assign the observed variables E to their observed values

3. Given an elimination ordering, decompose sum of products.

4. Sum out all variables Zi not involved in the query.

5.Multiply the remaining factors.

6. Normalize by dividing the resulting factor f(Y) by Σyf(Y) over all y.

===Example===
Consider the following Bayesian network and the query P(G | H=h1). We consider a given elimination order A,C,E,I,B,D,F.

[[File:Mainex.png|left|150px]]

Essentially we need to compute ΣA,B,C,D,E,F,IP(A,B,C,D,E,F,G,H,I). We sum over all the variables not involved in the query-A,B,C,D,E,F,I. Considering Bayesian independence, i.e., the current node depends only on its parent node, we can rewrite the above summation as:

<math>P(G,H)=\sum_{A,B,C,D,E,F,I} P(A) P(B | A) P(C) P(D | B,C) P(E | C) P(F | D) P(G | F,E) P(H | G) P(I | G). </math>

<big>'''Step 1: Construct a factor for each conditional probability.'''</big>

Writing the probabilities as factors we get,

<math>P(G,H)=\sum_{A,B,C,D,E,F,I} f_0(A) f_1(B | A) f_2(C) f_3(D,B,C) f_4(E,C) f_5(F,D) f_6(G,F,E) f_7(H,G) f_8(I,G). </math>

<big>'''Step 2: Assign to observed variables their observed values.'''</big>

We start by observing H=h1 as given in the query which changes the factor f7 to f9. Therefore we get,

<math>P(G,H=h_1)=\sum_{A,B,C,D,E,F,I} f_0(A) f_1(B | A) f_2(C) f_3(D,B,C) f_4(E,C) f_5(F,D) f_6(G,F,E) \color{red} {f_9(G)} \color{black}{f_8(I,G).} </math>

<big>'''Step 3: Decompose sum of products.'''</big>

According to the elimination order provided to us, we need to perform product and sum out A first, then C, then E and so on. All factors involving A will be considered in the summation of A as shown below:

<math>P(G,H=h_1)= f_9(G) \sum_F \sum_D f_5(F,D) \sum_B \sum_If_8(I,G) \sum_E f_6(G,F,E) \sum_C f_2(C) f_3(D,B,C)f_4(E,C) \color{blue}{\sum_A f_0(A) f_1(B | A)}. </math>

<big>'''Step 4: Sum out non query variables (one at a time).'''</big>

Performing product of f0(A) and f1(B,A) we get, f10(B,A). Summing out A, we get f11(B). This factor will be considered under the summing variable B as it does not depend on C,E,I (according to the elimination order) and therefore will be pushed outside of these sums.

<math>P(G,H=h_1)= f_9(G) \sum_F \sum_Df_5(F,D) \sum_B\color{red}{f_{11}(B)} \color{black}{\sum_If_8(I,G) \sum_E f_6(G,F,E) \sum_C f_2(C) f_3(D,B,C)f_4(E,C). }</math>

Performing the product and summing out C, we get the factor f12(D,B,E). This factor will be pushed under the summation of E as E comes before B and D in the elimination order.

<math>P(G,H=h_1)= f_9(G) \sum_F \sum_Df_5(F,D) \sum_Bf_{11}(B) \sum_If_8(I,G) \sum_E\color{blue}{f_{12}(D,B,E)} \color{black}{f_6(G,F,E).} </math>

Similarly multiplying f12(D,B,E)f6(G,F,E) and summing out E, we get the factor f13(B,D,F,G),

<math>P(G,H=h_1)= f_9(G) \sum_F \sum_Df_5(F,D) \sum_Bf_{11}(B)\color{red}{f_{13}(B,D,F,G)} \color{black} { \sum_If_8(I,G).}</math>

Continuing on these lines, we finally end up with;
<math>P(G,H=h_1)= f_9(G)f_{14}(G)f_{17}(G).</math>

<big>'''Step5: Multiply remaining factors.'''</big>

Multiplying all the remaining factors in G, we get,

<math>P(G,H=h_1)= f_{18}(G)</math>

<big>'''Step 6: Normalize'''</big>

<math>P(G=g | H=h_1) = \frac{P(G=g, H=h_1)}{\sum_{g' \in dom(G)} P(G=g',H=h_1)} = \frac{f_{17}(g)}{\sum_{g' \in dom(G)} f_{17}(g')}</math>

===Further optimizations===
In the previous example we did not take advantage of the conditional independence in the Bayesian network.

#Conditional Independence: Before we run variable elimination, we can reduce the nodes we need to consider by pruning all variables Z that are conditionally independent of the query Y given evidence E: Z <math> \perp\!\!\!\perp</math> Y | E. '''Example:''' Considering the same Bayesian network, for the query P(G=g | C=c1, F=f1, H=h1); we can prune elements A,B and D because both paths from these nodes to G are blocked.
#*F is observed node.
#*C is an observed common parent.
#Unobserved Leaf Nodes: We can also prune the unobserved leaf nodes since they are unobserved and also not the predecessors of the query nodes, they will not have any effect on the posterior probability of the query nodes.Therefore in the above example, we can prune variable I. We only need to run Variable Elimination on this reduced subnetwork as shown below.[[File:Subnet.png|150px|center|Subnetwork]]This makes the entire process more efficient and reduces the runtime complexity.

===Complexity===
A factor over n binary variables needs to store 2n numbers. The initial factors are usually small because variables have only a few parents in the Bayesian network, but after product, the factors tend to get large. The complexity of Variable Elimination is exponential in the maximum number of variables in any factor during its execution. This is also called the treewidth of a graph (along a particular variable elimination order). Finding the most optimal variable elimination order (i.e the one which gives the minimum treewidth) is NP complete. However, heuristics such as ordering the least connected variables first work well in practice.

==AISpace Demo==
===End of civilization? or Not?===
We look at a demo example from AISpace[http://www.aispace.org/exercises/exercise6-c-1.shtml] which uses variable elimination to answer some interesting questions- Is it the end of civilization?

The scenario is:
Bill has noticed that his morning newspaper delivery has been sporadic. There are several relevant variables relating to whether or not the paper is delivered. Delivery is dependent on the paper having been successfully printed the previous night. Possible explanations for a paper not having been printed are a malfunction at the printing press, or the end of civilization as we know it.

The probabilities assigned are:
The prior probability of a printer malfunction is 0.05. Bill has been noticing some ominous signs of the apocalypse and so expects the end of civilization with a relatively high probability of 0.001. If the end of civilization is here, then the paper not be printed for sure. If there is a printing malfunction and no end of civilization, there is a probability of 0.05 that the paper will be printed (this is non-zero because the malfunction might be fixed in time). If there is no malfunction and no end of civilization, there is a probability of 0.99 that the paper will be printed. If the paper is not printed it will not be delivered. If it is printed, there is a probability of 0.9 that it will be delivered. The fact that this probability is not 1 suggests that there are other possible causes for the paper not being delivered that we should eventually add to our belief network (e.g. the paperboy being sick).
A short demo[https://www.youtube.com/watch?v=qA3RJbaYqaY] that I have created can be viewed here:
{{#widget:YouTube|id=qA3RJbaYqaY|height=315|width=420}}

==Applications==
Variable elimination[https://www.cs.cmu.edu/~epxing/Class/10708.../scribe_note_lecture4.pdf] is a general technique for constraint processing. When combined with other techniques, it can be extremely useful for solving many problems arising in domains such as resource allocation, combinatorial auctions, bioinformatics and probabilistic reasoning that can be naturally modelled as constraint satisfaction and optimization problems.
Finding the posteriori belief has multiple applications in statistical analysis and learning. Given a causal trail, calculating the conditional probability of effects given causes is called prediction. In this case, the query node is a descendant of the evidence. Diagnosis is calculating the conditional probability of causes given effects, and this is useful in finding the probability of a disease given the symptoms. In this case, the query node is an ancestor of the evidence node in the trail. While learning under partial observation, posteriori belief of the unobserved variables given the observed ones is calculated.
Variable elimination works equally well for both Bayesian networks and Markovian networks.

== Annotated Bibliography ==
1. [http://artint.info/html/ArtInt_149.html David Poole & Alan Mackworth, "Artificial Intelligence: Foundations of Computational Agents",] 
2. [http://www.aispace.org/exercises/exercise6-c-1.shtml AISpace] 
3. [https://www.youtube.com/watch?v=qA3RJbaYqaY Youtube demo] 
4. [https://www.cs.cmu.edu/~epxing/Class/10708.../scribe_note_lecture4.pdf CMU lecture notes] 

==To add==
-Link to Bayesian network
-

Course:CPSC522/Analyzing online dating trends with Weka

2017-09-24T22:19:36Z

RitikaJain:

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this [http://www.wired.com/2014/01/how-to-hack-okcupid/ Math geek]<ref name="main" />

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create an almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives me the option to find matches according to preferred age, orientation, and location of who I want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just '''multiply''' your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that '''True Match = Calculated Match +/- Reasonable Margin of Error'''. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an '''OkCupid scraping bot''' written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]

The attributes are the questions that users have answered and the rows correspond to answers by each user to those questions.

====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062),i.e. choosing those questions which have been answered by all the users (having 0% missing data) 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]] I select the four attributes from the left pane and remove the rest of the attributes, and then i click on the cluster tab to perform k-means clustering on this selected data.
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
Broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
 Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as: '''True Match = Calculated Match +/- Reasonable Margin of Error.'''
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.

 For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T07:06:25Z

RitikaJain: /* Training */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this [http://www.wired.com/2014/01/how-to-hack-okcupid/ Math geek]<ref name="main" />

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just '''multiply''' your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that '''True Match = Calculated Match +/- Reasonable Margin of Error'''. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an '''OkCupid scraping bot''' written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]

The attributes are the questions that users have answered and the rows correspond to answers by each user to those questions.

====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062),i.e. choosing those questions which have been answered by all the users (having 0% missing data) 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]] I select the four attributes from the left pane and remove the rest of the attributes, and then i click on the cluster tab to perform k-means clustering on this selected data.
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
Broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
 Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as: '''True Match = Calculated Match +/- Reasonable Margin of Error.'''
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.

 For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T07:01:36Z

RitikaJain: /* Crawling data from OkCupid's website */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this [http://www.wired.com/2014/01/how-to-hack-okcupid/ Math geek]<ref name="main" />

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just '''multiply''' your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that '''True Match = Calculated Match +/- Reasonable Margin of Error'''. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an '''OkCupid scraping bot''' written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]

The attributes are the questions that users have answered and the rows correspond to answers by each user to those questions.

====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
Broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
 Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as: '''True Match = Calculated Match +/- Reasonable Margin of Error.'''
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.

 For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Thread:Course talk:CPSC522/Analyzing online dating trends with Weka/Critique/reply

2016-04-23T06:59:48Z

RitikaJain: Reply to Critique

Hi Tanuj,
I'm glad you liked my page. Thank you so much for your feedback. I have tried elaborating upon the figures to make it easy for the reader to understand. 
Thank you again for your inputs.
Ritika

Thread:Course talk:CPSC522/Analyzing online dating trends with Weka/Comments/reply

2016-04-23T06:57:56Z

RitikaJain: Reply to Comments

Hi Yan Zhao,
Thanks so much for your feedback. I have tried to elaborate upon the figures for users not familiar with Weka. And I've also worked on my results. 
Thanks again,
Ritika

Thread:Course talk:CPSC522/Analyzing online dating trends with Weka/Feedback/reply

2016-04-23T06:55:10Z

RitikaJain: Reply to Feedback

Hi Ricky,
* I did not respond to Deadpoolz2 :P 
* I guess I kept my page a little informal to keep things interesting and the readers engaged. Don't know if that worked well with everyone, but that was my thinking behind it. 
* So the questions are the attributes and the answers to these questions are my values between which I am calculating the distance. Since the attributes are categorical, my Euclidean distance(in K-means) works like: if they have the same answer, distance is 1; 0 if they have different answers. 
* I am not defining the error, this is how OkCupid finds your true percentage. The error is with regards to finding the true percentage, not with my experimental error.
* So initially I am clustering the data and then checking if the rest of data is actually falling in the clusters. I tried with just training as well; they gave me the same clusters (so that's positive)
* In my first set of experiments, I used just four of the most popular attributes (which everyone had answered, i.e. no missing data). I have extended it to incorporate more attributes (as I did in my presentation and my wikipage now).
* With better granularity on the data, with more attributes with higher variance we can get better clusters. Since in the experiment with 4 attributes those questions were answered almost same by everyone. 
Ritika

Thread:Course talk:CPSC522/Improving the accuracy of Affect Prediction in an Intelligent Tutoring System/Suggestions for Improving the accuracy of Affect Prediction in an Intelligent Tutoring System/reply (2)

2016-04-23T01:24:07Z

RitikaJain: Reply to [[Thread:Course talk:CPSC522/Improving the accuracy of Affect Prediction in an Intelligent Tutoring System/Suggestions for Improving the accuracy of Affect Prediction in an Intelligent Tutoring System/reply|Suggestions for Improving the accura...

Thanks for your clarifications Abed. I am still a little uncomfortable with how the user can be bored and curious at the same time and I don't see why they should overlap, but I guess those are just model parameters which you chose to use this way, which is perfectly fine.
As for the links, I am able to go from within your wiki page to the references and vice versa but not able to go to the actual page on the web. For instance where you have sited the papers, I cannot access those papers from your wikipage. I think the web url is missing in the references. You could refer to my page, 'Analysing online dating trends using Weka'; the reference section to see how to link it to the pages outside of wiki.

Rest all looks good!
Ritika

Thread:Course talk:CPSC522/Analyzing online dating trends with Weka/Suggestions/reply (2)

2016-04-23T01:18:24Z

RitikaJain: Reply to Suggestions

Hi Samprity,
I have made the suggested changes in my page. Thanks so much for your feedback! Let me know if there is anything else that needs modification. 
Thanks,
Ritika

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T01:15:53Z

RitikaJain: /* Hypothesis */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this [http://www.wired.com/2014/01/how-to-hack-okcupid/ Math geek]<ref name="main" />

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just '''multiply''' your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that '''True Match = Calculated Match +/- Reasonable Margin of Error'''. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an '''OkCupid scraping bot''' written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
Broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
 Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as: '''True Match = Calculated Match +/- Reasonable Margin of Error.'''
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.

 For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T01:15:40Z

RitikaJain: /* Hypothesis */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this [http://www.wired.com/2014/01/how-to-hack-okcupid/
Math geek]<ref name="main" />

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just '''multiply''' your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that '''True Match = Calculated Match +/- Reasonable Margin of Error'''. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an '''OkCupid scraping bot''' written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
Broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
 Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as: '''True Match = Calculated Match +/- Reasonable Margin of Error.'''
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.

 For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T01:15:05Z

RitikaJain: /* Examples of matches */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just '''multiply''' your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that '''True Match = Calculated Match +/- Reasonable Margin of Error'''. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an '''OkCupid scraping bot''' written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
Broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
 Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as: '''True Match = Calculated Match +/- Reasonable Margin of Error.'''
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.

 For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T01:14:23Z

RitikaJain: /* Crawling data from OkCupid's website */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an '''OkCupid scraping bot''' written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
Broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
 Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as: '''True Match = Calculated Match +/- Reasonable Margin of Error.'''
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.

 For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T01:13:12Z

RitikaJain: /* Results */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
Broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
 Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as: '''True Match = Calculated Match +/- Reasonable Margin of Error.'''
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.

 For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T01:12:42Z

RitikaJain: /* Results */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
Broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
 Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
'''True Match = Calculated Match +/- Reasonable Margin of Error.'''
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.

 For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T01:12:03Z

RitikaJain: /* Results */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
Broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
 Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
'''True Match = Calculated Match +/- Reasonable Margin of Error.'''
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.
 For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T01:11:25Z

RitikaJain: /* Results */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
 Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
'''True Match = Calculated Match +/- Reasonable Margin of Error.'''
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.

 For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T01:10:54Z

RitikaJain: /* Results */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
 Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
'''True Match = Calculated Match +/- Reasonable Margin of Error.'''

Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.
 For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T01:10:18Z

RitikaJain: /* Results */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
'''True Match = Calculated Match +/- Reasonable Margin of Error.'''

Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.
 For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T01:09:40Z

RitikaJain: /* Results */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the '''4 most popular''' questions: '''q34113''', '''q85419''', '''q416235''' and '''q20062''' as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
'''True Match = Calculated Match +/- Reasonable Margin of Error.'''
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25='''75%'''.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose '''31 attributes''', which gives me '''7 clusters'''. We calculate the '''true match rate''' as 100-(100/31)%=100-3.22%='''96.78%''' which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-23T01:07:25Z

RitikaJain: /* Results */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. 
In my next set of experiments, I chose 31 attributes, which gives me 7 clusters. We calculate the true match rate as 100-(100/31)%=100-3.22%=96.78% which is greater than 90% as proposed in the hypothesis. Therefore I have clustered the dating population into 7 clusters based on the 31 most popular attributes and I claim that answering the questions based on the mean cluster answers for each cluster would get a user match rate higher than 90% for that cluster.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-22T23:27:02Z

RitikaJain: /* Clustering with 31 most popular attributes */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?
'''q16053:''' How willing are you to meet someone from OkCupid?
'''q34113:''' How do you feel about government-subsidized food programs?
'''q64664:''' Do you think it is okay to open old graves to get more knowledge of ancient cultures and their history?
'''q85419:''' Which type of wine would you like to drink outside of a meal(such as for leisure)?
'''q179268:''' Are you either vegetarian or vegan?
'''q358077:''' Could you date someone who was really messy?
'''q358084:''' Do you enjoy intense intellectual conversations?
'''q416235:''' Do you like watching foreign movies with subtitles?
'''d_gender:''' Man, woman, transgender, transfeminine, agender
'''d_religion_type:''' Atheism, Agnosticism, Christianity, Judaism, Catholicism, Buddhism, Hinduism, Islam
'''d_smokes:''' No, yes, sometimes, trying to quit, when drinking
'''d_drinks:''' Socially, rarely, often, not at all, very often, desperately
'''q_20062:''' While in the middle of the best lovemaking of your life if your lover asked you to squeal like a dolphin, would you?
The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

Comparison for some attributes on clusters 4, 5, 6:
{| class="wikitable"
|-
! Question
!Cluster 4 ans
! Cluster 5 ans
!Cluster 6 ans
|-
| Regardless of future plans, what is more important to you right now, love or sex?
| Love
| Love
| Love
|-
| How important is religion/God in your life?
| Not important
| Not important
| Not important
|-
| Would you prefer good things happened to you or interesting things?
| Good
| Good
| Interesting
|-
| Which would you rather be, normal or weird?
| Normal
| Weird
| Weird
|-
| Which describes you better, intense or carefree?
| Carefree
| Intense
| Carefree
|-
| What's your relationship with marijuana?
| Never
| Missing
| Occasionally
|-
| What do you prefer, cats or dogs or both?
| Dogs
| Dogs
| Both
|-
| Would you consider having an open relationship?
| No
| No
| Yes
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. Experiments for these are currently under progress.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-22T22:32:08Z

RitikaJain:

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

====Clustering with 31 most popular attributes====
In these set of experiments, I consider 31 most popular attributes which have missing data < 34%. The attributes are as follows:
'''q35:''' Regardless of future plans, what's more interesting to you right now, love or sex?
'''q41:''' How important is religion/God in your life?
'''q46:''' Would you prefer good things happened or interesting?
'''q48:''' Which would you rather be? Normal or weird?
'''q49:''' Which word describes you better? Carefree or intense?
'''q70:''' Do you think homosexuality is a sin? yes or no?
'''q77:''' How frequently do you drink alcohol?
'''q79:''' What's your relationship with marijuana?
'''q123:''' Would you strongly prefer to go out with someone of your own skin color/ racial background?
'''q325:''' Would you consider having an open relationship (i.e. one where you can see other people)?
'''q403:''' Do you enjoy discussing politics?
'''q501:''' Have you smoked cigarette in the last six months?
'''q553:''' Do spelling mistakes annoy you?
'''q997:''' Are you a cat person or a dog person?
'''q1440:''' Is jealousy healthy in a relationship?
'''q1597:''' Would you consider sleeping with someone on the first date?
'''q4018:''' Are you happy with your life?
'''q9688:''' Could you date someone who does drugs?


The clustering visualization is shown below. Cross validation gives us k of size 7 as shown below. 
[[File:Clusters10.png|thumb|center|700px|Cluster visualization for 31 attributes (7 clusters)]]

Clustered instances:
{| class="wikitable"
|-
!Cluster 0
| 4%
|-
!Cluster 1
| 2%
|-
!Cluster 2
| 0%
|-
!Cluster 3
| 28%
|-
!Cluster 4
| 21%
|-
!Cluster 5
| 31%
|-
!Cluster 6
| 12%
|}

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. Experiments for these are currently under progress.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

File:Clusters10.png

2016-04-22T22:24:17Z

RitikaJain: User created page with UploadWizard

=={{int:filedesc}}==
{{Information
|description={{en|1=Cluster visualization for 31 attributes (10 clusters)}}
|date=2016-04-22 15:23:24
|source={{own}}
|author=[[User:RitikaJain|RitikaJain]]
|permission=
|other_versions=
}}

=={{int:license-header}}==
{{self|cc-by-sa-3.0}}

[[Category:Uploaded with UploadWizard]]

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-22T22:05:53Z

RitikaJain: /* Cluster Visualization */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]


===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. Experiments for these are currently under progress.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/MarchApril2016

2016-04-21T14:31:12Z

RitikaJain: /* Thursday April 21 - Room ICCS 104 */

==March/April Assignment==
Your third and final assignment is to create a hypothesis related to the course and test it. Your result should describe the hypothesis and whether it works. The reader should be able to understand the background, what the the hypothesis is, whether the hypothesis is true, and the evidence you used to come to this conclusion. Your hypothesis can be theoretical or practical.

==Extra Rules==
* You need to follow the rules on the main page and you should follow the guidelines there.
* Each page should have a principle author. You do not need co-authors but can have co-authors; co-authorship is encouraged. If others help you with your page, you should help them too.
* You need to add your page to the table of contents in a position that makes sense. Fell free to edit and change the structure of the table of content to give it a coherent structure.
* You will need to give a presentation of approximately 6 minutes + 2 minutes for questions; do not go over! If you would like to give a presentation during term time please contact David. For those who do not want to present during class time, we will have a presentation session during exam time.
* If you want to choose a topic that is related to other previous or current courses you need to negotiate with the instructor(s) to make sure you are not counting the same work multiple times.
* You should refer to wiki pages and to other research papers as appropriate.
===Key Dates===
* March 23 - create a pages that including an explicit statement of your hypothesis, and how you intend to test your hypothesis.
* April 17 - First Draft ready for critiquing
* April 20 - Critiques due
* April 22 - Final pages ready for marking
* April 27 - Marking Completed

===Marking Scheme===
Here are some questions to take into account marking. This is subject to change. Feel free to add questions, and edit the questions if they do not make sense.

* The topic is relevant for the course.
* The writing is clear and the English is good.
* The page is written at an appropriate level for CPSC 522 students (where the students have diverse backgrounds).
* The formalism (definitions, mathematics) was well chosen to make the page easier to understand.
* The abstract is a concise and clear summary.
* There were appropriate (original) examples that helped make the topic clear.
* There was appropriate use of (pseudo-) code.
* It had a good coverage of representations, semantics, inference and learning (as appropriate for the topic).
* It is correct.
* It was neither too short nor too long for the topic.
* It was an appropriate unit for a page (it shouldn't be split into different topics or merged with another page).
* It links to appropriate other pages in the wiki.
* The references and links to external pages are well chosen.
* I would recommend this page to someone who wanted to find out about the topic.
* This page should be highlighted as an exemplary page for others to emulate.

If I was grading it out of 20, I would give it:

Justification for the mark that will help the student in the future:

===Presentation Schedule===

==Tuesday April 19 - Room ICCS 144==
cancelled

==Thursday April 21 - Room ICCS 104==
* Samprity Kashyap
* Ke Dai
* Arthur Sun
* Prithu Banerjee
* Ritika Jain
* Bahare Fatemi
* Adnan Reza
* Abed Rahman
* Junyuan Zheng
* Mehrdad Ghomi
* Ricky Chen
* Yu Yan
* Jordon Johnson
* Yan Zhao
* Dandan Wang
* Jiahong Chen

Course talk:CPSC522/Titanic: Machine Learning from Disaster

2016-04-20T20:02:24Z

RitikaJain: Talk page autocreated when first thread was posted

Thread:Course talk:CPSC522/Titanic: Machine Learning from Disaster/Suggestions for Titanic: Machine Learning from Disaster

2016-04-20T20:02:24Z

RitikaJain: New thread: Suggestions for Titanic: Machine Learning from Disaster

Hi Junyuan,
I enjoyed going through your page. I think you have very clearly explained the hypothesis, problem description and how you proceed towards solving it. Some suggestions and queries I had are outlined below:
1. It would be great if you could proof read once because there are typos and grammatical errors; some of which hinder in understanding what you are trying to convey.
2. You might want to explain the attributes: for instance for the attribute Embarked, what do S,C and Q stand for? What does Parch stand for?
Under your section, Using probability method to fill missing age value; you might want to explain what PMM-predictive mean matching stands for and maybe in a line or two explain how it works.
3. I think it was really clever how you used title and relevant attributes to guess the age of the person. In rare title, do you mean titles like Dr.? 
4. You could give links to refer to RMSE and <math>R_2</math> because a layman user might not know these standards of comparison.
5. If my understanding of ensemble methods is correct, it creates and combines multiple models to sort of average out the errors. I am not sure how multiple models are created and combined in your ensemble NN. And why do you chose to ignore the attribute FamilyID in your second set of implementations. You could probably explain this more explicitly in your page.
6. So if i understand correctly; in conclusion, because of insufficient unique data you are unable to train your neural network properly because of which you cannot use it to give a good prediction on your age attribute's missing values. You could consider adding a Conclusion, Discussion and Future work section to give the readers a better idea of what you have established and what could be further checked. 
Thanks for a great page!
Ritika

Course talk:CPSC522/Improving the accuracy of Affect Prediction in an Intelligent Tutoring System

2016-04-20T06:38:23Z

RitikaJain: Talk page autocreated when first thread was posted

Thread:Course talk:CPSC522/Improving the accuracy of Affect Prediction in an Intelligent Tutoring System/Suggestions for Improving the accuracy of Affect Prediction in an Intelligent Tutoring System

2016-04-20T06:38:23Z

RitikaJain: New thread: Suggestions for Improving the accuracy of Affect Prediction in an Intelligent Tutoring System

Hi Abed,
I liked the concept of your project and I also think you have structured your page very well. It has a very neat and clean layout. Some suggestions and queries that I had are outlined below:
1. I think there is some typing error in your hypothesis. You were probably meaning to say that PCA and wrapper features have previously been used but now you try to increase the accuracy of the model by using Logistic with L1 and ensemble methods?
2. I am unable to access any of your references. You might want to double check your links.
3. You could probably add a line or two on Intelligent Tutoring Systems(ITSs) and make this abbreviation explicit.
4. When you mention that the you try to avoid the use of sensors to make the study as unobtrusive as possible so that the user is not distracted, don’t you think when you ask the user to take regular self-report prompts by clicking on the radio button; that is more distracting and might even annoy the test takers?
5. How was the data collected from the undergraduate students? was it like a questionnaire or was it like a radio button they had to click while gazing at a screen?
6. Its not very clear to me why both emotions boredom and curiosity are treated separately? Because how can a user be bored and curious at the same time?
7. I don’t know if its possible, but you might want to show some snapshots of what your dataset looks like. How did you attribute numerical values to features like boredom and curiosity? Was it the number of people who clicked on boredom in some time interval were accounted for? Also have you considered levels of boredom’ for instance ‘a little bored, very bored, dying to get out bored’; maybe as a future work.
8. Why does voting ensemble method have the highest accuracy for boredom but Stacked RBF for curiosity.
Also, maybe if you could provide us with a little bit more intuition on your statistical analysis or some background on the terms in one or two lines, because even though you have provided with links to read them; there are too many and without which the whole section is a little hard to understand. 
Thanks for introducing me to the research work going on in the field of ITSs.
Regards,
Ritika

Thread:Course talk:CPSC522/Analyzing online dating trends with Weka/Suggestions/reply

2016-04-20T05:27:52Z

RitikaJain: Reply to Suggestions

Thanks a lot for your suggestions Samprity. I'm glad you enjoyed going through my page. 
I shall make the suggested changes and get back to you.
Thanks,
Ritika

Course talk:CPSC522/Automatic Classification of Morphological Heart Arrhythmia

2016-04-19T00:51:38Z

RitikaJain: Talk page autocreated when first thread was posted

Thread:Course talk:CPSC522/Automatic Classification of Morphological Heart Arrhythmia/Suggestions for Automatic Classification of Morphological Heart Arrhythmia

2016-04-19T00:51:38Z

RitikaJain: New thread: Suggestions for Automatic Classification of Morphological Heart Arrhythmia

Hi Mehrdad,
This is going to be a rather long message. I found your project really exciting and I have lots of questions. 
Some initial cosmetic suggestions that I would like to get across are as follows:
1. In a line or two, right at the beginning you might want to mention that heart arrhythmia is an irregular heart beat. Also which type of arrhythmia are you trying to classify or model? Bradyarrythmias or tachyarrhythmias? 
2. You might want to add the definition of QRS complex (maybe in brackets where you first mention it) even though i noticed you did include it in your figure but it might not be very obvious. 
3. Right above your gaussian section, there is no link to click for further information on Hermite basis functions. Same for the Gaussian function ,the KNN algorithm and neural networks. 

Questions that came up in my mind as I was going through your page are outlined below: 
1. How did you chop the continuous signals into units containing only one peak? Also, does this mean you extract more than one peak from each patient’s 30 minute ECG signal?
2. Is it possible for you to show how you remodelled the 203 intervals to 25 intervals and what was your resulting weighted vector for the combination of Hermite basis functions? 
3. Using k value as 1 makes me a little uncomfortable. Isn’t it true that larger values of k reduce the effect of noise on classification? Do you think, using k=3 or 5 would improve your results?
4. The N training vectors that you are using are continuous variables? Or are they discrete y values at each of the 25 intervals? Would i be right in assuming the dimension of each training vector to be [1x25].
I really found your project very interesting and I think you’ve done a very good job explaining your problem description and the hypothesis is also very clear. You could probably add some more input to how you model your problem (for instance you could write your x,k_i and y vectors explicitly. I am really looking forward to you Neural network implementation and how it wold compare to this 1-KNN model as it seems almost everywhere neural nets are taking over. 
Regards,
Ritika

Thread:Course talk:CPSC522/Regularization for Neural Networks/Suggestions for Regularization for Neural Networks/reply (2)

2016-04-19T00:46:05Z

RitikaJain: Reply to Suggestions for Regularization for Neural Networks

Oh lol okay. I had it in mind that they were due tomorrow and I was getting all worked up. :P
Good luck with the writing!

Thread:Course talk:CPSC522/Regularization for Neural Networks/Suggestions for Regularization for Neural Networks

2016-04-18T23:46:12Z

RitikaJain:

Hi Ricky,
Till now whatever I've read on your page gives me a good background required on the methods that you'll be comparing against L2-regularization. But your actual description of the problem, experiments and analysis of the results seem incomplete. I'm guessing you'll be putting in more material on the page soon. 
Just as a side note, under your L2-regularization, you might want to look into the math formula. It is not rendered properly. 
Thanks,
Ritika
PS: Our critiques are due tomorrow (i.e the 19th), right?

Course talk:CPSC522/Regularization for Neural Networks

2016-04-18T23:45:50Z

RitikaJain: Talk page autocreated when first thread was posted

Thread:Course talk:CPSC522/Regularization for Neural Networks/Suggestions for Regularization for Neural Networks

2016-04-18T23:45:50Z

RitikaJain: New thread: Suggestions for Regularization for Neural Networks

Hi Ricky,
Till now whatever I've read on your page gives me a good background required on the methods that you'll be comparing against L2-regularization. But your actual description of the problem, experiments and analysis of the results seem incomplete. I'm guessing you'll be putting in more material on the page soon.
Just as a side note, under your L2-regularization, you might want to look into the math formula. It is not rendered properly.
Thanks,
Ritika
PS: Our critiques are due tomorrow (i.e the 19th), right?

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-18T16:32:53Z

RitikaJain: /* Almost-fake OkCupid account */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

It turns out, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. Experiments for these are currently under progress.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-18T16:28:13Z

RitikaJain: /* Almost-fake OkCupid account */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

Interestingly enough, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.<ref name="snapshots"/>]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers. The snapshot of people viewing my profile is shown below<ref name="snapshots"/>:
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. Experiments for these are currently under progress.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-18T16:26:47Z

RitikaJain: /* Almost-fake OkCupid account */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like<ref name="snapshots"/>:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches<ref name="snapshots"/>]]

Interestingly enough, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers.
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.
The snapshot for answering questions by going to some other user's profile is shown below<ref name="snapshots"/>:
[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.<ref name="snapshots"/>]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. Experiments for these are currently under progress.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-18T16:22:58Z

RitikaJain: /* Annotated Bibliography */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches]]

Interestingly enough, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers.
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.

[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. Experiments for these are currently under progress.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="snapshots">[http://www.okcupid.com/ Snapshots taken from my profile on okcupid.com]</ref>
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-18T16:20:37Z

RitikaJain: /* Discussion and Future Work */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches]]

Interestingly enough, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers.
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.

[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. Experiments for these are currently under progress.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"/> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-18T16:20:20Z

RitikaJain: /* Discussion and Future Work */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches]]

Interestingly enough, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers.
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.

[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. Experiments for these are currently under progress.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper<ref name="paper1"> [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-18T16:19:51Z

RitikaJain: /* OkCupid's question and answers */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches]]

Interestingly enough, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers.
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already.<ref name="paper1"/> This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.

[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. Experiments for these are currently under progress.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-18T16:18:32Z

RitikaJain: /* Annotated Bibliography */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches]]

Interestingly enough, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers.
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already. This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.

[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. Experiments for these are currently under progress.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-18T16:18:14Z

RitikaJain: /* Discussion and Future Work */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches]]

Interestingly enough, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers.
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already. This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.

[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. Experiments for these are currently under progress.

===Discussion and Future Work===
There are various possible directions this project could be extended in. I am currently in talks with the Emil O.W. Kirkegaard and his co-authors of the paper [https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users] to look into the possibility of developing the problem as a recommendation systems problem and developing/improving upon OkCupid's current algorithms for recommending matches to users.
I was also interested in doing a country based comparison in online-dating trends to analyze how culture affects online-dating(concerning factors like religion, marriage, food etc).
Essentially, we have a huge dataset of the online dating world and there are immense opportunities for collaboration with sociologists to find or validate interesting hypothesis.

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
*<ref name="paper2">V. K. Singh; R. Piryani ; A. Uddin ; P. Waila [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6526500&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel7%2F6520973%2F6526372%2F06526500.pdf%3Farnumber%3D6526500 Sentiment analysis of movie reviews: A new feature-based heuristic for aspect-level sentiment classification], Automation, Computing, Communication, Control and Compressed Sensing (iMac4s), 2013 International Multi-Conference on Date of Conference: 22-23 March 2013 Page(s): 712 - 717 Print ISBN: 978-1-4673-5089-1 INSPEC Accession Number: 13567093 Conference Location : Kottayam DOI: 10.1109/iMac4s.2013.6526500 Publisher: IEEE</ref>
</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-18T16:03:07Z

RitikaJain: /* Evaluation and Results */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches]]

Interestingly enough, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers.
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already. This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.

[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

===Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously answered as 'I can't answer without a subtitle'.
I guess broadly, it is safe to say that the online dating community is highly divided on how they feel about government-subsidized food programs.
To get higher match rates: firstly, it is important to answer the most popular questions and secondly, it is crucial to answer them how the majority in your desired cluster answers them. For instance if you feel, you would like to date someone from cluster 0; you must definitely answer the 4 most popular questions: q34113, q85419, q416235 and q20062 as 'Never get a job', 'Rose'(such as White Zinfindel)','Can't answer without a subtitle','The best? Maybe...' respectively. This would in expectation match you with someone from the same cluster with a minimum match percentage of 75%.
Match percentage would be 100% if you answered exactly this and this is what they wanted in their answers which is likely to be the same as the one they have answered (human psyche is bent towards liking options which it itself agrees with). Since I have chosen only 4 most popular attributes for my model here; assuming that you answer only these 4 questions; your calculated match percentage would be 100%. Your true match percentage, however will be calculated as follows:
True Match = Calculated Match +/- Reasonable Margin of Error.
Reasonable margin of error=100%/4 (since you have 4 questions in common)=25%.
 Therefore the true match would actually be 100-25=75%.
For increasing the match rate to 90%(as stated in the hypothesis), I would need to select 10 most popular attributes. Experiments for these are currently under progress.

===Discussion and Future Work===

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
*<ref name="paper2">V. K. Singh; R. Piryani ; A. Uddin ; P. Waila [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6526500&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel7%2F6520973%2F6526372%2F06526500.pdf%3Farnumber%3D6526500 Sentiment analysis of movie reviews: A new feature-based heuristic for aspect-level sentiment classification], Automation, Computing, Communication, Control and Compressed Sensing (iMac4s), 2013 International Multi-Conference on Date of Conference: 22-23 March 2013 Page(s): 712 - 717 Print ISBN: 978-1-4673-5089-1 INSPEC Accession Number: 13567093 Conference Location : Kottayam DOI: 10.1109/iMac4s.2013.6526500 Publisher: IEEE</ref>
</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-18T14:57:43Z

RitikaJain: /* Evaluation and Results */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches]]

Interestingly enough, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers.
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already. This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.

[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]

===Evaluation and Results===
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously 'I can't answer without a subtitle'.

===Discussion and Future Work===

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
*<ref name="paper2">V. K. Singh; R. Piryani ; A. Uddin ; P. Waila [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6526500&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel7%2F6520973%2F6526372%2F06526500.pdf%3Farnumber%3D6526500 Sentiment analysis of movie reviews: A new feature-based heuristic for aspect-level sentiment classification], Automation, Computing, Communication, Control and Compressed Sensing (iMac4s), 2013 International Multi-Conference on Date of Conference: 22-23 March 2013 Page(s): 712 - 717 Print ISBN: 978-1-4673-5089-1 INSPEC Accession Number: 13567093 Conference Location : Kottayam DOI: 10.1109/iMac4s.2013.6526500 Publisher: IEEE</ref>
</ref>
}}

== To Add ==

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-18T14:57:32Z

RitikaJain: /* Cluster Visualization */

Course:CPSC522/Analyzing online dating trends with Weka

2016-04-18T14:57:00Z

RitikaJain: /* Cluster Visualization */

== Analyzing online dating trends with Weka==

Author: Ritika Jain 

== Abstract ==
A very large OKCupid dataset (with 2620 attributes and over 68,000 instances) has been analyzed to form clusters as an unsupervised learning task. The rationale behind the clustering is that broadly speaking, population can be segmented into clusters based on their behavioural attributes (which in this project are accessed using OkCupid questions and answers) and we can find a representative profile which broadly matches that cluster.


===Builds on===
*[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]
*[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]
*[https://en.wikipedia.org/wiki/K-means_clustering K means clustering]

===Related Pages===
*[http://www.ted.com/talks/amy_webb_how_i_hacked_online_dating Amy Web: How i hacked online dating]
*[http://www.datascienceweekly.org/data-scientist-interviews/how-machine-learning-can-transform-online-dating-kang-zhao-interview How machine learning can transform online dating]
*[https://osf.io/hy86q/ The effects of humor and laughter on perceived intelligence and dating success]

== Content ==
===Hypothesis===
The goal of this page is to present the analysis on online dating trends (some initial ideas are: form clusters of dating profiles of men and women based on the questions they answer and the values in the partner they cannot compromise on; and then for each cluster try to make an ideal profile which would achieve greater than 90% match rate). I will be working with OkCupid's dataset and using Weka to train, cluster and visualize OkCupid's dataset.
Inspiration from this Math geek<ref name="main" />: http://www.wired.com/2014/01/how-to-hack-okcupid/

===Almost-fake OkCupid account===
To be able to understand how OkCupid works, the first step was to create a almost-fake account for myself. The reason i say fake is because I'm not actively looking to date online. The reason i say almost is, I still want to be taken as a legit user by the online dating community.
So here is what my profile looks like:

[[File:My profile.png|thumb|center|700px|This is the screenshot of my OkCupid profile.]]

Next, I go on to see how to find matches for myself. OkCupid gives you the option to find matches according to preferred age, orientation, and location of who you want to meet. I make these changes in my "I'm Looking For" settings and this immediately adjusts the matches I see around the site. I can also use the “Order By” dropdown to choose how my matches are ranked and presented to me. Sorting by OkCupid's “Special Blend” looks at many different things—more than just match percentage and last login time—and show some great matches that might not otherwise have shown up. I however sort by match percentage because that is the parameter I am interested in for this project.
[[File:Browse matches.png|thumb|center|700px|Browse matches]]

Interestingly enough, I don't need to do much more than this. Being a woman, that too in Vancouver, that's about all I need to do. I get profile likes, profile visits and messages streaming into my OkCupid inbox. This validates an interesting article by OkTrends called [https://www.okcupid.com/deep-end/a-womans-advantage| 'A Woman's advantage'].
[[File:Likes.png| thumb | center|700px| I get profile likes, but I cannot see the people who like me unless i upgrade to A-list.]]

I can also see people who are viewing my profile. I also have the options to browse invisibly at profiles, so they don't know I am looking at their profiles. I can also hide and unhide 'real' stalkers.
[[File:Stalkers.png|thumb|center|700px|Stalkers: People viewing my profile]]

====OkCupid's question and answers====
The questions on the site are made by users themselves, but some initial questions were made by the staff. Questions concern multitude of topics ranging from questions on Ethics, Sex, Religion, Lifestyle, Dating etc. By default, the questions are presented to users in the order of the questions having the most answers already. This minimizes the number of questions that a new user has to answer out before having many in common with other users, but has the cost of starkly decreasing the diversity of questions answered. Still, because users often answer hundreds if not thousands of questions, a great amount of data is available. It is also possible to answer any question by going directly it to, or finding it in some other user's profile.

[[File:Qans1.png|thumb|center|700px|Answering questions by going to other user's profile]]

The more the number of questions I answer, the more are my chances of finding a higher match percentage. For instance, in my profile I have answered 67 questions and the highest match percentage possible is 98.5% (assuming someone answers all my questions exactly how I would like them to and I don't give wrong answers to the questions important to them). The algorithm for match percentages is detailed below which will make things clearer.

[[File:Qans2.png|thumb|center|700px|Answering a question directly. A question asks you: your response, the other person's acceptable responses and the question's importance to you.]]

====OkCupid's matching algorithm====
After setting up my profile and understanding my way around the basic functionalities of the site, I go on to explore how OkCupid suggests matches to me. OkCupid has a very interesting matching algorithm. They calculate match percentages by matching your answers to questions with the other person's expected answers and vice versa.

1. For each question, three values are collected from a user:

a) Your answer
b) How you'd like someone else to answer
c) How important the question is to you

2. Calculating the match:
Assigning numeric values to the four levels of importance: irrelevant, a little important, somewhat important and very important.

{| class="wikitable"
!Level of Importance
!Point value
|-
|Irrelevant
|0
|-
|A little important
|1
|-
|Somewhat important
|10
|-
|Very important
|250
|}

3. Looking at how each of your answers satisfied the other’s preferences, they use these values to give correct weights to the calculations. Your match percentage with other person is figured by answering the following two questions:

a) How much did other person’s answer make you happy?

b)How much did your answers make the other person happy?


===Examples of matches===
Let us understand how this works with the help of an example.
We shall consider two users: user A and user B and try to find match percentage between them.
Say for instance the question that has been answered by both A and B is:
'''How messy are you?''' The answer options are: 
1. Very messy 
2. Average 
3. Very organized 
{| class="wikitable"
|-
|A's answer
|Very organized
|-
|How A wants someone else to answer
|Average or very organized
|-
|The question's importance to A
|Very important
|-
|B's answer
|Average
|-
|How B wants someone else to answer
|Average
|-
|The question's importance to B
|A little important
|}

'''Have you ever cheated in a relationship?''' The answer options are: 
1. Yes 
2. No 

{| class="wikitable"
|-
|A's answer
|No
|-
|How A wants someone else to answer
|No
|-
|The question's importance to A
|A little important
|-
|B's answer
|Yes
|-
|How B wants someone else to answer
|No
|-
|The question's importance to B
|somewhat important
|}

'''How much did B's answer make A happy?'''
A indicated that B’s answer to the first question was very important to you. And that his answer to the second question was not. So we placed 250 importance points on the first question and 1 point on the second question. Of those 251 possible points, B earned 250 by answering the first question how A wanted. So B’s answers were 250/251 = 99.6% satisfactory for A.<ref name="support" />
'''How much did A's answer make B happy?'''
Well, B placed 1 importance point on A's answer to the first question and 10 on A's answer to the second. Of those 11 possible points, A earned 10 points. So A's answers made B 10/11 = 91% happy.

To get a match percentage for A and B, they just multiply your satisfactions, and then take the square root: sqrt(91% * 99.6%) = ~95%.

But to eliminate the possibility where both A and B just answered one same question and got the answers right to each other's question; they would be matched 100% even though they practically know nothing about each other, OkCupid uses a reasonable margin of error which is calculated as 1/(size of number of questions answered by both). It believes that True Match = Calculated Match +/- Reasonable Margin of Error. To give users the most confidence in the match process, they always publish the lowest possible percentage your match can be. In this example, that would be 45%(95%-(1/2)*100%).

===What is Weka?===
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.<ref name="weka" />

Weka can be downloaded from [http://www.cs.waikato.ac.nz/ml/weka/downloading.html here].

===Methodology===

====Crawling data from OkCupid's website====
Since OkCupid's data is not publicly accessible, I was trying to write a website crawler to scrape data from the OkCupid website. Several issues that I ran into were: OkCupid maintains a dynamically changing html structure which did not allow me to extract elements statically by referencing them through the html hierarchy; I can only see other user profiles answers for questions that I have answered myself (To overcome this issue, I tried to create an automatic form-filling bot which worked fine but with a few limitations). After several failed attempts at being able to perfectly extract questions and answers from the other users' profiles, I ran into a GitHub project which had created an OkCupid scraping bot written in Python based Scrappy. The project can be accessed [https://github.com/Deleetdk/OKCubot2 here]. 
Trying to extract the amount of data required would take around 10 computers to run the bots on their systems. Because of time and resource constraints, I tried and then decided against running them on different systems with different fake profiles. Instead, I contacted the authors and luckily managed to get a reply from them. They graciously sent me the passwords for the encrypted user data set as well as the question data set.
The question data set has all the questions as its instances and the attributes are the options of each question. A part of the question data set has been shown in the screenshot below:
[[File:Question data.png|thumb|center|700px|A very small part of the question data]]
A very small part of the user data set has been shown in the screenshot below:
[[File:Userdata.png|thumb|center|700px|A very small part of the user data.]]


====K means clustering====
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.<ref name="wiki"/>
It is easy to visualize how K-means algorithm works as shown in this image taken from [https://en.wikipedia.org/wiki/K-means_clustering Wikipedia].
[[File:K-means algorithm.png|thumb|center|600px|Pictorial representation of how K-means algorithm works]]
Since the objective of my project is to cluster the population into segments and find a representative of that segment which represents that cluster well, K-means serves my purpose. 
The instances, i.e. the cases that I need to cluster are the answers of users and the attributes I am clustering on (i.e. trying to model their behavioural attributes) are the questions answered by a majority of the population.

====Training====
Since there are no actual labels we can use, clustering these instances(user profiles) according to attributes(questions) is actually an unsupervised learning task.
I experimented with density based clustering and K-means clustering. K-means clustering gave me better outcomes.
On Weka, the process can be shown as following:
* Preprocessing the data: removing unwanted attributes. I ran the experiments choosing only the four most popular attributes (q34113, q85419, q416235, q20062). 
The preprocessing of data is shown as follows:
[[File:Preprocessing.png|thumb|center|700px|Preprocessing the data]]
*Setting the K-means clustering parameters. I use 80-20% split, i.e 80% of the data is used to train the model and 20% is used to test the model. Playing around with different number of clusters, I decided to choose 4 as each cluster had enough elements to validate its existence. 
The K-means attribute selection pane in Weka where I choose the attributes is shown as follows:
[[File:Kmeans.png|thumb|center|700px|Setting K-means clustering parameters]]

====Cluster Visualization====

Number of iterations: 2
Within cluster sum of squared errors: 75.0

The most popular attributes are found to be: q34113, q85419, q416235 and q20062. 
*'''q34113:''' How do you feel about government-subsidized food programs (free lunch,food stamps etc)?
'''Options are:''' No problem, It's okay if it is not abused, Okay for short amounts of time, Never-get a job 
*'''q85419:''' Which type of wine would you prefer to drink outside of a meal such as for leisure?
'''Options are:''' White (such as Chardonnay Riesling), Red(such as Merlot Cabernet Shiraz), Rosé (such as White Zinfindel), I don't drink wine. 
*'''q416235:''' Do you like watching foreign movies with subtitles?
'''Options are:''' Yes, No, Can't answer without a subtitle.
*'''q20062:''' While in the middle of the best lovemaking of your life, if your lover asked you to squeal like a dolphin, would you?
'''Options are:''' Absolutely, No way, The best? Maybe... 

Cluster centroids:
{| class="wikitable"
|-
!Attribute
!Cluster 0
!Cluster 1
!Cluster 2
!Custer 3
|-
|q34113
|Never-get a job
|It's okay, if it is not abused
|No problem
|It's okay, if it is not abused
|-
|q85419
|Rosé (such as White Zinfindel)
|Rosé (such as White Zinfindel)
|Red(such as Merlot Cabernet Shiraz)
|Rosé (such as White Zinfindel)
|-
|q416325
|Can't answer without a subtitle
|Can't answer without a subtitle
|Can't answer without a subtitle
|Yes
|-
|q20062
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|The best? Maybe...
|}

{| class="wikitable"
|-
| Clustered Instances
Cluster0: 79%
Cluster1: 11%
Cluster2: 7%
Cluster3: 3%
|}

The clusters can be visualized below.
[[File:Cluster visualization.png|center|700px|thumb|Cluster visualization]]
We see that the clustering is heavily skewed towards the first cluster, i.e. cluster 0 which means to say that a large segment of the population feels that you should get a job instead of relying on government-subsidized food programs.Another interesting observation that can be made is that 97% of the online-dating population has a sense of humour. I say this because the response to ,'Do you like watching foreign movies with subtitles?' is almost unanimously 'I can't answer without a subtitle'.

===Evaluation and Results===

===Discussion and Future Work===

== Annotated Bibliography ==
{{Reflist|refs=
*<ref name="main">[http://www.wired.com/2014/01/how-to-hack-okcupid/ How a math genius hacked OkCupid]</ref>
*<ref name="support">[https://www.okcupid.com/help/match-percentages Match percentages]</ref>
*<ref name="weka">[http://www.cs.waikato.ac.nz/ml/weka/ Weka]</ref>
*<ref name="wiki">[https://en.wikipedia.org/wiki/K-means_clustering k-means clustering Wikipedia]</ref>
*<ref name="paper1">[https://osf.io/p9ixw/ The OKCupid dataset: A very large public dataset of dating site users]</ref>
*<ref name="paper2">V. K. Singh; R. Piryani ; A. Uddin ; P. Waila [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6526500&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel7%2F6520973%2F6526372%2F06526500.pdf%3Farnumber%3D6526500 Sentiment analysis of movie reviews: A new feature-based heuristic for aspect-level sentiment classification], Automation, Computing, Communication, Control and Compressed Sensing (iMac4s), 2013 International Multi-Conference on Date of Conference: 22-23 March 2013 Page(s): 712 - 717 Print ISBN: 978-1-4673-5089-1 INSPEC Accession Number: 13567093 Conference Location : Kottayam DOI: 10.1109/iMac4s.2013.6526500 Publisher: IEEE</ref>
</ref>
}}

== To Add ==