Course:COGS200/Group-27

From UBC Wiki

Introduction

Summary

In an effort to reduce stigma towards those on the autism spectrum in terms of their social, mental, and employable capabilities, our group proposes to design a product which would assist in emotion recognition during daily conversation. This device would utilize Affectiva's visual emotion recognition software and MIT's audio emotion recognition software, and be implemented on an app for a smartwatch. The entire device would consist of just a camera and a smartwatch, and quietly output observed emotions based on collected audio and visual data to allow for the least amount of disruption during day to day use. To evaluate the performance of the device, we would first test it with both individuals on the autism spectrum and not on the spectrum. In turn, these individuals would report to us on it's accuracy in correctly identifying emotions, as well as with any suggestions for improving ease of use. We will also track download data for the application and compare the frequency of downloads in both high and low integration societies. We view our product as being an aid for assisting in integration for those on the autism spectrum, and as an opportunity to inspire governments of low integration societies to collect more data on those on the spectrum so that they can implement effective social, medical, and economical policies for helping those with mental health problems.

Incentive

A 2009, Keiran Rump et al conducted a study exploring the development of emotion recognition in individuals with autism. Individuals across all ages groups, both with and without autism, were tested on brief visual displays of facial expressions of varying degrees. It was found that recognition performance from those on the autism spectrum was similar across all age groups. However, among the control group of individuals not on the spectrum, performance was best among the adult group. This discrepancy hints at “underlying cognitive processes that may be affecting the development of emotion recognition in [those] with autism”[1].

According to Australia’s Department of Social Services[2], those with autism can find difficulty in:

* recognizing facial expressions and the emotions behind them
* copying and using facial expressions, and
* understanding and interpreting emotions, often resulting in a seeming lack of empathy towards others.

To assist in helping children learn about and respond to emotions, the Department suggests building learning experiences into simple daily tasks. Daily interactions can help in improving a person with autism's ability to express and acknowledge the emotions of both themselves and others.

Even in countries where those with mental illnesses are considered to be well-integrated into society (through access to medical services, equal job opportunities, and active government involvement in the protection of human rights and combatting of stigma towards those with mental health problems)[3], development of recognition skills is primarily done through Special Education. This development is approached as a secluded problem, without opportunities to learn in day to day interactions.

Yet this is, sadly, a best case scenario. In countries like Romania, Portugal, Indonesia and Greece, people with mental health problems are hardly integrated into society at all. A combination of a lack of government policy for medical, social, and employment services, the prevalence of care being received in long-stay hospitals and institutions (as opposed to within the community), and a general lack of data on mental health inhibits any sense of belonging for those on the autism spectrum. Social withdrawal, paired with difficulty in verbal and non-verbal communication[4], is linked with the common diagnosis of depression among those on the spectrum. Depression heightens symptoms of self-harm and obsessionally for those with autism or Asperger’s, and can lead to suicide. While this illness is entirely treatable, promoting seclusion and divisiveness among those with mental health problems is nothing but counterproductive to the cause.

Proposal

However, those living on the spectrum could not be present in a better time period thus far in terms of technological and educational advances for mental health and awareness. Affectiva[5] has been developing an emotion AI with the ability to classify human emotion simply using facial data, and MIT constructed an app[6] which uses both audio and physiological cues to deduce emotions during conversation.

For our project, we propose to harmonize this technology in a way that would help in further integrating those with autism and Asperger’s into society by assisting in emotion recognition. We plan on constructing a wearable device which would use real-time audio and visual data from the individual the wearer is conversing with to deduce the individual’s emotion and relay this information to the wearer. The device would consist of a camera necklace and a smartwatch, loaded with our emotion-recognition application. This way, the device would be inconspicuous and easily adaptable to already accessible technology, to allow for real-time education without disrupting day to day interactions. Our project would utilize three aspects of the Cognitive Systems discipline; Psychology for it’s implications for the world of mental health education and awareness, Linguistics for it’s usage of conversational and phonetical cues for emotion deduction, and Computer Science for the harmonization of the technology and the development of the application itself.

If our product was distributed, we envision it assisting those on the autism spectrum in two ways. For those in higher-integration societies such as ours here in Canada, we view our project as a way to further help in education for real-time emotion recognition learning, and to further reduce any stigma towards the capabilities of those with mental illnesses. Secondly, for societies where these individuals are hardly integrated at all, we hope this product will be a catalyst for integration into day to day life to improve the lives of those with autism and Asperger’s globally.


Methods

Resources

We will design a wearable technology which is inconspicuous for our daily life. The best choice we believe is an app which could be implemented on smartwatch and necklace, as it will be easy for people to use every day. As well, a watches and necklaces are able to pass as dressing. Another important reason is that the smartwatch already exists, and necklace is practical. Because of the current existence of the smartwatch, it will allow for the technology to be accessible to more people than if the entire device would have to be purchased on it's own. MIT has built an application which has the ability to detect emotions during conversation. [7] We plan to design an app which is able to be used during daily conversations, video calls, and conference calls. Therefore, our design will be based on both of non-verbal and verbal cues.

The interface of the application would be as minimalistic as possible to allow for simplistic usage. Once the application has been activated, the watch face will display both an emoji and a title for the emotion currently being observed. If the user wishes to use the watch while the application is running, this will be possible as well. They will simply receive a notification every time the classification of the observed emotion has changed. The application will also have the ability to be modified to fit the user's needs. For instance, if the user wishes, they can change the alert sound of the notification, either to be specific to every emotion the application is able to detect or more generally, to differentiate between traditionally 'positive' and 'negative' emotions. This will also be paired with the ability to send multiple notifications, depending on the urgency or severity of the emotion change. If the application is unable to determine the observed emotion within our degree of accuracy (as discussed in later sections), tips to improve observations will be displayed to the user. For instance, if the visual data is returning to the application quite skewed, it may be suggested that the user moves closer to their counterpart, or increases the lighting in their environment.

For non-verbal cues, we will be using optical sensors that are used by Affectiva to create unbiased facial expressions of emotions. Affectiva uses machine learning to observe the facial expression in photos or videos in comparison with it's database of over five million faces and emotions. It can also detect emotions of faces in video if eye movements and lip tightening occurs on people's faces. Using optical sensors to scan people’s faces would enable us to use the camera in our necklace or smartwatch to capture accurate emotions in real time. Transmitting the visual data received to the smartwatch/necklace via using the optical sensor will create a depth map of individuals. We will be examining and analyzing the depth map with pre-set emotions for cross-reference. They include: Anger, Contempt, Disgust, Fear, Joy, Sadness and Surprise. We choose these emotions based on Baumeister’s study in 1995 which indicates that these emotions are pre-set and could generalize all the emotion of humans.[8] Moreover, we recognize that the other emotions are just different levels of these emotions we pre-set. That is, the emotion "ecstatic" is the high level of the emotion "joy" and might combine factors of the emotion "surprise". But we would conclude the emotion "ecstatic" into "joy". These emotions are also easily classified as either 'positive' or 'negative' reactions, if the user wishes to keep their outputted information more general.

However, emotions are also expressed using verbal cues. The reason as to why we are including emotion detection via speech is human emotion changes individual’s tone of voice. There are three speech properties, (voice quantity, utterance timing and utterance pitch contour) which are able to indicate people’s emotions during their speech.[9] We will be integrating MIT’s wearable app as the component of our device which recognizes emotions using audio data. Garun's study shows that neural network could analyze people’s emotion according their speech every 5 seconds. This app could analyze people’s emotions based on their tone of voices. For example, if long pause and monotones occurs in an individual’s speech, this individual will be recognized as sad. Therefore, this app will be added and used in congruent with our visual emotion detection. For our design, it will be cross-referencing with difference in pitch, loudness, speech rate and pauses.

Affectiva has high accuracy. The holders of Affectiva have determined accuracy of the different emotions by using a Receiver Operating Characteristic curve. They have indicated that almost all of the observed emotions are correctly identified with an accuracy over 90 percent. In order to improve the accuracy, they collect data from 75 different countries. This is imperative for successful emotion recognition as different countries, and therefore different cultures, display emotions visually in very different ways. For example, the encouragement of emotional expression by parents and peers in America versus the suppression of emotions, relative to whom the conversation is being held with in Japanese cultures.[10]

Furthermore, Affectiva has collected data from many different conditions like different brightness, spontaneous facial expressions, hair colors, and faces with glasses or no glasses.[11] Whereas, in comparison to Affectiva, the application of MIT has less accuracy around 68 percentage. This is due to the fact that the technology which is used to detect emotion based on individual's tone needs to be developed. And, in order to improve the accuracy, MIT is trying to collect more kinds of emotions when people are under specific emotional moments, like boredom.[12]

Development

Visual:

Once receiving the depth map of individuals face, we will be using the facial expressions as input to predict an emotion. [13]Ekman’s study in 1993 shows that every emotion will have its unique features. For example, Joy will result in a smile, cheek raising or Anger will result in brow furrowing, lid tightening, eye widening, lip sucking.

All of the input from our visual component of the product will produce the best emotion related to what it has detected. Because ours and Affectiva's software uses machine learning, we will be able to produce a value of accuracy based on the number of times it has been approved by matching the face input with the data we pre-set.

Vocal:

Same as our visual component, our vocal will take individuals change in tone of voice and compare it with our database, which will also produce a value of accuracy. High-pitch and quick-rate speakers are generally recognized as having negative emotion, whereas speakers with slow-rate are often recognized as having neutral emotion.[14] People will reduce or increase their level of loudness and pause-using when they have different emotions. For example, people who are sad would use low loudness and more pauses when they are talking, and surprised people would have high loudness.[15]

As one could infer, there could be quite a discrepancy between emotions observed if only audio data is used. This could be due to cultural norms, conflicting environmental inputs, and simply different personalities between individuals. This backs up our decision to utilize software which uses machine learning to determine emotions audibly. This way, our device will be able to learn and improve over time in partnership with the more accurate visual form of classification. As well, until emotion recognition using simply audio inputs holds to the same degree of accuracy as that of visual inputs, we will be sure to weight the output of our visual data slightly more heavily.

Combination and operate:

Next, our app will compare the value given by the visual and vocal component of our product. If it were over the threshold of approval along with the degree of confidence, it would produce the best-detected emotion with an emoji related to that emotion on the screen of the smartwatch. However, if the value is less or more than 10% of the standard value of an emotion, the value will be compared with other emotions.

If the value could not match any of emotions, it will output "unknown" on the screen. As well, the device will suggest possible improvements to the wearer's environment to help in improving accuracy for emotion detection. As a project group, we all agree that more value should be placed in outputting the correct emotion, as opposed to simply making sure an emotion is outputted. If the device is uncertain, we would prefer it to output improvements that could be made during the conversation to assist in recognition as opposed to just guessing an emotion and having it be the false result. The device is meant to be one of learning for the wearer, and that must continue to be the core purpose of the product. If it were making frequent guesses as to detected emotions and outputting incorrect classifications, this would be extremely counterproductive to learning for those on the autism spectrum. In turn, it would also be working against our main goals of reducing stigma towards, providing more opportunities for, and increasing integration levels for those on the spectrum.

We chose to go with visual and audio as it would be easier to extract emotion information as oppose other forms of detection such as change in body temperature or change in the speed of pulse of the individual. Both body temperature and speed of pulse will change during the growth of individuals, and they will be different based on individual’s body conditions.[16] These features of body temperature and speed of pulse will be causes error when analyzing humans’ emotions. Moreover, as one of our aim is for our devices to be inconspicuous, vocal and visual are the only two ways to retrieve information from an individual without the need of attaching any device to them. Our visual component, camera, will be shooting over 30,000 infrared dots which are not visible to humans eyes.

Another reason as to why we have decided to use two streams of emotion detection is to able able to detect sarcasm, or other misleading emotions. One could easily smile in a sarcastic way, having it so that the visual input detects a positive emotion and the audio detects a negative one. If either of these methods were being used individually, they could potentially produce an incorrect classification. However, when our application compares the results of our visual to our vocal, our app will flag a discrepancy which could imply a contradiction of the two emotions.

Mobile Application

Our mobile app will capture the information from the camera, and from the smartwatch and analyze them to infer the persons emotions. Specifically, our algorithm focuses on both data to extract the tiniest variation due to breathing or change in individuals face. We then further analyze these responses to obtain more details by feeding these informations as features into a machine learning algorithm to recognize the individuals emotions. Hence, our device can automatically know the persons emotion and output an emoji and a title corresponding to the particular emotion it has detected. We envision that our machine learning algorithm to improve by time and usage to increase its accuracy.

The camera on the necklace will wirelessly through bluetooth technology transmit its information to the smartwatch. We are using bluetooth technology in order not to lose any piece of data as they are all crucial in the decision making. As well, bluetooth will still be able to operate in dead zones (areas without reliable cellular service). Our smartwatch will be equipped with a microphone which will then transmit its data to our app.

The user of our device will need to wear the necklace and the smartwatch, our app will recognize when someone is interacting with the user. Therefore, our app is always running in the background. This also allows us to gather more data for the machine learning algorithm to improve itself. However, once the user wants to begin detecting emotions, they would need to open up the app and click on DETECT button which will then start analyzing using the camera and then it would output its findings. In summation, the application will be running in the background for the purposes of improving the machine learning it implements, but will only make this information available to the user through activation.

Open Source

We are releasing the our application as open source in order to have the community to work on the improvement of our accuracy. We are not planning on making money off of this product as our vision is to better the life of all human kind, regardless of their abilities. We are not concered with the privacy issues as we are not recording anyones data on our database. All data will be stored on individuals smartwatch which will be encrypted. This will allow developers to be more flexible to customize this app based on the severity of the user. Moreover, this way, it would bring our production cost down. This way we can focus more on the availability of our product to more users. That being said, it will still need to follow open source standards. We will however, provide paid support for any organization that would like to adopt our product and application.

Discussion

Our group proposed the device as a form of a necklace and a smartwatch to ensure that the device is inconspicuous, not distractive during conversations, and easily integrate into day to day life. We vision the device to be simple to wear and more accessible (compare to high-cost private therapies) to individuals on the autism spectrum. The key reason we chose to use a smartwatch loaded with an already developed application is to build on, utilize and combine existing technologies, making the design more technically feasible. We also would like to encourage other talent to innovate new ideas or expand on existing technologies to help individuals with other mental challenges.

Product Evaluation

We will evaluate our product performance both before and after the product is released.

Before - Focus Group

We will organize several focus groups with different level of autism or Asperger to test out our product. According to the Autism Society of America, there are three different types of Autism Spectrum Disorders, including 1) Autistic Disorder, 2) Asperger’s Syndromes, and 3) Pervasive Developmental Disorder- Not Otherwise Specified[17]. Within each Autism Spectrum Disorder, the Diagnostic and Statistical Manual of Mental Disorders (DSM–5)'s diagnostic criteria categories levels of severity by three "functional levels", including[18]

  • Level 1: Requiring Support
  • Level 2: Requiring Substantial Support
  • Level 3: Requiring Very Substantial Support

There will be 9 focus groups in total and each focus group will consist of 50 participants with similar positions on the spectrum. We choose to conduct this study with such a high number of participants and across all the different possible levels of autism to receive input from every sort of individual who may find this device to be of use. In conducting these tests with people from so many different levels on the spectrum, we will be able to see what type of person found the device to be most useful, and for what specific reasons. This way we will also be able to see more clearly what areas of the application need to be targeted for improvement, as we envision our product as being universally useful and accessible. Each individual will receive our product for six months. In these six months, they will experience the product in real life communication and then provide feedback to us every month on any difficulties encountered, personal opinion and areas for improvement.

A control group of 100 randomly-selected participants, who age between 15 to 60, and are not on the autism spectrum, will also receive our product and asked to experience our product for three months. They will generate feedback once per two weeks on how accurate the emotion recognition device is. This way, we will be getting feedback from people who have a stronger understanding on conversational and emotional norms than from those who may be on the autism spectrum. In this stage, major discrepancies will be identified and our product will be fine-tuned. This process will be repeated, if necessary, until our product outputs reach at least 90 percent accuracy.

After – Application Download Statistics

After the product is released, the application download data will be analyzed. We will evaluate the total numbers of downloads, and identify countries where the application is being downloaded much more than others. From the data, we hope to determine the relationship between the numbers of download with the level of integration and attitudes of these countries towards people with autism and Asperger's, and improve on any bugs reported from all our users.

Download data will also assist in our knowledge as to how to develop the device in regards to being as globally accessible as possible. If we notice that, globally, there are considerably less downloads within countries or areas with a shared culture, we could infer that our application could be lacking in accuracy for that area. From this data, we would know to focus current improvements on emotion recognition accuracy for societies with the previously identified cultural norms. However, a lack of downloads across a similar culture could also reflect that culture's unwillingness to move towards integration or reducing stigma for those on the spectrum. This problem will be further explored later in the wiki.

As the audio is analyzed based on the speakers’ tones of voice, all languages will be examined by the software. We will identify countries that have the most downloads, which will show a need for the application in specific areas. We recognize that different languages would have different speaking tones, timing, ways of pausing and other verbal cues, thus, we will then prioritize our research, we would focus on that specific area for its language and specific verbal cues to develop versions of the application to be regionally, culturally, and language specific.

Predicted difficulties

We predicted the following major difficulties of our product in real life situations.

Incorrect classification of emotions

Although we make every effort to provide the highest degree of precision, we foresee that there will be a certain level of inaccuracy during the real-life communication, that is, incorrect classification of emotions. We hope to minimize this accuracy with the control group’s feedback prior the product releases.

After our product is released, if there is a discrepancy between audio and visual input and the error exceed our pre-set acceptable margin of error, the smartwatch will output a message showing that “the device fails to recognize emotion” and provide recommendations, such as “please get closer to the speaker”, for users to facilitate the device emotion recognition. Our product design will prioritize correct responses over any uncertain responses. In other words, we would only generate results with at least 90 percent accuracy or else output an error message.

Privacy / Philosophical Dilemmas

This product would require individual to wear the device throughout their daily life. As this product consists of a camera and a microphone, various privacy issues are identified. We will minimize the disturbance by setting the device as an encrypted product so all the visual and audio records will not be accessible by anyone, even the users. The camera will only turn on when the microphone received certain clearness of voice, or volume of input, to ensure the device only work in circumstances where an individual or group of people is communicating with the users. There is also a lock function to allow the users to hibernate the product manually to reduce any unwanted observation.

This product is also reflective of current discussions being had in regards to an increase in recording and surveillance technologies globally. If a product such as this were to be distributed, it would have to be done in a world where the pursuit of knowledge is favoured over false illusions of privacy that some still seem to believe in during our modern age. This may seem like a cynical outlook, but upon inspection, a rational vessel could see that globalization and increased access of information comes with the tradeoff of reduced privacy. Therefore, we infer that if our world were to get to a point where this device would be accepted, the majority of it's populous would have reached these conclusions as well.

Reluctance of Use

For some countries and societies, there may not be any desire from its populous to increase integration in any way. Some cultures may view those with mental health issues to be lesser citizens and unworthy of extra assistance and care. As a result, our device will only be as useful as a populous makes it out to be. If there are any societies that are currently looking to improve their levels of integration and opportunity for those on the spectrum, our device could be used as a catalyst for development. It could help in reducing stigma towards these individuals and inspire policymakers and innovators to attain more data and further technologies to help people with mental health challenges.

While some societies may not have any desire to implement our device, in making it open-source, we hope to inspire the individuals who do care within said cultures to raise their voice and take action to help those of their populous who are unable to do so themselves.

Conclusion

At the start of this project, our group members had a spectrum of knowledge of and experience with those with mental health issues. Some of us were used to going to school or working with those with autism, but some others had very little knowledge of autism at all. In working through the development of this product, we were made more aware of the struggles that those on the autism spectrum go through during daily existence. As well, in having to back up every decision we were making in the design and development of the device, as a group we realized that some characteristics of the device needed to be forsaken. For instance, the restriction of the initial prototype only being available in English to account for the vast amount of knowledge that would be required to detect emotions across all cultures, social norms, and dialects. Because of this, we had to work as a group to be sure that all of our visions and priorities for the product were completely aligned.

One thing that we learned in researching the device was that the main reason for countries being categorized as ‘low integration’ was their government’s lack of data on mental health issues[19] and, because of this, their inability to implement appropriate social and medical services for these individuals. As a result, our device must not be viewed as a finite answer for solving emotion recognition education of those with autism and Asperger’s, but as an aid. If governments are unwilling to assist in integration, our product will never be able to be used to it’s full potential. However, with proper assistance, this device could be expanded and adapted to motivate technological advances for any mental illness that requires assistance in emotive recognition and communication.

We are hoping to suggest for future research on our product to increase its accuracy of prediction. The first phase of our product would only work for english speaking countries. Therefore, we are hoping for future iterations of our product to cover the majority of countries regardless of their language. This requires intense research in the linguistics parts of our product to remove the limitation of language and also to generalize it to all languages. As well, many sociological aspects would have to be pursued further to make the device more globally accessible.


References

  1. Rump, K. M., Giovannelli, J. L., Minshew, N. J. and Strauss, M. S. (2009), The Development of Emotion Recognition in Individuals With Autism. Child Development, 80: 1434–1447. doi:10.1111/j.1467-8624.2009.01343.x
  2. Emotional development in children with autism spectrum disorder. (n.d.). Retrieved November 20, 2017, from http://raisingchildren.net.au/articles/autism_spectrum_disorder_emotional_development.html
  3. Koehring, M. (2017, April 07). How countries are failing to integrate people with mental illness into society. Retrieved November 20, 2017, from https://www.huffingtonpost.com/entry/how-countries-are-failing-to-integrate-people-with_us_58e7d2d6e4b0acd784ca57ca
  4. Stewart, M. E., Barnard, L., Pearson, J., Hasan, R., & O’Brien, G. (2006). Presentation of depression in autism and Asperger syndrome. Autism, 10(1), 103-116. doi:10.1177/1362361306062013
  5. https://www.affectiva.com
  6. Garun, N. (2017, February 01). MIT built a wearable app to detect emotion in conversation. Retrieved November 20, 2017, from https://www.theverge.com/2017/2/1/14476372/mit-research-wearable-app-detect-emotion-speech
  7. Garun, N. (2017, February 01). MIT built a wearable app to detect emotion in conversation. Retrieved November 20, 2017, from https://www.theverge.com/2017/2/1/14476372/mit-research-wearable-app-detect-emotion-speech
  8. Baumeister, R. F., & Leary, M. R. (1995). The need to belong: Desire for interpersonal attachments as a fundamental human motivation. Psychological Bulletin, 117(3), 497-529. http://dx.doi.org/10.1037/0033-2909.117.3.497
  9. Murray, I. R., & Arnott, J. L. (1993). Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. The Journal of the Acoustical Society of America, 93(2), 1097-1108. doi:10.1121/1.405558
  10. Emotions and culture. (2017, November 20). Retrieved November 28, 2017, from https://en.wikipedia.org/wiki/Emotions_and_culture#Culture_and_emotional_experiences
  11. Determining Accuracy. (n.d.). Retrieved November 28, 2017, from https://developer.affectiva.com/determining-accuracy/
  12. Adam Conner-Simons | Rachel Gordon | CSAIL. (2017, February 01). Wearable AI system can detect a conversation's tone. Retrieved November 28, 2017, from http://news.mit.edu/2017/wearable-ai-can-detect-tone-conversation-0201
  13. Ekman, P., & Rosenberg, E. L. (2005). What the face reveals: basic and applied studies of spontaneous expression using the facial action coding system (FACS). Oxford: Oxford University Press.
  14. Apple, W., Streeter, L. A., & Krauss, R. M. (1979). Effects of pitch and speech rate on personal attributions. Journal of Personality and Social Psychology, 37(5), 715-727. http://dx.doi.org/10.1037/0022-3514.37.5.715
  15. Schröder, M. (2001). Emotional Speech Synthesis: A Review.
  16. Iliff, A., & Lee, V. (1952). Pulse Rate, Respiratory Rate, and Body Temperature of Children between Two Months and Eighteen Years of Age. Child Development, 23(4), 237-245. doi:10.2307/1126031
  17. What is Autism? (n.d.). Retrieved November 27, 2017, from https://www.asws.org/WhatisAutism.aspx
  18. Lisa Jo Rudy | Reviewed by Joel Forman, MD. (n.d.). What Are the 3 Levels of Autism? Retrieved November 27, 2017, from https://www.verywell.com/what-are-the-three-levels-of-autism-260233
  19. Koehring, M. (2017, April 07). How countries are failing to integrate people with mental illness into society. Retrieved November 27, 2017, from https://www.huffingtonpost.com/entry/how-countries-are-failing-to-integrate-people-with_us_58e7d2d6e4b0acd784ca57ca