Course:CPSC522/Multi-Attribute Utility

From UBC Wiki

Multi-Attribute Utility

Multi-attribute utility offers a mathematical model for evaluating and choosing between options when there are multiple reasons to consider when making the choice.

Page Author

Celeste/Clair Ross

Abstract

This page briefly goes over utility functions in machine learning and extends that to explain multi-attribute utility functions. It provides examples to help illustrate real-world scenarios of where and why this concept can be applied. It explains how multi-attribute utility functions work and explains the evaluation and optimization process. It introduces off-policy estimators, two of which, the direct method and inverse propensity scoring, are used in Doubly Robust Policy Evaluation and Learning[1] to come up with the doubly-robust approach. These three estimators are then used in Interactive Multi-Objective Off-Policy Optimization[2] to show that the interactive multi-objective off-policy optimization algorithm, or IMO3, provides better performance for off-policy estimators, particularly for estimators that are better than others, like the doubly robust evaluator.

Builds on

This builds on Utility (see Wikipedia until the course page is completed), a mathematical model for making decisions by applying a value to each choice. This page goes explores when a choice is motivated by multiple, often competing, reasons.

Related pages

There are currently no other pages that build on this page. A full list of backlinks (links that reference this page) has been provided in this section.

Backlinks

Introduction

Choices are made every day by people, organizations and any number of others agents. Most of these choices are not random (or stochastic), instead, there are informed reasons. To mathematically recreate this mimetic behaviour we can use a utility function. If there are multiple reasons that must be considered when choosing an option we call each reason an attribute and use a multi-attribute utility function.

Utility functions are often used in economic contexts but they also have a growing use for reinforcement learning in the field of machine learning.

Definition and application

To understand how multi-attribute utility functions work, we start by considering how an agent with objectives may go about achieving those objectives. To do so, the agent acts iteratively, starting each iteration by choosing some context based on the likelihood of each context. Then the agent chooses an action to take based on the context and is rewarded with a value for each objective it has. After this, the probability of what contexts are possible can change, so we start a new iteration.

As the agent acts it also learns how to act with a policy. When there are multiple objectives, the policy dictates which ones the agent prioritizes if objectives are conflicting. Someone who designs these policies, aptly called a designer, may change the priorities of an agent by changing the rewards for the agent's actions towards its objectives, resulting in the agent learning a new policy.

More technically, to find a utility function , we start by defining every possible context as the context space and all actions it can take as an action space . The start of the iteration begins by sampling a context probability distribution at the start of each iteration to get . Next, the agent determines which action to take, , based on the likelihood of taking each action in given the context by using a stochastic policy, . Taking action gives the agent a reward, , towards each of its objectives based on some reward distribution, . That is, for objectives the agent receives a reward vector .

The policy space for the stochastic policy is a set of vectors of length where each term in a vector corresponds with an objective and is a positive value where the sum of all terms in a vector add up to . This is equivalent to saying that, when we have attributes the stochastic policy samples from a -simplex with vertices, with the simplex denoted as . Written out in set notation the policy space is .

The expected value of a policy is a guess at the best policy to use by taking the likelihood of each policy being used and using that to weigh the associated reward values. This can be written as , which is a -dimensional vector whose -th entry, , is the expected value of objective for policy .

Multi-armed bandits and simple regret

Usually, there are not an unlimited amount of resources available to make decisions so we must make a choice with limited time to do so. If a multi-objective policy has a limited amount of time or rounds, , then it is a multi-armed bandit problem, also called a -armed bandit problem.

When a choice is made prematurely there is often some level of regret if it was not the right choice in retrospect. It is because of this that the difference between the utility of a given policy and the optimal policy is called regret. When we have a fixed amount of rounds, , to get a policy as close to optimal as we can, , we can calculate the difference in the utility value against the utility value of the optimal policy, . This is the simple regret[3], , which can be expressed as the equation .

Examples

Examples help motivate the topic of multi-attribute utility functions as they have a wide diversity of applications. Some are controversial like its use in hiring new employees. Other applications are emerging[4] with the potential for mass adoption, such as controlling traffic lights and the flow of traffic.

Hiring decisions

Companies may choose to give benefits to their "best employee". To calculate this some companies have opted to use machine learning and multi-objective utility theory to do so.[5] Factors that influence if an employee is considered better than another include attendance, work output, discipline, and reporting.

This could result in machine learning reinforcing gendered biases in the workplaces due to women doing more non-promotable work than men, like office chores.[6] As the public has become increasingly aware of this trend it has gained notoriety.[7] The majority of US citizens are against this process and believe that a machine learning hiring model would miss the individuality among applicants. At the same time, the majority of US citizens believe this practice would ensure everyone is treated equally, which is not what studies have shown.[8]

Traffic patterns

A traffic light network, or a Traffic Signal Control (TSC), may control the lights of many intersections and needs to manage competing objectives to make sure that traffic in all directions does not become congested during different times of the day and year.[9] Reinforcement learning may be used to optimize these control systems through A/B testing,[2] by gathering information on how a proposed solution affects traffic patterns. This process is iterative, if a designer is not satisfied with the results then a new solution must be created and tested.

An example of topology graph of traffic network.[10]

There are three major issues[2] with this.

  1. This can take weeks or months to gather enough data to account for seasonal traffic trends.
  2. The number of possible policies is large and often hard to predict how trading off priorities will affect traffic.
  3. For a large-scale project such as a municipal project the time requirements and unexpected results make it hard to manage an experiment's treatment and control group.

Utility evaluation and optimization

To measure the quality of a policy it helps to assume there is some optimal policy that the designer would like that can be measured against. However, if that policy is unknown then it has to be estimated with an off-policy learning estimator.

To do that, we need some scalarization vector, that, when applied as a linear program to the estimator , results in the maximum value possible. This can be expressed as the following equation.

.

Figure 1: Bias (upper) and root mean squared error (lower) of the three estimators for classification error.[1]
Figure 2: Classification error of direct loss minimization (upper) and filter tree (lower). Note that the representations used by DLM and the trees differ radically, conflating any comparison between the approaches. However, the Offset and Filter Tree[11] approaches share a similar representation, so differences in performance are purely a matter of superior optimization.[1]

While it might not be possible to find the exact policy, the goal is to find a near-optimal policy.

Optimization estimators

There are three estimators that are the focal point for the research that is being shown here. The direct method (DM), inverse propensity scoring (IPS), and the doubly robust (DR) estimator. The last of these is proposed in Dudìk et al. (2013),[1] the first of the two papers for this assignment.

Direct method

The direct method (DM) estimator[12] uses an offline-learned reward model, , to find the expected reward vector , using the equation:

The direct method is prone to bias when the policy isn't known.

Inverse propensity scoring

Inverse propensity scoring (IPS) is less prone to bias than the direct method. IPS uses logged records to correct between the existing logs and the new policies. A propensity score is the probability that a policy takes an action given a context. That is, for a policy , action , and context the propensity is .

IPS can to tuned to be less biased with a hyper-parameter, , but at the cost of becoming inconsistent with a high variance in values. The IPS estimator is represented by the following equation.

Doubly robust

The doubly robust (DR) evaluator[1] is put forward in the first of the two researched papers and builds off the DM and IPS estimators. DR is shown to have a lower variance than IPS if either one of two conditions is met:

  1. The propensity scores are correct
  2. The reward model is unbiased

Only one of these conditions needs to be met and if the reward model is unbiased, the error can be bounded like in the IPS estimator. The equation for the doubly robust estimator is the following equation.

The IMO3 algorithm

With IMO3 the goal is to gather a group of good policies and then, after a fixed amount of iterations, present them to the policy designer all at once. IMO3 is a general non-adaptive algorithm which means that the estimator used will not change but the scalarization vector, , can change over time after incorporating the designer's feedback.

IMO3 identifies near-optimal policies with high probability, but this is dependent on the amount of feedback there is from the policy designer and the training data for off-policy estimation.

How it works

Algorithm 1: IMO3 pseudocode

Input: Logging policy , logged data , budget , and pre-selection budget .

 1: 
 2: for  do
 3:     
 4:     
 5:     
 6: 
 7: for  = 1,..., do
 8:     
 9:     
 10:  
 11: 
Figure 1: Simple regret of different algorithms for a fixed logged data size and varying interaction budget. Each experiment is averaged over 10 logged data, 10 randomly selected and 5 runs under each combination of logged data and .[2]
Figure 2: Simple regret of different algorithms for a fixed interaction budget and varying logged data size. Each experiment is averaged over 10 logged data, 10 randomly selected and 5 runs under each combination of logged data and .[2]
Figure 3: Simple regret of different algorithms by fixing logged data size and varying budget. Each experiment is averaged over 10 logged data, 10 randomly selected and 5 runs under each combination of logged data and .[13]
Figure 4: Simple regret of different algorithms by fixing budget and varying logged data size. Each experiment is averaged over 10 logged data, 10 randomly selected and 5 runs under each combination of logged data and .[13]

Briefly, the algorithm consists of considering a fixed time spent gathering policies, finding the best candidates to present to the designer, and using the best policy of the group after incorporating the designer's feedback before starting again.

More formally, the first challenge is that the possible policies are a continuous and infinite set, . To overcome this, that policy space is discretized to a new set that is split into diverse policies.

To do this discretization of the policies involves learning what the scalarized vector should be, or rather, learning what the maximum likelihood estimate (MLE) is, which poses an efficiency challenge. To address this G-optimality is used. This method tries to minimize the maximum entry in a diagonal of the hat matrix which minimizes the maximum variance of the predicted values.

With our discretized policy space and policies we spend a specified number of rounds, , querying the designer. For an objective value , the designer uses a logistical response model to respond with a probability that a policy is acceptable.

In round , we draw a policy and present its values to the designer. The response from the designer is where is the Bernoulli distribution with mean .

Finally, we find the MLE from the collected observations and use it to identify the optimal policy among them.

Using estimators with IMO3

The way the second paper builds on the first is that IMO3 uses the doubly robust estimator as a baseline for performance. Since the DR estimator is slightly better than DM and ISP, the experiment results from Figure 3 and Figure 4 show that using a better off-policy estimator with IMO3 provides better performance.[13]

Future work

IMO3 adopted a simple non-adaptive algorithm here but in the future, there is room to generalize (and analyze) more complex utility functions, and other types of feedback and response models.

Annotated Bibliography

  1. 1.0 1.1 1.2 1.3 1.4 1.5 Dudìk, Miroslav; Langford, John; Li, Lihong (2013). "Doubly Robust Policy Evaluation and Learning". ICML'11: Proceedings of the 28th International Conference on International Conference on Machine Learning: 1097–1104.
  2. 2.0 2.1 2.2 2.3 2.4 2.5 Wang, Nan; Wang, Hongning; Karimzadehgan, Maryam; Kveton, Branislav; Boutilier, Craig (2022). "IMO^3: Interactive Multi-Objective Off-Policy Optimization". International Joint Conference on Artificial Intelligence Organization: 3523–3529.
  3. Keeney, Ralph L. (2014). Decisions with Multiple Objectives. Cambridge University Press. ISBN 9781139174084.
  4. Akhtar, Mahmuda; Moridpour, Sara (January 2021). "A Review of Traffic Congestion Prediction Using Artificial Intelligence". Journal of Advanced Transportation. 2021.
  5. Umar, Rusydi; Sahara, Dewi (December 2022). "Best Employee Decision Using Multi Attribute Utility Theory Method". Information Technology Articles. 6: 945–951. |first= missing |last= (help)
  6. Walker, Lisa (March 2020). "Eliminating Gender Bias In Nonpromotable Work Tasks". Forbes. Retrieved October 2023. Check date values in: |access-date= (help)
  7. Rainie, Lee; Anderson, Monica; McClain, Colleen; Vogels, Emily A.; Gelles-Watnick, Risa (April 2023). "Americans' views on use of AI in hiring". Pew Research Center. Retrieved October 2023. Check date values in: |access-date= (help)
  8. Raghavan, Manish; Barocas, Solon; Kleinberg, Jon; Levy, Karen (January 2020). "Mitigating bias in algorithmic hiring: evaluating claims and practices". FAT* '20: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency: 469–481.
  9. Bouktif, Salah; Cheniki, Abderraouf; Ouni, Ali; El-Sayed, Hesham (May 2023). "Deep reinforcement learning for traffic signal control with consistent state and reward design approach". Knowledge-Based Systems. 267.
  10. Yao, Baozhen; Ma, Ankun; Feng, Rui; Xiaopeng Shen, Xiaopeng; Zhang, Mingheng; Yansheng Yao, Yansheng (January 2022). "A Deep Learning Framework About Traffic Flow Forecasting for Urban Traffic Emission Monitoring System". Frontiers in Public Health. 9.
  11. Beygelzimer, Alina; Langford, John (June 2009). "The offset tree for learning with partial labels". KDD '09 Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 129–138.
  12. Lambert, Diane; Pregibon, Daryl (August 2007). "More bang for their bucks: assessing new features for online advertisers". ADKDD '07: Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising: 7–15.
  13. 13.0 13.1 13.2 Wang, Nan; Wang, Hongning; Karimzadehgan, Maryam; Kveton, Branislav; Boutilier, Craig (January 2022). "IMO3: Interactive Multi-Objective Off-Policy Optimization". ICML'11: Proceedings of the 28th International Conference on International Conference on Machine Learning – via arXiv.