Submitted by AutoModerator t3_10cn8pw in MachineLearning
trnka t1_j5k77wb wrote
Reply to comment by evys_garden in [D] Simple Questions Thread by AutoModerator
The difference from application-level evaluation is a bit vague in that text. I'll use a medical example that I'm more familiar with - predicting the diagnosis from text input.
Application-level evaluation: If the output is a diagnosis code and explanation, I might measure how often doctors accept the recommended diagnosis and read the explanation without checking more information from the patient. And I'd probably want a medical quality evaluation as well, to penalize any biasing influence of the model.
Non-expert evaluation: With the same model, I might compare 2-3 different models and possibly a random baseline model. I'd ask people like myself with some exposure to medicine which explanation is best for a particular case and I could compare against random.
That said I'm not used to seeing non-experts used as evaluators, though it makes some sense in the early stages of poor explanations.
I'm more used to seeing the distinction between real and artificial evaluation. I included that in my example above -- "real" would be when we're asking users to accomplish some task that relies on explanation and we're measuring task success. "Artificial" is more just asking for an opinion about the explanation but the evaluators won't be as critical as they would be in a task-based evaluation.
Hope this helps! I'm not an expert in explainability though I've done some work with it in production in healthcare tech.
Viewing a single comment thread. View all comments