MedAI #131: Analyzing and Exposing Vulnerabilities in Language Models | Yibo Wang
Education
Introduction
Welcome to the 131st Stanford Med Group exchange session. This week, we hosted Yibo Wang from the University of Illinois, Chicago, who presented his research on analyzing and addressing vulnerabilities in language models. Yibo, a Ph.D. student in the Computer Science department under Professor Philip U, specializes in natural language processing (NLP) and the development of trustworthy large language models (LLMs), focusing particularly on code generation using LLMs.
Presentation Overview
Yibo's presentation covered two primary papers highlighting different perspectives:
- Robustness: This section introduced a novel method called "DA3," which focuses on creating non-detectable adversarial examples to exploit language models.
- Fairness: The second part of the talk was about measuring and mitigating gender affiliations in LLMs, emphasizing the social implications of language bias.
Paper 1: Robustness
Yibo began by discussing adversarial attacks in NLP, explaining how they can mislead language models by subtly altering input text. Current adversarial attack methods are categorized into character-level, word-level, and sentence-level modifications. The effectiveness of these methods was analyzed through examples where the modified inputs could successfully deceive a victim model but with varying detection rates.
To visualize the performance of adversarial examples, he showed the distribution of maximum softmax probabilities (MSP) between original and adversarial data, demonstrating clear distinctions that could lead to easier detection.
Yibo then introduced the DA3 method as a new approach to generate successful and non-detectable adversarial examples through a two-phase process: fine-tuning using selected models and an inference phase involving masked token generation.
He measured attack effectiveness using two metrics: the Attack Success Rate (ASR) and newly proposed Non-detectable Adversarial Success Rate (NASA), which combines the success rate with detectability. The experiments demonstrated that DA3 outperformed other methods across various datasets, achieving higher NASA scores and emphasizing the importance of considering detectability in adversarial attacks.
Paper 2: Fairness
The second paper addressed the importance of assessing gender bias in LLMs. Yibo outlined how gender bias can manifest in AI outputs, potentially perpetuating stereotypes. To analyze this bias, the study developed strategies that used prompts without explicit gender references and measured the gender affiliations in the generated outputs.
The study utilized both token-level and logic-level metrics to evaluate explicit and implicit biases. The findings revealed that even neutral prompts could induce gender bias in the models, illustrating that the presence of gender in outputs could be influenced by the prompts used.
Yibo looked at various mitigation strategies including parameter tuning, instruction-based guiding, and a debiasing tuning technique. The results showed that debias tuning was most effective in minimizing bias, while instruction tuning also yielded significant improvements.
Conclusion
In summary, Yibo Wang's research highlights vital areas in the analysis of language models, underscoring the necessity of understanding both their vulnerabilities and biases. His work proposes innovative methodologies for generating adversarial examples while taking steps to mitigate gender bias in AI outputs.
Questions and Discussion
After the presentation, several audience members engaged in discussions about the methodologies and findings, highlighting the importance of ongoing research in this domain.
A round of applause concluded the session as attendees expressed gratitude for Yibo's enlightening presentation on these critical issues in the landscape of machine learning.
Keywords
- Adversarial attacks
- Language models
- Robustness
- Gender bias
- Non-detectable adversarial examples
- Fairness in AI
- Mitigation strategies
FAQ
Q1: What are adversarial attacks in language models?
A1: Adversarial attacks involve subtly altering input text to mislead or confuse the model, potentially leading to incorrect outputs.
Q2: What is the significance of the NASA metric?
A2: The Non-detectable Adversarial Success Rate (NASA) combines attack success rate with detectability, emphasizing the need for adversarial methods to be effective and not easily detected.
Q3: How does Yibo propose mitigating gender bias in language models?
A3: He explores various strategies, including hyperparameter tuning, instruction guiding, and debiasing tuning, to reduce gender bias in the outputs generated by LLMs.
Q4: What are the implications of gender affiliations in language models?
A4: When language models generate biased outputs, they can reinforce stereotypes and impact societal perceptions, making it crucial to assess and mitigate these biases.
Q5: Did the findings show any clear patterns in larger models regarding gender bias?
A5: No clear patterns emerged, as sometimes larger models produced more biased sentences, illustrating that size alone does not guarantee fairness.