Lingyu Gao

AI Research Engineer

Duolingo

About Me

Hello! I’m Lingyu, an AI research engineer at Duolingo. I hold a Ph.D. in Computer Science, specializing in natural language processing.

I am very fortunate to have Prof. Kevin Gimpel as my advisor. During my Ph.D. studies, I focused on text classification and generation, aiming to identify and enhance the capabilities of the generative components of pretrained language models.

Expertise

Natural Language Processing
Deep Learning & Machine Learning
Data Analysis & Visualization

Education

Toyota Technological Institute at Chicago
Ph.D. in Computer Science, M.S. within Ph.D. in Computer Science
Tsinghua University
M.E. in Electrical Engineering, B.E. in Electrical Engineering and Automation

Skills

Programming & Libraries

Python, PyTorch, TensorFlow, pandas, and more

Documentation Tools

MS Office, LaTeX, Version Control (Git)

Languages

Mandarin, English

Work Experience & Internships

AI Research Engineer

Duolingo

January 2025 – Present Pittsburg, PA, USA

AI Engineer

Educational Testing Service

July 2024 – December 2024 Princeton, NJ, USA

Key Skills: Pytorch, Python, AWS

Finetuned LLMs with domain-specific data.
Collaborated with test developers, engineers, and scientists for high-quality test item creation.
Conducted automatic scoring with multi-agents.

Research Intern

Google LLC.

May 2023 – August 2023 Mountain View, CA, USA

Target: Selecting Better In-Context Learning Demonstrations for Text Classification

Key Skills: TensorFlow, Pandas, Python, NumPy, LaTeX

Models: Flan-PaLM 2, off-the-shelf retriever (fine-tuned on mT5)

Completed over 100 pages of documentation and 4,000+ lines of code. Prepared a paper for submission.
Achieved a +2.6% gain on F1 macro scores over an already high baseline that matches or exceeds current benchmarks.
Proposed constraints for demonstration selection are potentially adaptable to other applications, including ranking.

Research Intern

TikTok Inc.

May 2022 – August 2022 Remote

Target: Generating Questions of Different Styles Controlled with Keywords

Key Skills: PyTorch, PyTorch Lightning, Python, NumPy

Models: T5, mT5, ByT5

Authored over 3,600 lines of code.
Demonstrated that an enhanced T5 model with additional tokens, such as emojis, excels in generating keywords together with topics over other models, surpassing spaCy on keyword extraction.
Generated questions controlled with keywords, topics, and specified length. Determined that using distinct models yields better results for generating questions with different styles.

Intern

Educational Testing Service

June 2021 – August 2021 Remote

Target: Generating and Ranking Inquisitive Questions Controlled with Question Types

Key Skills: PyTorch, Fairseq, Pandas, Python, NumPy, LaTeX

Models: RoBERTa, BART

Code is publicly available on GitHub (5000+ lines). Our paper was accepted for presentation at *SEM 2022.
Produced diverse questions tailored to specific question types.
Leveraged a pairwise ranker to select generated questions that matched the quality of human-crafted queries in terms of syntax, semantics, relevancy, and inquisitiveness, as validated by human assessment.

Publications

Lingyu Gao, Aditi Chaudhary, Krishna Srinivasan, Kazuma Hashimoto, Karthik Raman, Michael Bendersky (2023). Ambiguity-Aware In-context Learning with Large Language Models. arXiv preprint.

Cite

Xiaomeng Ma, Lingyu Gao (2023). Evaluating Transformer Models and Human Behaviors on Chinese Character Naming. TACL.

Code Dataset DOI Cite

Xiaomeng Ma, Lingyu Gao, Qihui Xu (2023). ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind. CoNLL.

Code Dataset Cite

Lingyu Gao, Debanjan Ghosh, Kevin Gimpel (2023). The Benefits of Label-Description Training for Zero-Shot Text Classification. EMNLP.

Code Dataset Poster Slides Video Cite

Xiaomeng Ma, Lingyu Gao (2022). How do we get there? Evaluating transformer neural networks as cognitive models for English past tense inflection. AACL-IJCNLP.

Code Dataset Cite

See all publications

Awards & Honors

2021: ETS Pre-Doctoral Fellowship
2014: Mitsubishi Heavy Industries Scholarship, being one of 25 selected from approximately 180 candidates
2013: NARI-RELAYS Scholarship, ranking in the approximate top 15%
2011: 1st Grade Academic Excellence Scholarship, placing 6th out of 120 candidates
2010: 2nd Grade Freshman Scholarship, ranking 2nd in the entire province

Teaching & Services

Teaching:

2019: Teaching Assistant, Introduction to Machine Learning

Reviewer Services:

NAACL-HLT 2021
BEA 2022 & 2023
EMNLP 2022 & 2023
ACL 2023
TALLIP 2023 & 2024
ARR 2024
Secondary Reviewer: EMNLP 2019 & RepL4NLP 2020

Other Services & Activities:

2023 - present: Volunteer, Circle Cat (a non-profit organization)
2020 - 2021: Student Member, TTIC Diversity, Equity, and Inclusion (DEI) Committee
2020: Student Member, TTIC Ph.D. Admissions Committee
2011: Teaching Volunteer, Mabian Yi Autonomous County, Sichuan, China
2011: Member, Student Association for Science and Technology, EE Department at Tsinghua University