Home   »   OpenAI's IndQA Benchmark

IndQA: OpenAI’s First Cultural Benchmark Begins with Indian Languages

OpenAI has officially launched IndQA, a new multilingual and culture-sensitive benchmark designed to evaluate how effectively AI models can understand and reason through questions grounded in Indian languages and cultural contexts. Released on November 4, 2025, this initiative marks OpenAI’s first major region-specific benchmark, focusing on linguistic diversity, cultural nuances, and contextual intelligence in India — the company’s second-largest user market for ChatGPT.

What Is IndQA?

IndQA stands for Indian Question-Answering benchmark. It currently features 2,278 questions covering 11 Indian languages,

  • Hindi, Hinglish, Gujarati, Punjabi, Kannada, Odia, Marathi, Malayalam, Tamil, Bengali, and Telugu

The benchmark spans 10 cultural domains,

  • Law and Ethics
  • Architecture and Design
  • Food and Cuisine
  • Everyday Life
  • Religion and Spirituality
  • Sports and Recreation
  • Literature and Linguistics
  • Media and Entertainment
  • Arts and Culture
  • History

It was developed with the input of 261 domain experts, including scholars, journalists, linguists, artists, and subject specialists.

How Does IndQA Work?

  • The evaluation process is built around a rubric-based grading system, where each AI-generated response is scored against predefined criteria crafted by experts for each question.
  • Each criterion is assigned weighted points based on its relevance and importance.
  • A model-based grader checks responses against these criteria, and the final score is calculated accordingly.
  • All questions were tested with OpenAI’s most powerful models, including GPT-4o, GPT-4.5, GPT-5, and OpenAI o3 during creation to ensure adversarial robustness.

Benchmark Performance: AI Models Compared

Initial benchmarking results based on IndQA showed significant variance among leading models,

  • GPT-5 (Thinking High): 34.9% (Highest overall)
  • Gemini 2.5 Pro Thinking: 34.3%
  • Gemini 2.5 Flash Thinking: 29.7%
  • Grok 4: 28.5%
  • OpenAI o3 High: 28.1%
  • GPT-4o: 20.3%
  • GPT-4 Turbo: 12.1%

Language-wise observations

  • Highest performance was seen in Hindi and Hinglish, where GPT-5 scored around 45% and 44% respectively.
  • Lowest performance was observed in Bengali and Telugu, revealing gaps in existing AI language models for these scripts.
  • OpenAI clarified that IndQA is not a cross-language leaderboard, since the questions differ across languages. Instead, it serves as a within-model benchmark to measure progress over time.
prime_image