AI summaryⓘ
The authors created QCalEval, a test to see how well vision-language models (VLMs) understand quantum computing calibration plots, which help interpret experimental data. They collected 243 plot samples from various quantum experiments and tested different VLMs on several types of questions. They found that the best general models scored about 72% correctly without prior examples, but performance varied when models had to learn from multiple images at once. They also showed that training larger models more specifically can help but doesn't fully solve the challenges of learning from many images together. Additionally, the authors released a new open model called NVIDIA Ising Calibration 1 that performs slightly better than other tested models in zero-shot settings.
quantum computingcalibration plotsvision-language modelszero-shot learningin-context learningsuperconducting qubitsneutral atomsfine-tuningmultimodal learning
Authors
Shuxiang Cao, Zijian Zhang, Abhishek Agarwal, Grace Bratrud, Niyaz R. Beysengulov, Daniel C. Cole, Alejandro Gómez Frieiro, Elena O. Glen, Hao Hsu, Gang Huang, Raymond Jow, Greshma Shaji, Tom Lubowe, Ligeng Zhu, Luis Mantilla Calderón, Nicola Pancotti, Joel Pendleton, Brandon Severin, Charles Etienne Staub, Sara Sussman, Antti Vepsäläinen, Neel Rajeshbhai Vora, Yilun Xu, Varinia Bernales, Daniel Bowring, Elica Kyoseva, Ivan Rungger, Giulia Semeghini, Sam Stanwyck, Timothy Costa, Alán Aspuru-Guzik, Krysta Svore
Abstract
Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context learning gap. As a reference case study, we release NVIDIA Ising Calibration 1, an open-weight model based on Qwen3.5-35B-A3B that reaches 74.7 zero-shot average score.