We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and imagetext alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, GLIDE and DALL-E 2, and find that human raters prefer Imagen over other models in side-byside comparisons, both in terms of sample quality and image-text alignment. See imagen.research.google for an overview of the results. * Equal contribution. † Core contribution.
Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.
We congratulate the KDIGO (Kidney Disease: Improving Global Outcomes) work group on their comprehensive work in a broad subject area and agreed with many of the recommendations in their clinical practice guideline on the evaluation and management of chronic kidney disease. We concur with the KDIGO definitions and classification of kidney disease and welcome the addition of albuminuria categories at all levels of glomerular filtration rate (GFR), the terminology of G categories rather than stages to describe level of GFR, the division of former stage 3 into new G categories 3a and 3b, and the addition of the underlying diagnosis. We agree with the use of the heat map to illustrate the relative contributions of low GFR and albuminuria to cardiovascular and renal risk, though we thought that the highest risk category was too broad, including as it does people at disparate levels of risk. We add an albuminuria category A4 for nephrotic-range proteinuria and D and T categories for patients on dialysis or with a functioning renal transplant. We recommend target blood pressure of 140/90mm Hg regardless of diabetes or proteinuria, and against the combination of angiotensin receptor blockers with angiotensin-converting enzyme inhibitors. We recommend against routine protein restriction. We concur on individualization of hemoglobin A1c targets. We do not agree with routine restriction of sodium intake to <2g/d, instead suggesting reduction of sodium intake in those with high intake (>3.3g/d). We suggest screening for anemia only when GFR is <30mL/min/1.73m(2). We recognize the absence of evidence on appropriate phosphate targets and methods of achieving them and do not agree with suggestions in this area. In drug dosing, we agree with the recommendation of using absolute clearance (ie, milliliters per minute), calculated from the patient's estimated GFR (which is normalized to 1.73m(2)) and the patient's actual anthropomorphic body surface area. We agree with referral to a nephrologist when GFR is <30mL/min/1.73m(2) (and for many other scenarios), but suggest urine albumin-creatinine ratio > 60mg/mmol or proteinuria with protein excretion > 1g/d as the referral threshold for proteinuria.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.