“…As for model development, to generate a reference standard for image labelling, 18 studies used expert consensus [ 27 – 33 , 35 – 38 , 49 , 53 – 55 , 71 , 77 , 83 ], two relied on the opinion of a single expert reader [ 76 , 85 ], 16 used pre-existing radiological reports or other imaging modalities [ 34 , 41 , 43 , 45 , 46 , 52 , 60 , 61 , 67 , 75 , 78 – 82 , 87 ], one study defined their reference standard as surgical confirmation (indicated for surgery) [ 86 ], 11 studies used mixed methods (any combination of the aforementioned) [ 40 , 47 , 48 , 50 , 51 , 62 , 63 , 65 , 69 , 70 , 73 ] and two studies did not report how their reference standard was generated [ 74 , 88 ]. As for model testing, to generate a reference standard for image labelling, 26 studies used expert consensus [ 26 – 28 , 30 – 33 , 38 , 39 , 44 , 51 , 54 – 57 , 61 , 64 , 66 , 68 , 71 – 73 , 77 , 80 , 83 , 84 ], two relied on the opinion of a single expert reader [ 58 , 85 ], 11 used pre-exist...…”