We investigated automatic approaches for clustering data that describes occupations related to hazardous airborne exposure (beryllium). The regulatory compliance data from Occupational Safety and Health Administration includes records containing short free text job descriptions and associated numerical exposure levels. Researchers in public health domain need to map job descriptions to Standard Occupational Classification (SOC) nomenclature for estimating occupational health risks. Previous manual process was time-consuming and did not advance so far to linkage to SOC. We investigated alternative automatic approaches for clustering job descriptions. The clustering results are the first essential step towards discovery of corresponding SOC terms. Our study indicated that the Tolerance Rough Set with Jaccard similarity was a better combination overall. The utility of the algorithm was further verified by applying logistic regression and validating that the predictive power of the automatically generated classifications, in terms of association of "job" with probability of exposure to beryllium above certain threshold, closely approached that of the manually assembled classification of the same 12,148 records.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.