Obscenity (the use of rude words or offensive expressions) has spread from informal verbal conversations to digital media, becoming increasingly common on user-generated comments found in Web forums, newspaper user boards, social networks, blogs, and media-sharing sites. The basic obscenity-blocking mechanism is based on verbatim comparisons against a blacklist of banned vocabulary; however, creative users circumvent these filters by obfuscating obscenity with symbol substitutions or bogus segmentations that still visually preserve the original semantics, such as writing shit as $h¡;t or s.h.i.t or even worse mixing them as $.h….¡.t . The number of potential obfuscated variants is combinatorial, yielding the verbatim filter impractical. Here we describe a method intended to obstruct this anomaly inspired by sequence alignment algorithms used in genomics, coupled with a tailor-made edit penalty function. The method only requires to set up the vocabulary of plain obscenities; no further training is needed. Its complexity on screening a single obscenity is linear, both in runtime and memory, on the length of the user-generated text. We validated the method on three different experiments. The first one involves a new dataset that is also introduced in this article; it consists of a set of manually annotated real-life comments in Spanish, gathered from the news user boards of an online newspaper, containing this type of obfuscation. The second one is a publicly available dataset of comments in Portuguese from a sports Web site. In these experiments, at the obscenity level, we observed recall rates greater than 90%, whereas precision rates varied between 75% and 95%, depending on their sequence length (shorter lengths yielded a higher number of false alarms). On the other hand, at the comment level, we report recall of 86%, precision of 91%, and specificity of 98%. The last experiment revealed that the method is more effective in matching this type of obfuscation compared to the classical Levenshtein edit distance. We conclude discussing the prospects of the method to help enforcing moderation rules of obscenity expressions or as a preprocessing mechanism for sequence cleaning and/or feature extraction in more sophisticated text categorization techniques.
BackgroundThe analysis of complex proteomic and genomic profiles involves the identification of significant markers within a set of hundreds or even thousands of variables that represent a high-dimensional problem space. The occurrence of noise, redundancy or combinatorial interactions in the profile makes the selection of relevant variables harder.Methodology/Principal FindingsHere we propose a method to select variables based on estimated relevance to hidden patterns. Our method combines a weighted-kernel discriminant with an iterative stochastic probability estimation algorithm to discover the relevance distribution over the set of variables. We verified the ability of our method to select predefined relevant variables in synthetic proteome-like data and then assessed its performance on biological high-dimensional problems. Experiments were run on serum proteomic datasets of infectious diseases. The resulting variable subsets achieved classification accuracies of 99% on Human African Trypanosomiasis, 91% on Tuberculosis, and 91% on Malaria serum proteomic profiles with fewer than 20% of variables selected. Our method scaled-up to dimensionalities of much higher orders of magnitude as shown with gene expression microarray datasets in which we obtained classification accuracies close to 90% with fewer than 1% of the total number of variables.ConclusionsOur method consistently found relevant variables attaining high classification accuracies across synthetic and biological datasets. Notably, it yielded very compact subsets compared to the original number of variables, which should simplify downstream biological experimentation.
Background. After several waves of spread of the COVID-19 pandemic, countries around the world are struggling to regain their economies by slowly lifting mobility restrictions and social distance measures applied during the crisis. Meanwhile, recent studies provide compelling evidence on how contact distancing, the use of face masks, and handwashing habits can reduce the risk of SARS-CoV-2 transmission. In this context, we investigated the effect that these personal protection habits can have in preventing new waves of contagion. Methods. We extended an agent-based COVID-19 epidemic model in a simulated community to incorporate the mechanisms of these aforementioned personal care habits and measure their incidence in person-to-person transmission. A full factorial experiment design was performed to illustrate the extent to which the interplay between these personal habits is effective in mitigating the spread of disease. A global sensitivity analysis was performed on the parameters that control these habits to further validate the results. Results. We found that observing physical distance is the dominant habit in reducing disease transmission, although adopting either or both of the other two habits is necessary to some extent to suppress a new outbreak entirely. When physical distance is not observed, adherence to the use of masks or handwashing has a significant decrease in infections and mortality, but the epidemic still unfolds. We also found that in all scenarios, the combined effect of adhering to the three habits is more powerful than adopting them separately. Conclusions. Our findings suggest that a broad adherence of the population to voluntary self-care habits would help contain unfold of new outbreaks. The purpose of our model is illustrative and contributes to ratify the importance of urging citizens to adopt the amalgam of personal care habits as a primary collective protection measure to prevent communities from returning to confinements, while immunisation is carried out in late stages of the pandemic.
Non-Pharmaceutical Interventions (NPI) are currently the only mechanism governments can use to mitigate the impact of the COVID-19 epidemic. Similarly to the actual spread of the disease, the dynamics of the contention patterns emerging from the application of NPIs are complex and depend on interactions between people within a specific region as well as other stochastic factors associated to demographic, geographic, political and economical conditions. Agent-based models simulate microscopic rules of simultaneous spatial interactions between multiple agents within a population, in an attempt to reproduce the complex dynamics of the effect of the contention measures. In this way, it is possible to design individual behaviours along with NPI scenarios, measuring how the simulation dynamics is affected and therefore, yielding rapid insights to perform a broad assessment of the potential of composite interventions at different stages of the epidemic. In this paper we describe a model and a tool to experiment with such kind of analysis applied to a conceptual city, considering a number of widely-applied NPIs such as social distancing, case isolation, home quarantine, total lockdown, sentinel testing, mask wearing and a distinctive "zonal" enforcement measure, requiring these interventions to be applied gradually to separated enclosed districts (zones). We find that the model is able to capture emerging dynamics associated to these NPIs; besides, the zonal contention strategy yields an improvement on the mitigation impact across all scenarios of combination with individual NPIs. The model and tool are open to extensions to account for omitted or newer factors affecting the planning and design of NPIs intended to counter the late stages or forthcoming waves of the COVID-19 crisis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.