The field of genomics has seen substantial advancements through the application of artificial intelligence (AI), with machine learning revealing the potential to interpret genomic sequences without necessitating an exhaustive experimental analysis of all the intricate and interconnected molecular processes involved in DNA functioning. However, precise decoding of genomic sequences demands the comprehension of rich contextual information spread over thousands of nucleotides. Presently, only a few architectures exist that can process such extensive inputs, and they require exceptional computational resources. To address this need, we introduce GENA-LM, a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36 thousands base pairs. We offer pre-trained versions of GENA-LM and demonstrate their capacity for fine-tuning to address complex biological questions with modest computational requirements. We also illustrate diverse applications of GENA-LM for various downstream genomic tasks, showcasing its performance in either matching or exceeding that of prior models, whether task-specific or universal. All models are publicly accessible on GitHub https://github.com/AIRI-Institute/GENA_LM and as pre-trained models with gena-lm- prefix on HuggingFace https://huggingface.co/AIRI-Institute .
Online advertising is one of the most widespread ways to reach and increase a target audience for those selling products. Usually having a form of a banner, advertising engages users into visiting a corresponding webpage. Professional generation of banners requires creative and writing skills and a basic understanding of target products. The great variety of goods presented in the online market enforce professionals to spend more and more time creating new advertisements different from existing ones. In this paper, we propose a neural network-based approach for the automatic generation of online advertising using texts from given webpages as sources. The important part of the approach is training on open data available online, which allows avoiding costly procedures of manual labeling. Collected open data consist of multiple subdomains with high data heterogeneity. The subdomains belong to different topics and vary in used vocabularies, phrases, styles that lead to reduced quality in adverts generation. We try to solve the problem of identifying existed subdomains and proposing a new ensemble approach based on exploiting multiple instances of a seq2seq model. Our experimental study on a dataset in the Russian language shows that our approach can significantly improve the quality of adverts generation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.