Accelerating Substructure Similarity Search for Formula Retrieval

Zhong, Wei; Rohatgi, Shaurya; Wu, Jian; Giles, C. Lee; Zanibbi, Richard

doi:10.1007/978-3-030-45439-5_47

Cited by 21 publications

(12 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The type must be specified by the user as there may be ambiguities, for example, matrix multiplication can also be the acronym for Artificial Intelligence. The searcher runs different query types on different indexes and uses a dynamic pruning algorithm [11] to generate structure-aware results efficiently.…”

Section: Searchermentioning

confidence: 99%

“…Recent tasks have shown that the top effective formula retrieval systems all take advantage of indexing tokens from structured tree representations [7,9]. Currently, Approach Zero indexes prefix leaf-root paths from formula OPT representations, where each unique path corresponds to an inverted list, similar to regular search engines [11]. More specifically, a L A T E X markup is converted to OPT and then the paths from the leaf to the internal nodes are extracted, for example, + = 1 will break down into five prefix paths: x/+/=, x/+, y/+/=, y/+ and 1/= (single token paths will not be generated, and we use "/" to visually separate individual path tokens).…”

Section: Indexermentioning

confidence: 99%

“…And over the years, evaluations have shown that search effectiveness has steadily improved. Search engines such as Approach Zero [11,12] have achieved sub-second single-thread query latencies while being able to handle substructure matching in a semantics-aware manner. However, advancements in MIR have not been broadly disseminated in the general IR community, mostly due to the extra effort required to parse math markups in documents and the need for special tooling to handle structured content.…”

mentioning

confidence: 99%

See 2 more Smart Citations

PYA0: A Python Toolkit for Accessible Math-Aware Search

Zhong

Lin

2021

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

Mathematical Information Retrieval (MIR) has been actively studied in recent years and many fruitful results have emerged. Among those, the Approach Zero system is one of the few math-aware search engines that is able to perform substructure matching efficiently. Furthermore, it has been deployed in ARQMath2020, the most recent community-wide MIR evaluation, as a strong baseline due to its empirical effectiveness and ability to handle structured math content. However, in order to implement a retrieval model that handles structured queries efficiently, Approach Zero is written in C from the ground up, requiring special pipelines for processing math content and queries. Thus, the system is not conveniently accessible and reusable to the community as a research tool. In this paper, we present PyA0, an easy-to-use Python toolkit built on Approach Zero that improves its accessibility to researchers. We introduce the toolkit interface and report evaluation results on popular MIR datasets to demonstrate the effectiveness and efficiency of our toolkit. We have made PyA0 source code publicly accessible at https://github.com/approach0/pya0, which includes a link to a notebook demo. CCS CONCEPTS• Information systems → Mathematics retrieval.

show abstract

Section: Searchermentioning

confidence: 99%

Section: Indexermentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

PYA0: A Python Toolkit for Accessible Math-Aware Search

Zhong

Lin

2021

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, the traditional full-text retrieval model for one-dimensional is not effective when facing the special two-dimensional pattern retrieval of mathematical expressions. At present, research studies on mathematical expression retrieval and ranking have been carried out with some progress, and methods and prototype systems [1][2][3][4][5][6] with mathematical retrieval functions have been proposed.…”

Section: Introductionmentioning

confidence: 99%

A Multimodal Retrieval and Ranking Method for Scientific Documents Based on HFS and XLNet

Yan

Shi

et al. 2022

Scientific Programming

View full text Add to dashboard Cite

Aiming at the defects of traditional full-text retrieval models in dealing with mathematical expressions, which are special objects different from ordinary texts, a multimodal retrieval and ranking method for scientific documents based on hesitant fuzzy sets (HFS) and XLNet is proposed. This method integrates multimodal information, such as mathematical expression images and context text, as keywords to realize the retrieval of scientific documents. In the image modal, the images of mathematical expressions are recognized, and the hesitancy fuzzy set theory is introduced to calculate the hesitancy fuzzy similarity between mathematical query expressions and the mathematical expressions in candidate scientific documents. Meanwhile, in the text mode, XLNet is used to generate word vectors of the mathematical expression context to obtain the similarity between the query text and the mathematical expression context of the candidate scientific documents. Finally, the multimodal evaluation is integrated, and the hesitation fuzzy set is constructed at the document level to obtain the final scores of the scientific documents and corresponding ranked output. The experimental results show that the recall and precision of this method are 0.774 and 0.663 on the NTCIR dataset, respectively, and the average normalized discounted cumulative gain (NDCG) value of the top-10 ranking results is 0.880 on the Chinese scientific document (CSD) dataset.

show abstract

“…But it is very difficult to retrieve scientific documents with mathematical expressions because mathematical expressions are characterized by a complex two-dimensional structure. To date, research on mathematical expression retrieval has achieved abundant results [1][2][3][4][5][6][7].…”

Section: Introductionmentioning

confidence: 99%