Classifying XML tags through "reading contexts"

Tannier, Xavier; Girardot, Jean-Jacques; Mathieu, Mihaela

doi:10.1145/1096601.1096638

Cited by 6 publications

(10 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…From a more syntactical point of view, Tannier et al [38] associate each (XML) element in a document with one of three different categories: hard elementselements that are commonly used to structure the document content in different blocks and usually interrupt the linearity of a text, such as paragraphs and sections; soft elements -elements that identify significant text fragments and are transparent while reading the text, such as emphasis and links; and jump elements -elements that are logically detached from the surrounding text, and that give access to related information, such as footnotes and comments.…”

Section: Existing Models Describing Document Componentsmentioning

confidence: 99%

“…We performed a preliminary test (fully described in [15]) on a dataset consisting of 117 scientific papers encoded in DocBook and published between 2008 and 2011 in the Balisage Series Conferences 38 . The documents vary a lot in their internal structure and size: from 3 Kbytes to 160 Kbytes, with an average size of about 60 Kbytes.…”

Section: Retrieving Structures From Xml Sourcesmentioning

confidence: 99%

“…The overall results of this test were encouraging, since the 37 The algorithm (fully introduced in [15]) is neither an intelligent nor an adaptive algorithm, but rather a prescriptive one that uses the logical characterisations of DoCO components as a basis to identify them in documents through an iterative process. 38 Balisage Conference Series: http://www.balisage.net -all the data gathered during the test are available at http://www.essepuntato. it/2013/doco/test.…”

Section: Retrieving Structures From Xml Sourcesmentioning

confidence: 99%

See 2 more Smart Citations

The Document Components Ontology (DoCO)

Constantin

Peroni

Pettifer

et al. 2016

View full text Add to dashboard Cite

Abstract. The availability in machine-readable form of descriptions of the structure of documents, as well as of the document discourse (e.g. the scientific discourse within scholarly articles), is crucial for facilitating semantic publishing and the overall comprehension of documents by both users and machines. In this paper we introduce DoCO, the Document Components Ontology, an OWL 2 DL ontology that provides a general-purpose structured vocabulary of document elements to describe both structural and rhetorical document components in RDF. In addition to giving a formal description of the ontology, this paper showcases its utility in practice in a variety of our own applications and other activities of the Semantic Publishing community that rely on DoCO to annotate and retrieve document components of scholarly articles.

show abstract

Section: Existing Models Describing Document Componentsmentioning

confidence: 99%

Section: Retrieving Structures From Xml Sourcesmentioning

confidence: 99%

Section: Retrieving Structures From Xml Sourcesmentioning

confidence: 99%

See 1 more Smart Citation

The Document Components Ontology (DoCO)

Constantin

Peroni

Pettifer

et al. 2016

View full text Add to dashboard Cite

show abstract

“…Some literature has recently come out about the characterization and identification of structural patterns of text documents. For instance, Tannier, Girardot, and Mathieu (), starting from previous works by Lini, Lombardini, Paoli, Colazzo, and Sartiani () and Colazzo et al. (), describe an algorithm to assign each XML element in a document to one of three different categories: hard tag, soft tag , and jump tag .…”

Section: Structural Patternsmentioning

confidence: 99%

“…Tannier et al. () also introduce algorithms to assign XML elements to these categories by means of natural language processing (NLP) tools. This classification is rather interesting, in that it provides a justification for the identification of the classes, but it is a little coarse for our purposes, ignoring empty elements and failing to distinguish higher level and lower level hard tags (i.e., those containing other tags but not text from those that never contain text).…”

Section: Structural Patternsmentioning

confidence: 99%

Dealing with structural patterns of XML documents

Iorio

Peroni

Poggi

et al. 2014

Asso for Info Science & Tech

View full text Add to dashboard Cite

Evaluating collections of XML documents without paying attention to the schema they were written in may give interesting insights into the expected characteristics of a markup language, as well as any regularity that may span vocabularies and languages, and that are more fundamental and frequent than plain content models. In this paper we explore the idea of structural patterns in XML vocabularies, by examining the characteristics of elements as they are used, rather than as they are defined. We introduce from the ground up a formal theory of 8 plus 3 structural patterns for XML elements, and verify their identifiability in a number of different XML vocabularies. The results allowed the creation of visualization and content extraction tools that are completely independent of the schema and without any previous knowledge of the semantics and organization of the XML vocabulary of the documents.

show abstract

SIRIUS XML IR System at INEX 2006: Approximate Matching of Structure and Textual Content

Popovici

Ménier

Marteau

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In this paper we report on the retrieval approach taken by the VALORIA laboratory of the University of South-Brittany while participating at INEX 2006 ad-hoc track with the SIRIUS XML IR system. SIRIUS retrieves relevant XML elements by approximate matching both the content and the structure of the XML documents. A weighted editing distance on XML paths is used to approximately match the documents structure while the IDF of the researched terms are used to rank the textual content of the retrieved elements. We briefly describe the approach and the extensions made to the SIRIUS XML IR system to address each of the four subtasks of the INEX 2006 ad-hoc track. Finally we present and analyze the SIRIUS retrieval evaluation results. SIRIUS runs were ranked on the 1 st position out of 77 submitted runs for the Best In Context task and obtained several top ten results for both the Focused and All In Context tasks.

show abstract

Classifying XML tags through "reading contexts"

Cited by 6 publications

References 1 publication

The Document Components Ontology (DoCO)

The Document Components Ontology (DoCO)

Dealing with structural patterns of XML documents

SIRIUS XML IR System at INEX 2006: Approximate Matching of Structure and Textual Content

Contact Info

Product

Resources

About