“…As defined in [102], each terminal element E i is identified by a set of meaningful features, as follows: E i mod corresponds to the modality (e.g., speech, facial expression, gesture) used to create the element E i ; E i repr indicates how the element E i is represented by the modality; E i time measures the time interval (based on the start and end time values) over which the element E i was created; E i role corresponds to the syntactic role that the element E i plays in the multimodal sentence, according to the Penn Treebank Tag set [103] (e.g., noun, verb, adjective, adverb, pronoun, preposition, etc. ); and E i concept gives the semantic meaning of the element considering the conceptual structure of the context [104]. Given two elements E i and E j , where E j has a close-by relationship with E j [7], E i coop is set to the same value as E j and specifies the type of cooperation [7] between the elements E i and E j .…”