“…While in optimization problems local solutions often produce optimal results, structured prediction represents a valid alternative to solve NLP tasks requiring complex output, such as syntactic parsing (Roth and Yih, 2004), co-reference resolution (Yu and Joachims, 2009;Fernan-des et al, 2014), and clustering (Finley and Joachims, 2005;Haponchyk et al, 2018). Nonetheless, relatively few works extend structured prediction theory to deep learning Durrett and Klein, 2015;Weiss et al, 2015;Kiperwasser and Goldberg, 2016;Peng et al, 2018;Milidiú and Rocha, 2018;Wang et al, 2019). In particular, when it comes to clustering, designing a differentiable loss function that captures the global characteristics of good clustering is particularly hard; for this reason, when dealing with coreference resolution -a closely related task - Lee et al (2017) use simple losses, which already perform well but do not strictly take into account the cluster structure.…”