Narrative works (e.g., novels and movies) consist of various utterances (e.g., scenes and episodes) with multi-layered structures. However, the existing studies aimed to embed only stories in a narrative work. By covering other granularity levels, we can easily compare narrative utterances that are coarser (e.g., movie series) or finer (e.g., scenes) than a narrative work. We apply the multi-layered structures on learning hierarchical representations of the narrative utterances. To represent coarser utterances, we consider adjacency and appearance of finer utterances in the coarser ones. For the movies, we suppose a four-layered structure (character roles ∈ characters ∈ scenes ∈ movies) and propose three learning methods bridging the layers: Char2Vec, Scene2Vec, and Hierarchical Story2Vec. Char2Vec represents a character by using dynamic changes in the character’s roles. To find the character roles, we use substructures of character networks (i.e., dynamic social networks of characters). A scene describes an event. Interactions between characters in the scene are designed to describe the event. Scene2Vec learns representations of a scene from interactions between characters in the scene. A story is a series of events. Meanings of the story are affected by order of the events as well as their content. Hierarchical Story2Vec uses sequential order of scenes to represent stories. The proposed model has been evaluated by estimating the similarity between narrative utterances in real movies.