Video annotation is an activity that aims to supplement this type of multimedia object with additional content or information about its context, nature, content, quality and other aspects. These annotations are the basis for building a variety of multimedia applications for various purposes ranging from entertainment to security. Manual annotation is a strategy that uses the intelligence and workforce of people in the annotation process and is an alternative to cases where automatic methods cannot be applied. However, manual video annotation can be a costly process because as the content to be annotated increases, so does the workload for annotating. Crowdsourcing appears as a viable solution strategy in this con- text because it relies on outsourcing the tasks to a multitude of workers, who perform specific parts of the work in a distributed way. However, as the complexity of required media annoyances increases, it becomes necessary to employ skilled labor, or willing to perform larger, more complicated, and more time-consuming tasks. This makes it challenging to use crowdsourcing, as experts demand higher pay, and recruiting tends to be a difficult activity. In order to overcome this problem, strategies based on the decom- position of the main problem into a set of simpler subtasks suitable for crowdsourcing processes have emerged. These smaller tasks are organized in a workflow so that the execution process can be formalized and controlled. In this sense, this thesis aims to present a new framework that allows the use of crowdsourcing to create applications that require complex video annotation tasks. The developed framework considers the whole process from the definition of the problem and the decomposition of the tasks, until the construction, execution, and management of the workflow. This framework, called CrowdWaterfall, contemplates the strengths of current proposals, incorporating new concepts, techniques, and resources to overcome some of its limitations.