There is an intensive growth in researches regarding surveillance and threat detection. Surveillance tasks often involve several actors with multiple interactions. Thus, modeling a complex activity becomes challenging. This work proposes an architecture comprised of low, middle, and high levels. The low-level recognizes characteristics, positioning of objects, and time of occurrences utilizing a camera and Unmanned Aerial Vehicle (UAV) sensors. The middle-level is responsible for structuring the information from the low-level using Deterministic Finite Automata (DFA). An expert system attached in the high-level module performs inference over the organized information to enables the system to have simple reasoning modules, assisting the operator decision. The architecture is embedded in a UAV to reduce the number of cameras and to reach difficult areas. The experiments showed that the proposed system updated the grammatical structure effectively given a sequence of information computed by the vision modules.