Classical models of collective behavior often take a “bird’s-eye perspective,” assuming that individuals have access to social information that is not directly available (e.g., the behavior of individuals outside of their field of view). Despite the explanatory success of those models, it is now thought that a better understanding needs to incorporate the perception of the individual, i.e., how internal and external information are acquired and processed. In particular, vision has appeared to be a central feature to gather external information and influence the collective organization of the group. Here, we show that a vision-based model of collective behavior is sufficient to generate organized collective behavior in the absence of spatial representation and collision. Our work suggests a different approach for the development of purely vision-based autonomous swarm robotic systems and formulates a mathematical framework for exploration of perception-based interactions and how they differ from physical ones.