The study of crowd movement and behavioral patterns typically relies on spatio-temporal localization data of pedestrians. While monocular cameras serve the purpose, industrial binocular cameras based on multi-view geometry offer heightened spatial accuracy. These cameras synchronize time through circuits and are calibrated for external parameters after fixing their relative positions. Yet, the flexibility and real-time adaptability of using two different cameras or smartphones in close proximity, forming a short-baseline binocular camera, presents challenges in camera time synchronization, external parameter calibration, and pedestrian feature matching. A method is introduced herein for jointly addressing these challenges. Images are abstracted into spatial-temporal point sets based on human head coordinates and frame numbers. Through point set registration, time synchronization and pedestrian matching are achieved concurrently, followed by the calibration of the shortbaseline camera's external parameters. Numerical results from synthetic and real-world scenarios indicate the proposed model's capability in addressing the aforementioned fundamental challenges. With the sole reliance on crowd image data, devoid of external hardware, software, or manual calibrations, time synchronization precision reaches the submillisecond level, pedestrian matching averages a 92% accuracy rate, and the camera's external parameters align with the calibration board's precision. Ultimately, this research facilitates the self-calibration, automatic time synchronization, and pedestrian matching tasks for short-baseline camera assemblies observing crowds.