We are living in a world of big sensor data. Due to the widespread prevalence of visual sensors (e.g. surveillance cameras) and social sensors (e.g. Twitter feeds), many events are implicitly captured in real-time by such heterogeneous "sensors". Combining these two complementary sensor streams can significantly improve the task of event detection and aid in comprehending evolving situations. However, the different characteristics of these social and sensor data make such information fusion for event detection a challenging problem. To tackle this problem, we propose an innovative multi-layer tweeting cameras framework integrating both physical sensors and social sensors to detect various concepts of real-world events. In this framework, visual concept detectors are applied on camera video frames and these concepts can be construed as "camera tweets" posted regularly. These tweets are represented by a unified probabilistic spatio-temporal (PST) data structure which is then aggregated to a concept-based image (Cmage) as the common representation for visualization. To facilitate event analysis, we define a set of operators and analytic functions that can be applied on the PST data by the user to discover occurrences of events and to analyse evolving situations. We further leverage on geo-located social media data by mining current topics discussed on Twitter to obtain the high-level semantic meaning of detected events in images. We quantitatively evaluate our framework with a large-scale dataset containing images from 150 New York real-time traffic CCTV cameras, university foodcourt camera feeds and Twitter data, which demonstrates the feasibility and effectiveness of our proposed framework. Results of combining camera tweets and social tweets are shown to be promising for detecting real-world events.