Modern distributed computing infrastructure need to process vast quantities of data streams generated by a growing number of participants with information generated in multiple formats. With the Internet of Multimedia Things (IoMT) becoming a reality, new approaches are needed to process realtime multimodal event data streams. Existing approaches to event processing have limited consideration for the challenges of multimodal events, including the need for complex content extraction, increased computational and memory costs. The paper explores event processing as a basis for processing real-time IoMT data. The paper introduces the Multimodal Event Processing (MEP) paradigm, which provides a formal basis for native approaches to neural multimodal content analysis (i.e., computer vision, linguistics, and audition) with symbolic event processing rules to support real-time queries over multimodal data streams using the Multimodal Event Processing Language to express single, primitive multimodal, and complex multimodal event patterns. The content of multimodal streams is represented using Multimodal Event Knowledge Graphs to capture the semantic, spatial, and temporal content of the multimodal streams. The approach is implemented and evaluated within an MEP Engine using single and multimodal queries achieving near real-time performance with a throughput of ~30 fps and sub-second latency of 0.075-0.30 seconds for video streams of 30 fps input rate. Support for high input stream rates (45 fps) is achieved through content-aware load shedding techniques with a ~127X latency improvement resulting in only a minor decrease in accuracy.