Abstract

In view of the pain points that infant health monitoring in current community infant care guidance mainly relies on manual observation, which is inefficient, highly subjective, has fragmented data, and is delayed in identifying and warning of subtle abnormal conditions, an efficient and automated infant health monitoring and early warning mechanism is constructed to improve the level of community infant care services. A multimodal learning framework is adopted: first, the pre-trained VGG19 deep convolutional neural network is used as the core visual feature extractor to process the real-time monitoring video stream of the infant, focusing on analyzing the infant's facial expressions, limb movement patterns, and skin color changes; secondly, the infant's audio data and environmental temperature and humidity sensor data are synchronously collected and processed; then a feature fusion strategy is designed to effectively fuse the high-level visual features extracted by VGG19, the temporal acoustic features extracted from the audio, and the environmental sensor data; finally, a classifier is trained based on the fused multimodal feature input to achieve real-time classification of the infant's health status and trigger graded warnings. On a real community nursery center dataset containing 500 hours of annotated video and audio, the system's all-day health status recognition accuracy reaches 94%, and the all-day recall rate for "urgent attention required" status reaches 89.5%, with the average warning time for potential dangerous conditions advanced by 15 minutes. The multimodal learning method that integrates the powerful visual feature extraction capabilities of VGG19 can effectively overcome the limitations of a single modality and significantly improve the automation, precision, and real-time warning capabilities of infant health monitoring in community scenarios