The Context

This whole thing started because I just wanted a decent way to watch my cats.

I had four IP cameras scattered around the house — one in the living room, one in the bedroom, one watching the balcony, and a fourth that I moved around depending on where the chaos was. All of them claimed ONVIF support (whatever that means anymore), and each came with its own barely usable app. None of the apps played well with each other, and switching between them just to check if someone had knocked over a planter started to feel ridiculous.

I wasn’t trying to build NASA. I just wanted a clean 4x4 view of all my cameras at once, maybe highlight one when needed, and ideally get motion alerts that weren’t triggered by a passing moth. At first, I figured this would be a quick weekend project. Grab the RTSP streams from the cameras, plug them into FFmpeg, tile the feeds, throw it on a monitor. But yeah — that plan went downhill fast.

The Problem

Our multi-camera surveillance infrastructure was failing across every dimension:

ONVIF Implementation Hell: Each camera manufacturer implements ONVIF differently. Some support full imaging services, others provide basic PTZ only. Camera A works perfectly with preset management, Camera B freezes when you try to access imaging settings, and Camera C randomly disconnects from ONVIF services after 30 minutes. There's no standardization, no graceful degradation, just vendor-specific chaos.
Manual Stream Management: Getting RTSP feeds into OBS requires manual configuration for each camera. Want to change which camera is highlighted? Click through OBS scene items, manually resize transforms, and hope you don't accidentally break the entire layout. Add a new camera? Reconfigure everything from scratch.
Primitive Motion Detection: Traditional pixel-based motion detection is surveillance theater. It triggers on everything except what you care about. Cloud shadows moving across the floor? Motion detected. Cat actually knocking things over? Missed because it happened too slowly to trigger the threshold.
Zero Automation Integration: Camera control, streaming management, and motion detection exist as completely separate systems. When motion is detected on Camera 2, there's no automatic way to highlight that feed in OBS. No shared state, no event coordination, no intelligent automation between components.
Reliability Nightmare: System crashes required manual intervention. A single camera going offline could break the entire detection pipeline. OBS WebSocket disconnections meant losing all automated scene control. No health monitoring, no automatic recovery, no operational visibility.

The fundamental issue: home surveillance systems are designed as collections of independent devices rather than integrated platforms. When your security system requires a systems administrator, it's not actually making you more secure.

The Solution

Modular Flask Architecture: Service-Oriented Camera Control

Rather than building another monolithic surveillance application, create a service-oriented architecture where each component has a single, well-defined responsibility:

def create_app():
    # Initialize core services with dependency injection
    obs_service = OBSService(Config.OBS_URL, Config.OBS_PASSWORD)
    camera_service = CameraService(Config.CAMERA_CONFIGS)
    detection_service = DetectionService(Config.MOTION_DETECTION_LOG)

    # Register route blueprints with injected dependencies
    app.register_blueprint(init_ptz_routes(camera_service))
    app.register_blueprint(init_obs_routes(obs_service))
    app.register_blueprint(init_detection_routes(detection_service))

Each service encapsulates its domain logic with clean REST APIs for coordination. The Flask application becomes an orchestration layer rather than a feature dumping ground. Camera failures don't cascade to OBS control. OBS disconnections don't break motion detection. Services can be developed, tested, and deployed independently.

Dual-Layer ONVIF Management: Graceful Degradation by Design

IP cameras are unreliable, and their ONVIF implementations are inconsistent. Handle this reality with a dual-initialization strategy:

class ONVIFCameraInstance:
    def setup_move(self, IP, PORT, USER, PASS):
        # Primary: Use optimized PTZ implementation
        try:
            self._ptz_cam = FixedPtzCam(ip_address=IP, port=PORT, user=USER, pword=PASS)
            self.is_online = True
        except Exception as e:
            logging.warning(f"FixedPtzCam failed for {IP}: {e}")
            self._ptz_cam = None

        # Backup: Fallback to basic ONVIF for presets
        try:
            self._onvif_camera = ONVIFCamera(IP, PORT, USER, PASS)
            self._onvif_ptz = self._onvif_camera.create_ptz_service()
            profiles = self._onvif_camera.create_media_service().GetProfiles()
            self._onvif_media_profile = profiles[0] if profiles else None
        except Exception as e:
            logging.warning(f"ONVIF backup failed for {IP}: {e}")

The FixedPtzCam class handles cameras that don't support full imaging services—common with budget IP cameras that technically support ONVIF but fail on advanced features. When primary PTZ control fails, the system automatically falls back to basic ONVIF commands. Cameras with limited capabilities can still provide basic functionality rather than failing completely.

Real-Time OBS WebSocket Automation: Programmatic Scene Control

OBS Studio's WebSocket API enables sophisticated automation that most people never explore. Instead of manually switching between camera feeds, implement intelligent layout management:

class OBSWebSocketClient:
    def update_obs_layout(self, scene_name="Mosaic", active_source=None):
        if active_source:
            # Highlight mode: featured camera with side thumbnails
            layout = self._calculate_highlight_layout(scene_name, active_source)
        else:
            # Grid mode: equal-sized 2x2 layout with detection strip space
            layout = self._calculate_grid_layout(scene_name)

        # Apply transforms atomically to prevent flickering
        for source_name, transform in layout.items():
            self._apply_scene_transform(source_name, transform)

The system can dynamically switch between grid layout (all cameras visible) and highlight mode (one camera prominent) based on motion detection events or manual control. When motion is detected on the balcony camera, automatically highlight that feed while keeping others visible as thumbnails. The layout calculations account for detection strip overlay space and maintain consistent aspect ratios.

YOLOv8-Powered Smart Detection: Object Recognition vs Motion Noise

Traditional motion detection triggers on everything—shadows, leaves, car headlights, changes in lighting. Using YOLOv8 for object detection means specifically looking for cats (or people, or specific objects) rather than generic pixel changes:

def detect_objects(frame, cat_class_id=15, conf=0.5, upscale_factor=1.5):
    """Cat detection with confidence thresholding and upscaling"""
    # First pass at native resolution
    results = YOLO_MODEL.predict(frame, conf=conf, classes=[cat_class_id])
    detections = []

    for r in results:
        for box in r.boxes:
            x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
            detections.append((int(x1), int(y1), int(x2), int(y2)))

    # Second pass with upscaling for small/distant objects
    if not detections and upscale_factor > 1.0:
        upscaled = cv2.resize(frame, None, fx=upscale_factor, fy=upscale_factor)
        results = YOLO_MODEL.predict(upscaled, conf=conf, classes=[cat_class_id])
        # Scale coordinates back to original frame size...

    return detections

The two-pass approach with upscaling dramatically improves detection accuracy for small objects (cats far from cameras). When detections occur, the system logs events to JSONL format with timestamps and automatically saves cropped detection frames for review.

Intelligent Detection Film Strip: Visual Event Timeline

Rather than storing hundreds of individual detection images, create a dynamic film strip that provides visual context for recent events:

class FilmStrip:
    def maybe_save(self, frame, boxes):
        """Save largest detection in time-bucketed format"""
        if not boxes:
            return False

        # Save crop of largest detected object
        x1, y1, x2, y2 = max(boxes, key=lambda b: (b[2]-b[0])*(b[3]-b[1]))
        crop = frame[max(0, y1-20):min(h, y2+20), max(0, x1-20):min(w, x2+20)]

        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        crop_path = self.cfg.crop_dir / f"filmstrip_{timestamp}.jpg"
        cv2.imwrite(str(crop_path), crop)

        # Update composite film strip
        self._render_strip()
        self._purge_old_images()

The film strip automatically manages storage by keeping only the most recent detection in each 30-second time bucket, maintaining a visual timeline of recent activity while preventing storage bloat.

Microservice Architecture: MJPEG Streaming with Async Detection

The detection system runs as a separate microservice that processes OBS Virtual Camera output and streams the result as MJPEG with optional detection overlays:

def _ffmpeg_reader():
    """Background thread processing OBS Virtual Camera feed"""
    while streaming_active and ffmpeg_process:
        chunk = ffmpeg_process.stdout.read(4096)
        frame_data += chunk

        # Extract complete JPEG frames
        while True:
            start = frame_data.find(b'\xff\xd8')
            end = frame_data.find(b'\xff\xd9', start)
            if start == -1 or end == -1:
                break

            frame = frame_data[start:end + 2]
            frame_data = frame_data[end + 2:]

            # Queue frame for async detection processing
            if DETECTION_AVAILABLE:
                detection_service.enqueue(frame)

This architecture separates concerns completely: the main Flask app handles camera control and OBS automation, while the microservice focuses purely on video processing and intelligent detection. FFmpeg handles the heavy lifting of video format conversion, while the detection service processes frames asynchronously without blocking the stream.

Production Health Monitoring: Observable Surveillance

Build operational visibility directly into the system with comprehensive health monitoring:

@app.route("/health")
def health_check():
    health_data = {
        "status": "healthy",
        "service": "camera-control",
        "timestamp": time.time(),
        "cameras": {
            nickname: camera.is_online
            for nickname, camera in camera_service.cameras.items()
        },
        "obs_connected": obs_service.connected,
        "detection_available": detection_service.running
    }
    return jsonify(health_data)

Each microservice exposes health endpoints for monitoring. The system tracks camera connectivity, OBS WebSocket status, detection service health, and processing statistics. Docker health checks automatically restart failed services.

The Impact

Unified Camera Management: All cameras accessible through a single REST API with consistent behavior regardless of manufacturer quirks. PTZ control, preset management, and status monitoring work reliably across different camera brands and ONVIF implementation variations.
Intelligent Motion Detection: False positive rate dropped from ~80% to less than 5% by using object detection instead of frame differencing. The system now specifically tracks cats moving through the house rather than triggering on shadows, lighting changes, or car headlights through windows.
Automated Scene Control: When motion is detected on any camera, OBS automatically highlights that feed while keeping others visible as thumbnails. Scene transitions happen in under 200ms. No more manual scene switching or missing events because you were watching the wrong camera.
Production-Grade Reliability: The entire system runs in Docker containers with health checks and automatic restart capabilities. Individual component failures don't cascade. Camera disconnections are handled gracefully. OBS WebSocket failures don't break camera control.
Scalable Architecture: The microservice design means adding new capabilities doesn't require touching existing code. Want facial recognition? Create a new detection service. Need home automation integration? Add a microservice that consumes motion events via the API.

What makes this system particularly satisfying is how it handles the mundane reliability challenges that kill most smart home projects. When Camera 3 goes offline, the system continues working with the remaining cameras. When OBS loses connection, camera control still functions independently. When detection processing falls behind, frames are dropped intelligently rather than causing memory leaks.

The modular architecture means we can continuously improve individual components without system-wide disruption. Upgrade to a newer YOLO model? Just change the environment variable. Add support for a new camera brand? Implement a new ONVIF adapter. Need better motion sensitivity? Tune the detection parameters without touching the streaming pipeline.

Sometimes the best smart home solutions aren't about buying the most expensive equipment—they're about making cheaper hardware work together intelligently. When you can monitor your cats' midnight adventures with millisecond precision and automatic highlight reels, you know you've built something that actually enhances daily life rather than just being technically impressive.

Though we do occasionally get alerts that the cats are plotting something suspicious in the kitchen. That's exactly what we wanted.

How to Monitor Cats Without Losing Your Sanity