Technical Papers

📄 Wallspace Captions - full technical paper

📄 Accessibility Tooling - transcription, OCR, and workflow systems

0. Introduction

Access to spoken word and music in live environments remains fundamentally limited for deaf audiences. Traditional solutions - such as human interpreters or basic captioning - are often constrained by availability, latency, lack of synchronisation with music, cost and poor integration into visual experiences.

In particular:

· Live music is largely inaccessible beyond low-frequency vibration

· Lyrics are rarely available in real time, especially in festival or club settings

· Existing captioning tools are not designed for performance environments (large screens, dynamic visuals, multi-source audio)

Within this system, Wallspace Captions operates as a core visual layer, treating captions not as an accessibility afterthought - as is often the case in the author’s lived experience as a profoundly deaf person - but as a primary, performance-driven visual element.

The system enables:

· Real-time speech transcription

· Automatic song recognition and synchronised lyric display

· Visually integrated captions designed for large-scale projection and LED environments

Unlike conventional captioning systems, this approach is:

· Engine-agnostic (multiple caption sources: Whisper, browser speech, music ID + lyrics)

· Latency-aware and adjustable in real time

· Designed for integration with VJ pipelines and visual systems

There are currently no widely adopted systems that combine real-time audio analysis, caption/lyric generation, synchronised visual rendering, and live performance integration within a single pipeline.

This project explores that full stack, with the goal of making live audiovisual experiences fully perceivable and engaging for deaf audiences, not just understandable. This work positions captions not as an auxiliary layer, but as a primary medium for visual expression and control.

This system has been published as a Scope node for public use, testing, and further development.

https://app.daydream.live/nodes/AEYESTUDIOS/wallspace-captions

Audio Input → Caption Engine → Text Stream → Event System → Visual Behaviour → Scope Rendering → Visual Output

Figure 1 – Wallspace Captions System Flow

1. Authorship & Contributions

Wallspace Captions was co-developed by this document’s author Matthew Israelsohn in collaboration with Jack Morgan of Wallspace.Studio, with both contributors working on system design, implementation, and iteration of the caption-driven visual pipeline.

The sign language conducting system, audio reactivity research, various technology comparison papers and accessibility tools were developed independently by the author.

2. Features

· 4 text input methods: Scope prompt field, manual text, OSC (UDP), WebSocket

· Pre + Post pipelines: Pre-bakes text into frames so AI stylises it; Post overlays clean captions after AI generation

· Advanced caption placement: XY coordinate positioning (percentage-based), preset positions (top/center/bottom), text alignment

· Full styling control: Font size, colour (RGB), opacity, text outline with colour/width, background box with colour/opacity/padding/corner radius

· Caption Event System: Parses text into structured events (WORD, SENTENCE_START/END, QUESTION, EXCLAMATION, PAUSE, EMPHASIS, SPEAKER_CHANGE) that drive visual behaviours

· Event-reactive effects: Per-word flash, punctuation colour reactions, pause fade, emphasis highlighting

· Prompt forwarding: Transcription text forwarded as prompts with style prefix, template formatting, and rate limiting

System Context: Multi-Input Visual Pipeline

This project sits within a broader system exploring multiple input modalities for controlling visual systems in real time.

Three parallel input streams are being developed:

Audio Input → Caption Engine → Text Processing → Visual Rendering (Scope)

Figure 2 - Multi-Input Visual System Context

This diagram shows how multiple input modalities - audio (speech/music), gesture (sign language conducting), and audio feature analysis - are processed in parallel and unified into a shared visual pipeline. All inputs are converted into structured control signals (text events, audio features, OSC data) which are combined within Scope and rendered through Resolume as a cohesive real-time visual output.

1. Audio -> transcription / lyrics -> caption events

2. Gesture -> motion tracking -> OSC control signals

3. Audio features -> spectral / temporal analysis -> modulation signals

These are unified through a shared routing and rendering pipeline (OSC -> Scope -> Resolume), enabling multiple forms of expression to drive visuals simultaneously.

Caption Event System (Core Innovation)

Wallspace Captions does not treat text as static output.

Incoming text is parsed into structured semantic events which can drive visual behaviour in real time.

Event types include:

· WORD

· SENTENCE_START / SENTENCE_END

· QUESTION / EXCLAMATION

· PAUSE

· EMPHASIS

· SPEAKER_CHANGE

This enables a second processing layer:

(Text -> Event -> Visual Modulation)

Audio Source (Mic / System / Stream)

→ Caption Engine (Whisper / Browser Speech / Shazam + LRCLIB)

→ Text Stream (Timestamped Segments)

→ Caption Logic Layer (Timing / Filtering / Formatting)

→ Scope Node (Rendering + Layout)

→ Visual Output (CRT / Projection / Screen)

Figure 3 - Caption Event System: Secondary Processing Layer

Examples:

· Per-word flashing synced to speech rhythm

· Colour changes triggered by punctuation

· Fade/hold behaviour during pauses

· Emphasis-driven scaling or highlighting

This transforms captions from passive information into active visual control signals, enabling expressive, performance-ready visual systems.

This effectively converts language into a real-time visual control signal.

Use Cases

· Accessibility: Live captions for deaf/hard-of-hearing audiences at live events (A.EYE.ECHO integration)

· VJ performance: Spoken word -> AI-generated reactive visuals in real-time

· Live events: Audience speech drives projected visuals

· Art installations: Text-reactive generative art

1. System Overview

Wallspace Captions is a real-time captioning and lyric visualisation system designed for large-format visual environments (e.g. LED walls, projection mapping, CRT installations). It converts live or captured audio into synchronised text and renders it visually as part of a VJ/visual performance pipeline.

Audio Input → Caption Engine → Text Processing → Visual Rendering (Scope)

Figure 4 - Wallspace Captions System Pipeline

This enables spoken word and music to be transformed into visual language in real time.

2. Data Flow Architecture

Audio Source (Mic / System / Stream)

→ Caption Engine (Whisper / Browser Speech / Shazam + LRCLIB)

→ Text Stream (Timestamped Segments)

→ Caption Logic Layer (Timing / Filtering / Formatting)

→ Scope Node (Rendering + Layout)

→ Visual Output (CRT / Projection / Screen)

Figure 5 - Captioning System Data Flow

3. Components

Audio Input Layer

Supports microphone, system audio capture, and web streams. Role: Provides real-time audio feed for transcription or identification.

Caption Engines

Whisper: Speech-to-text (local or remote). Browser Speech: Low-latency fallback transcription. Shazam + LRCLIB: Music identification and synced lyric retrieval. Role: Converts audio into structured text (speech or lyrics).

Caption Logic Layer

Handles timing alignment, latency compensation, and formatting. Includes dynamic sync offset and trim controls. Role: Ensures captions are synchronised and readable in real time.

Scope (Rendering Engine)

Node-based visual system for rendering captions. Supports layout control, styling, and integration with other visual pipelines. Role: Final visual output layer.

4. Protocols & Data Handling

Audio Stream: Real-time input from system/mic.

Internal Data: Timestamped text segments (structured data).

SSE/WebSocket: Used for live updates within Scope.

Optional OSC/MIDI: Enables integration with external visual systems.

5. Example Data Flow

Audio Input → Whisper → Timestamped Text → Caption Logic → Rendered Caption

Speech mode: Figure 6 - Speech Mode: Real-Time Transcription Pipeline

In speech mode, live or recorded audio is processed through a transcription engine (e.g. Whisper), producing timestamped text segments. These are converted into structured caption events and passed into the rendering system, where they are displayed as synchronised visual captions in real time.

Music mode:

Audio Input → Shazam → Track ID → LRCLIB → Synced Lyrics → Rendered Output

Figure 7- Music Mode: Audio Identification to Synchronized Lyric Rendering

Incoming audio is analysed via fingerprinting (e.g. Shazam) to identify the track in real time. Once identified, synchronised lyric data is retrieved and aligned with playback timing, generating a continuous stream of timed text events. These events are rendered dynamically, enabling lyrics to function as both captions and visual performance elements.

6. Current Mid-Cohort Status

✔ Multiple caption engines integrated

✔ Real-time transcription and lyric sync working

✔ Dynamic sync offset and trim controls implemented

✔ Rendering in Scope working

⚠ Visual styling and advanced layout in progress

⚠ External control integration (OSC/MIDI) exploration

7. Next Steps

1. Refine caption visual design for large-scale displays 2. Improve latency handling and sync accuracy 3. Add OSC/MIDI hooks for external control 4. Integrate with full visual performance pipeline 5. Record demo video

8. Relation to Other Work

This article represents the integration layer of three ongoing strands of work:

– A real-time caption and lyric system (Wallspace Captions)

– A gesture-based conducting system using Flowfal

– A research-driven audio reactivity engine for TouchDesigner

Together, these form a unified approach to translating sound and movement into visual experience for deaf audiences.

See linked articles for deeper technical breakdowns of each subsystem.

9. Accessibility Tools

Accessibility Tooling (Supporting Layer)

During development, a number of supporting accessibility tools were created to address limitations in real-time communication platforms for deaf users.

These included system audio transcription pipelines, video OCR, and transcription processing tools.

These tools informed the design of the caption-driven visual system and are documented separately & linked to from the top of this article.

Wallspace Captions: Real-Time Visual Systems for Deaf Audiences