Overview. Precise motion timing (PMT) is crucial for swift motion analysis, where a millisecond difference may determine athletic outcomes. Existing Human Pose Estimation (HPE) relies heavily on RGB cameras constrained by frame rates (30-60Hz). To address the lack of high-frequency annotated datasets, we propose FlashCap, the first flashing LED-based MoCap system for PMT. Leveraging event cameras and blinking LEDs, we collect the FlashMotion dataset featuring 1000Hz ground truth. Furthermore, we propose a strong baseline, ResPose, to estimate high-temporal-resolution poses by learning residual signals based on events and static RGB anchors.


Abstract

Precise motion timing (PMT) is crucial for swift motion analysis. A millisecond difference may determine victory or defeat in sports competitions. Despite substantial progress in human pose estimation (HPE), PMT remains largely overlooked by the HPE community due to the limited availability of high-temporal-resolution labeled datasets. Today, PMT is achieved using high-speed RGB cameras in specialized scenarios such as the Olympic Games; however, their high costs, light sensitivity, bandwidth, and computational complexity limit their feasibility for daily use. We developed FlashCap, the first flashing LED-based MoCap system for PMT. With FlashCap, we collect a millisecond-resolution human motion dataset, FlashMotion, comprising the event, RGB, LiDAR, and IMU modalities, and demonstrate its high quality through rigorous validation. To evaluate the merits of FlashMotion, we perform two tasks: precise motion timing and high-temporal-resolution HPE. For these tasks, we propose ResPose, a simple yet effective baseline that learns residual poses based on events and RGBs. Experimental results show that ResPose reduces pose estimation errors by ~40% and achieves millisecond-level timing accuracy, enabling new research opportunities. The dataset and code will be shared with the community.

Introduction

Precise motion timing (PMT) plays a critical role in understanding human motion. Current public motion datasets peak at 120Hz, which is inadequate for developing millisecond-accurate algorithms. Existing MoCap systems face fundamental limitations: conventional RGB cameras lack temporal resolution, while high-speed cameras (>1000 Hz) are hindered by massive bandwidth, lighting constraints, and prohibitive costs.

To break this bottleneck without relying on complex commercial rigs, we introduce FlashCap, an automated, studio-free approach leveraging event cameras. Event cameras capture asynchronous intensity changes, offering extremely high temporal resolution with minimal bandwidth. By outfitting subjects with precisely timed flashing LEDs, our system directly extracts native 1000Hz pose labels from event streams.

Our contributions are:

  1. FlashCap: The first flashing LED-based MoCap system to capture human motion at high temporal resolution with low bandwidth and monetary cost.
  2. FlashMotion Dataset: Contains 1000Hz ground-truth labels, improving the state-of-the-art temporal resolution by nearly an order of magnitude.
  3. ResPose: A novel simple yet effective baseline for high-temporal-resolution HPE and PMT tasks, utilizing high-frequency events as residual signals.

Method: FlashCap & ResPose

1. The FlashCap System & Annotation Pipeline

The FlashCap MoCap outfit contains 17 LEDs and 17 IMUs. Each LED is configured to emit light at specific high frequencies and is easily detected by the event camera. The multi-modal capture device incorporates an RGB camera (20 FPS), an Event camera, and a LiDAR.

To extract 1000Hz ground-truth labels from the raw data, we developed a robust Data Annotation Pipeline:

  • Event Cluster Identification: We use DBSCAN to cluster asynchronous event streams into potential LED locations.
  • Cluster Frequency Identification: We compute polarity changes to identify the specific on-time and off-time signatures of each LED.
  • Noise Removal & Matching: Outliers are filtered, and a bipartite matching algorithm accurately pairs identified clusters with the physical LEDs on the actor’s body based on period distance.

2. ResPose Baseline

To bridge the gap between low-frame-rate standard inputs and high-frequency motion dynamics, we developed ResPose. It leverages the structural stability of RGB priors and the high temporal resolution of events.

  • SNN-CNN Hybrid Encoder: It performs dynamic cropping centered at an RGB anchor and explicitly extracts spatiotemporal event patches using Leaky Integrate-and-Fire (LIF) neurons.
  • Multimodal Residual Transformer: This module fuses the 2D RGB anchors with the event features, enforcing kinematic constraints through skeleton-aware self-attention to yield the final high-frequency pose.

The FlashMotion Dataset

Based on FlashCap, we introduce FlashMotion, the first publicly available multi-modal human motion dataset with millisecond-accuracy pose labels. It includes:

  • Scale: 240 sequences covering 11 major action categories and 7.15M labeled frames.
  • Modalities: Synchronized RGB, LiDAR point clouds, IMU, and Event streams.
  • Annotations: 1000Hz 2D labels and 60Hz 3D SMPL parameters.

Compared to conventional high-speed cameras (e.g., 200Hz), which still exhibit 2-6ms timing errors, our event-based 1000Hz ground truth captures micro-dynamics during rapid motions flawlessly.

Tasks and Benchmarks

We propose two novel benchmarks to evaluate the merit of the dataset and our method.

Precise Motion Timing (PMT)

This task measures the exact millisecond when a specific joint passes a predefined line, essential for athletic speed calculations. Existing low-frame-rate RGB methods (like ViTPose) fail entirely with errors around ~50ms. Even prior event-fusion methods struggle. In contrast, our ResPose achieves single-digit millisecond accuracy (e.g., 4.8ms for Punching).

High Temporal Resolution HPE

We linearly upsample baseline outputs to 1000Hz to compare them against our high-frequency ground truth. ResPose significantly outperforms all state-of-the-art baselines, achieving the lowest Mean Per Joint Position Error (MPJPE of 5.66) and producing smooth, high-fidelity trajectories that capture true micro-movements.

Conclusion

We present FlashCap, a novel system enabling millisecond-accurate motion capture. Using this system, we introduce FlashMotion, the first millisecond-accuracy human motion dataset, and expose the fundamental limitations of standard low-frequency HPE methods. To address this, we proposed ResPose, a simple yet effective Hybrid Spiking-Transformer baseline, establishing a strong foundation for future high-temporal-resolution motion understanding.

Citation

```bash @inproceedings{wu2026flashcap, title={FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision}, author={Wu, Zekai and Fan, Shuqi and Liu, Mengyin and Luo, Yuhua and Lin, Xincheng and Yan, Ming and Wu, Junhao and Lin, Xiuhong and Ma, Yuexin and Wen, Chenglu and Xu, Lan and Shen, Siqi and Wang, Cheng}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2026} }