Karan Deo Burnwal

Project Motivation and Overview

The project addresses the critical need for privacy-preserving occupancy monitoring in sensitive environments where traditional camera or microphone surveillance is unsuitable. By utilizing structural vibration sensing via low-cost geophones, the system detects and classifies human and animal presence based on their unique gait and footstep signatures.

This method provides robust operation in challenging, non-visual conditions such as darkness, visual occlusion, and adverse weather. The core objective is to create a lightweight, multi-class detection system capable of distinguishing between human, animal, and noise signals efficiently on embedded hardware like microcontrollers or single-board computers.

Data Preparation and Pre-processing

1. Raw Signal Acquisition and Filtering

The process begins with the raw audio signal captured by a geophone sensor at 8KHz.

The raw audio is first downsampled to 500 Hz.
To extract relevant signal energy, the data is processed across three distinct frequency bands: 10–60 Hz, 60–120 Hz, and 120–180 Hz.
The Hilbert Envelope is extracted for each of these three bandpass filtered signals.

2. Footstep Event Extraction and Validation

The three Hilbert Envelopes are averaged (fused) to create a single detection signal. Footstep events are identified using Peak Detection with Parabolic Interpolation. To ensure only high-quality signals are retained, a validation step is crucial:

The Continuous Wavelet Transform (CWT) is computed using a Morlet Wavelet.
The CWT Ridge Magnitude is extracted.
A detected peak is validated as a true footstep event only if its corresponding CWT Ridge Magnitude is above the 70th percentile, filtering out low-energy noise events.

Synthetic Data Generation Pipeline

To address data scarcity and, critically, to train the model on the challenging scenario of simultaneous multiple footsteps (multi-occupancy), a synthetic data pipeline was developed.

Selection and Slicing: Random users and their processed footstep events are selected. Random slices are extracted from the full audio files.
Superposition: Realistic signal mixtures are generated by computationally summing (superposition) the time-series signals of the overlapping segments. This process includes applying a distance gain to simulate varied physical distances from the sensor.
Noise Injection: Gaussian noise is generated based on a calculation of the signal power and then added to the mixed sample.

This process yields a synthetic mixed sample that robustly trains the model for multi-source separation.

Model Input Representation

The final models operate directly on the raw 1D vibration waveform, specifically a 2.5-second window sampled at 8 kHz (20,000 samples). This choice is intentional, as it preserves the fine-grained temporal dynamics of the heel-strike, toe-off, and surface resonance, avoiding feature loss that occurs in Time-Frequency image conversion.

Deep Learning Modeling

Baseline Model: Footstep1DNet

The initial model was a lightweight, fast-inference 1D-CNN designed for microcontrollers. It consisted of four convolutional blocks with progressively decreasing kernel sizes ( $25 \rightarrow 15 \rightarrow 9 \rightarrow 5$ ). While fast, this shallow architecture had a fundamental limitation: a high False Negative (FN) rate, particularly for users with subtle or light-footed gaits, which severely limits its usability in real-world deployment.

Proposed Model: FootstepResNet

The proposed architecture is a deeper, physics-aware network utilizing Residual Blocks to overcome the baseline’s generalization failures.

Physics-Aware Stem: The network begins with a large Conv1D kernel of size 129. On the 8 kHz input, this large receptive field captures approximately 16ms of high-resolution data, effectively modeling the initial high-energy impact, foot-surface coupling, and resonant decay structure of the footstep event.
Residual Depth: It employs three stacked Residual Blocks with skip connections. This design ensures stable, deep feature learning, prevents vanishing gradients, and allows the model to learn complex, long-range gait patterns necessary to distinguish subtle intra-user signatures.

Performance and Error Reduction

The FootstepResNet demonstrated dramatic improvements in robustness and accuracy:

Metric	Baseline (Footstep1DNet)	Proposed (FootstepResNet)	Improvement
False Negatives (FN)	723	183	~4x Reduction
False Positives (FP)	331	170	~2x Reduction
Accuracy	Varied across users	96–99% (across users)	Significantly enhanced

Crucially, the new model showed an excellent ability to recover the detection of “problematic” users who were consistently missed by the baseline, whose False Negative count dropped from 331 to just 2.

Limitations and Future Work

While highly successful, the project identifies key challenges for real-world deployment:

Closed-Set Recognition: The current classifier is trained on a fixed set of identities and forces any “unknown” intruder into an existing class, which can lead to silent failures. Future work focuses on Open-Set Recognition to enable detection of previously unseen individuals.
Edge Optimization: To ensure practical deployment, the FootstepResNet will undergo edge optimization techniques, including Post-Training Quantization (INT8), pruning, and TFLite runtime optimization for microcontrollers and single-board computers.
Multi-Source Separation: The system currently performs multi-label detection but lacks true signal disentanglement. Future work aims to move from detection to explicit source separation to enable accurate counting and tracking in highly dense, overlapping footstep scenarios.
Robustness Testing: Generalization will be tested across diverse surfaces (Wood, Concrete, Tile, Soil/outdoor terrains) to address domain shift challenges.