Building a Personal Field Recording Pipeline on Unraid

homelabunraidaudiofield-recordingdockerpythonproject

After a morning in the field with a Zoom F3 recorder, I’d come home with a folder of WAV files, a phone camera roll of GPS-tagged photos, and a vague memory of what I’d heard. Processing it all — matching audio to location, running BirdNET, transcribing spoken notes, checking the weather — took long enough that I often just… didn’t.

fram-harness is the fix. Drop a session folder on the NAS, click a button, and get a report.

Work in progress. This project is actively being built. This post covers the design and motivation — a follow-up will cover the build and results.


The Problem

Every field recording session produces:

  • WAV files from the Zoom F3, named by time-of-day (HHMMSS_NNNN.WAV)
  • Phone photos with GPS EXIF data, taken at the same locations
  • Spoken field notes — mumbled into the recorder mid-session
  • A date buried in the folder name (YYMMDD-Location-Detail-Place)

Cross-referencing any of this by hand is tedious. The F3 has no GPS. BirdNET won’t run itself. The weather at the time is a detail I always forget to note.

The gap between “interesting recording session” and “useful documented session” was large enough that most sessions stayed undocumented. That’s the problem worth solving.


The Design

A lightweight web app on the NAS lets me browse session folders, select which analysis stages to run, and submit. Ductile picks it up and hands it to a pipeline container. The pipeline runs the stages in order and writes a report back to the session folder.

Web App (Flask, Unraid)

    ├─ Browse /mnt/field_Recording/F3/Orig/
    ├─ Session form: stage checkboxes + file picker
    └─ On submit:
            ├─ Write pipeline_config.yaml → session folder
            └─ POST to Ductile webhook

Ductile → pipeline-runner container

    ├─ EXIF scan      [always]  photos → GPS + timestamp map
    ├─ Transcribe     [opt]     faster-whisper CUDA → field notes
    ├─ Refine         [opt]     Ollama → cleaned text
    ├─ BirdNET        [opt]     CUDA inference → species detections
    ├─ Weather        [opt]     open-meteo → conditions at time/place
    ├─ Spectrogram    [opt]     melspec-to-video → MP4 per file
    └─ Report         [always]  merge → session_report.md + session.json

The output is a session_report.md in the session folder — readable immediately, linkable from Obsidian, archivable.


The Hardware Constraint That Shapes Everything

The NAS has an RTX 2070 (8GB VRAM). Three pipeline stages want the GPU: faster-whisper, BirdNET, and the spectrogram generator. Running them concurrently would cause VRAM contention and unpredictable failures.

The solution is to route GPU jobs through Ductile’s queue — one at a time, serialised. CPU stages (EXIF, weather, report) run independently. This turns a potential mess into a clean dependency graph without needing custom orchestration logic.


GPS Without GPS

The F3 has no GPS. But my phone does — and I take reference photos throughout a session. The pipeline solves this by:

  1. Scanning all photos for EXIF GPS coordinates and timestamps
  2. Building a timestamp → GPS map
  3. For each WAV file, deriving an absolute datetime from the F3 filename (HHMMSS) + folder date (YYMMDD)
  4. Finding the nearest photo timestamp and inheriting its GPS coordinates

It’s not surveying accuracy, but it’s accurate enough to place a recording within metres for field note purposes.


The Container Stack

ContainerGPUPurpose
ollamayesLLM inference for field note cleanup
birdnet-annotationsyesBird species detection
faster-whisperyesTranscribe spoken field notes
pipeline-runnernoOrchestration, EXIF, weather, report
fram-webnoWeb UI

All containers run on the NAS. No cloud. No subscription. The model weights live on local storage.


Reuse from framai

fram-harness shares utility code with framai, an earlier project in the same space:

  • EXIF extraction and GPS clustering (utils/exif.py)
  • Weather lookups via open-meteo with caching (weather.py)
  • Audio header/footer extraction for transcription (utils/audio.py)

This is the advantage of keeping personal projects in a coherent ecosystem — code written for one context finds a second life in another.


What’s Next

The pipeline design is complete. What remains is building it:

  • pipeline/f3_parser.py — F3 filename → absolute datetime
  • pipeline/photo_align.py — GPS inheritance from nearest photo
  • pipeline/stages/ — each analysis stage
  • webapp/ — Flask session browser and form
  • Docker images for faster-whisper and the pipeline runner
  • Ductile webhook configuration

The first real test will be the next field session. The report writes itself, or it doesn’t — no middle ground.


On Automation and Field Work

There’s an argument that automating the analysis removes something from the practice — that the act of reviewing recordings by hand is part of learning to listen. I’ve thought about this.

My counter: I wasn’t reviewing them at all. The gap between session and analysis was producing nothing. An automated first pass that gets data into Obsidian is infinitely better than a manual process I don’t do. The recordings get heard either way; this just makes the threshold for hearing them lower.

The interesting moments still require human ears. fram-harness just makes sure I get to those moments.

And I do still review manually — opening sessions in Audacity, listening through, pulling up the spectrogram view to hunt for signatures I don’t recognise. That part I actually enjoy. The pipeline handles the cataloguing so the time I do spend with the recordings is the interesting part, not the bookkeeping.