# Quill

**Voice → text relay for [Spellbook](#where-quill-fits).** Quill captures the
microphone, decides when you've finished talking, transcribes the utterance
locally, and streams the text over Server-Sent Events. It is the "ears that
write things down" — no cloud, everything on-device.

```
HydraVoice  ──▶   Quill   ──▶   Spellbook
iPad voice        mic → VAD →     LLM agent +
to the box        ASR → SSE text  actions + TTS
```

---

## Instructions for Claude Code

1. Create a repo named `quill` and add the files defined in [Files](#files):
   `go.mod`, `main.go`, `.gitignore`. Use this document as `README.md`.
2. Commit and publish with the GitHub CLI:
   ```bash
   git init
   git add .
   git commit -m "Quill: local voice→text SSE relay for Spellbook"
   gh repo create quill --public --source=. --remote=origin --push
   ```
3. Do **not** commit models or built binaries — `.gitignore` already excludes
   them. Models are downloaded by the user per [Prerequisites](#prerequisites).
4. Leave a `LICENSE` (MIT) if you like; the code is stdlib-only with no
   third-party Go dependencies.

---

## What it does

A small Go program that:

1. Runs sherpa-onnx's prebuilt **mic + VAD + ASR** binary as a subprocess.
   - **VAD** (Silero) decides when an utterance starts and stops.
   - **ASR** (Parakeet on GPU, or Moonshine on CPU) turns each utterance into text.
2. Catches each finished transcript line.
3. Streams those lines to any subscriber over **SSE** at `GET /transcript`,
   bound to `127.0.0.1` so nothing off-box can reach it.

The output is per-utterance, not character-by-character: you speak, you pause,
one line of text appears. That rhythm is what the downstream agent wants.

## Design note (why it's so thin)

Quill does **not** implement audio capture, VAD, or speech recognition itself.
sherpa-onnx already ships a prebuilt binary that does all three — think of it as
the "ffmpeg of on-device speech." Quill is just glue: run that binary, fan its
output out over a stream. The Go code is stdlib-only and stays constant even
when you change the underlying model or hardware.

## Where Quill fits

| Component   | Role                                             | Tech                         |
|-------------|--------------------------------------------------|------------------------------|
| HydraVoice  | Carries iPad voice to the box (appears as a mic) | (existing, external)         |
| **Quill**   | Mic → VAD → ASR → SSE text stream                | Go stdlib + sherpa-onnx      |
| Spellbook   | Reads the stream, runs the LLM, acts, speaks back| Go + Ollama + TTS (separate) |

## Deployment targets

The Go code is identical on both; only the sherpa build, provider, and model
flags differ.

| Target              | sherpa build | Provider | Model         |
|---------------------|--------------|----------|---------------|
| Windows + NVIDIA    | CUDA         | `cuda`   | Parakeet TDT  |
| Raspberry Pi / ARM  | ARM (CPU)    | `cpu`    | Moonshine     |

Parakeet is GPU-bound and extremely fast on CUDA. Moonshine is CPU-first and
purpose-built for edge latency — it runs comfortably on a Pi. Swapping between
them is a flag change in `runSTT()`, never a code rewrite (see
[Swapping models](#swapping-the-model)).

## Prerequisites

1. **sherpa-onnx prebuilt tools** on `PATH`:
   - Windows + NVIDIA: the **CUDA** build (also needs CUDA toolkit + cuDNN).
   - Pi / ARM: the **aarch64** build.
2. **Model files** under `models/`:
   - `silero_vad.onnx` (both targets)
   - Parakeet TDT export: `encoder.onnx`, `decoder.onnx`, `joiner.onnx`, `tokens.txt`
     — or Moonshine's four ONNX files + `tokens.txt`.
   - Get these from the k2-fsa / sherpa-onnx model releases.
3. **Go 1.22+**.

## Build & run

```bash
go build -o quill        # quill.exe on Windows
./quill
```

Then watch the stream live in another shell:

```bash
curl -N http://127.0.0.1:8137/transcript
```

Talk into the iPad → text events appear in the curl window.

> **First-run tweak:** the `transcript()` function is a passthrough. Run the
> sherpa binary once, look at how it prints results, and adjust that one
> function to extract just the recognized text.

## Swapping the model

In `runSTT()`, replace the Parakeet/CUDA flags with Moonshine/CPU:

```go
cmd := exec.Command("sherpa-onnx-vad-microphone-offline-asr",
    "--silero-vad-model="+modelDir+"/silero_vad.onnx",
    "--tokens="+modelDir+"/moonshine/tokens.txt",
    "--moonshine-preprocessor="+modelDir+"/moonshine/preprocess.onnx",
    "--moonshine-encoder="+modelDir+"/moonshine/encode.onnx",
    "--moonshine-uncached-decoder="+modelDir+"/moonshine/uncached_decode.onnx",
    "--moonshine-cached-decoder="+modelDir+"/moonshine/cached_decode.onnx",
    "--provider=cpu",
)
```

Match the exact flag names to sherpa's Moonshine example for your version.

## Roadmap

- [ ] Spellbook consumes `GET /transcript`, runs Ollama with tool calling.
- [ ] Actions: shell/scripts, app & window control, HTTP, files, smart home.
- [ ] TTS reply (Piper, or Kyutai TTS for low-latency streaming).
- [ ] Optional: config via env/flags so model + provider need no recompile.

---

## Files

### `go.mod`

```
module quill

go 1.22
```

### `main.go`

```go
// Quill — the voice→text relay for Spellbook.
//
// Runs sherpa-onnx's prebuilt mic+VAD+ASR binary (Parakeet on CUDA), catches
// each finished utterance, and streams it to subscribers over Server-Sent
// Events at GET /transcript. HydraVoice carries voice in; Quill turns it into a
// text stream; Spellbook reads that stream and decides to answer or act.
//
// Stdlib only — nothing in go.mod.
//
//   go build -o quill.exe && quill.exe
//   # in another shell, watch it live:
//   curl -N http://127.0.0.1:8137/transcript

package main

import (
	"bufio"
	"fmt"
	"log"
	"net/http"
	"os"
	"os/exec"
	"strings"
	"sync"
)

const (
	modelDir = "models"
	addr     = "127.0.0.1:8137" // localhost only
)

// hub fans each transcript line out to every connected subscriber.
type hub struct {
	mu   sync.Mutex
	subs map[chan string]struct{}
}

func newHub() *hub { return &hub{subs: map[chan string]struct{}{}} }

func (h *hub) add() chan string {
	ch := make(chan string, 16)
	h.mu.Lock()
	h.subs[ch] = struct{}{}
	h.mu.Unlock()
	return ch
}

func (h *hub) remove(ch chan string) {
	h.mu.Lock()
	if _, ok := h.subs[ch]; ok {
		delete(h.subs, ch)
		close(ch)
	}
	h.mu.Unlock()
}

func (h *hub) broadcast(line string) {
	h.mu.Lock()
	for ch := range h.subs {
		select {
		case ch <- line:
		default: // drop for a backed-up subscriber rather than block the mic
		}
	}
	h.mu.Unlock()
}

func (h *hub) handleSSE(w http.ResponseWriter, r *http.Request) {
	flusher, ok := w.(http.Flusher)
	if !ok {
		http.Error(w, "streaming unsupported", http.StatusInternalServerError)
		return
	}
	w.Header().Set("Content-Type", "text/event-stream")
	w.Header().Set("Cache-Control", "no-cache")
	w.Header().Set("Connection", "keep-alive")

	ch := h.add()
	defer h.remove(ch)

	for {
		select {
		case <-r.Context().Done():
			return
		case line, ok := <-ch:
			if !ok {
				return
			}
			fmt.Fprintf(w, "data: %s\n\n", line)
			flusher.Flush()
		}
	}
}

// runSTT launches the sherpa mic tool and pumps its transcripts into the hub.
func runSTT(h *hub) error {
	cmd := exec.Command("sherpa-onnx-vad-microphone-offline-asr",
		"--silero-vad-model="+modelDir+"/silero_vad.onnx",
		"--tokens="+modelDir+"/parakeet/tokens.txt",
		"--encoder="+modelDir+"/parakeet/encoder.onnx",
		"--decoder="+modelDir+"/parakeet/decoder.onnx",
		"--joiner="+modelDir+"/parakeet/joiner.onnx",
		"--model-type=nemo_transducer",
		"--provider=cuda",
	)
	cmd.Stderr = os.Stderr // sherpa's own logs
	out, err := cmd.StdoutPipe()
	if err != nil {
		return err
	}
	if err := cmd.Start(); err != nil {
		return fmt.Errorf("start sherpa tool (on PATH?): %w", err)
	}
	go func() {
		sc := bufio.NewScanner(out)
		for sc.Scan() {
			if text := transcript(sc.Text()); text != "" {
				log.Printf("> %s", text)
				h.broadcast(text)
			}
		}
		_ = cmd.Wait()
	}()
	return nil
}

// Pull the recognized text out of a stdout line. Run the binary once, see how it
// prints results, and tweak this one function to match.
func transcript(line string) string {
	return strings.TrimSpace(line)
}

func main() {
	h := newHub()
	if err := runSTT(h); err != nil {
		log.Fatal(err)
	}
	http.HandleFunc("/transcript", h.handleSSE)
	log.Printf("Quill: mic live, streaming transcripts at http://%s/transcript", addr)
	log.Fatal(http.ListenAndServe(addr, nil))
}
```

### `.gitignore`

```
# built binaries
quill
quill.exe
*.exe

# downloaded models (large; fetched per README)
models/
```