Files
recipe/.planning/research/PITFALLS.md
2026-04-24 12:54:05 +02:00

293 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Pitfalls Research
**Domain:** Kotlin Multiplatform + Compose Multiplatform (iOS-primary), Ktor/Exposed/Postgres, OIDC, LWW delta sync
**Researched:** 2026-04-23
**Confidence:** HIGH for KMP/Ktor/Exposed gotchas; MEDIUM for Haze + Navigation-CMP specifics (behavior shifts across minor versions)
## Critical Pitfalls
### Pitfall 1: Kotlin/Native iOS GC thrashing and `objcDisposeOnMain` hangs
**What goes wrong:** On-device (especially iPhone XR/11) the app consumes 300700 MB steadily and freezes for 12 s under ViewModel churn. Flamegraphs show GC threads at >100% CPU.
**Why:** The K/N memory manager dispatches Obj-C release to the main thread by default, serializing teardown behind UI frames. Compose/Koin graphs produce many bridged Obj-C references per navigation.
**Warning signs:** Frame hitches on tab switches; main-thread time in `objc_release` / `Kotlin_ObjCExport_releaseReservedObjectTail`; Instruments shows growing K/N heap.
**How to avoid:** Set `kotlin.native.binary.objcDisposeOnMain=false` and `kotlin.native.binary.gc=cms` in `gradle.properties` from day 1. Release Kotlin refs in `onDispose`; don't hold them in long-lived Swift closures.
**Phase:** UI chrome.
---
### Pitfall 2: Legacy `freeze()` / strict-mm ceremony in copy-pasted snippets
**What goes wrong:** Code from 20212022 tutorials adds `freeze()`, `@SharedImmutable`, `AtomicReference` from `kotlin.native.concurrent`, or `ensureNeverFrozen()`. Compiles on Kotlin 2.x but adds dead code and masks real bugs.
**Why:** The new memory manager removed the freeze paradigm entirely; `freeze()` is a no-op and deprecated.
**Warning signs:** Any of the above symbols appearing in snippets you're about to paste.
**How to avoid:** Reject pre-1.7.20 KMP code. Use `kotlinx.atomicfu` if you truly need atomics; StateFlow is already thread-safe.
**Phase:** Data.
---
### Pitfall 3: `ComposeUIViewController` state loss on iOS re-entry
**What goes wrong:** Backgrounding then returning resets scroll positions, selected tabs, half-filled forms. Koin-scoped ViewModels re-create.
**Why:** If the `UIViewController` is instantiated inside a SwiftUI `body`, each re-render builds a fresh composition. Compose state is owned by the controller's composition root.
**Warning signs:** State survives Android rotation but dies on iOS foreground-return; ViewModel `init` fires on backgrounded return.
**How to avoid:** Build the `UIViewController` **once** — store in `@StateObject` or a top-level property, not in a SwiftUI `body`. Use `rememberSaveable` for any UI state that must survive process death. Never nest multiple `ComposeUIViewController` wrappers.
**Phase:** UI chrome.
---
### Pitfall 4: SQLDelight iOS — missing migration files, in-memory vs file driver divergence
**What goes wrong:** JVM tests pass with in-memory driver; the iOS app crashes on launch with `no such column` after a schema change.
**Why:** `NativeSqliteDriver` persists a real file. Editing `.sq` without a numbered `.sqm` migration and a bumped schema `version` means SQLDelight only *verifies* the schema on open — on a device with an existing install, that check fails.
**Warning signs:** Works on fresh simulator install; breaks on physical device with prior install; Android OK, iOS fails.
**How to avoid:** Every schema change gets a numbered `Nm.sqm`. Enable `verifyMigrations = true` and `verifyDefinitions = true`. Add a dev-only "wipe DB" debug button during early development. Reinstall on device before any QA.
**Phase:** Data.
---
### Pitfall 5: Exposed `transaction {}` inside suspend functions → pool exhaustion
**What goes wrong:** Plain `transaction { ... }` in Ktor handlers. Under modest concurrency (~20 requests) the pool exhausts, p99 cliffs, and `IllegalStateException: Transaction is not currently active` appears.
**Why:** `transaction {}` is blocking and binds the transaction to the calling thread. In a coroutine it blocks event-loop threads; if the code suspends mid-transaction, resume lands on a different thread and loses the JDBC connection binding.
**Warning signs:** Connection pool always fully leased at low RPS; latency cliffs; "transaction not active" in logs.
**How to avoid:** Use `newSuspendedTransaction(Dispatchers.IO) { ... }` in suspend contexts. Pass the `Database` instance explicitly. No HTTP calls inside transactions. HikariCP pool size 810 is plenty for 510 users.
**Phase:** Data.
---
### Pitfall 6: Exposed DAO + JSONB footguns
**What goes wrong:** `IntEntity` + `jsonb<T>()` produces double-serialized JSON in Postgres (`"{\"key\":\"v\"}"`) or `SerializationException` on read.
**Why:** DAO integration with JSONB is thin; it's easy to store a pre-stringified value. DAO lazy-loads hide *when* the column is read, so failures manifest far from the cause.
**Warning signs:** Escaped JSON in `psql` output; serialization errors deep in read paths.
**How to avoid:** Use DSL only (already locked in PROJECT.md). For JSONB, define `jsonb("extras", Json.Default, MealExtras.serializer())` once; never stringify upstream. Round-trip integration test per JSONB column.
**Phase:** Data.
---
### Pitfall 7: Ktor JWT — audience, issuer, clock skew, JWKS cache
**What goes wrong:** 401s in production only, after a while, or after Authentik restart. Messages: "Token can't be used before...", "Claim 'aud' doesn't contain required audience", or silent 401s post key-rotation.
**Why:** Four defaults converge:
1. `ktor-server-auth-jwt` requires explicit `.withAudience()` / `.withIssuer()`.
2. Default clock leeway is **zero** — 2 s device drift rejects fresh tokens.
3. JWKS cache defaults to `(10, 24h)` — key rotation invisible for hours.
4. Authentik's `aud` can be array or string depending on provider config.
**Warning signs:** 401 only in prod; 401 only on some devices; works briefly then fails; 401 after Authentik restart.
**How to avoid:** Configure `.withIssuer(issuer).withAudience(clientId).acceptLeeway(30)`. JWKS provider with `.cached(10, 15, MINUTES).rateLimited(10, 1, MINUTES)`. In Authentik, emit `aud` as a single client_id string. Integration test: wrong `aud` → 401.
**Phase:** Auth.
---
### Pitfall 8: OIDC redirect URI mismatch + missing PKCE
**What goes wrong:** "redirect_uri does not match" or consent loop on one platform; or login succeeds without PKCE and is interceptable.
**Why:** Native apps are *public* clients — no shippable secret, so Authentik requires PKCE. Redirect URIs must match byte-for-byte (trailing slash, case). iOS uses a custom URL scheme or Universal Link; Android uses an intent-filter. Debug and release builds can differ.
**Warning signs:** Works on Android, fails on iOS (or vice versa); Authentik logs show `invalid_grant`; no `code_challenge` in auth request; fails on release build only.
**How to avoid:** Authentik provider = "Public" + PKCE S256. Register both `recipe://callback` and `recipe://callback/`. AppAuth (Android) + ASWebAuthenticationSession (iOS) with `usePKCE = true`. Keep the redirect URI in one constant in `shared/commonMain`.
**Phase:** Auth.
---
### Pitfall 9: LWW trusting client clocks
**What goes wrong:** User A's phone clock is 90 s fast; A's edit beats B's real-time-later edit in LWW. B's change silently disappears.
**Why:** Client-assigned timestamps trust unverifiable clocks. Even NTP-synced devices drift; simulators can be minutes off.
**Warning signs:** "My edit vanished"; stable prior state reappears; most common with both household members editing the same meal.
**How to avoid:** Server assigns `updated_at` on every write (already in PROJECT.md — enforce it). Client sends only content + prior `updated_at` for optimistic concurrency. Server sets `updated_at = now()` in the transaction and returns it. Make timestamps strictly monotonic per row (e.g. `GREATEST(now(), old.updated_at + interval '1 microsecond')`) to avoid tie collisions.
**Phase:** Sync.
---
### Pitfall 10: Soft-delete + recreate race
**What goes wrong:** Delete a meal entry, immediately re-add "the same" one. Depending on pull ordering, the new row is hidden by the tombstone, or the old row is resurrected with old fields.
**Why:** If `(plan_date, slot)` is treated as identity, tombstone/recreate races are inevitable on concurrent 2-user editing.
**Warning signs:** Undeleted items; deleted meals reappear on partner's device; duplicates in pantry.
**How to avoid:** Identity is always a fresh UUID per row, never `(date, slot)`. Tombstones carry their own `updated_at`. Pull returns tombstones and live rows; client applies in `updated_at` order. Per-client push outbox replays in local sequence order — never parallel. Integration test: two clients alternating delete/recreate, assert convergence.
**Phase:** Sync.
---
### Pitfall 11: Pull-cursor edge cases — missed updates, same-timestamp ties
**What goes wrong:** Partner edits at 14:00:05; client's last pull cursor is `14:00:04.999`. If cursor semantics or timestamp precision are wrong, the change is skipped forever.
**Why:** Cursor semantics are subtle. Second-precision timestamps, `>=` instead of `>`, and ties among rows sharing a `updated_at` all cause skipped or replayed rows. Debounced push interleaved with pull can reorder writes.
**Warning signs:** Sporadic stale data that vanishes after pull-to-refresh; only reproduces near DB restarts or bulk imports; duplicates after manual refresh.
**How to avoid:** `updated_at` is `timestamptz` with microsecond precision and strictly monotonic. Cursor is `(updated_at, id)` lexicographic: `WHERE (updated_at, id) > (:since_ts, :since_id) ORDER BY updated_at, id LIMIT N`. Pause pull while a push is in flight. Never split the write and its timestamp notification across transactions.
**Phase:** Sync.
---
### Pitfall 12: Haze on scroll + nested children tank older iPhones
**What goes wrong:** LazyColumn scrolling under a blurred top bar stutters badly on iPhone XR/11, dropping to ~30 fps. Nesting `hazeChild` inside a list item sitting in a `hazeSource` Scaffold makes it worse.
**Why:** iOS Haze uses Skiko `GraphicsLayer` for offscreen capture + re-blur each frame. Progressive blur adds ~25% cost. Older A-series chips without hardware-accelerated RenderEffect equivalents jank under this load.
**Warning signs:** Smooth on simulator/M-series, choppy on iPhone 11; FPS 4050; Skiko render thread pegged in Instruments.
**How to avoid:** One `hazeSource` per screen, never nested. Limit blur to chrome (tab bar, nav bar, sheet headers), not scrolling content. Avoid progressive blur on iOS pre-iPhone 13. Test on the oldest target device in real hardware. Feature-flag the effect with a solid-translucent fallback.
**Phase:** UI chrome.
---
### Pitfall 13: Navigation-CMP tabs — `when`-switch kills per-tab back stack
**What goes wrong:** Tabs implemented as `when (tab) { 0 -> RecipesScreen()... }`. Tapping into a detail, switching tabs, and returning loses the detail. System back exits the app instead of unwinding the tab.
**Why:** A `when` switch destroys the non-current tab's Compose tree. Jetpack Navigation's multi-back-stack requires either each tab as a destination in a parent NavHost, or per-tab nested `NavHost` instances, with `popUpTo(saveState) + restoreState + launchSingleTop`.
**Warning signs:** Deep-links don't restore; back from a nested screen jumps tabs; ViewModels re-created on tab switches.
**How to avoid:** One top-level `NavHost`; `navigation(route = "recipesGraph", ...)` block per tab. Bottom bar navigates: `popUpTo(graph.findStartDestination().id) { saveState = true }; launchSingleTop = true; restoreState = true`. Scope `koinViewModel()` to the destination's `NavBackStackEntry`, not the parent graph. Wasm deep-links are deferred per PROJECT.md.
**Phase:** UI chrome.
---
### Pitfall 14: Polish locale — plurals and timestamp zones
**What goes wrong:** "added 2 godzina temu" (wrong plural form). Shopping items near midnight show on the wrong day across devices.
**Why:** Polish has four CLDR plural forms (one / few / many / other). Naive `if (n == 1)` handles at most two. Serializing `LocalDateTime` over the wire (instead of UTC `Instant`) produces zone/DST bugs.
**Warning signs:** Grammatically wrong Polish copy; yesterday's items shown as today's.
**How to avoid:** Use Compose Resources `<plurals>` with all four forms; call `pluralStringResource(count)`. Wire format: `Instant` UTC ISO-8601 only; display: `.toLocalDateTime(TimeZone.currentSystemDefault())`. Unit test plurals with count 0/1/2/5/22.
**Phase:** UI chrome (i18n foundation).
---
## Technical Debt Patterns
| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|---|---|---|---|
| Ad-hoc `psql` DDL, skipping Flyway | Fast schema iteration | Dev/prod drift; can't rebuild from scratch | Pre-first-deploy only; squash into `V1__init.sql` before real data |
| Hardcoded OIDC issuer/client_id in `shared/commonMain` | Avoids build-config plumbing | Can't run against staging Authentik; Authentik change forces rebuild | v1 single-environment only |
| Plain `transaction {}` in admin endpoints | Simpler mental model | Mixing blocking + suspend patterns leaks; eventually every endpoint wants suspend | Admin-only, single-user endpoints |
| Free-form `meal_entry.extras` JSONB without schema | Evolve without migrations | No DB validation; orphan fields accumulate; hard to query | Until extras shape stabilizes; then promote hot fields to columns |
| No indices until queries are slow | Faster early dev | p99 cliffs during sync; adding indices under load is risky | Until first data import; then index every `(household_id, updated_at)` |
## Integration Gotchas
| Integration | Common Mistake | Correct Approach |
|---|---|---|
| Authentik OIDC | Confidential client type with secret shipped in binary | Public client + PKCE S256; never ship `client_secret` |
| Authentik OIDC | Leaving default signing alg; Ktor JWT expects RS256 | Configure RS256 explicitly; verify `kid` resolves via JWKS |
| Haze + Scaffold | `hazeSource` on Scaffold root + `hazeChild` on a sheet both capturing | `hazeSource` on scrollable content only; chrome uses `hazeChild` |
| App Store / TestFlight | ATS exception to reach homelab self-signed cert | Real cert via Let's Encrypt + Caddy/Traefik; never ship ATS exceptions |
| Postgres JSONB | `WHERE extras->>'k' = 'v'` with no GIN index | `CREATE INDEX ... USING GIN (extras jsonb_path_ops)` once access patterns emerge |
## Performance Traps
| Trap | Symptoms | Prevention | When It Breaks |
|---|---|---|---|
| Pull sync without pagination | First-sync-after-seed hangs seconds | Cursor-paginate `LIMIT 200 ORDER BY updated_at, id` | >500 rows in any scoped table |
| Coil full-res images in recipe grid | Memory spikes, laggy scroll | Explicit thumbnail `Size`; memory+disk cache | >30 images on screen |
| Compose recomposition of entire calendar per edit | Calendar flashes on slot change; scroll resets | Stable IDs per slot; hoist per-slot state; `derivedStateOf` for totals | Any calendar with >7 days visible |
| Haze over full scrolling region | Jank on iPhone XR/11 | Blur chrome only, not content; fallback for old devices | Pre-A13 silicon on 60 Hz panels |
## Security Mistakes
| Mistake | Risk | Prevention |
|---|---|---|
| Missing `WHERE household_id = :caller_household` on reads | Cross-household data leak | All scoped reads go through a `HouseholdScope` helper; review rule: no raw `selectAll()` on scoped tables |
| Trusting client-supplied `household_id` in request body | Tenancy bypass via crafted POST | Derive `household_id` from JWT `sub``memberships`; ignore body's value |
| Logging the `Authorization` header in Ktor `CallLogging` | Tokens leak to log files → account compromise | Custom log filter redacting `Authorization`; never `log.info(token)` |
| Storing OIDC refresh token in plain prefs | Local/backup exposure | `multiplatform-settings` with Keychain (iOS) / EncryptedSharedPreferences (Android) backends |
## "Looks Done But Isn't" Checklist
- [ ] **Auth:** Login works — verify token refresh runs before expiry (set Authentik access-token lifetime to 5 min in dev; watch for silent 401s)
- [ ] **Sync:** Pull works — verify tombstones propagate (delete on A, confirm gone on B after pull, not just after push)
- [ ] **Sync:** Offline writes survive app kill + relaunch + reconnect — not just a warm resume
- [ ] **Household isolation:** Log in as household B; hit every endpoint; assert zero household A rows returned
- [ ] **SQLDelight migrations:** Install prior release, launch once, upgrade in place; confirm no crash, no data loss
- [ ] **Polish plurals:** Open every screen with counts 0, 1, 2, 5, 22; verify grammar
- [ ] **Haze performance:** Test on oldest supported device (iPhone XS/11) scrolling a full screen; not just simulator
## Pitfall-to-Phase Mapping
| Pitfall | Prevention Phase | Verification |
|---|---|---|
| K/N GC thrash; `objcDisposeOnMain` | UI chrome (infra) | Gradle property set; Instruments shows no GC-main domination |
| Legacy `freeze()` ceremony | Data | Code search for `freeze(`, `@SharedImmutable` returns empty |
| UIViewController re-creation | UI chrome | State survives background/foreground cycle |
| SQLDelight missing migration | Data | Prior-build → new-build upgrade test on real device |
| Blocking Exposed transaction in suspend | Data | No `transaction {` in suspend paths; 50-concurrent-request load test with pool size 10 |
| DAO + JSONB | Data | No `exposed.dao.*` imports; per-JSONB-column round-trip test |
| JWT aud/iss/leeway/JWKS | Auth | Wrong-aud → 401; 30 s skew → 200; JWKS refreshes within 15 min |
| OIDC redirect URI / PKCE | Auth | Flow passes on iOS *and* Android; Authentik logs show `code_challenge` per request |
| LWW client-clock trust | Sync | All writes set `updated_at` server-side; clients never send it |
| Soft-delete recreate race | Sync | Two-client alternating delete/recreate converges |
| Pull-cursor edge cases | Sync | Cursor is `(updated_at, id)` lexicographic; same-timestamp test |
| Haze scroll jank | UI chrome | iPhone 11 real-device FPS >55 on recipe grid scroll |
| Nested NavHost / multi-back-stack | UI chrome | Tab switch preserves deep state; system back unwinds within tab |
| Polish plurals / timestamps | UI chrome | Plural unit tests pass; wire format is UTC-only |
| Household tenancy bypass | Auth + Sync | Cross-household read test asserts empty result sets |
## Sources
- [Kotlin/Native memory management](https://kotlinlang.org/docs/native-memory-manager.html) (HIGH)
- [Compose Multiplatform for iOS Stable, 2025](https://www.kmpship.app/blog/compose-multiplatform-ios-stable-2025) (MEDIUM)
- [Haze 1.0 release notes — Chris Banes](https://chrisbanes.me/posts/haze-1.0/) (HIGH)
- [Haze Platforms documentation](https://chrisbanes.github.io/haze/latest/platforms/) (HIGH)
- [Navigation in Compose Multiplatform — JetBrains](https://kotlinlang.org/docs/multiplatform/compose-navigation.html) (HIGH)
- [Bottom Nav + Nested Navigation guide](https://saurabhjadhavblogs.com/jetpack-compose-bottom-navigation-nested-navigation-solved) (MEDIUM)
- [Exposed — Working with Transactions](https://www.jetbrains.com/help/exposed/transactions.html) (HIGH)
- [Exposed — JSON/JSONB types](https://www.jetbrains.com/help/exposed/json-and-jsonb-types.html) (HIGH)
- [Exposed — Breaking Changes](https://www.jetbrains.com/help/exposed/breaking-changes.html) (HIGH)
- Community-known K/N + KMP gotchas synthesized from training + surrounding sources (MEDIUM)
---
*Pitfalls research for: Kotlin Multiplatform recipe/meal-planning app with self-hosted Ktor + Postgres + Authentik backend*
*Researched: 2026-04-23*