docs: pitfalls research
This commit is contained in:
292
.planning/research/PITFALLS.md
Normal file
292
.planning/research/PITFALLS.md
Normal file
@@ -0,0 +1,292 @@
|
||||
# Pitfalls Research
|
||||
|
||||
**Domain:** Kotlin Multiplatform + Compose Multiplatform (iOS-primary), Ktor/Exposed/Postgres, OIDC, LWW delta sync
|
||||
**Researched:** 2026-04-23
|
||||
**Confidence:** HIGH for KMP/Ktor/Exposed gotchas; MEDIUM for Haze + Navigation-CMP specifics (behavior shifts across minor versions)
|
||||
|
||||
## Critical Pitfalls
|
||||
|
||||
### Pitfall 1: Kotlin/Native iOS GC thrashing and `objcDisposeOnMain` hangs
|
||||
|
||||
**What goes wrong:** On-device (especially iPhone XR/11) the app consumes 300–700 MB steadily and freezes for 1–2 s under ViewModel churn. Flamegraphs show GC threads at >100% CPU.
|
||||
|
||||
**Why:** The K/N memory manager dispatches Obj-C release to the main thread by default, serializing teardown behind UI frames. Compose/Koin graphs produce many bridged Obj-C references per navigation.
|
||||
|
||||
**Warning signs:** Frame hitches on tab switches; main-thread time in `objc_release` / `Kotlin_ObjCExport_releaseReservedObjectTail`; Instruments shows growing K/N heap.
|
||||
|
||||
**How to avoid:** Set `kotlin.native.binary.objcDisposeOnMain=false` and `kotlin.native.binary.gc=cms` in `gradle.properties` from day 1. Release Kotlin refs in `onDispose`; don't hold them in long-lived Swift closures.
|
||||
|
||||
**Phase:** UI chrome.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 2: Legacy `freeze()` / strict-mm ceremony in copy-pasted snippets
|
||||
|
||||
**What goes wrong:** Code from 2021–2022 tutorials adds `freeze()`, `@SharedImmutable`, `AtomicReference` from `kotlin.native.concurrent`, or `ensureNeverFrozen()`. Compiles on Kotlin 2.x but adds dead code and masks real bugs.
|
||||
|
||||
**Why:** The new memory manager removed the freeze paradigm entirely; `freeze()` is a no-op and deprecated.
|
||||
|
||||
**Warning signs:** Any of the above symbols appearing in snippets you're about to paste.
|
||||
|
||||
**How to avoid:** Reject pre-1.7.20 KMP code. Use `kotlinx.atomicfu` if you truly need atomics; StateFlow is already thread-safe.
|
||||
|
||||
**Phase:** Data.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 3: `ComposeUIViewController` state loss on iOS re-entry
|
||||
|
||||
**What goes wrong:** Backgrounding then returning resets scroll positions, selected tabs, half-filled forms. Koin-scoped ViewModels re-create.
|
||||
|
||||
**Why:** If the `UIViewController` is instantiated inside a SwiftUI `body`, each re-render builds a fresh composition. Compose state is owned by the controller's composition root.
|
||||
|
||||
**Warning signs:** State survives Android rotation but dies on iOS foreground-return; ViewModel `init` fires on backgrounded return.
|
||||
|
||||
**How to avoid:** Build the `UIViewController` **once** — store in `@StateObject` or a top-level property, not in a SwiftUI `body`. Use `rememberSaveable` for any UI state that must survive process death. Never nest multiple `ComposeUIViewController` wrappers.
|
||||
|
||||
**Phase:** UI chrome.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 4: SQLDelight iOS — missing migration files, in-memory vs file driver divergence
|
||||
|
||||
**What goes wrong:** JVM tests pass with in-memory driver; the iOS app crashes on launch with `no such column` after a schema change.
|
||||
|
||||
**Why:** `NativeSqliteDriver` persists a real file. Editing `.sq` without a numbered `.sqm` migration and a bumped schema `version` means SQLDelight only *verifies* the schema on open — on a device with an existing install, that check fails.
|
||||
|
||||
**Warning signs:** Works on fresh simulator install; breaks on physical device with prior install; Android OK, iOS fails.
|
||||
|
||||
**How to avoid:** Every schema change gets a numbered `Nm.sqm`. Enable `verifyMigrations = true` and `verifyDefinitions = true`. Add a dev-only "wipe DB" debug button during early development. Reinstall on device before any QA.
|
||||
|
||||
**Phase:** Data.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 5: Exposed `transaction {}` inside suspend functions → pool exhaustion
|
||||
|
||||
**What goes wrong:** Plain `transaction { ... }` in Ktor handlers. Under modest concurrency (~20 requests) the pool exhausts, p99 cliffs, and `IllegalStateException: Transaction is not currently active` appears.
|
||||
|
||||
**Why:** `transaction {}` is blocking and binds the transaction to the calling thread. In a coroutine it blocks event-loop threads; if the code suspends mid-transaction, resume lands on a different thread and loses the JDBC connection binding.
|
||||
|
||||
**Warning signs:** Connection pool always fully leased at low RPS; latency cliffs; "transaction not active" in logs.
|
||||
|
||||
**How to avoid:** Use `newSuspendedTransaction(Dispatchers.IO) { ... }` in suspend contexts. Pass the `Database` instance explicitly. No HTTP calls inside transactions. HikariCP pool size 8–10 is plenty for 5–10 users.
|
||||
|
||||
**Phase:** Data.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 6: Exposed DAO + JSONB footguns
|
||||
|
||||
**What goes wrong:** `IntEntity` + `jsonb<T>()` produces double-serialized JSON in Postgres (`"{\"key\":\"v\"}"`) or `SerializationException` on read.
|
||||
|
||||
**Why:** DAO integration with JSONB is thin; it's easy to store a pre-stringified value. DAO lazy-loads hide *when* the column is read, so failures manifest far from the cause.
|
||||
|
||||
**Warning signs:** Escaped JSON in `psql` output; serialization errors deep in read paths.
|
||||
|
||||
**How to avoid:** Use DSL only (already locked in PROJECT.md). For JSONB, define `jsonb("extras", Json.Default, MealExtras.serializer())` once; never stringify upstream. Round-trip integration test per JSONB column.
|
||||
|
||||
**Phase:** Data.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 7: Ktor JWT — audience, issuer, clock skew, JWKS cache
|
||||
|
||||
**What goes wrong:** 401s in production only, after a while, or after Authentik restart. Messages: "Token can't be used before...", "Claim 'aud' doesn't contain required audience", or silent 401s post key-rotation.
|
||||
|
||||
**Why:** Four defaults converge:
|
||||
1. `ktor-server-auth-jwt` requires explicit `.withAudience()` / `.withIssuer()`.
|
||||
2. Default clock leeway is **zero** — 2 s device drift rejects fresh tokens.
|
||||
3. JWKS cache defaults to `(10, 24h)` — key rotation invisible for hours.
|
||||
4. Authentik's `aud` can be array or string depending on provider config.
|
||||
|
||||
**Warning signs:** 401 only in prod; 401 only on some devices; works briefly then fails; 401 after Authentik restart.
|
||||
|
||||
**How to avoid:** Configure `.withIssuer(issuer).withAudience(clientId).acceptLeeway(30)`. JWKS provider with `.cached(10, 15, MINUTES).rateLimited(10, 1, MINUTES)`. In Authentik, emit `aud` as a single client_id string. Integration test: wrong `aud` → 401.
|
||||
|
||||
**Phase:** Auth.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 8: OIDC redirect URI mismatch + missing PKCE
|
||||
|
||||
**What goes wrong:** "redirect_uri does not match" or consent loop on one platform; or login succeeds without PKCE and is interceptable.
|
||||
|
||||
**Why:** Native apps are *public* clients — no shippable secret, so Authentik requires PKCE. Redirect URIs must match byte-for-byte (trailing slash, case). iOS uses a custom URL scheme or Universal Link; Android uses an intent-filter. Debug and release builds can differ.
|
||||
|
||||
**Warning signs:** Works on Android, fails on iOS (or vice versa); Authentik logs show `invalid_grant`; no `code_challenge` in auth request; fails on release build only.
|
||||
|
||||
**How to avoid:** Authentik provider = "Public" + PKCE S256. Register both `recipe://callback` and `recipe://callback/`. AppAuth (Android) + ASWebAuthenticationSession (iOS) with `usePKCE = true`. Keep the redirect URI in one constant in `shared/commonMain`.
|
||||
|
||||
**Phase:** Auth.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 9: LWW trusting client clocks
|
||||
|
||||
**What goes wrong:** User A's phone clock is 90 s fast; A's edit beats B's real-time-later edit in LWW. B's change silently disappears.
|
||||
|
||||
**Why:** Client-assigned timestamps trust unverifiable clocks. Even NTP-synced devices drift; simulators can be minutes off.
|
||||
|
||||
**Warning signs:** "My edit vanished"; stable prior state reappears; most common with both household members editing the same meal.
|
||||
|
||||
**How to avoid:** Server assigns `updated_at` on every write (already in PROJECT.md — enforce it). Client sends only content + prior `updated_at` for optimistic concurrency. Server sets `updated_at = now()` in the transaction and returns it. Make timestamps strictly monotonic per row (e.g. `GREATEST(now(), old.updated_at + interval '1 microsecond')`) to avoid tie collisions.
|
||||
|
||||
**Phase:** Sync.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 10: Soft-delete + recreate race
|
||||
|
||||
**What goes wrong:** Delete a meal entry, immediately re-add "the same" one. Depending on pull ordering, the new row is hidden by the tombstone, or the old row is resurrected with old fields.
|
||||
|
||||
**Why:** If `(plan_date, slot)` is treated as identity, tombstone/recreate races are inevitable on concurrent 2-user editing.
|
||||
|
||||
**Warning signs:** Undeleted items; deleted meals reappear on partner's device; duplicates in pantry.
|
||||
|
||||
**How to avoid:** Identity is always a fresh UUID per row, never `(date, slot)`. Tombstones carry their own `updated_at`. Pull returns tombstones and live rows; client applies in `updated_at` order. Per-client push outbox replays in local sequence order — never parallel. Integration test: two clients alternating delete/recreate, assert convergence.
|
||||
|
||||
**Phase:** Sync.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 11: Pull-cursor edge cases — missed updates, same-timestamp ties
|
||||
|
||||
**What goes wrong:** Partner edits at 14:00:05; client's last pull cursor is `14:00:04.999`. If cursor semantics or timestamp precision are wrong, the change is skipped forever.
|
||||
|
||||
**Why:** Cursor semantics are subtle. Second-precision timestamps, `>=` instead of `>`, and ties among rows sharing a `updated_at` all cause skipped or replayed rows. Debounced push interleaved with pull can reorder writes.
|
||||
|
||||
**Warning signs:** Sporadic stale data that vanishes after pull-to-refresh; only reproduces near DB restarts or bulk imports; duplicates after manual refresh.
|
||||
|
||||
**How to avoid:** `updated_at` is `timestamptz` with microsecond precision and strictly monotonic. Cursor is `(updated_at, id)` lexicographic: `WHERE (updated_at, id) > (:since_ts, :since_id) ORDER BY updated_at, id LIMIT N`. Pause pull while a push is in flight. Never split the write and its timestamp notification across transactions.
|
||||
|
||||
**Phase:** Sync.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 12: Haze on scroll + nested children tank older iPhones
|
||||
|
||||
**What goes wrong:** LazyColumn scrolling under a blurred top bar stutters badly on iPhone XR/11, dropping to ~30 fps. Nesting `hazeChild` inside a list item sitting in a `hazeSource` Scaffold makes it worse.
|
||||
|
||||
**Why:** iOS Haze uses Skiko `GraphicsLayer` for offscreen capture + re-blur each frame. Progressive blur adds ~25% cost. Older A-series chips without hardware-accelerated RenderEffect equivalents jank under this load.
|
||||
|
||||
**Warning signs:** Smooth on simulator/M-series, choppy on iPhone 11; FPS 40–50; Skiko render thread pegged in Instruments.
|
||||
|
||||
**How to avoid:** One `hazeSource` per screen, never nested. Limit blur to chrome (tab bar, nav bar, sheet headers), not scrolling content. Avoid progressive blur on iOS pre-iPhone 13. Test on the oldest target device in real hardware. Feature-flag the effect with a solid-translucent fallback.
|
||||
|
||||
**Phase:** UI chrome.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 13: Navigation-CMP tabs — `when`-switch kills per-tab back stack
|
||||
|
||||
**What goes wrong:** Tabs implemented as `when (tab) { 0 -> RecipesScreen()... }`. Tapping into a detail, switching tabs, and returning loses the detail. System back exits the app instead of unwinding the tab.
|
||||
|
||||
**Why:** A `when` switch destroys the non-current tab's Compose tree. Jetpack Navigation's multi-back-stack requires either each tab as a destination in a parent NavHost, or per-tab nested `NavHost` instances, with `popUpTo(saveState) + restoreState + launchSingleTop`.
|
||||
|
||||
**Warning signs:** Deep-links don't restore; back from a nested screen jumps tabs; ViewModels re-created on tab switches.
|
||||
|
||||
**How to avoid:** One top-level `NavHost`; `navigation(route = "recipesGraph", ...)` block per tab. Bottom bar navigates: `popUpTo(graph.findStartDestination().id) { saveState = true }; launchSingleTop = true; restoreState = true`. Scope `koinViewModel()` to the destination's `NavBackStackEntry`, not the parent graph. Wasm deep-links are deferred per PROJECT.md.
|
||||
|
||||
**Phase:** UI chrome.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 14: Polish locale — plurals and timestamp zones
|
||||
|
||||
**What goes wrong:** "added 2 godzina temu" (wrong plural form). Shopping items near midnight show on the wrong day across devices.
|
||||
|
||||
**Why:** Polish has four CLDR plural forms (one / few / many / other). Naive `if (n == 1)` handles at most two. Serializing `LocalDateTime` over the wire (instead of UTC `Instant`) produces zone/DST bugs.
|
||||
|
||||
**Warning signs:** Grammatically wrong Polish copy; yesterday's items shown as today's.
|
||||
|
||||
**How to avoid:** Use Compose Resources `<plurals>` with all four forms; call `pluralStringResource(count)`. Wire format: `Instant` UTC ISO-8601 only; display: `.toLocalDateTime(TimeZone.currentSystemDefault())`. Unit test plurals with count 0/1/2/5/22.
|
||||
|
||||
**Phase:** UI chrome (i18n foundation).
|
||||
|
||||
---
|
||||
|
||||
## Technical Debt Patterns
|
||||
|
||||
| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|
||||
|---|---|---|---|
|
||||
| Ad-hoc `psql` DDL, skipping Flyway | Fast schema iteration | Dev/prod drift; can't rebuild from scratch | Pre-first-deploy only; squash into `V1__init.sql` before real data |
|
||||
| Hardcoded OIDC issuer/client_id in `shared/commonMain` | Avoids build-config plumbing | Can't run against staging Authentik; Authentik change forces rebuild | v1 single-environment only |
|
||||
| Plain `transaction {}` in admin endpoints | Simpler mental model | Mixing blocking + suspend patterns leaks; eventually every endpoint wants suspend | Admin-only, single-user endpoints |
|
||||
| Free-form `meal_entry.extras` JSONB without schema | Evolve without migrations | No DB validation; orphan fields accumulate; hard to query | Until extras shape stabilizes; then promote hot fields to columns |
|
||||
| No indices until queries are slow | Faster early dev | p99 cliffs during sync; adding indices under load is risky | Until first data import; then index every `(household_id, updated_at)` |
|
||||
|
||||
## Integration Gotchas
|
||||
|
||||
| Integration | Common Mistake | Correct Approach |
|
||||
|---|---|---|
|
||||
| Authentik OIDC | Confidential client type with secret shipped in binary | Public client + PKCE S256; never ship `client_secret` |
|
||||
| Authentik OIDC | Leaving default signing alg; Ktor JWT expects RS256 | Configure RS256 explicitly; verify `kid` resolves via JWKS |
|
||||
| Haze + Scaffold | `hazeSource` on Scaffold root + `hazeChild` on a sheet both capturing | `hazeSource` on scrollable content only; chrome uses `hazeChild` |
|
||||
| App Store / TestFlight | ATS exception to reach homelab self-signed cert | Real cert via Let's Encrypt + Caddy/Traefik; never ship ATS exceptions |
|
||||
| Postgres JSONB | `WHERE extras->>'k' = 'v'` with no GIN index | `CREATE INDEX ... USING GIN (extras jsonb_path_ops)` once access patterns emerge |
|
||||
|
||||
## Performance Traps
|
||||
|
||||
| Trap | Symptoms | Prevention | When It Breaks |
|
||||
|---|---|---|---|
|
||||
| Pull sync without pagination | First-sync-after-seed hangs seconds | Cursor-paginate `LIMIT 200 ORDER BY updated_at, id` | >500 rows in any scoped table |
|
||||
| Coil full-res images in recipe grid | Memory spikes, laggy scroll | Explicit thumbnail `Size`; memory+disk cache | >30 images on screen |
|
||||
| Compose recomposition of entire calendar per edit | Calendar flashes on slot change; scroll resets | Stable IDs per slot; hoist per-slot state; `derivedStateOf` for totals | Any calendar with >7 days visible |
|
||||
| Haze over full scrolling region | Jank on iPhone XR/11 | Blur chrome only, not content; fallback for old devices | Pre-A13 silicon on 60 Hz panels |
|
||||
|
||||
## Security Mistakes
|
||||
|
||||
| Mistake | Risk | Prevention |
|
||||
|---|---|---|
|
||||
| Missing `WHERE household_id = :caller_household` on reads | Cross-household data leak | All scoped reads go through a `HouseholdScope` helper; review rule: no raw `selectAll()` on scoped tables |
|
||||
| Trusting client-supplied `household_id` in request body | Tenancy bypass via crafted POST | Derive `household_id` from JWT `sub` → `memberships`; ignore body's value |
|
||||
| Logging the `Authorization` header in Ktor `CallLogging` | Tokens leak to log files → account compromise | Custom log filter redacting `Authorization`; never `log.info(token)` |
|
||||
| Storing OIDC refresh token in plain prefs | Local/backup exposure | `multiplatform-settings` with Keychain (iOS) / EncryptedSharedPreferences (Android) backends |
|
||||
|
||||
## "Looks Done But Isn't" Checklist
|
||||
|
||||
- [ ] **Auth:** Login works — verify token refresh runs before expiry (set Authentik access-token lifetime to 5 min in dev; watch for silent 401s)
|
||||
- [ ] **Sync:** Pull works — verify tombstones propagate (delete on A, confirm gone on B after pull, not just after push)
|
||||
- [ ] **Sync:** Offline writes survive app kill + relaunch + reconnect — not just a warm resume
|
||||
- [ ] **Household isolation:** Log in as household B; hit every endpoint; assert zero household A rows returned
|
||||
- [ ] **SQLDelight migrations:** Install prior release, launch once, upgrade in place; confirm no crash, no data loss
|
||||
- [ ] **Polish plurals:** Open every screen with counts 0, 1, 2, 5, 22; verify grammar
|
||||
- [ ] **Haze performance:** Test on oldest supported device (iPhone XS/11) scrolling a full screen; not just simulator
|
||||
|
||||
## Pitfall-to-Phase Mapping
|
||||
|
||||
| Pitfall | Prevention Phase | Verification |
|
||||
|---|---|---|
|
||||
| K/N GC thrash; `objcDisposeOnMain` | UI chrome (infra) | Gradle property set; Instruments shows no GC-main domination |
|
||||
| Legacy `freeze()` ceremony | Data | Code search for `freeze(`, `@SharedImmutable` returns empty |
|
||||
| UIViewController re-creation | UI chrome | State survives background/foreground cycle |
|
||||
| SQLDelight missing migration | Data | Prior-build → new-build upgrade test on real device |
|
||||
| Blocking Exposed transaction in suspend | Data | No `transaction {` in suspend paths; 50-concurrent-request load test with pool size 10 |
|
||||
| DAO + JSONB | Data | No `exposed.dao.*` imports; per-JSONB-column round-trip test |
|
||||
| JWT aud/iss/leeway/JWKS | Auth | Wrong-aud → 401; 30 s skew → 200; JWKS refreshes within 15 min |
|
||||
| OIDC redirect URI / PKCE | Auth | Flow passes on iOS *and* Android; Authentik logs show `code_challenge` per request |
|
||||
| LWW client-clock trust | Sync | All writes set `updated_at` server-side; clients never send it |
|
||||
| Soft-delete recreate race | Sync | Two-client alternating delete/recreate converges |
|
||||
| Pull-cursor edge cases | Sync | Cursor is `(updated_at, id)` lexicographic; same-timestamp test |
|
||||
| Haze scroll jank | UI chrome | iPhone 11 real-device FPS >55 on recipe grid scroll |
|
||||
| Nested NavHost / multi-back-stack | UI chrome | Tab switch preserves deep state; system back unwinds within tab |
|
||||
| Polish plurals / timestamps | UI chrome | Plural unit tests pass; wire format is UTC-only |
|
||||
| Household tenancy bypass | Auth + Sync | Cross-household read test asserts empty result sets |
|
||||
|
||||
## Sources
|
||||
|
||||
- [Kotlin/Native memory management](https://kotlinlang.org/docs/native-memory-manager.html) (HIGH)
|
||||
- [Compose Multiplatform for iOS Stable, 2025](https://www.kmpship.app/blog/compose-multiplatform-ios-stable-2025) (MEDIUM)
|
||||
- [Haze 1.0 release notes — Chris Banes](https://chrisbanes.me/posts/haze-1.0/) (HIGH)
|
||||
- [Haze Platforms documentation](https://chrisbanes.github.io/haze/latest/platforms/) (HIGH)
|
||||
- [Navigation in Compose Multiplatform — JetBrains](https://kotlinlang.org/docs/multiplatform/compose-navigation.html) (HIGH)
|
||||
- [Bottom Nav + Nested Navigation guide](https://saurabhjadhavblogs.com/jetpack-compose-bottom-navigation-nested-navigation-solved) (MEDIUM)
|
||||
- [Exposed — Working with Transactions](https://www.jetbrains.com/help/exposed/transactions.html) (HIGH)
|
||||
- [Exposed — JSON/JSONB types](https://www.jetbrains.com/help/exposed/json-and-jsonb-types.html) (HIGH)
|
||||
- [Exposed — Breaking Changes](https://www.jetbrains.com/help/exposed/breaking-changes.html) (HIGH)
|
||||
- Community-known K/N + KMP gotchas synthesized from training + surrounding sources (MEDIUM)
|
||||
|
||||
---
|
||||
*Pitfalls research for: Kotlin Multiplatform recipe/meal-planning app with self-hosted Ktor + Postgres + Authentik backend*
|
||||
*Researched: 2026-04-23*
|
||||
Reference in New Issue
Block a user