[{"body":"","link":"https://kane.mx/tags/agentcore/","section":"tags","tags":null,"title":"AgentCore"},{"body":"","link":"https://kane.mx/categories/ai/ml/","section":"categories","tags":null,"title":"AI/ML"},{"body":"","link":"https://kane.mx/tags/amazon-verified-permissions/","section":"tags","tags":null,"title":"Amazon Verified Permissions"},{"body":"","link":"https://kane.mx/tags/aws/","section":"tags","tags":null,"title":"AWS"},{"body":"","link":"https://kane.mx/tags/bedrock-agents/","section":"tags","tags":null,"title":"Bedrock Agents"},{"body":"","link":"https://kane.mx/categories/","section":"categories","tags":null,"title":"Categories"},{"body":"","link":"https://kane.mx/tags/cedar/","section":"tags","tags":null,"title":"Cedar"},{"body":"","link":"https://kane.mx/categories/cloud-computing/","section":"categories","tags":null,"title":"Cloud-Computing"},{"body":"","link":"https://kane.mx/tags/multi-tenant/","section":"tags","tags":null,"title":"Multi-Tenant"},{"body":"TL;DR (30-Second Read) With Amazon Bedrock AgentCore now generally available — including AgentCore Identity for agent authentication and AgentCore Policy, which enforces Cedar rules by intercepting every tool call before execution — the security design for multi-tenant SaaS on Bedrock Agents has reached an inflection point. This blueprint addresses the hardest problem in agentic SaaS: a Large Language Model (LLM) that, through prompt injection or hallucination, crosses tenant boundaries or escalates privileges. The answer is a zero-trust, low-latency, two-layer authorization architecture with real-time quota enforcement, built on Amazon Verified Permissions (AVP) and the Cedar policy engine.\nWho this is for: Cloud architects and senior security/backend engineers building or running multi-tenant SaaS on Amazon Bedrock Agents. If you read only one section: jump to Two-Layer Authorization Architecture, the most directly actionable part of the design. 1. Threat Model When you build enterprise multi-tenant AI agent SaaS, the agent's dynamic decision-making and autonomous reasoning introduce security threats that traditional web applications never faced. This blueprint defends against three core failure modes:\nImpersonation and tenant injection. An attacker uses prompt injection to tamper with the agent's runtime context (for example, session attributes), tricking a tool Lambda into performing a cross-tenant data operation. Permission creep. Across multi-step reasoning, an agent — driven by hallucination or a malicious prompt — accumulates excessive privilege between iterations, or chains individually harmless tools to reach a high-risk operation. Data isolation failure. Inconsistent resource identification or bolt-on authorization logic causes the Policy Enforcement Point (PEP) to skip the tenant context during a check, leaking data across tenants. 2. Trust Boundaries and Trust Anchors To eliminate attacks that exploit mutability, the system must establish a single trusted source — a trust anchor — for both the principal and resource identity, and treat them as intrinsic identity rather than runtime-mutable context.\nIdentity Dimension Trusted Source Implementation and Tamper-Resistance Principal Tenant ID Session Metadata Store (immutable) 1. Abandon sessionAttributes, which are vulnerable to mutation. 2. Use an external Session Metadata Store (such as DynamoDB) keyed by session_id, binding tenant_id, cognito_sub, and created_at. 3. The orchestrator writes this record after JWT verification; the tool and PEP have read-only access, and the agent has no write permission. Resource Tenant ID Resource Prefix Convention (intrinsic) 1. Enforce a tenant-prefix convention on resource identifiers (such as tenant-{id}/resources/{resource-id}). 2. The tenant ID is part of the resource identity, not a bolt-on field. 3. Even if the agent omits tenant context in its parameters, the PEP can extract it by parsing the resource ID, preventing escalation. flowchart TD Client[Client JWT] --\u0026gt;|1. Request| Orch[Orchestrator] Orch --\u0026gt;|2. Verify and Write| Meta[(Session Metadata Store)] Orch --\u0026gt;|3. Invoke| Bedrock[Bedrock Agents] Bedrock --\u0026gt;|4. Tool Call| PreCheck[Pre-Call Check Lambda] Meta -.-\u0026gt;|5. Read immutable tenant_id| PreCheck PreCheck --\u0026gt;|6. IsAuthorized| AVP[AVP Cedar PDP] PreCheck --\u0026gt;|7. Forward if Allowed| Tool[Tool Lambda PEP] Important Never trust a tenant_id produced by the LLM or passed through sessionAttributes. Every security check must compare against the Session Metadata Store (the principal trust anchor) and the prefix convention (the resource trust anchor).\n3. Cedar Policy Patterns Using Amazon Verified Permissions (AVP) as the Policy Decision Point (PDP), declarative Cedar policies express three tenant-control models.\nPattern A: Owner-Isolated (Strong Tenant Isolation) The bottom-line hard boundary: a principal may only act on resources belonging to the same tenant.\n1// Pattern A: Owner-Isolated (strong tenant isolation) 2permit ( 3 principal, 4 action in [ 5 Action::\u0026#34;GetDocument\u0026#34;, 6 Action::\u0026#34;UpdateDocument\u0026#34;, 7 Action::\u0026#34;DeleteDocument\u0026#34; 8 ], 9 resource 10) 11when { 12 principal.tenant_id == resource.tenant_id 13}; Pattern B: Role-Tiered (Fine-Grained Roles Within a Tenant) Within the isolation boundary, grant different permissions to Admin, Member, and Guest roles.\n1// Pattern B: Role-Tiered (fine-grained roles within a tenant) 2// 1. Admin, Member, and Guest in the tenant may read documents 3permit ( 4 principal, 5 action in [Action::\u0026#34;ReadDocument\u0026#34;], 6 resource 7) 8when { 9 principal.tenant_id == resource.tenant_id \u0026amp;\u0026amp; 10 principal.role in [\u0026#34;Admin\u0026#34;, \u0026#34;Member\u0026#34;, \u0026#34;Guest\u0026#34;] 11}; 12 13// 2. Only Admin and Member may write or modify documents 14permit ( 15 principal, 16 action in [Action::\u0026#34;WriteDocument\u0026#34;, Action::\u0026#34;CreateDocument\u0026#34;], 17 resource 18) 19when { 20 principal.tenant_id == resource.tenant_id \u0026amp;\u0026amp; 21 principal.role in [\u0026#34;Admin\u0026#34;, \u0026#34;Member\u0026#34;] 22}; 23 24// 3. Only Admin may perform high-risk configuration operations 25permit ( 26 principal, 27 action in [Action::\u0026#34;DeleteTenantSpace\u0026#34;, Action::\u0026#34;ConfigureIntegrations\u0026#34;], 28 resource 29) 30when { 31 principal.tenant_id == resource.tenant_id \u0026amp;\u0026amp; 32 principal.role == \u0026#34;Admin\u0026#34; 33}; Pattern C: Quota-Bounded (Plan-Aware Real-Time Quota Enforcement) Dynamically block over-quota requests. The tenant's quota state is not stored in Cedar; the PEP fetches it in real time and passes it to the PDP through context.\n1// Pattern C: Quota-Bounded (plan-aware real-time quota enforcement) 2// Enterprise tenants may call expensive tools directly 3permit ( 4 principal, 5 action in [Action::\u0026#34;InvokePremiumTool\u0026#34;], 6 resource 7) 8when { 9 principal.tenant_id == resource.tenant_id \u0026amp;\u0026amp; 10 principal.tier == \u0026#34;Enterprise\u0026#34; 11}; 12 13// Standard tenants must pass a context-supplied quota check 14permit ( 15 principal, 16 action in [Action::\u0026#34;InvokePremiumTool\u0026#34;], 17 resource 18) 19when { 20 principal.tenant_id == resource.tenant_id \u0026amp;\u0026amp; 21 principal.tier == \u0026#34;Standard\u0026#34; \u0026amp;\u0026amp; 22 context.monthly_api_calls \u0026lt; context.api_call_limit 23}; Tip For concurrent quota updates, use a DynamoDB conditional update for atomicity, and perform a read-after-write (strongly consistent, no-cache) read before evaluation to prevent over-quota window bypass.\n4. Two-Layer Authorization Architecture To balance defense and runtime cost, the system uses a two-layer design: a pre-call check (first-line entrance filter) and a post-call PEP (second-line in-Lambda safety net). This mirrors how AgentCore Policy itself intercepts tool calls before execution — the pre-call layer is the place to do it cheaply, while the post-call PEP remains the defense-in-depth backstop.\n1+--------------------------------------------------------------------------+ 2| Orchestrator (Cognito JWT) | 3+--------------------------------------------------------------------------+ 4 | 5 v 6+--------------------------------------------------------------------------+ 7| 1. Pre-Call Check (Action Group Entrance) | 8| - Intercept structured params from Bedrock Agents (OpenAPI schema) | 9| - Map to an (Action, Resource) tuple | 10| - Read principal.tenant_id from the Session Metadata Store | 11| - Extract resource.tenant_id from the resource prefix | 12| - AVP Evaluate -\u0026gt; DENY: return a unified 403 JSON to the runtime | 13+--------------------------------------------------------------------------+ 14 | 15 ALLOW 16 v 17+--------------------------------------------------------------------------+ 18| 2. Post-Call PEP (Tool Lambda PEP) - Safety Net | 19| - Guards against pre-call misses or bypass | 20| - Re-evaluates by calling the AVP PDP again | 21| - DENY: block execution, return structured 403, trip circuit + audit | 22+--------------------------------------------------------------------------+ A. Layer 1: Pre-Call Check (Entrance Interception) Responsibility: intercept after a Bedrock Agents action group fires but before the real business Lambda (the tool) executes. Mechanism: Configure a single OpenAPI schema on the action group so Bedrock Agents calls carry structured JSON parameters. The pre-check Lambda reads those structured JSON parameters directly, extracting the resource identifier (such as an S3 key or DynamoDB key). It does not rely on extracting identifiers from free-form LLM output. Extract the tenant ID via the prefix convention and compare it against the Session Metadata. Unified DENY response. On DENY, the pre-call check returns exactly the same structured 403 payload as the post-call PEP (see 4.B), which the Bedrock Agents runtime interprets as tool output — keeping the API response contract uniform. Prefix parsing and Cedar resource construction (implementation detail). After parsing the structured parameters, the PEP must split the resource ID (for example tenant-corp-99/doc-a1b2c3) into a (tenant_id, resource_id) tuple, construct the matching Cedar resource entity Document::\u0026quot;tenant-corp-99:doc-a1b2c3\u0026quot;, and explicitly set the attribute resource.tenant_id = \u0026quot;tenant-corp-99\u0026quot;. This mapping from a naming convention to Cedar policy evaluation is the crux — the Cedar engine does not parse prefixes on its own. Benefits: Avoids the cold-start and execution cost of tool Lambdas triggered by malicious or redundant calls. Cuts off illegal calls early, saving substantial LLM tokens. Produces clearer, more direct audit signals. B. Layer 2: Post-Call PEP (Safety Net) Responsibility: the PEP logic inside the tool Lambda, the defense-in-depth backstop.\nUnified 403 DENY response shape. Whether triggered at the pre-call check or the post-call PEP, a DENY does not crash. It returns a single structured 403 payload to Bedrock Agents:\n1{ 2 \u0026#34;status\u0026#34;: \u0026#34;error\u0026#34;, 3 \u0026#34;code\u0026#34;: \u0026#34;AccessDenied\u0026#34;, 4 \u0026#34;message\u0026#34;: \u0026#34;Security policy violation: operation not permitted for this tenant context.\u0026#34; 5} This lets the Bedrock Agents runtime recognize an \u0026quot;unauthorized\u0026quot; outcome and present the permission limit gracefully to the user instead of crashing the system.\nC. Iteration Limits and Circuit Breaker Max iterations. To stop Bedrock Agents from entering a \u0026quot;hallucination retry loop\u0026quot; when blocked (repeatedly swapping resource IDs to bypass policy), the orchestrator enforces class-aware defaults: Write or sensitive workflows: a small default (such as 5). This conservative default comes from injection-and-retry patterns common in production; raise it to match the maximum depth of legitimate business workflows in your deployment. Read-heavy or research workflows: allow 10–15 iterations to preserve complex reasoning chains. PEP-level circuit breaker. Trigger: if a session triggers 3 consecutive DENY results during tool calls, trip the breaker. State marking: on trip, mark the session compromised=true in the Session Metadata Store. Cheapest enforcement path: read and check the compromised flag as the first step of the pre-call check Lambda. If true, block immediately and return a hard-fail (SESSION_REVOKED) — no need to make an expensive AVP (Cedar PDP) call. Audit evidence chain: on trip, write a circuit_breaker_tripped event to the audit trail and record the N denials that caused it as evidence. 5. Structured Audit Trail Schema To satisfy independent audit requirements such as SOC 2 and ISO 27001, every authorization decision (pre-call and post-call), plus circuit-breaker and over-quota events, must be emitted as standard JSON audit logs.\nAudit Log JSON Schema — Extension Fields event_type: event type (AgentAuthorizationEvaluation / circuit_breaker_tripped). deny_reason: reason for denial (policy_denied / quota_exceeded / quota_store_unreachable / circuit_breaker_active). determining_policies: the specific Cedar policy IDs that determined the decision. This maps to the AVP IsAuthorized response field determiningPolicies. Note an important AVP semantic: on an implicit DENY (no matching policy), determiningPolicies is an empty array — so absence of a policy ID is itself a meaningful audit signal, not a gap. execution_status: an execution status code that records the final decision outcome. Allowed Values for execution_status To trace the final outcome of a request's lifecycle in the audit trail, execution_status must be one of the following enumerated values:\nAllowed Value Meaning When It Fires PROCESSED Normal evaluation Request was permitted (ALLOW) by AVP, or denied (DENY) by a normal policy decision. DENY PEP interception Blocked by the PEP inside the tool. SESSION_REVOKED Session revoked / circuit tripped PEP-level circuit breaker tripped, or the session compromised flag is active; subsequent calls are rejected directly in this state. SYSTEM_FALLBACK_DENY System-failure fail-closed block AVP / Cedar PDP is down, or the Session Metadata Store is unavailable, triggering a fail-closed block. Example Audit Log Events Example 1: Quota Exceeded Deny 1{ 2 \u0026#34;timestamp\u0026#34;: \u0026#34;2026-06-06T02:30:15Z\u0026#34;, 3 \u0026#34;event_type\u0026#34;: \u0026#34;AgentAuthorizationEvaluation\u0026#34;, 4 \u0026#34;tenant_id\u0026#34;: \u0026#34;tenant-corp-99\u0026#34;, 5 \u0026#34;session_id\u0026#34;: \u0026#34;bedrock-session-47609bf2\u0026#34;, 6 \u0026#34;principal\u0026#34;: \u0026#34;User::tenant-corp-99:user-alex\u0026#34;, 7 \u0026#34;action\u0026#34;: \u0026#34;Action::InvokePremiumTool\u0026#34;, 8 \u0026#34;resource\u0026#34;: \u0026#34;Document::tenant-corp-99:doc-a1b2c3\u0026#34;, 9 \u0026#34;decision\u0026#34;: \u0026#34;DENY\u0026#34;, 10 \u0026#34;deny_reason\u0026#34;: \u0026#34;quota_exceeded\u0026#34;, 11 \u0026#34;quota_metric\u0026#34;: \u0026#34;monthly_api_calls\u0026#34;, 12 \u0026#34;determining_policies\u0026#34;: [\u0026#34;policy-quota-limit-standard\u0026#34;], 13 \u0026#34;execution_status\u0026#34;: \u0026#34;PROCESSED\u0026#34; 14} Example 2: Circuit Breaker Tripped 1{ 2 \u0026#34;timestamp\u0026#34;: \u0026#34;2026-06-06T02:31:05Z\u0026#34;, 3 \u0026#34;event_type\u0026#34;: \u0026#34;circuit_breaker_tripped\u0026#34;, 4 \u0026#34;tenant_id\u0026#34;: \u0026#34;tenant-corp-99\u0026#34;, 5 \u0026#34;session_id\u0026#34;: \u0026#34;bedrock-session-47609bf2\u0026#34;, 6 \u0026#34;principal\u0026#34;: \u0026#34;User::tenant-corp-99:user-alex\u0026#34;, 7 \u0026#34;circuit_breaker_deny_history\u0026#34;: [ 8 { 9 \u0026#34;timestamp\u0026#34;: \u0026#34;2026-06-06T02:30:45Z\u0026#34;, 10 \u0026#34;action\u0026#34;: \u0026#34;Action::UpdateDocument\u0026#34;, 11 \u0026#34;resource\u0026#34;: \u0026#34;Document::tenant-corp-99:doc-888\u0026#34; 12 }, 13 { 14 \u0026#34;timestamp\u0026#34;: \u0026#34;2026-06-06T02:30:52Z\u0026#34;, 15 \u0026#34;action\u0026#34;: \u0026#34;Action::UpdateDocument\u0026#34;, 16 \u0026#34;resource\u0026#34;: \u0026#34;Document::tenant-corp-99:doc-889\u0026#34; 17 }, 18 { 19 \u0026#34;timestamp\u0026#34;: \u0026#34;2026-06-06T02:31:01Z\u0026#34;, 20 \u0026#34;action\u0026#34;: \u0026#34;Action::UpdateDocument\u0026#34;, 21 \u0026#34;resource\u0026#34;: \u0026#34;Document::tenant-corp-99:doc-890\u0026#34; 22 } 23 ], 24 \u0026#34;execution_status\u0026#34;: \u0026#34;SESSION_REVOKED\u0026#34; 25} 6. Failure Modes and Observability Matrix Failure Scenario Potential Impact Protection and Degradation Strategy Audit Log and Alerting Metric AVP / Cedar PDP down or evaluation fails Authorization decision blocked Fail-closed (hard block): the PEP catches the exception and defaults to DENY, refusing all access. Raise a CedarPDPUnreachable critical alarm; write an audit log with decision source SYSTEM_FALLBACK_DENY. Quota Store (DynamoDB) hot partition / throttling Cannot confirm real-time quota state Tier-aware fail-safe: 1. Free / Standard tier: allow read-only / standard tools that do not depend on quota; block only quota-dependent premium tools (return QuotaStatusUnknown). 2. Enterprise tier: under a strong SLA commitment, force fail-closed, rejecting all premium tool calls with QuotaStatusUnknown. 1. Raise a QuotaStoreThrottle alarm, plus a new QuotaStoreThrottleAffectingFreeTier metric (track free-tier degradation error rate to monitor churn risk). 2. Mark deny_reason as quota_store_unreachable (distinct from quota_exceeded). Cognito JWT expired or malformed Unverified identity Fail-closed (unauthorized block): the orchestrator rejects at the entrance, making no downstream calls. Raise an AuthTokenValidationFailed metric; block and emit an HTTP 401 audit log at the outermost gateway. Session Metadata Store (DynamoDB) down Lost tenant binding Fail-closed (hard block): unable to recover the principal's true tenant identity, the pre-check blocks all downstream calls. Raise a SessionMetadataUnreachable critical alarm; record decision SYSTEM_FALLBACK_DENY in the audit log. 7. Conclusion Multi-tenant isolation for AI agents is not the same problem as multi-tenant isolation for a CRUD API. The agent reasons, retries, and composes tools at runtime, so the security boundary has to be explicit, external, and fail-closed — never inferred from anything the model produced. The three anchors of this blueprint hold that line:\nIdentity comes from a trust anchor, not the model. Principal tenant ID lives in an immutable Session Metadata Store; resource tenant ID is intrinsic to the resource name. The LLM never gets a vote. Authorization is two layers, decided by Cedar. A cheap pre-call check stops most bad calls before any tool runs; a post-call PEP is the defense-in-depth backstop. Both call the same AVP PDP and return the same structured 403. Every decision is auditable and every failure is fail-closed. determiningPolicies, execution_status, and the circuit-breaker evidence chain give auditors a complete record, while PDP or store outages degrade to DENY, never to open access. Now that AgentCore Identity and AgentCore Policy are generally available, much of this can lean on managed building blocks — AgentCore Policy already intercepts tool calls with Cedar, and AgentCore Identity handles inbound JWT authorization (see how an API Gateway façade closes the OAuth gaps for AgentCore Gateway + Cognito). The patterns here remain the design contract you enforce on top of them. If you are weighing where the isolation boundary should sit in the first place, the serverless multi-tenant OpenHands on AWS post works the same problem from the infrastructure side.\nResources Amazon Bedrock AgentCore documentation AgentCore Identity AgentCore Policy (Cedar-based tool-call control) Amazon Verified Permissions — IsAuthorized API Cedar policy language ","link":"https://kane.mx/posts/2026/multi-tenant-agent-security-blueprint/","section":"posts","tags":["AWS","Bedrock Agents","AgentCore","Cedar","Amazon Verified Permissions","Multi-Tenant","Zero Trust","SaaS Security"],"title":"Multi-Tenant Bedrock Agents Security with Cedar"},{"body":"","link":"https://kane.mx/posts/","section":"posts","tags":null,"title":"Posts"},{"body":"","link":"https://kane.mx/tags/saas-security/","section":"tags","tags":null,"title":"SaaS Security"},{"body":"","link":"https://kane.mx/categories/security/","section":"categories","tags":null,"title":"Security"},{"body":"","link":"https://kane.mx/tags/","section":"tags","tags":null,"title":"Tags"},{"body":"","link":"https://kane.mx/","section":"","tags":null,"title":"The road"},{"body":"","link":"https://kane.mx/tags/zero-trust/","section":"tags","tags":null,"title":"Zero Trust"},{"body":"","link":"https://kane.mx/series/effective-cloud-computing/","section":"series","tags":null,"title":"Effective-Cloud-Computing"},{"body":"","link":"https://kane.mx/tags/finops/","section":"tags","tags":null,"title":"FinOps"},{"body":"","link":"https://kane.mx/tags/opensearch/","section":"tags","tags":null,"title":"OpenSearch"},{"body":"","link":"https://kane.mx/tags/s3-vectors/","section":"tags","tags":null,"title":"S3 Vectors"},{"body":"Choosing a vector store on AWS for generative AI (GenAI) workloads used to be a one-line decision: pick Amazon OpenSearch Service or its serverless variant (AOSS) and move on. That changed when Amazon S3 Vectors went GA in 2025. By storing vector data directly in S3 and pricing it on a fully consumption-based model, S3 Vectors has reset the cost-performance frontier for vector search.\nThis post is not a rehash of the official documentation. It distills selection and tuning experience across more than 30 production GenAI projects shipped over the past year. You will get a decision tree, the cost-crossover math between the two services, and the specific migration pitfalls that bite hardest in practice — including the cosine-distance range mismatch and the metadata structure constraints.\n1. TL;DR Four rules of thumb:\nLarge scale, write-heavy, sub-second latency tolerated, pure vector retrieval — choose S3 Vectors. Per the AWS official example, 250K vectors × 40 indices × 1M queries/month costs roughly $11.38/month. Scale up to 10M vectors per index across 40 indices (400M total) with 10M queries/month and the bill is still around $1,217.29/month. The same workload on AOSS starts at $175.20/month for a 1-OCU dev configuration and $350.40/month for a 2-OCU HA production baseline (1 OCU indexing + 1 OCU search, each spread across two AZs as 0.5 + 0.5). Hybrid search (BM25 keyword + vector), complex filtering, GeoIP, or k-NN plugins required — stay on OpenSearch / AOSS. S3 Vectors is a pure vector database with no tokenization, prefix matching, or full-text fusion. Graph-and-vector hybrid retrieval (GraphRAG, LightRAG) — only OpenSearch works today. Frameworks like LightRAG embed deeply nested topology in vector metadata, while S3 Vectors only accepts flat key/value pairs capped at 2 KB of filterable metadata. Hot-path retrieval with strict latency SLOs (p95 \u0026lt; 200 ms) — S3 Vectors is the cost minimum at most data sizes, but its managed ANN index over S3 produces p95 latencies in the 100–300 ms range under typical workloads. For request paths where 100 ms matters, AOSS or an OpenSearch Domain remains the safer choice. The most-overlooked migration trap: when moving from OpenSearch to S3 Vectors, the cosine ordering is opposite. OpenSearch's k-NN query returns a relevance _score in [0, 1] where higher = more similar (under the Lucene engine, score = (2 − d) / 2 = (1 + cos_sim) / 2). S3 Vectors returns a raw cosine distance d = 1 − cos_sim in [0, 2] where lower = more similar. Any upstream min_score filter that was tuned against OpenSearch will silently invert recall unless you renormalize.\n2. Selection Decision Tree The decision tree below captures the five priority-ordered checks we run in our architecture reviews:\nflowchart TD Q1[Q1: Need BM25 or hybrid keyword plus vector retrieval?] Q1 --\u0026gt;|Yes| AOSS[Choose OpenSearch or AOSS] Q1 --\u0026gt;|No| Q2[Q2: Need graph traversal GraphRAG?] Q2 --\u0026gt;|Yes| AOSS Q2 --\u0026gt;|No| Q3[Q3: p95 latency must be under 200 ms?] Q3 --\u0026gt;|Yes| AOSS Q3 --\u0026gt;|No| Q4[Q4: Index size and query frequency?] Q4 --\u0026gt;|Small under 10 GB or under 100K queries per month| S3V[Choose S3 Vectors] Q4 --\u0026gt;|Very large with sustained high QPS| AOSS Q4 --\u0026gt;|Middle 10 GB to 1 TB mixed| Q5[Q5: Multi-account or multi-region fan-out needed?] Q5 --\u0026gt;|Yes| S3V Q5 --\u0026gt;|No| CO[Compute exact crossover threshold] 3. Feature Matrix Before drilling into cost math, the comparison table below clarifies the boundaries between the three options:\nDimension S3 Vectors OpenSearch Serverless (AOSS) OpenSearch Domain Entry-level price (minimum config) First-month minimum ~ $0.60 (10 GB storage; no PUT, no queries) Dev (1 OCU) ~ $175.20/month; HA prod (2 OCU minimum, 1 indexing + 1 search) ~ $350.40/month t3.small.search ~ $30.00/month + EBS Billing model Pay-per-use (storage + writes + API calls + data processing) Capacity-based (per-OCU-hour, auto-scaled) Resource-based (instance-hour + disk) Max vectors per index 2 billion (10K indices per bucket; 20T per bucket) Bound by OCU memory (~50M/OCU in practice) Bound by instance type Vector dimensions supported 1–4096 1–16,000 (limited by mapping) Same as AOSS Top-K per query 100 10,000 10,000 Hybrid search Not supported (pure vector) Supported via Neural Search plugin Supported Custom analyzers / plugins Not supported Limited (no third-party installs) Supported Per-vector metadata limit 40 KB total; 2 KB / 50 keys filterable No hard cap (governed by mapping) No hard cap Write throttling 1,000 req/s; max 500 vectors/batch; 2,500 vectors/s aggregate Bound by OCU capacity Bound by instance type IAM model IAM-native (s3vectors: actions) IAM policy + data-access policy (two layers) Resource policy + fine-grained access control (FGAC) Best fit Long-tail archives, AI agent long-term memory, write-heavy Real-time e-commerce search, hybrid retrieval, multi-tenant SaaS Migrating existing search clusters; HNSW tuning required 4. Cost Crossover Math The breakeven threshold between S3 Vectors and AOSS depends on three variables: index size, QPS, and AOSS OCU sizing rules.\nS3 Vectors is fully consumption-priced; AOSS is capacity-priced by the hour. All numbers below are list pricing for US East (N. Virginia) as of May 2026; verify against the S3 Vectors pricing page and the OpenSearch pricing page before building a long-term budget.\n4.1 S3 Vectors price components Storage: $0.06 / GB-month (logical GB across all vector indices in a bucket). Writes (PUT): $0.20 per logical GB uploaded (logical GB includes vector data, key, and metadata after AWS-side optimization — not raw float32 bytes). Query API: $2.50 / million calls. Data processing: tiered per index. Within a single index, the first 100K vectors are billed at $0.004 / TB scanned; vectors beyond 100K within that same index are billed at $0.002 / TB scanned. The tiering is per-index, not bucket-wide. 4.2 AWS official examples vs AOSS Both examples use 1,536-dimensional vectors with ~6 KB of metadata each, distributed across 40 indices.\nExample A — small/medium workload Data: 10M vectors total (250K per index × 40 indices, ~59 GB stored). Queries: 1M / month. S3 Vectors itemization: Storage: $3.54 PUT: $1.97 API + data processing: $5.87 (API $2.50, data processing $3.37) Total: $11.38/month AOSS equivalent: dev configuration of 1 OCU bottoms out at $175.20/month. Conclusion: S3 Vectors is ~15× cheaper. Example B — medium/large workload Data: 400M vectors total (10M per index × 40 indices, ~2,353 GB stored). Queries: 10M / month. S3 Vectors itemization: Storage: $141.21 PUT: $78.46 API + data processing: $997.62 (API $25.00, data processing $972.62) Total: $1,217.29/month AOSS equivalent: the 2-OCU HA minimum costs $350.40/month, but a 400M-vector workload will not fit in 2 OCUs in practice (~50M vectors per OCU is a reasonable rule of thumb under default HNSW parameters). Sized for the workload, the AOSS bill rises broadly into the 8–16 OCU range — i.e. $1,400–$2,800/month before storage. Conclusion: S3 Vectors at $1,217/month is roughly 3.5× the cost of the 2-OCU HA minimum on paper, but the 2-OCU minimum cannot house 400M vectors. Once AOSS is sized realistically (8–16 OCU at $1,400–$2,800/month), S3 Vectors comes out roughly even or cheaper, depending on QPS pattern. Key insight: S3 Vectors data-processing cost scales linearly with both per-query scan size and QPS. The crossover with AOSS is not at a single index size — it is the point where AOSS OCU count, sized to your real working set, costs less than the per-query scan bill on S3 Vectors. For p95-bound, hot-path workloads this point arrives quickly; for cold-path or batch workloads it never does.\n5. Production Selection Notes The cases below summarize production selection outcomes (project names anonymized).\n5.1 Player feedback (VOC) analytics pipeline — winner: S3 Vectors Workload: multilingual player feedback and ticket clustering for offline analysis. Scale: ~10M historical feedback vectors (1,024-dim, ~59 GB stored), ~50K new vectors/day. Query rate: one clustering batch per hour, ~720–2,000 queries/month. Actual cost: ~$5.30/month (storage $3.54 + writes $1.72 + queries $0.04). Versus the $350.40/month AOSS 2-OCU HA baseline, that's a ~66× saving. Implementation notes: CDK manages index idempotency via AwsCustomResource. The writer uses ThreadPoolExecutor(4) to call s3vectors.put_vectors in parallel with batches of 25 (well below the 500/batch and 2,500 vectors/sec aggregate caps in the feature matrix, leaving headroom for retries without hitting the throttle). Lessons learned: Metadata size guardrails. The embedded text payload itself can be long, but filterable metadata is capped at 2 KB. The ingestion layer truncates long fields into summaries to avoid import failures. Partial-failure handling. When a 25-vector batch contains one malformed vector, VectorizeResult returns PARTIAL. The writer parses and retries the failed items rather than rolling back the whole batch. 5.2 AI marketing copy generation — winner: S3 Vectors Workload: historical reference corpus (long-term memory) for an AI marketing copy agent. Scale: 200K reference snippets (~2 GB). Query rate: ~43K queries/day (~0.5 QPS, ~1.3M/month). Actual cost: ~$7.10/month (storage $0.12 + PUT $0.40 + API \u0026amp; data processing $6.58) — about 1/25 of the $175.20/month AOSS dev minimum, or 1/49 of the $350.40/month HA production baseline. Implementation notes: a Python VectorStorageManager exposes a unified backend interface and switches between s3_vectors and opensearch via environment variable, decoupling the storage layer to keep migrations smooth. 5.3 Retail product multimodal retrieval — winner: S3 Vectors Workload: dual-tower (image + text) retrieval over an e-commerce catalog. Scale: 4M vectors (2M SKUs × 2 vectors per SKU — image + text — at 40–80 GB). Query rate: ~43K searches/day (~1.3M/month). Dual-tower means each user search triggers two vector queries, totaling ~2.6M vector queries/month. Estimated cost (actual bill depends on vector size, QPS, and metadata; verify with the AWS Pricing Calculator): Standard parameters (1,024-dim, 4-shard index): ~$36.54/month. Worst case (2,048-dim, 1-shard index): ~$192.97/month. Even at the worst-case parameters, the cost is still only ~55% of the $350.40/month AOSS 2-OCU HA baseline. Lessons learned: Nova MME embeddingPurpose consistency. With Amazon Nova multimodal embedding (MME) models, writes must specify embeddingPurpose = \u0026quot;GENERIC_INDEX\u0026quot; and queries must specify \u0026quot;GENERIC_RETRIEVAL\u0026quot;. Mixing the two silently degrades retrieval quality with no error. Lightweight existence checks. To enforce idempotent ingestion, use s3vectors.get_vectors(returnData=False, returnMetadata=False) to test for key presence without paying for vector payload egress. 5.4 Manufacturing multimodal knowledge base — winner: OpenSearch Domain Workload: RAG over engineering drawings, SOPs, and process specs for a manufacturing line. Scale: 5M vectors, ~15K queries/day. Metadata includes nested part hierarchies, SOP versions, and release timestamps. Why OpenSearch was kept: Hybrid retrieval (BM25 + vector) is required. Engineering drawings and process specs contain specialized part codes and version strings (for example, SOP-CN-2026-V3). Pure vector recall scores poorly on such tokens; OpenSearch tokenization plus weighted RRF fusion is necessary. Context expansion design. The application uses ContextExtendMethod.NEIGHBOR to look up neighbor chunks via parent_doc_id and adjacent chunk_id, which requires nested boolean filters that S3 Vectors cannot express. 5.5 After-sales GraphRAG — winner: OpenSearch Domain Workload: GraphRAG over an after-sales fault graph for diagnostic recommendation. Scale: 3M vectors with deep graph topology stored in metadata. Why OpenSearch was kept: Nested metadata for graph storage. LightRAG stores complex graph topology inside vector metadata (for example, \u0026lt;SEP\u0026gt;-delimited file_path arrays and related-node references). S3 Vectors only allows flat KV metadata under 2 KB. Prefix wildcard search. GraphRAG depends on doc_id* prefix matches to walk all entities under a parent document. S3 Vectors has no wildcard or fuzzy filter support. 5.6 Online content moderation — winner: OpenSearch Domain Workload: synchronous similarity check on chat messages against a violation feature library. Scale: 10M feature vectors, peak ~300 QPS. Why OpenSearch was kept: Strict SLA. As an inline component of synchronous chat moderation, the workload requires p95 \u0026lt; 30 ms. S3 Vectors p95 sits at 100–300 ms, which cannot block in real time. Composite boolean filters. Moderation requires hard tenant isolation by app-id and time-window filtering by created-time. OpenSearch is heavily optimized for boolean filtering and caching. 6. Migration Pitfall Guide If you are migrating from OpenSearch / AOSS to S3 Vectors, the following points must be covered in code:\n6.1 Cosine distance renormalization S3 Vectors returns distanceMetric=cosine as a raw cosine distance d = 1 − cos_sim, naturally in [0, 2] (since cos_sim ∈ [−1, 1]); lower is more similar.\nOpenSearch's k-NN query (Lucene engine) returns a relevance _score in [0, 1] via the documented formula score = (2 − d) / 2 = (1 + cos_sim) / 2; higher is more similar.\nThe two values are not on the same scale and are ordered in opposite directions. Any legacy min_score = 0.7 filter from OpenSearch will not just shift its threshold — it will reverse pass/fail semantics.\n1# Migration helper: convert S3 Vectors cosine distance to OpenSearch-style score 2def to_opensearch_score(s3v_distance: float) -\u0026gt; float: 3 # cos_sim = 1 - d → score = (1 + cos_sim) / 2 = (2 - d) / 2 4 return (2.0 - s3v_distance) / 2.0 Applying the legacy min_score = 0.7 filter to raw S3 Vectors distances would silently flip recall — re-derive the threshold against the converted score, or rewrite the filter against the distance with the inequality reversed (distance ≤ 0.6 for the equivalent cutoff).\n6.2 Split metadata between S3 Vectors and S3 buckets In OpenSearch, dropping the entire raw text into the _source metadata for downstream rendering is a common pattern.\nIn S3 Vectors, the design must enforce a vectors-in-S3-Vectors, text-in-S3 split:\nFilterable attributes must be flat (no nesting) and stay strictly under 2 KB. The s3vectors.put_vectors metadata payload should carry only file_key plus required filter tags. After retrieval, asynchronously fetch the source JSON from S3 by file_key. 6.3 The Nova MME embeddingPurpose trap For multimodal retrieval, writes to Nova must set embeddingPurpose = \u0026quot;GENERIC_INDEX\u0026quot;; queries must set \u0026quot;GENERIC_RETRIEVAL\u0026quot;. Mixing the two skews the underlying angle calculation and tanks recall — without throwing any error.\n6.4 Idempotent deployment and retention For infrastructure-as-code:\nCDK retention policy. S3 Vectors has no built-in snapshotting. When defining CfnVectorBucket in CDK, set RemovalPolicy to RETAIN. An accidental delete loses millions of vectors with no recovery path. Idempotency. Inside AwsCustomResource, ignore ConflictException so repeated deploys do not red-line the pipeline. 7. Open Questions S3 Vectors is promising, but a few gaps remain before it can fully replace OpenSearch:\nCosine distance range is not officially specified. The [0, 2] range follows directly from the 1 − cos_sim formula and matches OpenSearch's documented k-NN distance, but the S3 Vectors API reference does not currently spell it out as a contract. Validate in a dev environment before migrating production traffic. No native cross-region replication (CRR) for vector indices. Standard S3 CRR can replicate the underlying objects in a vector bucket, but the vector index metadata and query endpoint do not automatically follow — the secondary region must rebuild the index programmatically. Globally available architectures must therefore dual-write at the application layer via a message queue. Sparse independent benchmarks at scale. AWS claims sub-second latency at tens-of-billions scale, but independent community benchmarks past ~1B vectors remain rare. 8. Conclusion For pure-vector workloads where p95 latency can sit at 200–300 ms and QPS is moderate or bursty, S3 Vectors is the new default. The cost story is decisive at small and medium scale (15–66× cheaper across our $5–$36/month workloads) and stays competitive at 400M vectors once AOSS is sized realistically. Reserve OpenSearch / AOSS for the cases this post catalogues: hybrid retrieval, graph traversal, sub-100 ms SLOs, or nested metadata filters.\nThe two migration traps that caught us hardest were the cosine score / distance direction reversal and the 40 KB / 2 KB metadata split. Get those right before flipping the application's min_score threshold.\nResources Amazon S3 Vectors documentation S3 Vectors pricing OpenSearch Service pricing OpenSearch k-NN spaces — cosine formula ","link":"https://kane.mx/posts/2026/s3-vectors-vs-opensearch/","section":"posts","tags":["AWS","S3 Vectors","OpenSearch","Vector Database","FinOps"],"title":"S3 Vectors vs OpenSearch: Decision Tree from 30+ Projects"},{"body":"","link":"https://kane.mx/series/","section":"series","tags":null,"title":"Series"},{"body":"","link":"https://kane.mx/tags/vector-database/","section":"tags","tags":null,"title":"Vector Database"},{"body":"","link":"https://kane.mx/tags/agentcore-gateway/","section":"tags","tags":null,"title":"AgentCore Gateway"},{"body":"","link":"https://kane.mx/tags/amazon-cognito/","section":"tags","tags":null,"title":"Amazon Cognito"},{"body":"","link":"https://kane.mx/tags/api-gateway/","section":"tags","tags":null,"title":"API Gateway"},{"body":"","link":"https://kane.mx/categories/authentication--authorization/","section":"categories","tags":null,"title":"Authentication \u0026 Authorization"},{"body":"","link":"https://kane.mx/tags/bedrock/","section":"tags","tags":null,"title":"Bedrock"},{"body":"","link":"https://kane.mx/tags/claude-code/","section":"tags","tags":null,"title":"Claude Code"},{"body":"","link":"https://kane.mx/categories/cloud-infrastructure/","section":"categories","tags":null,"title":"Cloud Infrastructure"},{"body":"","link":"https://kane.mx/tags/mcp/","section":"tags","tags":null,"title":"MCP"},{"body":"Introduction Amazon Bedrock AgentCore Gateway is the most pragmatic way to host a Model Context Protocol server on AWS today. Declare your tools as OpenAPI or as Lambda targets, get a managed multi-target MCP endpoint, and inherit AWS-native authentication via a customJwtAuthorizer. For machine-to-machine traffic that pattern is excellent.\nThe moment you ask an interactive MCP client — Claude Code, Cursor, the MCP Inspector — to talk to that same gateway with a per-user OAuth flow, the seams show. AgentCore Gateway expects a JWT and trusts whatever issuer you wired into its authorizer. Pair it with Amazon Cognito and the wiring works for the server side. It does not work for the client side, because Cognito is an OIDC identity provider, not an MCP-compliant authorization server. The two are not the same thing.\nThe MCP authorization spec is built on a specific stack of IETF RFCs — RFC 9728 (Protected Resource Metadata), RFC 8414 (Authorization Server Metadata), RFC 7591 (Dynamic Client Registration), and RFC 7636 (PKCE). I covered the full stack in my MCP authorization deep-dive and showed how Keycloak fills it cleanly with one workaround. Cognito does not fill it. Without those RFCs, Claude Code never gets past metadata discovery and reports Failed to connect.\nThis post walks through an architecture I shipped recently: a small API Gateway + Lambda façade that adds the missing RFC surfaces in front of an AgentCore Gateway backed by Cognito. The result is a claude mcp add \u0026lt;https-url\u0026gt; that just works, while keeping Cognito as the identity provider and AgentCore Gateway as the multi-target MCP runtime.\nWhy AgentCore Gateway If you treat AgentCore Gateway purely as an MCP transport, the value is real and worth restating before we critique the auth story.\nMulti-target multiplexing. One MCP endpoint fronts many backends — OpenAPI HTTP services, Lambda functions, even Smithy models — without your client ever knowing they're separate. Two backend shapes that cover most needs. External HTTP backends (with their own JWT authorizer) plug in via OpenAPI; in-account Lambda backends invoked via the gateway's IAM role need no second auth layer at all. Tool-level interceptors. A gateway-attached Lambda can run on every tools/list and tools/call to gate which tools a given JWT scope sees. The interceptor is your scope-to-tool mapping in code. In-place updates of the JWT authorizer's allowedClients. Adding or removing a tenant's M2M client does not rotate the gateway URL. That is the kind of operational property you only appreciate after living without it. Managed scaling, observability, and session isolation. No fleet to run, no transport code to maintain. For a sense of how an OAuth2 client interacts with this surface from the client side, see my earlier walkthroughs on invoking AgentCore-hosted MCP servers and using the MCP SDK's OAuthClientProvider.\nWhere Cognito Falls Short for MCP AgentCore Gateway's customJwtAuthorizer validates inbound JWTs against a discoveryUrl and an allowedClients list. Point that at Cognito, M2M traffic with client_credentials works immediately. Interactive flows are where the spec asymmetry bites.\nThe MCP authorization spec asks the resource to advertise its authorization server, and asks the authorization server to support a small handful of behaviours that Cognito does not.\nRFC What MCP requires What Cognito does RFC 9728 — Protected Resource Metadata /.well-known/oauth-protected-resource returns the resource identifier and links to its authorization_servers Not served by Cognito; AgentCore Gateway emits the header but points at the OIDC issuer RFC 8414 — Authorization Server Metadata /.well-known/oauth-authorization-server under the issuer URL Cognito only serves OIDC discovery at /.well-known/openid-configuration under the cognito-idp.\u0026lt;region\u0026gt;.amazonaws.com/\u0026lt;pool-id\u0026gt; path RFC 7591 — Dynamic Client Registration Public registration_endpoint so clients with no preconfigured client_id can self-register No public DCR endpoint; admin-only CreateUserPoolClient API RFC 6749 §3.1.2.3 — exact redirect_uri match Native MCP clients open OAuth callback listeners on a random ephemeral port Hosted UI requires the redirect_uri to match a pre-registered callbackUrls entry exactly — you cannot enumerate every port Each gap is independently a hard stop for Claude Code. Together they explain a behaviour I watched many times during the spike: paste the gateway URL into claude mcp add, see Claude Code attempt RFC 9728 metadata discovery, hit an issuer that does not serve RFC 8414, and surface a generic \u0026quot;Failed to connect.\u0026quot;\nThe third gap — DCR — is structural. MCP's design assumes a federated ecosystem where clients are not pre-provisioned by every server they want to talk to. SEP-991 softens that with Client ID Metadata Documents, but Claude Code's stable build still expects a registration endpoint to be advertised, and Cognito does not have one. The fourth gap is the most operationally annoying: Cognito's exact-match callback URL rule (per RFC 6749 §3.1.2.3) is correct from a security perspective but incompatible with native-app loopback redirects (RFC 8252 §7.3) on random ports.\nThis is the same shape of friction I documented for Keycloak — see Implementing MCP OAuth 2.1 with Keycloak on AWS — except Keycloak only needed the RFC 8707 audience workaround and was otherwise compliant out of the box. Cognito needs four gaps closed, not one.\nThe Façade Pattern Closing those four gaps does not require replacing Cognito or AgentCore. It requires putting a thin, stateless adapter in front of both, which:\nServes RFC 9728 / RFC 8414 metadata that points back at itself for authorization_endpoint, token_endpoint, and registration_endpoint — so the client never tries to GET RFC 8414 from a Cognito URL where it does not exist. Implements RFC 7591 by returning the same pre-provisioned Cognito user-pool app client to every caller. Conceptually a \u0026quot;fake\u0026quot; DCR, behaviourally indistinguishable from the real thing for a confidential client whose secret is held by the façade. Acts as the single registered Cognito callback URL and 302-redirects the authorization code to whatever loopback port the client picked. State is round-tripped through Cognito as an HMAC-signed opaque blob so the proxy stays stateless. Proxies everything else straight to AgentCore Gateway, including the /mcp endpoint that carries the actual JSON-RPC traffic. That is one Lambda behind one HTTP API in front of one AgentCore Gateway. The full architecture:\nflowchart LR Client[Claude Code MCP Client] --\u0026gt;|\u0026#34;1. GET /.well-known/oauth-protected-resource\u0026#34;| Facade Client --\u0026gt;|\u0026#34;2. GET /.well-known/oauth-authorization-server\u0026#34;| Facade Client --\u0026gt;|\u0026#34;3. POST /register (RFC 7591)\u0026#34;| Facade Client --\u0026gt;|\u0026#34;4. GET /oauth/authorize\u0026#34;| Facade Facade --\u0026gt;|\u0026#34;5. 302 to Cognito Hosted UI\u0026#34;| Cognito Cognito --\u0026gt;|\u0026#34;6. user signs in via Feishu / Cognito\u0026#34;| Cognito Cognito --\u0026gt;|\u0026#34;7. callback to facade\u0026#34;| Facade Facade --\u0026gt;|\u0026#34;8. 302 to localhost ephemeral port\u0026#34;| Client Client --\u0026gt;|\u0026#34;9. POST /oauth/token\u0026#34;| Facade Facade --\u0026gt;|\u0026#34;10. forward to Cognito\u0026#34;| Cognito Client --\u0026gt;|\u0026#34;11. POST /mcp with Bearer JWT\u0026#34;| Facade Facade --\u0026gt;|\u0026#34;12. proxy\u0026#34;| AgentCore[AgentCore Gateway] AgentCore --\u0026gt;|\u0026#34;13. JWT validated\u0026#34;| AgentCore AgentCore --\u0026gt;|\u0026#34;14. invoke target\u0026#34;| Backend[OpenAPI / Lambda Targets] The façade does four small jobs. The agent runtime does the heavy lifting. Cognito remains the source of truth for identity. Nothing in the AgentCore Gateway configuration changes — its customJwtAuthorizer still trusts the Cognito issuer, the same as for an M2M client.\nImplementation: Four Routes That Matter The façade lives in src/lambdas/oauth2-facade/handler.ts. It has more routes than I'll show here, but four of them carry the architectural weight. I'll walk each.\nRoute 1: RFC 9728 Protected Resource Metadata When AgentCore Gateway returns a 401, it emits a WWW-Authenticate: Bearer resource_metadata=\u0026quot;…\u0026quot; header per RFC 9728. The MCP client follows that URL to learn which authorization server protects the resource. The façade serves the metadata itself and points back at itself as the authorization server:\n1if (path === \u0026#34;/.well-known/oauth-protected-resource\u0026#34;) { 2 return json(200, { 3 resource: `${base}/mcp`, 4 authorization_servers: [base], 5 scopes_supported: RESOURCE_SCOPES, 6 bearer_methods_supported: [\u0026#34;header\u0026#34;], 7 }); 8} The crucial line is authorization_servers: [base]. Pointing at Cognito's OIDC issuer here would push the client straight back into the RFC 8414 gap. Pointing at the façade keeps discovery on a path the façade controls.\nRoute 2: RFC 8414 Authorization Server Metadata The same idea, one step further. The façade advertises itself as the authorization_endpoint and token_endpoint, while passing through userinfo, revocation, and jwks_uri to Cognito because those endpoints do not need any redirect proxying:\n1if (path === \u0026#34;/.well-known/oauth-authorization-server\u0026#34;) { 2 return json(200, { 3 issuer: ISSUER, 4 authorization_endpoint: `${base}/oauth/authorize`, 5 token_endpoint: `${base}/oauth/token`, 6 userinfo_endpoint: USERINFO, 7 revocation_endpoint: REVOCATION, 8 jwks_uri: JWKS, 9 registration_endpoint: `${base}/register`, 10 response_types_supported: [\u0026#34;code\u0026#34;], 11 grant_types_supported: [\u0026#34;authorization_code\u0026#34;, \u0026#34;refresh_token\u0026#34;], 12 code_challenge_methods_supported: [\u0026#34;S256\u0026#34;], 13 token_endpoint_auth_methods_supported: [ 14 \u0026#34;client_secret_basic\u0026#34;, 15 \u0026#34;client_secret_post\u0026#34;, 16 ], 17 scopes_supported: [\u0026#34;openid\u0026#34;, \u0026#34;email\u0026#34;, \u0026#34;profile\u0026#34;, ...RESOURCE_SCOPES], 18 }); 19} issuer stays as the real Cognito issuer because the JWT's iss claim will carry that value. AgentCore Gateway's customJwtAuthorizer validates iss against the discovery URL it was configured with. If the façade lied about the issuer here, the JWT would still pass through Cognito unchanged and AgentCore would reject it. The façade rewrites paths, not claims.\ncode_challenge_methods_supported is [\u0026quot;S256\u0026quot;] only — the proxy refuses plain PKCE explicitly, because plain transmits the verifier unhashed and negates the protection PKCE was meant to provide.\nRoute 3: RFC 7591 Dynamic Client Registration Cognito has no public DCR endpoint. Building one would mean exposing privileged Cognito admin APIs. The pragmatic alternative is to acknowledge that, in this deployment shape, every MCP client ends up using the same Cognito user-flow app client — its identity is the human user, not the calling application. So the façade implements a \u0026quot;fake DCR\u0026quot; that returns the same pre-provisioned client to every caller:\n1if (path === \u0026#34;/register\u0026#34; \u0026amp;\u0026amp; method === \u0026#34;POST\u0026#34;) { 2 let req: { redirect_uris?: string[] } = {}; 3 try { req = JSON.parse(event.body ?? \u0026#34;{}\u0026#34;); } catch {} 4 return json(201, { 5 client_id: USER_CLIENT_ID, 6 client_secret: USER_CLIENT_SECRET, 7 client_id_issued_at: Math.floor(Date.now() / 1000), 8 client_secret_expires_at: 0, 9 redirect_uris: req.redirect_uris ?? [\u0026#34;http://localhost:8080/callback\u0026#34;], 10 grant_types: [\u0026#34;authorization_code\u0026#34;, \u0026#34;refresh_token\u0026#34;], 11 response_types: [\u0026#34;code\u0026#34;], 12 token_endpoint_auth_method: \u0026#34;client_secret_post\u0026#34;, 13 scope: [\u0026#34;openid\u0026#34;, \u0026#34;email\u0026#34;, ...RESOURCE_SCOPES].join(\u0026#34; \u0026#34;), 14 }); 15} This satisfies Claude Code's expectation that a registration_endpoint exists and returns a client_id it can drive an authorization-code flow with. The trade-off — every MCP client ends up sharing one Cognito app client — is acceptable because the user identity flows through the JWT, not through client_id. If you need per-tenant isolation, partition by Cognito group / scope rather than by app client. SEP-991's URL-based client identity, if and when Claude Code adopts it, removes this trade-off entirely; until then, the fake DCR is the cleanest shim.\nRoute 4: The Authorization-Code Redirect Proxy This is where the work gets interesting. Cognito requires callbackUrls to match exactly per RFC 6749 §3.1.2.3; native apps per RFC 8252 §7.3 use loopback URIs with arbitrary ports. Both rules are correct. They are not jointly satisfiable without an indirection.\nThe proxy collapses to three handlers. GET /oauth/authorize rewrites the redirect to point at the façade and signs the original state + redirect_uri into an HMAC blob:\n1// Defense against open-redirector abuse: only loopback URIs round-trip 2if (!isAllowedClientRedirect(clientRedirect)) { 3 return json(400, { error: \u0026#34;invalid_request\u0026#34;, 4 error_description: \u0026#34;redirect_uri must be a loopback URL\u0026#34; }); 5} 6if (!inParams.get(\u0026#34;code_challenge\u0026#34;)) { 7 return json(400, { error: \u0026#34;invalid_request\u0026#34;, 8 error_description: \u0026#34;code_challenge is required (PKCE)\u0026#34; }); 9} 10 11const facadeState = signState( 12 { cs: clientState, r: clientRedirect }, 13 HMAC_KEY, 14 Date.now(), 15); 16 17const out = new URLSearchParams(); 18for (const [k, v] of inParams.entries()) { 19 if (k === \u0026#34;redirect_uri\u0026#34; || k === \u0026#34;state\u0026#34;) continue; 20 out.append(k, v); 21} 22out.set(\u0026#34;redirect_uri\u0026#34;, `${base}/oauth/callback`); 23out.set(\u0026#34;state\u0026#34;, facadeState); 24 25return redirect(`${AUTHORIZE}?${out.toString()}`); GET /oauth/callback verifies the HMAC, extracts the original loopback URL, and 302s the auth code there:\n1const decoded = verifyState(facadeState, HMAC_KEY, Date.now()); 2if (!decoded) return json(400, { error: \u0026#34;invalid_state\u0026#34;, … }); 3 4// Defense in depth: re-check loopback even on a verified state 5if (!isAllowedClientRedirect(decoded.r)) return json(400, …); 6 7const out = new URLSearchParams(); 8if (code) out.set(\u0026#34;code\u0026#34;, code); 9out.set(\u0026#34;state\u0026#34;, decoded.cs); 10return redirect(`${decoded.r}?${out.toString()}`); POST /oauth/token swaps the client's loopback redirect_uri for the façade's, so Cognito's redirect_uri replay check matches the single registered callback URL:\n1if (inForm.has(\u0026#34;redirect_uri\u0026#34;) || inForm.get(\u0026#34;grant_type\u0026#34;) === \u0026#34;authorization_code\u0026#34;) { 2 inForm.set(\u0026#34;redirect_uri\u0026#34;, `${base}/oauth/callback`); 3} Two security properties of this proxy are worth dwelling on, because an OAuth redirect proxy is a textbook open-redirector if you build it carelessly:\nThe HMAC is the only authority over the original redirect_uri. Without it, anyone could forge a state that redirects the auth code to an attacker URL. The façade signs {cs, r, ts} with HMAC-SHA-256 over a base64url payload, with a 10-minute TTL and a 60-second future-skew tolerance. The key is held in SST Secret, never in code. Loopback-only enforcement, twice. isAllowedClientRedirect() checks the URL is http://localhost, 127.0.0.1, or [::1] once at /oauth/authorize (before the state is signed) and a second time at /oauth/callback (after verifying the HMAC). If the HMAC key ever leaks, a forged state still cannot redirect codes anywhere except a loopback address — and PKCE makes the code useless without the verifier. The state.ts module is fifty lines and has no external dependencies beyond node:crypto. Stateless, no DynamoDB, no TTL bookkeeping. A typical signed state is 150 to 200 bytes, well under Cognito's 1024-character state limit.\nWhat the AgentCore Gateway Side Looks Like (with SST) The façade is half the architecture. The other half is the AgentCore Gateway you would have built anyway, with one nuance: the user-flow Cognito app client must be in the gateway's allowedClients list alongside the M2M clients.\nI use SST v4 for the entire stack. SST is a thin layer over Pulumi that gives you first-class TypeScript primitives for AWS — sst.aws.Function, sst.aws.ApiGatewayV2 — and preserves access to every native Pulumi resource (aws.bedrock.AgentcoreGateway, aws.cognito.UserPool) when SST has not yet shipped a wrapper. That mix matters here because AgentCore Gateway is new enough that there is no sst.aws.AgentcoreGateway yet, but it sits naturally next to the SST-native façade Lambda and HTTP API.\nThe whole infrastructure is six TypeScript files under infra/. Cognito + IdP federation + per-target M2M clients live in one; the AgentCore Gateway and its targets in another; the façade Lambda + HTTP API in a third. SST's sst.config.ts lazy-imports them in dependency order:\n1// sst.config.ts 2async run() { 3 $transform(aws.lambda.Function, (args) =\u0026gt; { 4 if (!args.runtime || (typeof args.runtime === \u0026#34;string\u0026#34; \u0026amp;\u0026amp; args.runtime.startsWith(\u0026#34;nodejs\u0026#34;))) { 5 args.runtime = \u0026#34;nodejs24.x\u0026#34;; 6 } 7 }); 8 9 await import(\u0026#34;./infra/cognito\u0026#34;); // user pool, Feishu IdP, M2M clients 10 await import(\u0026#34;./infra/facade-api\u0026#34;); // HTTP API + user-flow Cognito client 11 await import(\u0026#34;./infra/gateway\u0026#34;); // AgentCore Gateway + targets 12 await import(\u0026#34;./infra/facade\u0026#34;); // façade Lambda + routes 13} The AgentCore Gateway itself is a native Pulumi resource. The Cognito user-flow client and M2M clients flow into allowedClients as Pulumi Outputs — SST resolves them at deploy time, no manual ARN copying:\n1// infra/gateway.ts 2const gateway = new aws.bedrock.AgentcoreGateway(\u0026#34;McpGateway\u0026#34;, { 3 protocolType: \u0026#34;MCP\u0026#34;, 4 authorizerType: \u0026#34;CUSTOM_JWT\u0026#34;, 5 authorizerConfiguration: { 6 customJwtAuthorizer: { 7 discoveryUrl: cognitoIssuer.apply( 8 (i) =\u0026gt; `${i}/.well-known/openid-configuration`, 9 ), 10 allowedClients: [ 11 userAppClient.id, // Claude Code (interactive) 12 backendM2mClient.id, // M2M for the in-account backend 13 // … other M2M clients 14 ], 15 }, 16 }, 17}); allowedClients is in-place updatable on AgentCore Gateway, so adding a new tenant's M2M client is a one-line edit followed by pnpm -C infra deploy --stage prod — no gateway URL rotation, no client reconfiguration. The user-flow client is generateSecret: true; the façade Lambda — not the browser — holds the secret and forwards it through client_secret_basic on the token endpoint.\nOn the façade side, SST's first-class primitives keep the wiring tight. The HTTP API and Lambda are both sst.aws.* resources, and the Lambda's environment block carries Outputs straight from the gateway and Cognito modules without intermediate string conversion:\n1// infra/facade.ts 2const facadeApi = new sst.aws.ApiGatewayV2(\u0026#34;FacadeApi\u0026#34;, { 3 cors: { 4 allowOrigins: [\u0026#34;*\u0026#34;], 5 allowMethods: [\u0026#34;*\u0026#34;], 6 allowHeaders: [ 7 \u0026#34;Authorization\u0026#34;, \u0026#34;Content-Type\u0026#34;, \u0026#34;Accept\u0026#34;, 8 \u0026#34;MCP-Protocol-Version\u0026#34;, // required by browser-based MCP clients 9 ], 10 }, 11}); 12 13const oauth2FacadeFn = new sst.aws.Function(\u0026#34;Oauth2Facade\u0026#34;, { 14 handler: \u0026#34;src/lambdas/oauth2-facade/handler.handler\u0026#34;, 15 timeout: \u0026#34;30 seconds\u0026#34;, 16 memory: \u0026#34;256 MB\u0026#34;, 17 environment: { 18 UPSTREAM_GATEWAY_URL: gatewayUrl, 19 COGNITO_ISSUER: cognitoIssuer, 20 COGNITO_AUTHORIZE_ENDPOINT: cognitoAuthorizeEndpoint, 21 COGNITO_TOKEN_ENDPOINT: cognitoTokenEndpoint, 22 USER_CLIENT_ID: userAppClient.id, 23 USER_CLIENT_SECRET: userAppClient.clientSecret, 24 OAUTH_STATE_HMAC_KEY: oauthStateHmacKey.value, 25 // … 26 }, 27}); 28 29facadeApi.route(\u0026#34;ANY /{proxy+}\u0026#34;, oauth2FacadeFn.arn); 30facadeApi.route(\u0026#34;ANY /\u0026#34;, oauth2FacadeFn.arn); The user-flow Cognito client has exactly one allowed callback URL — the façade's /oauth/callback, set via SST's $interpolate against the HTTP API's URL output:\n1// infra/facade-api.ts 2callbackUrls: [$interpolate`${facadeApi.url}/oauth/callback`], 3logoutUrls: [$interpolate`${facadeApi.url}/oauth/logout`], Every other localhost callback the client opens is reached by the façade's 302, never by Cognito directly.\nThe HMAC key for opaque-state signing is an sst.Secret, set once per stage with pnpm -C infra exec sst secret set OauthStateHmacKey \u0026quot;$(openssl rand -base64 32)\u0026quot; --stage \u0026lt;stage\u0026gt;. The empty default lets fresh stages deploy clean before the operator seeds it; the Lambda returns 503 on OAuth flows if the key is missing, which is loud enough to catch in a smoke test.\nThe full stack — including PR-preview stages auto-managed via pr-\u0026lt;N\u0026gt; and a prod stage with protect: true — is what makes this pattern operationally cheap. Adding a new MCP backend is two file edits (a target declaration in infra/gateway.ts, an OpenAPI schema or Lambda handler) and a deploy. The façade is untouched.\nCognito's Federation Story Carries Through One reason to keep Cognito in this design rather than drop in for Keycloak is that Cognito's native federation surface keeps working untouched by the façade. In the deployment I built, the user pool federates to Feishu via OIDC, but the same pattern applies to any external IdP via Cognito's federated identities — Google, Apple, SAML, your enterprise OIDC.\n1const feishuIdp = new aws.cognito.IdentityProvider(\u0026#34;FeishuIdp\u0026#34;, { 2 userPoolId: cognitoUserPoolId, 3 providerName: \u0026#34;Feishu\u0026#34;, 4 providerType: \u0026#34;OIDC\u0026#34;, 5 providerDetails: { 6 client_id: feishuAppId, 7 client_secret: feishuAppSecret, 8 authorize_scopes: \u0026#34;openid email profile\u0026#34;, 9 oidc_issuer: \u0026#34;https://passport.feishu.cn\u0026#34;, 10 authorize_url: \u0026#34;https://passport.feishu.cn/suite/passport/oauth/authorize\u0026#34;, 11 token_url: \u0026#34;https://passport.feishu.cn/suite/passport/oauth/token\u0026#34;, 12 attributes_url: \u0026#34;https://passport.feishu.cn/suite/passport/oauth/userinfo\u0026#34;, 13 jwks_uri: \u0026#34;https://passport.feishu.cn/suite/passport/oauth/userinfo\u0026#34;, 14 }, 15 … 16}); The MCP client never sees this. It hits the façade's RFC 8414 metadata, follows the authorization_endpoint to the façade, gets 302'd to Cognito's Hosted UI, and Cognito handles the IdP picker. Federation is invisible to the MCP layer, exactly as it should be.\nComparing the Approaches Property Keycloak (full IdP) AgentCore + Cognito + Façade RFC 9728 protected resource metadata Native (configurable) Façade serves it RFC 8414 authorization server metadata Native Façade serves it RFC 7591 DCR Native Façade fakes it (same client to everyone) RFC 8707 audience binding Workaround via audience mapper Inherits Cognito's client_id-as-audience model PKCE S256 Native Enforced by façade Native loopback redirect Native Façade redirect proxy with HMAC state Federation to enterprise IdPs Native (broad) Cognito's native federation Operational footprint ECS Fargate + Aurora Serverless Two Lambdas + HTTP API + Cognito Cost shape Per-second container compute Per-invocation Lambda Keycloak gets you a fully MCP-compliant authorization server with a single RFC 8707 workaround. The trade-off is operating Keycloak — a stateful Java application with a Postgres backend.\nThe façade pattern is a different trade-off: you accept a small amount of TypeScript code in exchange for keeping Cognito and AgentCore Gateway. If you are already on Cognito, already on Bedrock, and your alternative would be re-platforming user identity, the façade is the lower-risk path. The Lambda is ~500 lines and the code surface is small enough to audit in a sitting. There is no database, no long-lived state, and the only secret is the HMAC key for opaque-state signing.\nOperational Notes A few items that took longer than expected to get right:\nThe HMAC key is per-stage and stable. Set it once with sst secret set OauthStateHmacKey \u0026quot;$(openssl rand -base64 32)\u0026quot; --stage \u0026lt;stage\u0026gt;. Rotating it invalidates all in-flight OAuth flows transparently — annoying, not catastrophic. There is no reason to auto-rotate it on every CI deploy. WWW-Authenticate rewriting on the proxy path. When the façade proxies the /mcp endpoint and AgentCore Gateway returns a 401, the upstream WWW-Authenticate header points at the gateway URL's RFC 9728 metadata, not the façade's. The façade rewrites it before returning — otherwise discovery walks the client straight back into the AgentCore-fronted version of the metadata, which advertises Cognito as the authorization server and re-opens the original problem. Token endpoint authentication forwarding. Claude Code sends client_secret_basic (Authorization header). The proxy forwards that header verbatim to Cognito. Without forwarding, Cognito sees no client credentials and returns invalid_client (HTTP 400) — a confusing failure mode the first time you hit it. MCP-Protocol-Version in the CORS allowlist. The façade's HTTP API needs MCP-Protocol-Version in allowHeaders, otherwise browser-based MCP clients (the Inspector) get blocked by preflight before any of the auth flow runs. Conclusion AgentCore Gateway is a well-designed multi-target MCP runtime with a clean JWT authorization model. Cognito is a well-designed OIDC identity provider. Neither was built with the full MCP authorization spec in mind — that spec sits on RFC 9728 / 8414 / 7591 / 7636 in a way that overlaps but does not match either component's native surface.\nA 500-line API Gateway + Lambda façade closes the four gaps: it serves the metadata documents the spec wants, fakes DCR by returning the pre-provisioned Cognito client, and proxies the authorization-code flow through a single registered callback URL with HMAC-signed opaque state for native-app loopback redirects. AgentCore Gateway and Cognito remain unmodified.\nWhere Keycloak gets you a full identity stack at the cost of operating one, this pattern lets you stay on managed Cognito and AgentCore Gateway and still hand a claude mcp add \u0026lt;https-url\u0026gt; to a teammate. For deployments that already live in this corner of AWS, the façade is the minimum viable adapter.\nThe whole pattern is small enough that the snippets above plus the architecture diagram are the entire load-bearing surface — the rest is SST scaffolding, an interceptor Lambda for tool-level scope gating, and the usual CI plumbing.\nResources MCP and OAuth Specifications Model Context Protocol Authorization Specification — the full MCP authorization spec, including the RFC stack RFC 6749 — OAuth 2.0 Authorization Framework — including the redirect_uri exact-match rule RFC 7591 — OAuth 2.0 Dynamic Client Registration RFC 7636 — Proof Key for Code Exchange (PKCE) RFC 8252 — OAuth 2.0 for Native Apps — loopback redirect URI rules RFC 8414 — OAuth 2.0 Authorization Server Metadata RFC 8707 — Resource Indicators for OAuth 2.0 RFC 9728 — OAuth 2.0 Protected Resource Metadata AWS Documentation Amazon Bedrock AgentCore Gateway — managed multi-target MCP runtime Amazon Cognito User Pools — federated identity provider Cognito Federated Identities — connecting to external IdPs Related Articles Technical Deconstruction of MCP Authorization: A Deep Dive into OAuth 2.1 and IETF RFC Specifications — the underlying RFC stack and IdaaS compatibility matrix Implementing MCP OAuth 2.1 with Keycloak on AWS — the comparison case: a full IdP with one workaround MCP OAuth Evolution: SEP-991 Simplifies Client Registration — where DCR is going How invoking remote MCP servers hosted on AWS AgentCore — earlier work on the AgentCore client side Leveraging MCP Client's OAuthClientProvider for Seamless AWS AgentCore Authentication — using the MCP SDK's native OAuth client ","link":"https://kane.mx/posts/2026/agentcore-gateway-cognito-mcp-oauth/","section":"posts","tags":["MCP","Model Context Protocol","AWS","Bedrock","AgentCore","AgentCore Gateway","Amazon Cognito","API Gateway","SST","OAuth 2.1","RFC 9728","RFC 8414","RFC 7591","PKCE","Claude Code"],"title":"MCP OAuth on AgentCore Gateway + Cognito via APIGW Façade"},{"body":"","link":"https://kane.mx/tags/model-context-protocol/","section":"tags","tags":null,"title":"Model Context Protocol"},{"body":"","link":"https://kane.mx/series/model-context-protocol/","section":"series","tags":null,"title":"Model-Context-Protocol"},{"body":"","link":"https://kane.mx/tags/oauth-2.1/","section":"tags","tags":null,"title":"OAuth 2.1"},{"body":"","link":"https://kane.mx/tags/pkce/","section":"tags","tags":null,"title":"PKCE"},{"body":"","link":"https://kane.mx/tags/rfc-7591/","section":"tags","tags":null,"title":"RFC 7591"},{"body":"","link":"https://kane.mx/tags/rfc-8414/","section":"tags","tags":null,"title":"RFC 8414"},{"body":"","link":"https://kane.mx/tags/rfc-9728/","section":"tags","tags":null,"title":"RFC 9728"},{"body":"","link":"https://kane.mx/tags/sst/","section":"tags","tags":null,"title":"SST"},{"body":"","link":"https://kane.mx/categories/ai-development/","section":"categories","tags":null,"title":"AI Development"},{"body":"","link":"https://kane.mx/tags/amazon-bedrock/","section":"tags","tags":null,"title":"Amazon Bedrock"},{"body":"","link":"https://kane.mx/tags/anthropic/","section":"tags","tags":null,"title":"Anthropic"},{"body":"","link":"https://kane.mx/categories/architecture/","section":"categories","tags":null,"title":"Architecture"},{"body":"","link":"https://kane.mx/categories/aws/","section":"categories","tags":null,"title":"AWS"},{"body":"","link":"https://kane.mx/tags/aws-marketplace/","section":"tags","tags":null,"title":"AWS Marketplace"},{"body":"","link":"https://kane.mx/tags/claude/","section":"tags","tags":null,"title":"Claude"},{"body":"","link":"https://kane.mx/tags/claude-platform-on-aws/","section":"tags","tags":null,"title":"Claude Platform on AWS"},{"body":"\u0026quot;Use Bedrock\u0026quot; was a one-line answer six months ago. As of May 11, 2026, it's not. Anthropic and AWS shipped Claude Platform on AWS to general availability — Anthropic's native developer platform, accessed through your AWS account, billed through AWS Marketplace, and operated by Anthropic outside the AWS security boundary.\nThere are now three ways to run Claude when AWS is anywhere in the picture:\nAmazon Bedrock — AWS operates inference; AWS is the data processor. Claude Platform on AWS — Anthropic operates inference; Anthropic and AWS are independent data processors. Auth, billing, and audit go through AWS-native plumbing. Claude Platform direct (claude.com) — Anthropic everything. AWS is not in the loop. The choice has real architectural implications: feature availability, who holds your data, what your CloudTrail trail actually records, whether HIPAA workloads can run there, and whether your AWS EDP commit can absorb the spend. Pick wrong and you'll either re-platform when a beta you need only exists on one path, or re-architect when an auditor asks why model inference is leaving the AWS security boundary.\nThis post is the decision tree.\nThe three-way landscape flowchart TB awsHeader[Your AWS Account]:::header awsHeader --- app[Application] app --\u0026gt;|InvokeModel| brk[Amazon Bedrock] app --\u0026gt;|Messages API| cpa[Claude Platform on AWS] app --\u0026gt;|Messages API| direct[Claude Platform direct] brk --\u0026gt;|inference inside AWS| brkInf[AWS infrastructure] cpa --\u0026gt;|inference at Anthropic| anth[Anthropic infrastructure] direct --\u0026gt;|inference at Anthropic| anth awsHeader -.- iam[IAM SigV4] awsHeader -.- ct[CloudTrail] awsHeader -.- bill[AWS Bill] iam -.-\u0026gt; brk iam -.-\u0026gt; cpa ct -.-\u0026gt; brk ct -.-\u0026gt; cpa bill -.-\u0026gt; brk bill -.-\u0026gt; cpa classDef aws fill:#fff7e6,stroke:#d97706,color:#92400e classDef anthropic fill:#eef2ff,stroke:#4f46e5,color:#312e81 classDef neutral fill:#f3f4f6,stroke:#6b7280,color:#374151 classDef header fill:#fef3c7,stroke:#d97706,color:#92400e,font-weight:bold class brk,brkInf aws class cpa,direct,anth anthropic class app,iam,ct,bill neutral The key visual: in path 1, AWS is both the entry point and the inference operator. In path 2, AWS is the entry point but Anthropic is the inference operator. In path 3, AWS is not in the picture at all.\nTwo facts shape every downstream decision:\nWho is the data processor? On Bedrock, AWS is. On Claude Platform on AWS, the user guide is explicit: \u0026quot;Both Anthropic and AWS act as independent data processors\u0026quot; and \u0026quot;Data may not reside in AWS. Inference may route to Anthropic's primary cloud.\u0026quot; Whose feature roadmap do you ride? Bedrock's feature surface is set by AWS's integration cadence. Claude Platform on AWS commits to \u0026quot;same-day model and feature access\u0026quot; — every new Anthropic API feature, including betas, ships there the day it ships on the first-party API. If you internalize those two distinctions, the rest of this post is just consequences.\nFeature parity matrix (May 13, 2026) The most cited reason to leave Bedrock has been feature lag. Here's the actual surface, today:\nCapability Bedrock Claude Platform on AWS Direct (claude.com) Messages API (Opus 4.7, Sonnet 4.6, Haiku 4.5) Via Converse / InvokeModel Native /v1/messages Native /v1/messages Prompt caching (5min + 1h TTL) Yes Yes (full) Yes (full) Streaming (SSE) Yes Yes Yes Batch processing Yes (Batch Inference) Yes (/v1/messages/batches) Yes Files API No Yes (beta) Yes (beta) Code execution (managed sandbox) No Yes (beta) Yes (beta) Web search / web fetch No (use Bedrock Agents) Yes Yes MCP connector No Yes (beta) Yes (beta) Agent Skills No Yes (beta, GA on AWS for tagging) Yes (beta) Claude Managed Agents No Yes (beta, no outcomes/webhooks/multi-agent) Yes (beta, full) Extended thinking Yes Yes Yes Guardrails (content filtering) Yes (Bedrock Guardrails) No (use app-layer) No Knowledge Bases (managed RAG) Yes No No Computer use Yes Yes Yes HIPAA-eligible Yes (under AWS BAA) No Limited Cross-region inference profiles Yes No (use inference_geo per request) N/A OAuth No (SigV4/IAM) No Yes OpenAI-compatible endpoints No No Yes A few non-obvious entries are worth calling out.\nWorkspace tagging is GA on Claude Platform on AWS but beta on the first-party API. That's the one place where the AWS-fronted variant is actually ahead of direct — because tags need to flow into IAM and AWS Cost and Usage Reports, AWS-side billing tooling pushed it past beta first.\nHIPAA is an explicit \u0026quot;no\u0026quot; on Claude Platform on AWS. From the AWS user guide: \u0026quot;Anthropic's HIPAA-ready program is not available on Claude Platform on AWS. Customers with HIPAA requirements should evaluate Claude in Amazon Bedrock instead.\u0026quot; This is not an oversight to wait out — it's a structural consequence of Anthropic operating the inference stack outside AWS's security and compliance perimeter. If PHI touches your prompts, the path is Bedrock.\nManaged Agents has functional gaps on Claude Platform on AWS. Outcome tracking, multi-agent sessions, and webhook delivery are not available; available on direct. Long-running autonomous sessions also require re-authentication every 6 hours, which forces SigV4 credential refresh logic into any agent that runs longer than that.\nCross-region inference profiles are Bedrock-only. Claude Platform on AWS uses inference_geo per request, with two values today: us (1.1× pricing multiplier, US data centers only) and global (default, standard pricing). EU residency on the platform side is not supported at launch — for EU-resident workloads, the answer remains Bedrock.\nBilling and the AWS commit question The single biggest reason a finance team will care: where does the dollar land on your AWS bill?\nAmazon Bedrock. Token usage is a Bedrock service line. It draws down AWS Enterprise Discount Program (EDP) commits the same way EC2 or S3 spend does — directly, no Marketplace intermediary. Private Pricing Agreements with AWS apply.\nClaude Platform on AWS. Billing flows through AWS Marketplace as a SaaS subscription — Anthropic is the seller of record, AWS is the merchant. Token rates match the first-party Anthropic API (no AWS markup or discount), and the consumption shows up on your AWS bill as a Marketplace line item. The Marketplace EDP rule has two numbers worth knowing: 100% of eligible Marketplace spend retires your EDP commitment dollar-for-dollar (up from 50% historically), but the total fraction of your commitment that can be filled by Marketplace is typically capped at 25% (some contracts negotiate higher). A separate 2025 AWS policy change adds a \u0026quot;hosted on AWS\u0026quot; eligibility requirement for some Marketplace SaaS — Claude Platform on AWS is the kind of service AWS structured to qualify, but confirm with your account team before assuming, and check your specific EDP terms. Don't take a blog post's word for your contract.\nClaude Platform direct. Billed by Anthropic, no AWS involvement, paid out-of-band. Doesn't help AWS commit at all.\nThe architectural takeaway: if your AWS spend is committed and Claude is going to be a meaningful line item, paths 1 and 2 both keep it inside the commit. Path 3 doesn't. For teams who started on direct API and grew into AWS-shop status, this is the most concrete reason path 2 exists at all — it converts a separate Anthropic invoice into a line on your existing AWS bill, with no model switch.\nWhat CloudTrail actually records This is where the abstraction stops being equivalent across paths.\nBedrock logs bedrock:InvokeModel, bedrock:InvokeModelWithResponseStream, and the Converse equivalents. Every inference call is a CloudTrail event, classified by AWS as a Data event for high-volume routes.\nClaude Platform on AWS uses a separate IAM service prefix: aws-external-anthropic. The action namespace is documented in the IAM actions reference and looks like this:\nRoute IAM action CloudTrail type POST /v1/messages aws-external-anthropic:CreateInference Data POST /v1/messages/count_tokens aws-external-anthropic:CountTokens Data POST /v1/messages/batches aws-external-anthropic:CreateBatchInference Data POST /v1/files aws-external-anthropic:CreateFile Data POST /v1/skills aws-external-anthropic:CreateSkill Data POST /v1/organizations/workspaces aws-external-anthropic:CreateWorkspace Management Workspace ARN format is arn:aws:aws-external-anthropic:{region}:{account-id}:workspace/{workspace-id}, which means workspace-scoped IAM policies, AWS Organizations SCPs, and permission boundaries all work. There is no per-workspace user list — workspace membership is purely an IAM evaluation. Adding someone to a workspace means attaching a policy that grants aws-external-anthropic:* against the workspace ARN; removing them means revoking the policy.\nTwo surprises in the IAM model that have already tripped up early adopters:\nGet* wildcards include reading model output and memory contents. GetFile authorizes both metadata and the /content endpoint — a read-only role can download bytes. GetMemoryStore reads memory contents. If you intend a true read-only boundary that excludes data exfiltration, you can't use aws-external-anthropic:Get* directly; you must enumerate.\nConsole federation issues 12-hour Claude Console sessions independent of the caller's IAM session. Granting aws-external-anthropic:AssumeConsole lets a principal open the Claude Console — and that console session lasts up to 12 hours regardless of how short the caller's IAM session is. A short-lived federated role can mint a long-lived console session. Worse, actions inside the Claude Console after federation don't appear in CloudTrail; for audit, you have to use the Claude Console's own audit logs. Treat AssumeConsole as a high-privilege grant.\nClaude Platform direct has no IAM, no CloudTrail. Audit goes through Anthropic's own admin tooling. For a regulated environment that requires AWS-native audit, this is enough by itself to disqualify path 3.\nNetwork path and PrivateLink Bedrock supports VPC endpoints for com.amazonaws.{region}.bedrock-runtime. Traffic from your VPC to inference never leaves the AWS network.\nClaude Platform on AWS supports AWS PrivateLink, but the user guide is careful: \u0026quot;PrivateLink covers only the path from your application to AWS; requests still leave AWS to reach Anthropic's infrastructure for inference.\u0026quot; So PrivateLink here is a partial guarantee — it secures the on-ramp, not the destination. If your security review needs end-to-end network isolation inside AWS, PrivateLink on path 2 doesn't give you that. It hides the source IP and removes the public internet hop to AWS, but the leg from AWS to Anthropic's inference stack still happens over Anthropic's network.\nClaude Platform direct is public internet only, full stop.\nFor workloads with hard \u0026quot;data must not traverse the public internet\u0026quot; requirements, only Bedrock with VPC endpoints clears the bar.\nMigrating from Bedrock to Claude Platform on AWS The migration cost is mostly client-side and mostly minor. The Anthropic SDKs ship a dedicated AWS backend. Here's the change:\n1# Before: Bedrock via boto3 2import boto3 3import json 4 5client = boto3.client(\u0026#34;bedrock-runtime\u0026#34;, region_name=\u0026#34;us-west-2\u0026#34;) 6response = client.invoke_model( 7 modelId=\u0026#34;us.anthropic.claude-sonnet-4-6-v1:0\u0026#34;, 8 body=json.dumps({ 9 \u0026#34;anthropic_version\u0026#34;: \u0026#34;bedrock-2023-05-31\u0026#34;, 10 \u0026#34;max_tokens\u0026#34;: 1024, 11 \u0026#34;messages\u0026#34;: [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;Hello!\u0026#34;}], 12 }), 13) 14print(json.loads(response[\u0026#34;body\u0026#34;].read())) 15 16# After: Claude Platform on AWS via the Anthropic SDK 17from anthropic import AnthropicAWS 18 19client = AnthropicAWS(aws_region=\u0026#34;us-west-2\u0026#34;) 20message = client.messages.create( 21 model=\u0026#34;claude-sonnet-4-6\u0026#34;, 22 max_tokens=1024, 23 inference_geo=\u0026#34;us\u0026#34;, 24 messages=[{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;Hello!\u0026#34;}], 25) 26print(message) Three things to notice.\nModel IDs lose the Bedrock prefix. us.anthropic.claude-sonnet-4-6-v1:0 becomes claude-sonnet-4-6. Native Anthropic naming applies.\nNo more JSON body wrapping. The Anthropic SDK takes typed parameters; you don't manually serialize bedrock-2023-05-31 envelopes anymore. If you've built a wrapper layer around Bedrock to give yourself an Anthropic-style API, you can delete it.\nSigV4 is automatic when you pass aws_region to AnthropicAWS. Or you can use an API key — set CallWithBearerToken on the principal. For long-running agents, API keys are easier than refreshing SigV4 credentials every 6 hours.\nThe endpoint format is https://aws-external-anthropic.{region}.api.aws/v1/messages. If you're calling the API directly with curl, you sign with the aws-external-anthropic service in the SigV4 scope:\n1curl \u0026#34;https://aws-external-anthropic.us-west-2.api.aws/v1/messages\u0026#34; \\ 2 --aws-sigv4 \u0026#34;aws:amz:us-west-2:aws-external-anthropic\u0026#34; \\ 3 --user \u0026#34;$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY\u0026#34; \\ 4 -H \u0026#34;x-amz-security-token: $AWS_SESSION_TOKEN\u0026#34; \\ 5 -H \u0026#34;content-type: application/json\u0026#34; \\ 6 -H \u0026#34;anthropic-version: 2023-06-01\u0026#34; \\ 7 -H \u0026#34;anthropic-workspace-id: $ANTHROPIC_AWS_WORKSPACE_ID\u0026#34; \\ 8 -d \u0026#39;{\u0026#34;model\u0026#34;: \u0026#34;claude-sonnet-4-6\u0026#34;, \u0026#34;max_tokens\u0026#34;: 1024, 9 \u0026#34;messages\u0026#34;: [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;Hello!\u0026#34;}]}\u0026#39; The anthropic-workspace-id header is required and points to the workspace in your account (wrkspc_… format). A default workspace is provisioned at sign-up; multi-workspace setups are managed through the aws-external-anthropic:CreateWorkspace action and SCPs.\nWhat you don't get to migrate cleanly:\nBedrock Guardrails configurations — no equivalent on Claude Platform on AWS. You either rebuild content filtering at the application layer, leave it on Bedrock, or accept the gap. Bedrock Knowledge Bases retrievals — no equivalent. Migrate to a separate retrieval layer (OpenSearch, pgvector, etc.) and pass context into messages directly. HIPAA workloads — see above. Don't migrate them. Bedrock-side Claude Code tooling, including the Agent Toolkit for AWS integration — that path stays Bedrock-native; the Toolkit isn't wired to Claude Platform on AWS at GA. For everything else, the migration is a config change and a model ID rename.\nThe decision tree Six questions, in order. Stop at the first one that pins you.\n1. Do you have hard data residency or HIPAA requirements? Yes → Bedrock. Claude Platform on AWS is not under AWS BAAs and operates outside the AWS security boundary. EU residency requires Bedrock. This question is a hard gate; don't optimize past it.\n2. Are you using Bedrock Guardrails or Knowledge Bases as a managed dependency? Yes → Bedrock, unless you're prepared to rebuild those layers. The migration cost is real. If you're new and haven't committed to either, ignore this question.\n3. Do you need a feature only on Anthropic's roadmap (Skills, MCP connector, code execution, Files API, Managed Agents)? Yes → Claude Platform on AWS. Bedrock historically lags Anthropic features by months; the new platform commits to same-day parity. If you're building anything agentic with Skills or MCP, path 2 is the only AWS-native option that exists today.\n4. Is your spend committed to AWS through an EDP, and would consolidating Anthropic billing into AWS change procurement materially? Yes, and you're not pinned by 1 or 2 → Claude Platform on AWS. Path 3 doesn't draw down EDP. Path 1 does directly. Path 2 does through Marketplace at terms you should confirm with AWS, but it's qualitatively the same outcome as path 1 for most companies.\n5. Are you using only Claude (no Llama, no Mistral, no Titan)? Yes → Claude Platform on AWS. Bedrock's value proposition is the model shelf plus AWS-native governance. If you only ever invoke anthropic.* model IDs, you're paying for shelf space you don't use.\n6. Are you a small team, no AWS contract, and do you want OpenAI-compatible endpoints, OAuth, or the most aggressive Anthropic feature access? Yes → Claude Platform direct (claude.com). Path 3 still has the broadest feature surface (no inference_geo restrictions, full Managed Agents, OAuth, OpenAI-compat endpoints). Skip the AWS detour entirely.\nflowchart TB start([Start]) --\u0026gt; q1{Hard data residency\u0026lt;br/\u0026gt;or HIPAA?} q1 --\u0026gt;|Yes| bed1[Bedrock] q1 --\u0026gt;|No| q2{Bedrock Guardrails\u0026lt;br/\u0026gt;or KB dependency?} q2 --\u0026gt;|Yes| bed2[Bedrock] q2 --\u0026gt;|No| q3{Need Skills, MCP,\u0026lt;br/\u0026gt;code exec, Files,\u0026lt;br/\u0026gt;Managed Agents?} q3 --\u0026gt;|Yes| cpa1[Claude Platform on AWS] q3 --\u0026gt;|No| q4{AWS EDP commit\u0026lt;br/\u0026gt;and AWS-shop?} q4 --\u0026gt;|Yes| cpa2[Claude Platform on AWS] q4 --\u0026gt;|No| q5{Only using\u0026lt;br/\u0026gt;Claude models?} q5 --\u0026gt;|Yes| cpa3[Claude Platform on AWS] q5 --\u0026gt;|No| q6{No AWS contract,\u0026lt;br/\u0026gt;want OAuth or\u0026lt;br/\u0026gt;OpenAI-compat?} q6 --\u0026gt;|Yes| direct[Claude Platform direct] q6 --\u0026gt;|No| bed3[Bedrock] classDef bedrock fill:#fff7e6,stroke:#d97706,color:#92400e classDef platform fill:#eef2ff,stroke:#4f46e5,color:#312e81 classDef directc fill:#ecfdf5,stroke:#059669,color:#064e3b class bed1,bed2,bed3 bedrock class cpa1,cpa2,cpa3 platform class direct directc The mixed case worth naming: production on Bedrock, R\u0026amp;D on Claude Platform on AWS. Your auditable production workloads stay inside the AWS security boundary; your prototyping team gets day-one access to whatever Anthropic ships next. The two paths share IAM and CloudTrail, so the operational overhead is small. Just write the data boundary into the architecture diagram before anyone gets clever and starts pointing prod traffic at the wrong endpoint.\nWhat's still unclear at GA A few questions don't have public answers yet, and a launch-day commit to path 2 needs to factor them in.\nSupport escalation across two vendors. When inference fails, is it an AWS ticket or an Anthropic ticket? AWS handles the auth and billing layer; Anthropic handles inference. Public SLAs for Claude Platform on AWS aren't documented in the same form Bedrock SLAs are. If your incident process depends on a single throat to choke, this matters.\nBeta feature pricing. Token rates match the direct API. But Skills, Managed Agents, and code execution are beta — beta pricing has historically been a moving target on the direct API. Whether Bedrock-side private pricing offers automatically translate to Marketplace terms is a per-account question.\nBedrock-to-Platform regional parity. Bedrock has more granular regions; Claude Platform on AWS launched in 17. If your application is pinned to a specific region for latency or compliance, check the What's New announcement before assuming parity.\nFor early adopters, treat path 2 as a strategic option that's production-ready for most workloads but worth piloting before bulk-cutting over.\nWhy this matters beyond the choice itself The interesting thing about Claude Platform on AWS isn't the product — it's the shape of the deal. AWS allowed an external supplier to operate a platform inside its own catalog as a first-class option, ceding the data-processor seat in the process. That's not how the previous generation of cloud-AI partnerships worked. It's a signal worth noting if you're making multi-year commitments: the lock-in pattern that selected your model when you selected your cloud is starting to soften. The same model can now run in your AWS account on two structurally different bases, and an Anthropic engineer making a feature decision no longer has to wait for an AWS integration team to catch up.\nFor an architect, the practical consequence is that \u0026quot;model layer\u0026quot; and \u0026quot;cloud layer\u0026quot; are now genuinely separable axes in the diagram. Pick each on its own merits.\n","link":"https://kane.mx/posts/2026/claude-platform-on-aws-vs-bedrock/","section":"posts","tags":["AWS","Anthropic","Claude","Amazon Bedrock","Claude Platform on AWS","IAM","CloudTrail","AWS Marketplace","PrivateLink"],"title":"Claude Platform on AWS vs. Bedrock: A Decision Tree"},{"body":"","link":"https://kane.mx/tags/cloudtrail/","section":"tags","tags":null,"title":"CloudTrail"},{"body":"","link":"https://kane.mx/tags/iam/","section":"tags","tags":null,"title":"IAM"},{"body":"","link":"https://kane.mx/tags/privatelink/","section":"tags","tags":null,"title":"PrivateLink"},{"body":"","link":"https://kane.mx/tags/agent-toolkit/","section":"tags","tags":null,"title":"Agent Toolkit"},{"body":"If you've been using Claude Code for AWS development, you've probably seen the pattern: you paste a CloudFormation snippet into your session, Claude suggests something plausible, you deploy it, and the stack events stream lights up with CREATE_FAILED on a property the model couldn't have known about — because its training data stopped months ago.\nThe usual workaround has been hand-rolling context into CLAUDE.md: copying service endpoint quirks, IAM condition key syntax, and PrivateLink DNS formats that the model gets wrong. It works, but it's manual, fragile, and grows without bound.\nAWS shipped the Agent Toolkit for AWS (GA) on May 6, 2026 — the official, AWS-supported path forward from the community-grade AWS Labs MCP servers, skills, and plugins. Three plugins (aws-core, aws-agents, aws-data-analytics) bundle 30+ curated skills and the AWS MCP Server, now also GA. Plugins ship for Claude Code and Codex out of the box; Kiro and other agents connect through direct MCP server configuration.\nThis is the fix. But installing it without thinking will burn tokens or open your IAM scope wider than you want. This post is the integration guide for developers already running Claude Code on Bedrock.\nWhy Claude Code Fails on AWS The failure modes are consistent:\nOutdated knowledge. CloudFormation resource schemas evolve. New condition keys appear. Deprecations happen. A model trained before late 2025 won't know about recent additions like AWS::Bedrock::AgentCore or new IAM session-tag conditions. It will confidently write something that parses but doesn't deploy.\nMulti-service wiring drift. A real workload involves IAM, VPC, security groups, CloudWatch, and the actual service — five or six resources that must reference each other in exactly the right way. Claude gets the first two right and starts fabricating ARN formats by resource three.\nNo environment awareness. The model doesn't know your account ID, your VPC endpoint DNS suffixes, or which AZs your subnets are in. Every context injection you forget is a hallucinated placeholder.\nThe usual answer — more CLAUDE.md context — is a patch, not a solution. What you actually need is live AWS knowledge wired into the tool-call loop.\nWhat's in the Toolkit (and How It Differs from AWS Labs) Before the toolkit, the ecosystem looked like this:\nflowchart TD subgraph \u0026#34;Before: AWS Labs MCP era\u0026#34; CC1[Claude Code] --\u0026gt;|tool calls| LABS[awslabs MCP servers] LABS --\u0026gt; CFN[CloudFormation MCP] LABS --\u0026gt; CDK[CDK MCP] LABS --\u0026gt; DOCS[Docs MCP] CC1 --\u0026gt;|manual context| MD[CLAUDE.md] end AWS Labs MCP servers were community-grade: useful, but inconsistently maintained, with no end-to-end skill evaluation and no official IAM scoping guidance.\nThe Agent Toolkit formalizes three layers:\nflowchart LR CC[Claude Code] --\u0026gt;|MCP protocol| MCP[AWS MCP Server managed] CC --\u0026gt;|skill discovery| SKILLS[Curated Skills] CC --\u0026gt;|installer| PLUGINS[Plugins aws-core aws-agents aws-data-analytics] MCP --\u0026gt; APIS[300+ AWS APIs] MCP --\u0026gt; KMCP[Knowledge MCP docs search] MCP --\u0026gt; SCRIPT[Sandboxed script exec] SKILLS --\u0026gt; CFN1[CloudFormation patterns] SKILLS --\u0026gt; SVL[Serverless EDA] SKILLS --\u0026gt; AGT[AgentCore] Skills are curated packages of instructions and reference materials — validated CloudFormation patterns, Well-Architected serverless heuristics, CDK idioms. They don't make API calls. They constrain the model toward patterns that actually work, and they're loaded on demand so unused skills don't burn context.\nAWS MCP Server is a managed remote endpoint (https://aws-mcp.\u0026lt;region\u0026gt;.api.aws/mcp) reached through the mcp-proxy-for-aws stdio shim. It exposes three things to the agent:\nFull AWS API coverage across 300+ services through one authenticated endpoint Sandboxed Python execution for multi-step operations Real-time documentation search via the Knowledge MCP (https://knowledge-mcp.global.api.aws), which needs no AWS credentials Plugins are the delivery mechanism. Three of them at GA:\nPlugin Coverage aws-core Service selection, CDK/CloudFormation, serverless, containers, storage, observability, billing, SDK usage, deployment. Start here. aws-agents Building AI agents on AWS with Amazon Bedrock and AgentCore. aws-data-analytics Data lake, analytics, ETL with S3 Tables, Glue, Athena. The concrete improvement over AWS Labs:\nIAM context keys that distinguish agent actions from human actions — you can write a policy that allows write-actions for the human role but read-only when reached through the MCP server, even if the underlying role permits writes. CloudWatch metrics and CloudTrail audit logging on every MCP request — agent activity is observable, not invisible. Skills with end-to-end evaluation — not just \u0026quot;someone wrote a markdown file.\u0026quot; AWS Labs continues to accept contributions; over time, the best of it transitions into the toolkit.\nInstalling in Claude Code Three commands. Run them inside a Claude Code session:\n1# Add the marketplace 2/plugin marketplace add aws/agent-toolkit-for-aws 3 4# Install the core plugin (start here) 5/plugin install aws-core@agent-toolkit-for-aws 6 7# Optional: agents and analytics plugins 8/plugin install aws-agents@agent-toolkit-for-aws 9/plugin install aws-data-analytics@agent-toolkit-for-aws The plugin ships an .mcp.json that registers the AWS MCP Server through uvx mcp-proxy-for-aws@latest. So you need uv installed locally:\n1# macOS / Linux 2curl -LsSf https://astral.sh/uv/install.sh | sh For documentation search and skill discovery, no AWS credentials are needed. The Knowledge MCP is unauthenticated. For API calls and sandboxed script execution, you do need credentials configured locally — see my earlier post on credential_process for a pattern that doesn't leave plaintext keys in ~/.aws/credentials.\nCodex and Other Agents The same marketplace works in Codex:\n1codex plugin marketplace add aws/agent-toolkit-for-aws For agents without plugin support (Kiro and others), configure the AWS MCP Server directly in the agent's MCP config and install skills from the toolkit repo with npx skills add aws/agent-toolkit-for-aws/skills.\nScoping IAM Without Footguns The biggest mistake I've watched developers make: pointing the AWS MCP Server at credentials with AdministratorAccess. The toolkit's most important IAM feature exists precisely so you don't have to.\nThe agent-vs-human IAM distinction. Every request the AWS MCP Server forwards on your behalf carries the aws:ViaAWSMCPService context key, which is true when an MCP server makes the call and false when the principal calls AWS directly. You can write IAM policies that gate behavior on that key — read-only on the agent path, full access on the human path — without splitting roles or vending separate credentials.\nA minimal policy that denies destructive actions when reached through the MCP server, while leaving the underlying role's permissions untouched for direct human use:\n1{ 2 \u0026#34;Version\u0026#34;: \u0026#34;2012-10-17\u0026#34;, 3 \u0026#34;Statement\u0026#34;: [ 4 { 5 \u0026#34;Sid\u0026#34;: \u0026#34;DenyDestructiveActionsViaMCP\u0026#34;, 6 \u0026#34;Effect\u0026#34;: \u0026#34;Deny\u0026#34;, 7 \u0026#34;Action\u0026#34;: [ 8 \u0026#34;cloudformation:DeleteStack\u0026#34;, 9 \u0026#34;iam:Delete*\u0026#34;, 10 \u0026#34;iam:Put*\u0026#34;, 11 \u0026#34;s3:DeleteBucket\u0026#34;, 12 \u0026#34;dynamodb:DeleteTable\u0026#34; 13 ], 14 \u0026#34;Resource\u0026#34;: \u0026#34;*\u0026#34;, 15 \u0026#34;Condition\u0026#34;: { 16 \u0026#34;Bool\u0026#34;: { 17 \u0026#34;aws:ViaAWSMCPService\u0026#34;: \u0026#34;true\u0026#34; 18 } 19 } 20 } 21 ] 22} This is the canonical pattern from the AWS IAM docs — the same condition key drives every \u0026quot;agent vs. human\u0026quot; boundary you might want to enforce.\nIf you want the agent to deploy (not just validate), bound the blast radius by narrowing the resource ARN to a stack-name prefix and constraining the resource types the change set can touch:\n1{ 2 \u0026#34;Version\u0026#34;: \u0026#34;2012-10-17\u0026#34;, 3 \u0026#34;Statement\u0026#34;: [ 4 { 5 \u0026#34;Sid\u0026#34;: \u0026#34;AgentDeployBoundedStackNames\u0026#34;, 6 \u0026#34;Effect\u0026#34;: \u0026#34;Allow\u0026#34;, 7 \u0026#34;Action\u0026#34;: [ 8 \u0026#34;cloudformation:CreateChangeSet\u0026#34;, 9 \u0026#34;cloudformation:ExecuteChangeSet\u0026#34;, 10 \u0026#34;cloudformation:DescribeChangeSet\u0026#34; 11 ], 12 \u0026#34;Resource\u0026#34;: \u0026#34;arn:aws:cloudformation:*:*:stack/claude-code-*/*\u0026#34;, 13 \u0026#34;Condition\u0026#34;: { 14 \u0026#34;ForAllValues:StringEquals\u0026#34;: { 15 \u0026#34;cloudformation:ResourceTypes\u0026#34;: [ 16 \u0026#34;AWS::Lambda::*\u0026#34;, 17 \u0026#34;AWS::IAM::Role\u0026#34;, 18 \u0026#34;AWS::Logs::*\u0026#34;, 19 \u0026#34;AWS::Events::*\u0026#34;, 20 \u0026#34;AWS::DynamoDB::Table\u0026#34; 21 ] 22 } 23 } 24 } 25 ] 26} Two caveats — both important.\nForAllValues:StringEquals evaluates as true when the multi-valued key is absent from the request. So if the caller doesn't pass ResourceTypes, this policy imposes no resource-type restriction at all. To actually enforce the constraint, either require the caller to always pass ResourceTypes (workflow convention or service control policy), or pair the existing condition with a Null block that denies when the key is missing:\n1\u0026#34;Condition\u0026#34;: { 2 \u0026#34;Null\u0026#34;: { 3 \u0026#34;cloudformation:ResourceTypes\u0026#34;: \u0026#34;false\u0026#34; 4 }, 5 \u0026#34;ForAllValues:StringEquals\u0026#34;: { 6 \u0026#34;cloudformation:ResourceTypes\u0026#34;: [ \u0026#34;AWS::Lambda::*\u0026#34;, \u0026#34;...\u0026#34; ] 7 } 8} And if you want this Allow to apply only to agent-initiated changes, add an aws:ViaAWSMCPService: true condition — the same context key as the Deny policy above, used here to scope the allow to MCP traffic instead of denying it.\nUse short-lived credentials. Don't bake long-lived keys into ~/.aws/credentials. Assume a role and vend session tokens through credential_process — the same pattern works for the MCP server because it just reads from the local credential chain.\nBefore/After: Building an EventBridge Pipe Illustrative composite — wiring a DynamoDB stream into SQS via EventBridge Pipes, with the IAM role to make it work. The failure modes below are real ones I've watched models produce; the token figures are rough orders of magnitude, not measurements.\nWithout the toolkit. Claude tends to produce templates with one or more of:\nDependsOn that creates a circular reference DynamoDB table missing StreamSpecification (StreamViewType: NEW_AND_OLD_IMAGES not set), so Pipes has no stream to source from A Pipes role trust policy that targets the wrong service principal — events.amazonaws.com instead of pipes.amazonaws.com Trust policy missing the aws:SourceArn / aws:SourceAccount confused-deputy conditions Two or three rounds of correction. Order of 8K input tokens before the template deploys cleanly.\nWith aws-core installed. The relevant skills activate — aws-serverless (which has an EventBridge Pipes reference under orchestration.md and a DynamoDB Streams → Pipes → Lambda pattern in architecture.md), aws-messaging-and-streaming (Pipes service-principal and confused-deputy guidance), and aws-iam (trust policy correctness). The Knowledge MCP can confirm the current AWS::Pipes::Pipe schema if needed. Claude produces a deployable template on the first pass with the correct pipes.amazonaws.com principal and proper aws:SourceArn scoping.\nOrder of 3K input tokens, one round.\nThe gain compounds across a session. A complex multi-service stack (API Gateway → Lambda → DynamoDB → EventBridge → SQS, with the right IAM for each hop) used to be a multi-hour back-and-forth. With skills active, it's one generation plus review.\nA caveat worth measuring on your own workload: the Knowledge MCP isn't free in tokens. Documentation search returns prose, and prose is expensive. For trivial single-resource changes, the search overhead can exceed what it saves. Benchmark before turning it on for every session — the AWS MCP Server emits CloudWatch metrics that make this measurable.\nLayering on Top of Your Existing CLAUDE.md If you've been hand-rolling AWS context into CLAUDE.md, don't throw it out. The toolkit and your custom context are complementary — but split them across two files so you don't leak account topology into the public repo.\nCLAUDE.md (committed) holds conventions that anyone working on the project needs: naming patterns, design rules, what the toolkit handles, project-specific overrides.\nCLAUDE.local.md (gitignored) holds environment-specific values that shouldn't be in source control: account IDs, VPC IDs, subnet IDs, security group IDs, on-call contacts. Add CLAUDE.local.md and .local.* to .gitignore if they aren't already.\nA clean layering pattern:\n1# CLAUDE.md (committed) 2 3## What the Agent Toolkit Handles 4The aws-core plugin is installed. It provides: 5- CloudFormation resource schemas (do not override) 6- Serverless and EDA patterns 7- IAM best practices 8- Live documentation search via the Knowledge MCP 9 10## Project Conventions 11- Naming convention: {team}-{service}-{env}-{resource} 12- Lambda timeout: 30s max (SLA constraint) 13- DynamoDB: always use on-demand billing (cost policy) 14- Primary region: ap-northeast-1 15- For account ID, VPC, and subnets, see CLAUDE.local.md 1# CLAUDE.local.md (gitignored — never commit) 2 3## Environment Values 4- AWS Account: 123456789012 (production) 5- VPC ID: vpc-0a1b2c3d4e5f (do not create new VPCs) 6- Private subnets: subnet-aaa, subnet-bbb, subnet-ccc 7- Security group for outbound: sg-0f1a2b3c Skills load automatically. The two CLAUDE.md files fill in what the toolkit can't know. The split keeps your repo shareable while still giving the agent everything it needs in your local checkout.\nThe toolkit also ships recommended rules files you can drop into a project to tell agents how to use AWS most effectively — for example, querying the MCP server before fabricating a service capability, or discovering an applicable skill before writing code from scratch. Worth borrowing into your own CLAUDE.md even if you don't adopt them wholesale.\nA Builder's Perspective: What the Toolkit Got Right I've been running a similar system for several months — an open-source aws-skills plugin set with aws-cdk, aws-cost-ops, serverless-eda, aws-agentic-ai, and a shared aws-common dependency. The architecture maps closely to what AWS shipped:\nSkills are separate from execution. Injecting expert context into the model is different from making API calls. Keep them distinct so you can use one without the other. One plugin per domain, not a monolith. A CDK-heavy project doesn't need the serverless-EDA patterns in context. Loading only what's relevant matters at scale. The MCP server is the live signal, not the only signal. Static skills stay valuable even when documentation search is available. Patterns and idioms don't change as fast as service features. What the official toolkit has that the community version didn't: the IAM context keys are the real differentiator (these need AWS-side support — you can't bolt them on), the MCP server is production-grade and officially supported, and the Knowledge MCP is backed by the actual documentation index, not a scrape.\nIf you're starting fresh, install aws-core and don't look back. If you've already invested in a custom plugin set like mine, the migration path is to keep your project-specific skills (they capture conventions the toolkit can't know) and let aws-core replace the generic ones.\nWhat's Still Missing Honest gaps as of May 2026:\nNo CDK construct-level validation. Skills cover CloudFormation resource schemas. CDK L2/L3 constructs that wrap those resources get less coverage. If you're generating CDK code, still run cdk synth and review the output.\nAccount topology isn't auto-discovered. The MCP server can run describe calls when asked, but it doesn't automatically prime context with your VPC, subnet, and security group IDs at session start. You still need to put those in CLAUDE.md or feed them in early.\nKnowledge MCP token cost. Documentation search isn't always cheaper than the model guessing. For simple operations, it adds tokens. Measure before enabling unconditionally.\nPlugin support beyond Claude Code and Codex. Plugins are agent-specific. For Kiro and other agents, you configure the MCP server directly and install skills, but you don't get the bundled experience.\nWatch For The toolkit will move quickly. Things worth tracking:\nMore plugins beyond the initial three (security, networking, migration are obvious gaps) CDK construct-level skills, not just CloudFormation Cross-toolkit composition with AgentCore so agents can call other agents inside the MCP boundary Summary The Agent Toolkit for AWS is a meaningful upgrade for Claude Code + AWS workflows. Validated skills give better first-pass accuracy on CloudFormation. The Knowledge MCP keeps the model current with breaking schema changes. The IAM context keys make it safe to give agents real credentials.\nInstall aws-core. Scope the IAM with the agent-vs-human distinction. Layer it on top of your existing CLAUDE.md rather than replacing it. Benchmark the Knowledge MCP token cost for your specific workload before turning it on for every session.\nIf you want to see how this fits with a custom plugin set, my aws-skills repo is open source — it predates the official toolkit, the architecture transfers, and the companion post walks through the same workflow with the community-grade version.\nRelated posts: Build on AWS Faster with Claude Code and AWS Skills · Secure AWS Credentials with credential_process · Claude Code Cost Per Project on AWS\n","link":"https://kane.mx/posts/2026/agent-toolkit-for-aws-claude-code/","section":"posts","tags":["AWS","Claude Code","MCP","Agent Toolkit","Amazon Bedrock","CloudFormation","IAM","Serverless"],"title":"Agent Toolkit for AWS: What It Changes for Claude Code"},{"body":"","link":"https://kane.mx/tags/cloudformation/","section":"tags","tags":null,"title":"CloudFormation"},{"body":"","link":"https://kane.mx/categories/developer-tools/","section":"categories","tags":null,"title":"Developer Tools"},{"body":"","link":"https://kane.mx/tags/serverless/","section":"tags","tags":null,"title":"Serverless"},{"body":"","link":"https://kane.mx/tags/autonomous-development/","section":"tags","tags":null,"title":"Autonomous Development"},{"body":"","link":"https://kane.mx/tags/aws-spot/","section":"tags","tags":null,"title":"AWS Spot"},{"body":"","link":"https://kane.mx/tags/ci/","section":"tags","tags":null,"title":"CI"},{"body":"","link":"https://kane.mx/tags/ci/cd/","section":"tags","tags":null,"title":"CI/CD"},{"body":"","link":"https://kane.mx/tags/cost-optimization/","section":"tags","tags":null,"title":"Cost Optimization"},{"body":"","link":"https://kane.mx/categories/devops/","section":"categories","tags":null,"title":"DevOps"},{"body":"","link":"https://kane.mx/tags/github-actions/","section":"tags","tags":null,"title":"GitHub Actions"},{"body":"","link":"https://kane.mx/tags/graviton/","section":"tags","tags":null,"title":"Graviton"},{"body":"Autonomous AI dev teams move the bottleneck. When a dispatcher fans out work to dev and review agents every 5 minutes, the constraint is no longer human attention — it is the CI/CD pipeline that gates every PR. Each agent push triggers builds, tests, E2E verification, and bot reviews. With even a small team of agents iterating in parallel, GitHub Actions minutes become the dominant operational cost.\nHosted runners are convenient, but they are billed per-minute and capped at 2 cores on the standard tier. For a team of agents that pushes dozens of times an hour and runs container builds + Playwright suites, the math stops working. The lever is self-hosted runners on AWS spot EC2 — bigger instances, multi-architecture, scaling to zero when idle.\nWhy CI/CD Becomes the Bottleneck for AI Agents In the autonomous dev team architecture, three categories of agent activity push the CI/CD pipeline:\nActivity Frequency Cost characteristic Dev agent commits Multiple per task; one per fix iteration Each commit re-runs lint, type, unit, build Review agent verification Every PR; every revision Triggers preview deploy, E2E tests, bot review Hook-enforced re-checks Before commit, before push, before completion Hits CI status APIs, may trigger reruns A single feature implementation typically iterates 3–8 times through the dev → review → fix loop. Each iteration is a full pipeline run. Multiply by parallel issues, multiply again by container image builds and Playwright suites, and the monthly hosted-runner bill scales linearly with agent throughput.\nThe other constraint is horsepower. Modern projects target both arm64 (Graviton, Apple Silicon, AWS Lambda arm64) and x86_64 (legacy services, GPU container base images, vendor SDKs). GitHub does offer standard arm64 hosted runners (ubuntu-24.04-arm, ubuntu-22.04-arm) on all plans now, but in private repos they're capped at 2 vCPU — the same ceiling as the x64 standard tier. For container builds, full Playwright suites, and any compile step that scales with cores, 2 vCPU is the bottleneck. Larger hosted runners exist on Team / Enterprise plans, but their per-minute pricing is a multiple of the standard rate.\nThe Self-Hosted Spot Approach The mechanics are straightforward:\nflowchart LR A[GitHub Actions Job] --\u0026gt;|workflow_job event| B[Webhook Lambda] B --\u0026gt;|enqueue| C[SQS Queue] C --\u0026gt; D[Scale-Up Lambda] D --\u0026gt;|RunInstances Spot| E[EC2 Runner] E --\u0026gt;|register| F[GitHub App] E --\u0026gt;|run job| G[Job Complete] G --\u0026gt;|idle 15min| H[Scale-Down Lambda] H --\u0026gt;|TerminateInstances| I[Idle, near-zero cost] The control plane is a set of small Lambda functions that translate GitHub webhook events into EC2 spot launches. The data plane is whatever EC2 instance type fits the job. When no jobs are queued the data plane scales to zero — the only running cost between bursts is the small Lambda + SQS + S3 control plane.\nSeveral open-source projects implement this pattern. The reference implementation used here is the terraform-aws-github-runner module, originally from Philips Labs and now community-maintained. A feature branch of the module ships an opinionated multi-architecture deployment under deployments/shared-runners/ — that deployment is the example this post walks through, not the subject. Copy the patterns, adapt the rest.\nMulti-Architecture Fleet Behind One Webhook The example deployment routes a single GitHub App webhook to two distinct fleets, dispatched by GitHub label:\n1multi_runner_config = { 2 \u0026#34;linux-arm64\u0026#34; = { 3 matcherConfig = { 4 labelMatchers = [[\u0026#34;self-hosted\u0026#34;, \u0026#34;linux\u0026#34;, \u0026#34;arm64\u0026#34;]] 5 exactMatch = true 6 } 7 runner_config = merge(local.common_runner_config, { 8 runner_architecture = \u0026#34;arm64\u0026#34; 9 instance_types = [\u0026#34;c8g.2xlarge\u0026#34;] 10 runners_maximum_count = 10 11 # owners = self-account: the AMI is built by Packer in this account. 12 # Use a public AMI ID + Canonical\u0026#39;s owner ID if you skip the AMI build. 13 ami = { 14 filter = { name = [\u0026#34;github-runner-ubuntu-noble-arm64-*\u0026#34;] } 15 owners = [data.aws_caller_identity.current.account_id] 16 } 17 }) 18 } 19 20 \u0026#34;linux-amd64\u0026#34; = { 21 matcherConfig = { 22 labelMatchers = [[\u0026#34;self-hosted\u0026#34;, \u0026#34;linux\u0026#34;, \u0026#34;x64\u0026#34;]] 23 exactMatch = true 24 } 25 runner_config = merge(local.common_runner_config, { 26 runner_architecture = \u0026#34;x64\u0026#34; 27 instance_types = [\u0026#34;c7a.4xlarge\u0026#34;, \u0026#34;c7i.4xlarge\u0026#34;, \u0026#34;m7a.4xlarge\u0026#34;] 28 runners_maximum_count = 5 29 }) 30 } 31} Three properties are worth highlighting:\nExact-match label routing. Setting exactMatch = true on each matcher prevents a job tagged [self-hosted, linux, x64] from accidentally landing on the arm64 fleet. Without it, partial matches let a job leak across architectures and either fail the build or burn an instance for nothing.\nMulti-pool spot for amd64. The amd64 fleet lists three instance families (c7a AMD, c7i Intel, m7a AMD higher-memory). When AWS spot capacity tightens on any single pool, the allocator picks another. The arm64 fleet uses one type because Graviton4 c8g.2xlarge capacity has been consistently available in us-east-1 across the AZs we use; multi-pool fallback would still be a safer default for fleets at higher scale or in capacity-constrained regions.\nPer-fleet caps. runners_maximum_count is enforced inside the scale-up Lambda — every invocation queries the current count and clamps new launches to min(requested, max - current). With scale_up_reserved_concurrent_executions = 1, only one Lambda instance executes at a time, which avoids races on the count check. The tradeoff: at high burst (say, 20 agent pushes within a minute), SQS messages drain sequentially rather than in parallel — fine for most teams, but raise the reserved concurrency if your agents tend to spike all at once.\nA typical workflow picks its fleet:\n1jobs: 2 arm64-build: 3 runs-on: [self-hosted, linux, arm64] 4 # ... 5 amd64-build: 6 runs-on: [self-hosted, linux, x64] 7 # ... Or makes the choice configurable per repo:\n1runs-on: ${{ vars.RUNNER_LABEL \u0026amp;\u0026amp; fromJSON(vars.RUNNER_LABEL) || \u0026#39;ubuntu-latest\u0026#39; }} Set the repo variable RUNNER_LABEL to a JSON array string, not a plain label — for example [\u0026quot;self-hosted\u0026quot;,\u0026quot;linux\u0026quot;,\u0026quot;arm64\u0026quot;]. The fromJSON parses it back into a labels list.\nSpot Allocation: Why price-capacity-optimized Beats lowest-price The Terraform module's default is lowest-price. The example deployment overrides this:\n1common_runner_config = { 2 instance_target_capacity_type = \u0026#34;spot\u0026#34; 3 instance_allocation_strategy = \u0026#34;price-capacity-optimized\u0026#34; 4 # ... 5} lowest-price picks the cheapest pool at the moment of launch. That works until a popular pool tightens and the allocator keeps picking it because it remains nominally the cheapest, while interruptions spike. price-capacity-optimized is AWS's recommended strategy: it weights price and available capacity, biasing toward pools where launch is most likely to succeed and least likely to be reclaimed in the near term.\nFor an autonomous AI dev team, this is the difference between \u0026quot;the review agent re-runs E2E tests three times because its runner kept getting interrupted\u0026quot; and \u0026quot;the job runs once on a stable pool and finishes.\u0026quot; Interruption recovery is not free — the scale-up Lambda has to handle the queued retry, the dev agent may need to fetch new logs, and wall-clock time stretches.\nWhen a spot interruption does hit a running job, GitHub marks the job as failed (the runner stops responding before completing the job). Configure job_retry in the module (the example deployment uses max_attempts: 1 with a 180-second delay) to automatically requeue the workflow_job event; the agent then sees a normal failure-and-retry rather than a stuck pipeline. For higher availability, enable_on_demand_failover_for_errors = [\u0026quot;InsufficientInstanceCapacity\u0026quot;] (a per-fleet field on the multi-runner submodule) falls back to on-demand when spot can't be scheduled at all.\nHardened AMIs: Security and Determinism at Build Time A self-hosted runner is a long-running process that executes arbitrary code from your repos. Treat its AMI like a production server image, not a CI cache. Pre-baking also eliminates first-boot drift and shaves seconds off cold-start. The example deployment ships two Packer directories — images/ubuntu-noble-arm64/ and images/ubuntu-noble/ — that build from Canonical's Ubuntu 24.04 Pro base and bake in the entire runner toolchain. (Pro is the same OS as standard 24.04 LTS; the Pro base AMI just carries the entitlement metadata for ESM and livepatch, which earns its keep when you're running the same image for months between rebuilds.)\nThe security and determinism baseline:\nLatest OS patches at build time. Packer runs apt-get -y upgrade (with force-confdef + force-confold to avoid interactive prompts) before snapshotting, so the AMI ships with the kernel/glibc/openssl patches available at build time. Schedule periodic rebuilds — fresh AMIs are how you ship the next month's CVE fixes to the fleet.\nEncrypted EBS root. The launch template specifies encrypted = true on the gp3 root volume. Default-encrypt the account, but enforce it at the launch template too — defense in depth costs nothing here.\nIMDSv2 enforced on both the AMI and the builder. Setting imds_support = \u0026quot;v2.0\u0026quot; on the Packer source makes the resulting AMI register as IMDSv2-only, blocking SSRF-style metadata exfiltration from any process on the runner. The builder instance also needs metadata_options { http_tokens = \u0026quot;required\u0026quot; }, because AWS accounts with httpTokensEnforced reject any IMDSv1 launch — including the launch Packer itself does to build the AMI. Easy to miss, easy to debug from the launch error.\nNo first-boot userdata, no runner-binary download. Combined with enable_runner_binaries_syncer = false and enable_userdata = false, every spot launch is a fast, deterministic boot — no first-boot apt churn, no runner binary fetched from S3 at launch time. Whatever lives on the AMI is what runs.\nMinimal IAM on the runner instance role. The runner only needs SSM (for debugging, optional), CloudWatch Logs, and any per-job permissions your workflows actually require. Don't reuse a build-server role with broad write access — every workflow that runs is implicitly trusted with that role's permissions.\nThe toolchain baked in:\nTool Why Node.js 24 Frontend builds, Lambda runtime parity Bun Fast TS/JS bundling for AI dev workflows Playwright Chromium E2E tests run by review agents Docker CE Container image builds AWS CLI v2 Deploy steps, integration tests CloudWatch Agent Metrics + logs without extra setup Shared-Runner Isolation: What's at Stake A shared self-hosted runner that serves multiple repos is a cross-repo trust boundary. GitHub's security hardening guide is direct about this: self-hosted runners are not recommended for public repositories (any PR can run code on them), and runners shared at the org level are explicitly flagged because \u0026quot;a security compromise of these environments can result in a wide impact.\u0026quot;\nTwo design choices keep the blast radius small:\nLock the fleet to specific repos. The example deployment uses repository_white_list to scope the GitHub App to a known set of repos, and configures runners at the repo level rather than the org level. This is the upstream module's recommended starting point and matches GitHub's own guidance.\nPrefer ephemeral runners for any repo with untrusted contributors. The example uses persistent runners with a 15-minute idle window — acceptable for a private fleet of trusted repos, where the cost of relaunch latency outweighs the marginal isolation benefit. For repos with external PRs or higher-risk workloads, flip enable_ephemeral_runners = true (and enable_jit_config = true); each runner then handles exactly one job and is destroyed. You pay relaunch latency on every job, but a compromised job can't observe the next one. GitHub also notes that even with JIT runners, re-using underlying hardware can leak data between runs — combine ephemeral runners with a fresh-instance-per-job lifecycle (the default when EC2 termination follows job completion) so no two jobs ever share a host.\nRun two fleets when trust levels mix. If your project mixes trusted internal repos with untrusted external contributions, the safest move is two separate fleets — persistent and cheap for the trusted side, ephemeral and isolated for the rest. The same multi-runner module handles both behind one webhook with different label matchers.\nPer-Project Cost Attribution Without Per-Project Fleets A shared runner fleet handles jobs from many repos in its lifetime. Tagging EC2 instances with Project=foo would attribute the wrong project — a single runner instance might handle a job from repo-a followed by a job from repo-b before going idle.\nThe right answer is CloudWatch Logs Insights against the webhook Lambda log, not CloudWatch Metrics. The upstream Lambdas don't emit repository as a metric dimension (only as EMF metadata, which Metrics Explorer cannot aggregate). The webhook Lambda does log full repo + job + action + conclusion on every workflow_job event:\n1fields github.repository as repo, github.action as action 2| filter ispresent(github.repository) 3| stats count() as events by repo, action 4| sort events desc For wall-clock cost attribution, derive duration from the timestamp delta between the queued and completed events for the same workflowJobId. In Logs Insights, @timestamp behaves as a millisecond-resolution numeric value for arithmetic — (max(@timestamp) - min(@timestamp)) returns milliseconds, dividing by 1000 yields seconds. (Wrap with toMillis(@timestamp) if you want the numeric conversion explicit.)\n1fields @timestamp, github.repository as repo, github.workflowJobId as job_id 2| filter ispresent(job_id) and ispresent(repo) 3| stats min(@timestamp) as first_at, 4 max(@timestamp) as last_at, 5 (max(@timestamp) - min(@timestamp)) / 1000 as wall_sec, 6 count() as events 7 by repo, job_id 8| filter events \u0026gt;= 2 9| stats count() as jobs, 10 sum(wall_sec) as total_wall_sec, 11 avg(wall_sec) as avg_wall_sec 12 by repo 13| sort total_wall_sec desc wall_sec is GitHub's view, which includes time spent waiting for a runner and the log-delivery jitter between events — treat it as an approximation, not exact compute time. For pure compute time, key the same query off action = in_progress events instead. (Verify the field path against your own webhook log — the JSON shape can shift across module versions.)\nIn our fleet a high skipped ratio on a repo (60%+ of completed jobs) has reliably traced back to a paths filter that's too broad — every push fires every job, GitHub-side conditional skip dismisses most of them, but each one still costs a webhook → SQS → Lambda round-trip. Worth fixing at the repo, not the fleet.\nWhat This Costs in Practice Sampled from us-east-1 in May 2026; spot prices move with demand and AZ:\nResource Approximate cost EC2 Spot c8g.2xlarge (arm64, 8 vCPU) $0.11–0.16/hr EC2 Spot c7a.4xlarge (amd64, 16 vCPU) $0.34–0.46/hr Lambda + SQS + S3 control plane ~$0.50–1.00/month NAT Gateway (if not already shared) ~$32/month + $0.045/GB data Idle compute scales to zero — between bursts the only running cost is the control plane (and NAT, if your VPC needs a dedicated one). Runners stay alive for a 15-minute idle window after the last job (configurable via idle_config), so a burst of agent activity doesn't pay relaunch cost for the next job.\nThe cost comparison is unit-economics, not apples-to-apples hardware:\nA standard GitHub-hosted Linux x64 minute is $0.006/min, billed against private-repo minutes. A c8g.2xlarge spot at $0.13/hr is ~$0.0022/min — about 2.7× cheaper per minute. The c8g.2xlarge has 8 vCPU vs the standard hosted runner's 2 vCPU, so a CPU-bound job (compile, container build, Playwright suite) typically finishes ~2× faster wall-clock. Combined effect: roughly 5× lower cost-per-job for compute-heavy work. I/O-bound jobs see closer to the raw 2.7× rate ratio. For an autonomous dev team running mostly compile + test + Playwright cycles, 5× is a defensible expectation; the high end of the range needs a measured benchmark from your own fleet. Hand-Off Back to the Autonomous Dev Team Stitching this back to the original constraint: an autonomous dev team is rate-limited by what its CI/CD pipeline costs and how fast it runs. With self-hosted spot runners:\nFaster iteration — c8g.2xlarge finishes lint/type/test cycles in roughly half the wall-clock time of a 2-vCPU hosted runner, so the dev → review → fix loop closes quicker. Native arm64 — no emulation when projects target Graviton or Lambda arm64, cutting build time again. Better unit economics — cost is still a function of job count and runtime, but the per-minute rate drops sharply and the fleet cap holds an upper bound on burst spend. Near-zero idle — between bursts the data plane scales to zero; only the small control plane (and NAT, if any) keeps running. The CI pipeline stops being the cost ceiling, and the agents can iterate freely. The remaining bottleneck shifts back to where it belongs — what the agents are actually building.\nResources terraform-aws-github-runner (upstream) — the community Terraform module feat/multi-runners branch — example multi-architecture deployment under deployments/shared-runners/ Autonomous Dev Team (previous post) — the agent architecture this CI infrastructure supports AWS Spot allocation strategies — price-capacity-optimized rationale GitHub Actions self-hosted runner docs — registration, security model, label semantics GitHub Actions security hardening — shared runner risks, ephemeral runners, JIT config ","link":"https://kane.mx/posts/2026/self-hosted-github-runners-aws-spot/","section":"posts","tags":["GitHub Actions","Self-Hosted Runners","AWS Spot","Graviton","Terraform","CI/CD","Cost Optimization","Autonomous Development"],"title":"Self-Hosted GitHub Runners on AWS Spot for AI Dev Teams"},{"body":"","link":"https://kane.mx/tags/self-hosted-runners/","section":"tags","tags":null,"title":"Self-Hosted Runners"},{"body":"","link":"https://kane.mx/tags/terraform/","section":"tags","tags":null,"title":"Terraform"},{"body":"","link":"https://kane.mx/categories/ai-engineering/","section":"categories","tags":null,"title":"AI Engineering"},{"body":"","link":"https://kane.mx/tags/cost-allocation/","section":"tags","tags":null,"title":"Cost Allocation"},{"body":"","link":"https://kane.mx/tags/cost-management/","section":"tags","tags":null,"title":"Cost Management"},{"body":"","link":"https://kane.mx/tags/session-tags/","section":"tags","tags":null,"title":"Session Tags"},{"body":"If you run claude against Amazon Bedrock across a dozen repos, your bill arrives as one opaque number. Until recently, the workaround was clunky — create an application inference profile per project, swap them by hand, hope you remembered which one was active. In April 2026, AWS shipped native per-principal cost attribution for Bedrock: every InvokeModel call now carries the caller's IAM principal and session tags straight into Cost Explorer and CUR 2.0. That is enough to turn a ~60-line shell recipe into a per-repo billing system.\nThis post walks through the pattern I run locally — a cc shell wrapper that detects the current git project, a credential_process script that assumes a role with Project=\u0026lt;repo\u0026gt; as a session tag, and the IAM plumbing behind it. It builds on the credential_process foundation from my previous post on secure AWS credentials.\nWhat AWS Shipped in April 2026 The granular cost attribution feature for Amazon Bedrock attributes every InvokeModel* call to the IAM principal that made it, with no additional resources to create. Three mechanisms feed into the final cost line item:\nPrincipal tags — static tags attached to an IAM user or role. Applied to every request that principal makes. Session tags — dynamic tags passed through sts:AssumeRole --tags (or the equivalent OIDC/SAML assertion). Each assumed-role session can carry a different set. CUR 2.0 — the line_item_iam_principal column plus tag columns expose the attribution to queries and to Cost Explorer's \u0026quot;Group by tag\u0026quot; views. Session tags are what matter for per-repo billing. One human, many projects, one source credential — and a different Project tag on every assumed-role session. No dedicated inference profiles, no ARN rotation, no application code changes. The feature is free in commercial regions; expect a 24–48 hour lag between activating a tag and seeing it in Cost Explorer.\nArchitecture Overview The system is five moving parts: a shell function that detects the project, an AWS CLI profile wired to a credential_process script, the script that calls STS with session tags, an IAM role configured to accept those tags, and the Cost Explorer tag activation that makes the Project tag billable. Every piece is inert without the others — which is a feature, because each piece is tiny.\nflowchart LR A[cd my-repo] --\u0026gt; B[cc] B --\u0026gt;|git rev-parse| C[CC_PROJECT=my-repo] C --\u0026gt; D[claude with AWS_PROFILE=cc-tracked] D --\u0026gt; E[AWS SDK] E --\u0026gt;|credential_process| F[cc-creds script] F --\u0026gt;|sts:AssumeRole --tags Project=my-repo| G[IAM Role] G --\u0026gt;|temporary creds| E E --\u0026gt;|InvokeModel| H[Amazon Bedrock] H --\u0026gt;|cost event + principal + tags| I[CUR 2.0 / Cost Explorer] I --\u0026gt;|Group by tag Project| J[Per-repo bill] The rest of this post walks each arrow in that diagram, from left to right.\nThe cc Shell Wrapper The wrapper does four things and nothing more: detect the current git project, set two environment variables, pass through optional role/profile overrides, and launch Claude Code against Bedrock. Drop it into ~/.bashrc, ~/.zshrc, or whatever your shell loads at startup.\n1cc() { 2 local project 3 project=$(basename \u0026#34;$(git rev-parse --show-toplevel 2\u0026gt;/dev/null || pwd)\u0026#34;) 4 CC_PROJECT=\u0026#34;$project\u0026#34; \\ 5 CC_USER=\u0026#34;${USER:-unknown}\u0026#34; \\ 6 CC_SOURCE_PROFILE=\u0026#34;${CC_SOURCE_PROFILE:-default}\u0026#34; \\ 7 CC_ROLE_ARN=\u0026#34;${CC_ROLE_ARN:-}\u0026#34; \\ 8 AWS_PROFILE=cc-tracked AWS_REGION=us-west-2 \\ 9 CLAUDE_CODE_USE_BEDROCK=1 \\ 10 ANTHROPIC_MODEL=\u0026#39;global.anthropic.claude-opus-4-7\u0026#39; \\ 11 claude \u0026#34;$@\u0026#34; 12} The git rev-parse --show-toplevel 2\u0026gt;/dev/null || pwd pattern falls back to the current directory when you are not inside a git repo, so cc still works for ad-hoc experiments — they just bill against the directory name instead of a repo name. CC_USER is also tagged so you get per-user attribution for free on shared boxes.\nSet CC_ROLE_ARN once in the same shell rc file (it is the role that cc-creds will assume) and leave CC_SOURCE_PROFILE pointing at whatever base credential profile you already use.\nThe cc-creds Credential Process Script This is the heart of the setup. AWS SDKs will execute any program named in a profile's credential_process field and expect a one-line JSON envelope back (Version: 1, access key, secret, session token, expiration). The SDK then uses those credentials until Expiration, at which point it re-runs the script.\nSave this as cc-creds somewhere on your PATH and chmod +x it:\n1#!/usr/bin/env bash 2# credential_process for Claude Code with per-project session tags. 3set -euo pipefail 4 5PROJECT=\u0026#34;${CC_PROJECT:-unknown}\u0026#34; 6USER_TAG=\u0026#34;${CC_USER:-${USER:-unknown}}\u0026#34; 7SOURCE_PROFILE=\u0026#34;${CC_SOURCE_PROFILE:-default}\u0026#34; 8ROLE_ARN=\u0026#34;${CC_ROLE_ARN:-}\u0026#34; 9DURATION=\u0026#34;${CC_SESSION_DURATION:-3600}\u0026#34; 10 11if [[ -z \u0026#34;$ROLE_ARN\u0026#34; ]]; then 12 echo \u0026#39;{\u0026#34;error\u0026#34;:\u0026#34;CC_ROLE_ARN not set\u0026#34;}\u0026#39; \u0026gt;\u0026amp;2 13 exit 1 14fi 15 16SESSION_NAME=\u0026#34;cc-${PROJECT}-$(date +%s)\u0026#34; 17SESSION_NAME=$(echo \u0026#34;$SESSION_NAME\u0026#34; | tr -c \u0026#39;a-zA-Z0-9_=,.@-\u0026#39; \u0026#39;-\u0026#39; | cut -c1-64) 18 19RESULT=$(AWS_PROFILE=\u0026#34;$SOURCE_PROFILE\u0026#34; aws sts assume-role \\ 20 --role-arn \u0026#34;$ROLE_ARN\u0026#34; \\ 21 --role-session-name \u0026#34;$SESSION_NAME\u0026#34; \\ 22 --duration-seconds \u0026#34;$DURATION\u0026#34; \\ 23 --tags \u0026#34;[{\\\u0026#34;Key\\\u0026#34;:\\\u0026#34;Project\\\u0026#34;,\\\u0026#34;Value\\\u0026#34;:\\\u0026#34;${PROJECT}\\\u0026#34;},{\\\u0026#34;Key\\\u0026#34;:\\\u0026#34;User\\\u0026#34;,\\\u0026#34;Value\\\u0026#34;:\\\u0026#34;${USER_TAG}\\\u0026#34;}]\u0026#34; \\ 24 --output json 2\u0026gt;\u0026amp;1) || { 25 echo \u0026#34;sts assume-role failed: $RESULT\u0026#34; \u0026gt;\u0026amp;2 26 exit 1 27} 28 29AK=$(echo \u0026#34;$RESULT\u0026#34; | python3 -c \u0026#34;import json,sys; print(json.load(sys.stdin)[\u0026#39;Credentials\u0026#39;][\u0026#39;AccessKeyId\u0026#39;])\u0026#34;) 30SK=$(echo \u0026#34;$RESULT\u0026#34; | python3 -c \u0026#34;import json,sys; print(json.load(sys.stdin)[\u0026#39;Credentials\u0026#39;][\u0026#39;SecretAccessKey\u0026#39;])\u0026#34;) 31ST=$(echo \u0026#34;$RESULT\u0026#34; | python3 -c \u0026#34;import json,sys; print(json.load(sys.stdin)[\u0026#39;Credentials\u0026#39;][\u0026#39;SessionToken\u0026#39;])\u0026#34;) 32EX=$(echo \u0026#34;$RESULT\u0026#34; | python3 -c \u0026#34;import json,sys; print(json.load(sys.stdin)[\u0026#39;Credentials\u0026#39;][\u0026#39;Expiration\u0026#39;])\u0026#34;) 33 34cat \u0026lt;\u0026lt;EOF 35{\u0026#34;Version\u0026#34;:1,\u0026#34;AccessKeyId\u0026#34;:\u0026#34;$AK\u0026#34;,\u0026#34;SecretAccessKey\u0026#34;:\u0026#34;$SK\u0026#34;,\u0026#34;SessionToken\u0026#34;:\u0026#34;$ST\u0026#34;,\u0026#34;Expiration\u0026#34;:\u0026#34;$EX\u0026#34;} 36EOF A few things worth pointing out:\nSession name sanitization. STS only allows [a-zA-Z0-9_=,.@-] in session names and caps them at 64 characters. Repo names with slashes or colons get cleaned up by the tr -c line. --duration-seconds. Default is 3600; the maximum for role chaining is 43200 (12h). The SDK caches credentials until Expiration, so longer durations mean fewer STS calls — but also slower reflection of IAM policy changes. Pluggable source profile. CC_SOURCE_PROFILE is deliberately unopinionated. It can point at an SSO profile, a static credential profile, or an encrypted credential_process pattern like the one from my previous post. Any profile whose credentials can call sts:AssumeRole will do. Failure mode. The script writes errors to stderr and exits non-zero. AWS SDKs treat that as a credential acquisition failure and surface it up to claude. AWS Side: IAM Role and Policies Three pieces need to exist in AWS before cc-creds can assume a role with session tags.\nTrust Policy Attach this trust policy to the role you want cc to assume. The critical detail is that sts:TagSession is a separate action from sts:AssumeRole — omit it and the --tags flag fails with an AccessDenied that looks, misleadingly, like a trust-relationship problem.\n1{ 2 \u0026#34;Version\u0026#34;: \u0026#34;2012-10-17\u0026#34;, 3 \u0026#34;Statement\u0026#34;: [{ 4 \u0026#34;Effect\u0026#34;: \u0026#34;Allow\u0026#34;, 5 \u0026#34;Principal\u0026#34;: {\u0026#34;AWS\u0026#34;: \u0026#34;arn:aws:iam::\u0026lt;aws-account-id\u0026gt;:user/you\u0026#34;}, 6 \u0026#34;Action\u0026#34;: [\u0026#34;sts:AssumeRole\u0026#34;, \u0026#34;sts:TagSession\u0026#34;] 7 }] 8} Replace \u0026lt;aws-account-id\u0026gt; with your own account ID and user/you with the IAM user or role whose credentials back CC_SOURCE_PROFILE.\nPermissions Policy The role only needs to invoke Bedrock. A minimal policy:\n1{ 2 \u0026#34;Version\u0026#34;: \u0026#34;2012-10-17\u0026#34;, 3 \u0026#34;Statement\u0026#34;: [{ 4 \u0026#34;Effect\u0026#34;: \u0026#34;Allow\u0026#34;, 5 \u0026#34;Action\u0026#34;: [ 6 \u0026#34;bedrock:InvokeModel\u0026#34;, 7 \u0026#34;bedrock:InvokeModelWithResponseStream\u0026#34;, 8 \u0026#34;bedrock:GetInferenceProfile\u0026#34;, 9 \u0026#34;bedrock:ListInferenceProfiles\u0026#34; 10 ], 11 \u0026#34;Resource\u0026#34;: \u0026#34;*\u0026#34; 12 }] 13} See the Bedrock IAM reference for the full action list if you need streaming, cross-region inference, or guardrails.\nOptional: Enforce the Project Tag For team setups, add a Deny statement that refuses any Bedrock call without a Project session tag. This turns the attribution system into a contract — if the tag is missing, the call simply fails.\n1{ 2 \u0026#34;Effect\u0026#34;: \u0026#34;Deny\u0026#34;, 3 \u0026#34;Action\u0026#34;: \u0026#34;bedrock:InvokeModel*\u0026#34;, 4 \u0026#34;Resource\u0026#34;: \u0026#34;*\u0026#34;, 5 \u0026#34;Condition\u0026#34;: {\u0026#34;Null\u0026#34;: {\u0026#34;aws:PrincipalTag/Project\u0026#34;: \u0026#34;true\u0026#34;}} 6} Activate the Cost Allocation Tag Last step: go to Billing → Cost allocation tags → User-defined cost allocation tags and activate Project (and User, if you are tagging that too). Activating cost allocation tags is what moves the tag from \u0026quot;exists in CloudTrail\u0026quot; to \u0026quot;exists as a column in Cost Explorer and CUR 2.0\u0026quot;. Expect 24–48 hours before the tag appears in the UI.\nWiring It Together: ~/.aws/config Add one profile stanza. That's it.\n1[profile cc-tracked] 2region = us-west-2 3credential_process = cc-creds Nothing else — no role_arn, no source_profile, no sso_*. All of that information flows in through environment variables set by the cc wrapper. Keeping the profile stanza this thin means you can reuse the same cc-tracked profile across every repo and every role you assume; only the env vars change.\nSeeing Your Bill Per Project Open Cost Explorer and:\nFilter service to Amazon Bedrock. Group by Tag → Project. Pick a daily or monthly granularity. Bars are now labeled with your repo names. If the chart is empty on day one, that is the 24–48 hour tag-activation lag — come back tomorrow.\nFor programmatic analysis, CUR 2.0 ships the raw data as Parquet or CSV in S3. The two columns you care about are line_item_iam_principal (the assumed-role session ARN, which includes the session name cc-\u0026lt;project\u0026gt;-\u0026lt;timestamp\u0026gt;) and resource_tags_user_project (the tag value). A single SQL query grouped by the tag column is enough to build your own dashboard.\nWhy This Beats Application Inference Profiles Before April 2026, the \u0026quot;official\u0026quot; way to get per-project Bedrock attribution was to create a dedicated application inference profile for each project and bake its ARN into the application's environment. It worked, but it scaled poorly.\nDimension Application inference profiles Session tag attribution (this post) Per-project setup Create profile ARN; bake it into env Zero — project auto-detected from git Model changes New ARN per model × per project Nothing — tags decouple from models Works with existing clients Must reconfigure for profile ARN Drop-in — any InvokeModel call attributed Attribution dimensions One (the profile itself) Any tag: Project, User, Team, Env, … Where cost surfaces Profile-as-resource in Cost Explorer line_item_iam_principal + tag columns Application inference profiles still have their place — they're useful for provisioned throughput, per-application guardrails, and controlled model routing. But for plain cost attribution, the session-tag approach is lighter, more flexible, and doesn't require you to manage extra resources.\nGotchas sts:TagSession is a separate action. Omit it from the trust policy and the call fails silently with a generic AccessDenied. This is the single most common mistake. Tag value character set. STS session tag values allow [A-Za-z0-9 _.:/=+\\-@]. Sanitize any repo name with unusual punctuation before passing it through. Activation lag. Tags take 24–48 hours to appear in Cost Explorer after you activate them in the Billing console. Do not assume the setup is broken because Day 1 looks empty. Only InvokeModel* carries attribution. Bedrock management APIs (listing models, describing agents) do not show up tagged in CUR 2.0. That is fine for cost attribution — management calls are free. STS default limit: 500 AssumeRole/sec/account. With 1-hour credential caching per session, hitting this is essentially impossible at a single developer's scale. Request a limit increase if you are fanning out to thousands of sessions. Session tags are immutable within a session. The SDK caches credentials until Expiration, so changing CC_PROJECT mid-session will not re-tag existing calls. This is almost always what you want — each cc invocation starts fresh. Conclusion The previous post in this pair was about keeping credentials safe; this post is about keeping them accountable. Together they form a complete personal Bedrock setup in about 100 lines of shell: encrypted source credentials that are safe to back up, tagged temporary credentials that make every Bedrock call attributable, and an AWS-side configuration that surfaces those tags as cost allocation dimensions. The whole thing costs nothing extra on your AWS bill and requires no application changes.\nWhatever LLM workloads you run on Bedrock — Claude Code, Bedrock Agents, or anything else — the same two-layer pattern applies: hygienic source credentials underneath, attributable session credentials on top. The April 2026 attribution feature is the missing piece that makes the top layer free.\nPer-user attribution comes free from the same setup — cc-creds already sets a User tag. Activate User as a cost allocation tag too and you get a second dimension to slice by.\n","link":"https://kane.mx/posts/2026/claude-code-cost-per-project/","section":"posts","tags":["AWS","Amazon Bedrock","Claude Code","Cost Management","IAM","Session Tags","Cost Allocation","FinOps"],"title":"Track Claude Code Cost Per Project with Bedrock Tagging"},{"body":"","link":"https://kane.mx/tags/agent-skills/","section":"tags","tags":null,"title":"Agent Skills"},{"body":"","link":"https://kane.mx/categories/ai--ml/","section":"categories","tags":null,"title":"AI \u0026 ML"},{"body":"","link":"https://kane.mx/tags/asr/","section":"tags","tags":null,"title":"ASR"},{"body":"","link":"https://kane.mx/tags/cam++/","section":"tags","tags":null,"title":"CAM++"},{"body":"","link":"https://kane.mx/tags/funasr/","section":"tags","tags":null,"title":"FunASR"},{"body":"","link":"https://kane.mx/tags/llm-post-processing/","section":"tags","tags":null,"title":"LLM Post-Processing"},{"body":"","link":"https://kane.mx/tags/openclaw/","section":"tags","tags":null,"title":"OpenClaw"},{"body":"","link":"https://kane.mx/tags/podcast-transcription/","section":"tags","tags":null,"title":"Podcast Transcription"},{"body":"","link":"https://kane.mx/tags/seaco-paraformer/","section":"tags","tags":null,"title":"SeACo-Paraformer"},{"body":"","link":"https://kane.mx/tags/speaker-diarization/","section":"tags","tags":null,"title":"Speaker Diarization"},{"body":"","link":"https://kane.mx/tags/speech-to-text/","section":"tags","tags":null,"title":"Speech-to-Text"},{"body":"Two recordings sat on my disk waiting to be turned into searchable text. A 4-hour 13-minute discussion from a TGO founders' group — eight speakers, Chinese, Zoom audio. A 1-hour 8-minute podcast episode (屠龙之术 Vol.94 × 知本论) where two hosts spent the whole hour dissecting OpenClaw — its positioning, the AI-agent narrative, and investors' reactions.\nBoth were full of specific claims, speaker-attributed opinions, and Chinese brand terms I wanted to grep later. Neither would transcribe cleanly with the tools I tried first.\nCommercial ASR APIs failed on at least one of three axes: per-minute pricing that stings for multi-hour hobby-scale work, poor recall for emerging Chinese brand names like \u0026quot;龙虾\u0026quot; (the nickname for OpenClaw) or \u0026quot;屠龙之术\u0026quot;, and speaker diarization that quietly collapses an eight-person meeting into five voices. Off-the-shelf open-source pipelines were closer — but long-audio diarization was painfully slow and there was no single packaged workflow I could hand to a coding agent.\nSo I wrote one.\nWhy off-the-shelf doesn't fit long, multi-speaker audio Three failure modes kept recurring:\nLength and pricing. Major hosted APIs cap single-file duration or bill per minute. A single 4-hour meeting runs real money at typical rates — fine for a business, friction for iterating on a hobbyist pipeline.\nChinese proper-noun recall. Out-of-the-box ASR consistently mangles emerging brand names. In the hotword-biasing experiment that ships with the skill, \u0026quot;龙虾\u0026quot; (OpenClaw's community nickname) went from 28 recognized mentions to 42 with biasing — a 50% uplift. Person names like \u0026quot;高琦\u0026quot; went from zero to seven. Stock ASR simply could not hear words it had never been trained on.\nDiarization fragility on long audio. FunASR's speaker clustering runs spectral decomposition on the full N×N Laplacian, where N is the number of voice-activity segments. For a 4-hour recording N is 6000+, and the stock pipeline takes over ten hours to cluster on a L40S GPU. The fix is a one-line swap, covered in the engineering deep-dive below.\nI wanted a workflow I could run locally, resume when it crashed, iterate on, and hand to any coding agent as a skill.\nThe skill: audio-transcriber-funasr zxkane/audio-transcriber-funasr is an agent skill that wraps FunASR (Alibaba DAMO's open-source speech toolkit) into a one-command pipeline with speaker diarization, hotword biasing, and LLM cleanup.\nInstall it into any coding agent that supports the skills.sh standard — Claude Code, Cursor, Codex, Cline:\n1npx skills add zxkane/audio-transcriber-funasr What the pipeline looks like end-to-end:\nflowchart TD A[Audio m4a or mp3] --\u0026gt; B[ffmpeg 16kHz mono] B --\u0026gt; C[Phase 1 FunASR] C --\u0026gt; C1[FSMN-VAD voice activity] C --\u0026gt; C2[ASR model lang-dependent] C --\u0026gt; C3[Hotword biasing zh only] C --\u0026gt; C4[Punctuation restoration] C --\u0026gt; C5[CAM plus plus diarization] C5 --\u0026gt; D[Phase 2 merge and map speakers] D --\u0026gt; E[Phase 3 LLM cleanup] E --\u0026gt; F[transcript.md] Five things come bundled:\nFour language presets — SeACo-Paraformer for Chinese with hotwords (CER 1.95%), Paraformer-en for English, SenseVoiceSmall for auto-detect across zh/en/ja/ko/yue, and Whisper-large-v3-turbo for 99 languages. Speaker diarization via CAM++ on every preset that emits per-sentence timestamps (zh, zh-basic, en). Hotword biasing on SeACo-Paraformer — inject participant names, project names, and domain terms to improve ASR recall on words the base model has never seen. LLM post-cleanup to remove fillers, fix homophone errors, polish grammar, and verify speaker labels. Supports Bedrock, Anthropic, and any OpenAI-compatible endpoint (vLLM, Ollama, commercial APIs) — pick your backend with environment variables. A clustering patch that swaps scipy.linalg.eigh (O(N³)) for scipy.sparse.linalg.eigsh (O(N²·k)) — cutting a 4-hour recording's clustering step from 10+ hours to ~10 seconds. The setup script installs it automatically. Resume-from-checkpoint is on by default — a crashed run picks up at the last completed phase instead of re-transcribing 4 hours of audio.\nCase study 1: a 1-hour podcast episode The recording: Vol.94 of 屠龙之术 (\u0026quot;the art of slaying dragons\u0026quot; — host 庄明浩, an investor in Chinese tech), crossed over with 知本论 (host 孙冰洁). Released 2026-04-16, 1h 8m 53s, two speakers, Chinese.\nThe shownotes gave me everything needed to prepare supporting files:\nA speaker context JSON describing who each host is and what they do A hotwords file with the proper nouns that mattered: OpenClaw, 龙虾, Manus, 屠龙之术, 知本论, PayPal, DeepSeek, and both host names One command to run the whole pipeline:\n1SCRIPTS=${CLAUDE_PLUGIN_ROOT}/skills/funasr-transcribe/scripts 2 3python3 $SCRIPTS/transcribe_funasr.py episode.flac \\ 4 --lang zh --num-speakers 2 \\ 5 --speakers \u0026#34;庄明浩,孙冰洁\u0026#34; \\ 6 --hotwords hotwords.txt \\ 7 --speaker-context speaker-context.json \\ 8 --title \u0026#34;Vol.94 再不聊聊openclaw可能就不需要聊了\u0026#34; A lightly-trimmed excerpt from the resulting transcript, showing the cold open:\n1[00:00:07] 孙冰洁: 大语言模型很像一个发动机，它甚至已经到了F1的引擎。 2可是民众没有办法直接拿引擎来用。Manus就是非常简单版的脚手架，跟车一样。 3 4[00:00:17] 庄明浩: 这个到底是技术落地的信号，还是一场赚快钱的狂欢？ 5 6[00:00:21] 庄明浩: 第一波赚钱的肯定是卖课的。 7在美国也一样，所有公司的最优解就是应蹭尽蹭。 First voice ≠ host. The raw FunASR output labels speakers by first-appearance order — in this episode the guest 孙冰洁 speaks first, so SPEAKER_00 is actually the guest. The skill handles this automatically in two layers: Phase 2 scans the first five minutes for explicit self-introductions (\u0026quot;我是X\u0026quot;, \u0026quot;I'm X\u0026quot;) and swaps labels if needed; Phase 3 asks the LLM to verify roles against the speaker-context.json before cleanup begins. For two-speaker podcasts this is a binary CORRECT/SWAP call; for 3+ speakers it becomes a full JSON reassignment.\nHotwords matter, but only in Chinese. Every term in hotwords.txt was Chinese. English loanwords like \u0026quot;PayPal\u0026quot; or \u0026quot;Manus\u0026quot; I deliberately left out — SeACo-Paraformer's hotword biasing operates on Chinese token embeddings and can actually regress recall on English terms. More on that in the deep-dive.\nEnd-to-end timing on a L40S GPU: under ten minutes, dominated by the LLM cleanup call for the single 68-minute chunk. The FunASR model load, VAD, ASR, punctuation, and clustering together took well under a minute.\nCase study 2: a 4-hour, 8-speaker meeting The harder recording: a 4h 13m TGO founders' discussion with eight speakers on a Zoom call, mostly Chinese with occasional English technical terms.\nThe first thirty seconds of the transcript, raw from the pipeline:\n1[00:00:00] 用户1：做得快吗？ 2[00:00:07] 用户2：我没写。 3[00:00:10] 用户3：这个靠传统的 SIP 电话过来。 4[00:00:48] 用户4：这个。 5[00:00:53] 用户3：好像不知道有没有， 6先把它录下来，能拿到那个音频，然后再处理。 Three realities of long multi-speaker audio surfaced immediately:\nDiarization merges acoustically similar voices. CAM++ detected seven distinct speaker IDs, not eight. Two participants with similar vocal profiles collapsed into one cluster. This is normal on conference-call audio where bandwidth compression flattens formants. Remedies built into the skill: pass --num-speakers 8 to hint the expected count; provide --speaker-context with per-person keywords and let the LLM split merged IDs using those clues (the skill's own test suite reports ~73% success at this). For fully accurate labels, a short post-run pass by a human who knows the participants is still the fastest path.\nThe O(N³) tax on long audio is real. For this recording, N — the number of voice-activity segments — was over six thousand. Without the clustering patch, scipy.linalg.eigh on the full Laplacian takes more than ten hours on a L40S GPU. With the patch, the same step completes in roughly ten seconds. The clustering patch is the single most important optimization in the skill for recordings longer than about an hour.\nLLM cleanup becomes the wall-clock bottleneck. On the L40S the FunASR phase (model load + VAD + ASR + punctuation + patched clustering) ran in under five minutes. The LLM cleanup pass — 17 chunks through a frontier model — took roughly 35 minutes. Total end-to-end: about 38 minutes for a 4h 13m recording — roughly 7× faster than real time.\nThe skill checkpoints after every phase, so a network blip or rate-limit hit during chunk 14 of 17 doesn't cost you the previous 30 minutes.\nThree engineering details that make it work Hotword biasing is Chinese-only — and can hurt English SeACo-Paraformer supports hotword biasing: you pass a list of terms you expect to appear in the audio, and the model biases its decoding toward those tokens. Empirically, on a 4h 14m Chinese meeting with 27 hotwords:\nTerm Without hotwords With hotwords Change 龙虾 28 42 +50% 高琦 (person) 0 7 0 → 7 鲲鹏 (org) 6 7 +1 搬瓦工 (brand) 0 1 0 → 1 Rebase (English) 5 0 regression Tailwind (English) 3 1 regression Two things to take away. First, for Chinese proper nouns the uplift is significant — brand names and domain terms that the base model has never seen become recognizable. Second, English loanwords can regress when included in hotwords, because SeACo's biasing operates on the Chinese token vocabulary and fights with the model's existing English handling.\nPractical rule: put Chinese terms in hotwords.txt, leave English terms out, and fix any remaining English errors in the Phase 3 LLM cleanup pass. If English term accuracy matters more than hotword uplift, use --lang zh-basic (the plain Paraformer variant without hotword support).\nThe clustering patch: ten hours to ten seconds This is the highest-impact patch in the pipeline, and it reduces to a one-line swap.\nFunASR's SpectralCluster.get_spec_embs() computes speaker embeddings by running eigendecomposition on the graph Laplacian:\n1# stock FunASR — computes ALL N eigenvalues 2from scipy.linalg import eigh 3eigenvalues, eigenvectors = eigh(L) scipy.linalg.eigh is a dense symmetric-eigendecomposition routine. It returns all N eigenpairs, running in O(N³). For a 4-hour recording where N ≈ 6000, that's roughly 2×10¹¹ operations — measured on a L40S GPU, over ten hours of wall-clock time.\nBut k-way spectral clustering only needs the k smallest eigenvalues. The patch in scripts/patch_clustering.py replaces the call with a sparse iterative solver:\n1# patched — computes only the k smallest eigenvalues 2from scipy.sparse.linalg import eigsh 3eigenvalues, eigenvectors = eigsh(L_sparse, k=num_speakers, which=\u0026#39;SM\u0026#39;) scipy.sparse.linalg.eigsh uses Lanczos iteration and stops after finding the k eigenvalues you actually want. Complexity drops to O(N²·k). On the same 4-hour recording: ~10 seconds.\nA second optimization: the p_pruning() step (which sparsifies the affinity matrix) was a Python loop in the stock implementation; the patch rewrites it with numpy broadcasting. The combined effect is large enough that without the patch, 4-hour diarization is effectively impossible on a single GPU; with it, clustering is a footnote in the total runtime.\nGeneral takeaway for numerical pipelines: when a step feels \u0026quot;slow, but I guess that's just how it is,\u0026quot; check what the inner routine is actually computing. Here it was computing thousands of eigenvalues that nobody ever read.\nLLM cleanup is a core phase, not an optional polish Raw ASR output is readable but tiring. Fillers (嗯, 啊, um, uh), self-corrections, homophone errors that only the surrounding context disambiguates, and sentence boundaries broken by aggressive VAD all accumulate. Across a 4-hour transcript, the readability tax is real.\nThe skill's LLM cleanup phase does four things per chunk:\nRemove fillers without losing semantic content. Fix homophone errors using local context — Chinese ASR confuses 尔 / 儿 / 而 constantly, and only context disambiguates. Polish grammar while preserving wording and tone. The system prompt explicitly forbids rewriting. Verify speaker labels (when --speaker-context is provided) by checking whether the first chunk's attributions are consistent with the provided roles. The backend is configurable: set AWS_REGION for Bedrock, ANTHROPIC_API_KEY for Anthropic, or OPENAI_API_KEY + OPENAI_BASE_URL for any OpenAI-compatible endpoint (including self-hosted vLLM and Ollama). A local Qwen-72B via vLLM gets the same cleanup quality as a hosted frontier model at no marginal cost per chunk.\nPerformance: GPU vs CPU Benchmarked on the 4h 14m, 9-speaker Chinese recording that ships in the skill's own tests:\nPhase GPU (L40S 46GB) CPU Model load 14s ~30s Transcription 169s 30–60 min Clustering (patched) ~10s 2–5 min LLM cleanup (17 chunks) ~35 min ~35 min (network-bound) Total ~38 min 70–100 min The important framing: LLM cleanup dominates total wall-clock time on GPU. Moving to CPU is less painful than the transcription row suggests, because the LLM call is network-bound either way. A 4-hour meeting on a laptop-class CPU runs overnight, not over a weekend.\nFirst-run model download is ~944 MB for SeACo-Paraformer; cached after that. The tool earns its setup cost only at multi-hour or recurring volume — for one-off short clips, a hosted API is faster end-to-end.\nWhen not to use this Real-time captioning or live streaming — this is a batch pipeline. For live use cases, Whisper-streaming variants or a hosted live-transcription service are better fits. Pure English short clips — AWS Transcribe or OpenAI Whisper is a simpler path. You don't need hotword biasing or a 944 MB local model. Auto-detect or Whisper preset plus speaker labels — --lang auto (SenseVoiceSmall) and --lang whisper don't emit per-sentence timestamps, so CAM++ diarization is disabled for those presets. Use zh, zh-basic, or en when you need speaker attribution. You have no patience for a first-run model download — the setup script pulls ~944 MB of FunASR models on first use. Subsequent runs use the local cache. Resources The skill repository: zxkane/audio-transcriber-funasr. It ships with the clustering patch, setup script, and both the skills.sh manifest and a Claude Code plugin manifest, so the same code works across coding agents.\nNatural follow-ups I'm working on next: (1) a lightweight RAG index over a growing local transcript corpus so \u0026quot;what did X say about Y three months ago\u0026quot; becomes a single query, and (2) per-episode summaries gated by the same LLM cleanup backend.\nRelated posts:\nFrom Solo AI Engineer to Autonomous Dev Team — the OpenClaw multi-agent workflow this blog's recent posts have been building toward. The AI Digital Engineer Pattern — single-agent predecessor to the autonomous team. ","link":"https://kane.mx/posts/2026/funasr-podcast-transcription-openclaw/","section":"posts","tags":["FunASR","Speaker Diarization","Podcast Transcription","Agent Skills","CAM++","SeACo-Paraformer","OpenClaw","ASR","LLM Post-Processing","Speech-to-Text"],"title":"Transcribing Long Podcasts and Meetings with FunASR"},{"body":"","link":"https://kane.mx/tags/ai-digital-engineer/","section":"tags","tags":null,"title":"AI Digital Engineer"},{"body":"","link":"https://kane.mx/categories/automation/","section":"categories","tags":null,"title":"Automation"},{"body":"","link":"https://kane.mx/tags/devops-automation/","section":"tags","tags":null,"title":"DevOps Automation"},{"body":"In a previous post, the AI Digital Engineer pattern was introduced, featuring a single Claude Code agent guided by Skills and enforced by Hooks to execute a complete engineering workflow. This approach demonstrated effectiveness in delivering production-ready code with guaranteed quality gates.\nHowever, a fundamental limitation emerged: one agent performing all tasks.\nThis limitation highlighted the need for a multi-agent system, where a team of AI agents, each with a distinct role, could collaborate autonomously to transform GitHub issues into merged pull requests without human intervention.\nThe Limitation of a Solo AI Engineer The single-agent approach detailed in the previous post is suitable for interactive development workflows. A human creates an issue, initiates Claude Code execution, monitors progress, and triggers reviews. While the Skills + Hooks architecture ensures quality, human oversight is required for orchestration.\nThis approach presents several bottlenecks:\nManual dispatch — Each task requires manual assignment to the AI agent. Self-review — The same agent responsible for code generation also performs the review, potentially reducing objectivity. Sequential processing — Tasks are processed one at a time, dependent on human checkpoints. No crash recovery — Progress is lost if a session terminates unexpectedly. The objective is to establish a system where agents emulate a conventional engineering team structure: a tech lead assigns work, a developer implements features, and a reviewer independently verifies quality — all operating autonomously.\nIntroducing the Autonomous Dev Team The Autonomous Dev Team establishes a fully automated development pipeline that automates the transformation of GitHub issues into merged pull requests, eliminating human intervention. It is powered by OpenClaw as the orchestration layer and supports multiple AI coding CLIs: Claude Code, Codex CLI, and Kiro CLI.\nThe architecture emulates a standard engineering team with three distinct roles:\n1┌──────────────────────────────────────────────────────────────────────┐ 2│ Autonomous Dev Team │ 3├──────────────────────┬───────────────────┬───────────────────────────┤ 4│ Dispatcher │ Dev Agent │ Review Agent │ 5│ (OpenClaw) │ (Claude/Codex/ │ (Claude/Codex/ │ 6│ │ Kiro) │ Kiro) │ 7├──────────────────────┼───────────────────┼───────────────────────────┤ 8│ • Scans issues │ • Reads issue │ • Finds linked PR │ 9│ every 5 minutes │ requirements │ • Checks merge conflicts │ 10│ • Dispatches agents │ • Creates worktree│ • Runs review checklist │ 11│ • Manages labels │ • Implements TDD │ • Verifies E2E tests │ 12│ • Handles crashes │ • Creates PR │ • Approve or reject │ 13│ • Enforces │ • Marks checkboxes│ • Auto-merge on pass │ 14│ concurrency │ • Resumes sessions│ • Posts structured │ 15│ limits │ │ findings │ 16└──────────────────────┴───────────────────┴───────────────────────────┘ How It Works: The Label-Based State Machine The entire workflow is driven by GitHub issue labels. Each label represents a state, and agents transition between states based on their outcomes:\nflowchart LR A[Issue Created] --\u0026gt;|autonomous label| B[Dispatcher Picks Up] B --\u0026gt;|adds in-progress| C[Dev Agent Working] C --\u0026gt;|success| D[pending-review] C --\u0026gt;|crash| D D --\u0026gt;|dispatcher cycle| E[Review Agent Working] E --\u0026gt;|PASS| F[approved + merged] E --\u0026gt;|FAIL| G[pending-dev] G --\u0026gt;|dispatcher cycle| C There is no central database or message queue. GitHub labels serve as the single source of truth. The dispatcher polls every 5 minutes, reads the labels, and executes predefined actions:\nCurrent Labels Action autonomous (no state label) Dispatch Dev Agent (new session) autonomous + pending-review Dispatch Review Agent autonomous + pending-dev Dispatch Dev Agent (resume session) autonomous + approved Done — PR merged The Three Agents in Detail Agent 1: The Dispatcher (OpenClaw) The dispatcher functions as the team's tech lead. It operates on a cron schedule (every 5 minutes), scans for actionable issues, and spawns the appropriate agent:\n1# Dispatcher workflow (simplified) 2# 1. Check concurrency — don\u0026#39;t exceed MAX_CONCURRENT (default: 5) 3# 2. Find issues needing work 4# 3. Spawn agent via nohup background process 5# 4. Detect stale processes and recover 6 7# Example: dispatching a dev agent 8dispatch-local.sh dev-new $ISSUE_NUM 9 10# Example: dispatching a review agent 11dispatch-local.sh review $ISSUE_NUM 12 13# Example: resuming after review rejection 14dispatch-local.sh dev-resume $ISSUE_NUM $SESSION_ID Key capabilities include:\nConcurrency control — Tracks active processes via PID files, adhering to the MAX_CONCURRENT limit. Crash recovery — Detects terminated processes and transitions stale in-progress issues to pending-review. Session tracking — Extracts session IDs from comments for resumable development. Self-correction — Manages edge cases such as unintended re-dispatching of already-approved issues. Agent 2: The Dev Agent (Developer) The Dev Agent serves as the implementer. Upon dispatch, it performs the following:\nReads the issue — Parses requirements and acceptance criteria. Creates an isolated worktree — Utilizes git worktree add to prevent cross-contamination between parallel tasks. Follows TDD — Writes tests prior to implementation, guided by the autonomous-dev skill. Marks progress — Checks off requirement checkboxes in the issue as each is completed. Creates a PR — Includes Closes #\u0026lt;issue-number\u0026gt; to link the issue. Reports results — Posts a structured session report with a session ID for resumability. The Dev Agent supports two modes:\nNew — Executes fresh implementation from scratch. Resume — Continues a previous session, processing review feedback and resolving issues without re-implementing completed work. Agent 3: The Review Agent (Reviewer) The Review Agent acts as the quality gate. It operates independently from the Dev Agent (utilizing a different model and session) and adheres to a strict checklist:\nFind the PR — Locates the pull request linked to the issue via Closes #N references. Check merge conflicts — If conflicts are present, it performs a rebase onto the main branch and force-pushes. Run review checklist — Executes 10 items covering process compliance, code quality, testing, and infrastructure safety. Trigger external reviews — Posts /q review to invoke Amazon Q Developer for static analysis. Run E2E tests — Uses Chrome DevTools MCP to verify on the preview deployment. Execute the Findings→Decision Gate — This critical step prevents inconsistent verdicts. The Findings→Decision Gate is a mandatory self-check: the review agent enumerates all findings, classifies each as BLOCKING or NON-BLOCKING, and only approves if there are zero blocking findings. This mechanism prevents the common failure mode where an agent provides a positive assessment despite listing problems.\nReal-World Workflow: A Complete Example This section details a real feature implementation, illustrating the collaborative workflow of the three agents. This example originates from a production Next.js application where the autonomous dev team manages feature implementation and bug fixes.\nPhase 1: Issue Created and Dispatched A feature request is created with the autonomous label — to add swipe gestures for day navigation on mobile. The dispatcher picks it up within 5 minutes:\nThe dispatcher (my-claw) picks up the issue and spawns the dev agent (kane-coding-agent). The dev agent reports its session for resumability. The review agent (kane-test-agent) then takes over.\nThe dispatcher adds the in-progress label, spawns the dev agent, and initiates monitoring. Upon dev agent completion (exit code 0), the dispatcher transitions the issue to pending-review and spawns the review agent.\nPhase 2: Review Agent Rejects with Structured Findings The review agent executes its full checklist and identifies 5 blocking issues:\nThe review agent posts structured findings: missing design document, missing test cases, no unit tests, pending CI, and unchecked PR checklist. Each finding includes a specific action item.\nThis scenario demonstrates the value of the multi-agent architecture. A single agent performing self-review may overlook deficiencies in documentation or process adherence. An independent review agent identifies process violations that an implementing agent might bypass due to task-specific incentives.\nPhase 3: Dev Agent Self-Corrects The review rejection triggers the dispatcher to transition the issue back to pending-dev. The dev agent resumes its previous session and addresses all 5 findings:\nThe dev agent addresses all review findings: creates design document, adds test case document with 10 scenarios, writes 15 unit tests, extracts a pure function for testability, and rebases on latest main.\nThe dev agent independently identified and addressed required fixes. It processed the review comment, comprehended the requirements, and systematically addressed each point, notably by extracting detectSwipe as a pure function for enhanced testability.\nPhase 4: Review Passes The review agent re-executes its checklist. This time, all items pass:\nReview PASSED with 0 blocking findings. The review agent verified design docs, 15 unit tests, all CI checks passing, E2E tests on preview deployment, and all 7 acceptance criteria.\nThe review agent approves the PR. Since this issue includes the no-auto-close label, it notifies the maintainer for manual merging instead of automatic merging.\nA Bug Fix in Under an Hour The team also manages bug resolutions with the same pipeline. Presented here is a CJK character encoding bug — from issue creation to merged PR in under one hour:\nA bug report: CJK characters in plan slugs cause the detail page to fail loading. The issue includes screenshots showing the broken behavior.\nThe dev agent's root cause analysis: slugify() preserved CJK via Unicode regex, causing URL encoding mismatches through CloudFront → API Gateway → Lambda. Fixed with NFKD normalization and 22 unit tests.\nReview passed with 0 blocking findings. The review agent verified all 6 acceptance criteria, 22 unit tests, E2E tests with ASCII-only URLs, and Amazon Q's positive review. Issue auto-closed.\nThe dev agent did not merely address the symptom; it conducted a thorough root cause analysis, identified the URL encoding mismatch within the CloudFront → API Gateway → Lambda chain, and implemented a comprehensive fix with 22 unit tests covering CJK, diacritics, emoji, and mixed scripts.\nCI/CD as the Final Quality Gate Autonomous agents can write code, but shipping reliable software demands more than passing an AI review. The Autonomous Dev Team integrates traditional software engineering practices — CI/CD pipelines, automated testing, and bot-assisted static analysis — as non-negotiable quality gates that agents cannot bypass.\nWhy CI/CD Matters for AI-Generated Code AI coding agents are prone to subtle failures that only surface during build or runtime: type errors caught by the compiler, import paths that work locally but break in CI, or dependency mismatches between environments. Without a CI pipeline enforcing these checks, an agent could generate code that \u0026quot;looks correct\u0026quot; in its session but fails in production.\nThe template enforces a strict progression:\nflowchart LR A[Agent Writes Code] --\u0026gt; B[Pre-push Hooks] B --\u0026gt; C[PR Created] C --\u0026gt; D[GitHub Actions CI] D --\u0026gt;|pass| E[Bot Reviewers] E --\u0026gt; F[E2E Tests] F --\u0026gt;|all pass| G[Review Agent Approves] D --\u0026gt;|fail| H[Agent Fixes and Retries] E --\u0026gt;|findings| H F --\u0026gt;|fail| H Hook-Enforced Quality Checkpoints Claude Code hooks act as local gatekeepers that block agent actions until quality conditions are met:\nHook Trigger Enforcement check-unit-tests Before git commit Warns if unit tests haven't been run for code changes check-pr-review Before git push Blocks push until PR review skill has been executed verify-completion Before task completion Blocks completion unless CI passes, E2E tests run, and all review threads resolved block-push-to-main Before git push Prevents direct pushes to main branch block-commit-outside-worktree Before git commit Ensures all work happens in isolated git worktrees The verify-completion hook deserves special attention. It queries the GitHub API in real-time to check CI status, counts unresolved review threads via GraphQL, and verifies that E2E tests have been executed. The agent cannot claim the task is complete until every checkpoint is green.\nGitHub Actions and Bot Reviewers When a PR is pushed, GitHub Actions runs the project's standard CI pipeline — linting, type checking, unit tests, build verification, and preview deployment. These are the same checks that would gate a human developer's PR.\nThe review agent also integrates with bot reviewers like Amazon Q Developer for static analysis. It triggers a review via /q review, waits for findings, and either addresses them or documents design decisions before resolving each thread. This creates a layered defense:\nLocal hooks — Catch issues before code leaves the agent's session CI pipeline — Validates build integrity, test coverage, and deployment Bot reviewers — Static analysis from independent tools (Amazon Q, Codex) Review agent — Holistic assessment with E2E verification on the preview environment E2E Tests on Preview Deployments The review agent doesn't stop at code-level checks. Using Chrome DevTools MCP, it navigates to the preview deployment URL, executes happy-path test cases, verifies authentication flows, checks for console errors, and captures screenshots as evidence. These screenshots are uploaded and linked in the PR as a verification report.\nThis approach mirrors what a senior engineer would do during a thorough review: deploy the branch, click through the feature, and confirm it works end-to-end — except the agent does it automatically on every review cycle.\nThe system supports multiple AI coding CLIs through an abstraction layer:\n1# scripts/lib-agent.sh — Agent CLI abstraction 2run_agent() { 3 case \u0026#34;$AGENT_CMD\u0026#34; in 4 claude) 5 claude --session-id \u0026#34;$SESSION_ID\u0026#34; \\ 6 --model \u0026#34;$MODEL\u0026#34; \\ 7 -p \u0026#34;$PROMPT\u0026#34; \\ 8 --allowedTools \u0026#34;$TOOLS\u0026#34; 9 ;; 10 codex) 11 codex -p \u0026#34;$PROMPT\u0026#34; \\ 12 --model \u0026#34;$MODEL\u0026#34; \\ 13 --approval-mode full-auto 14 ;; 15 kiro) 16 kiro -p \u0026#34;$PROMPT\u0026#34; \\ 17 --model \u0026#34;$MODEL\u0026#34; \\ 18 --non-interactive 19 ;; 20 esac 21} Feature Claude Code Codex CLI Kiro CLI Dev Agent Full support Basic Basic Review Agent Full support Basic Basic Session Resume Native (--session-id) Falls back to new Falls back to new Model Selection Configurable Configurable Configurable Claude Code offers the most comprehensive integration, particularly regarding session resumability, which facilitates the review→fix→re-review cycle without re-implementing completed work.\nAuthentication: GitHub Apps for Audit Clarity Each agent can operate as a distinct GitHub App bot, providing clear audit trails:\n1kane-coding-agent[bot] → Dev Agent actions (commits, PR creation) 2kane-test-agent[bot] → Review Agent actions (reviews, approvals) 3my-claw[bot] → Dispatcher actions (label changes, comments) This makes it straightforward to trace agent activities. The accompanying screenshots illustrate distinct bot avatars and identities, providing the same visibility as a human team.\nThe system also supports a simpler token-based mode where all agents share one identity, which is simpler for initial setup.\nGetting Started 1. Use the Template The autonomous-dev-team repository is a GitHub template. Utilize the \u0026quot;Use this template\u0026quot; button to create your own copy:\nThe autonomous-dev-team template repository on GitHub — ready to use with Claude Code, Codex CLI, or Kiro CLI.\n2. Configure Your Project 1# Copy the config template 2cp scripts/autonomous.conf.example scripts/autonomous.conf 3 4# Edit with your settings 5cat scripts/autonomous.conf 1# Project identification 2PROJECT_ID=\u0026#34;my-project\u0026#34; 3REPO=\u0026#34;myorg/my-project\u0026#34; 4PROJECT_DIR=\u0026#34;/path/to/my-project\u0026#34; 5 6# Agent CLI selection (claude, codex, or kiro) 7AGENT_CMD=\u0026#34;claude\u0026#34; 8 9# Authentication mode (token or app) 10GH_AUTH_MODE=\u0026#34;token\u0026#34; 11 12# Concurrency 13MAX_CONCURRENT=5 3. Set Up OpenClaw Dispatcher Install OpenClaw and configure the dispatcher cron:\n1# Run dispatcher every 5 minutes 2*/5 * * * * cd /path/to/autonomous-dev-team \u0026amp;\u0026amp; openclaw run 4. Create an Issue with the autonomous Label Create a GitHub issue, add the autonomous label, initiating the automated pipeline. Within 5 minutes, the dispatcher will spawn a dev agent, which will implement the feature, create a PR, and hand it off for review.\nFrom Solo to Team: What Changed Aspect Solo AI Engineer Autonomous Dev Team Orchestration Human initiates each task Dispatcher auto-assigns from issues Review Self-review (same agent) Independent review agent Concurrency One task at a time Up to 5 parallel tasks Crash Recovery Lost progress Auto-retry with session resume Audit Trail Single conversation Separate bot identities per role Human Involvement Initiate, monitor, approve Create issue, optionally final merge Agent CLI Claude Code only Claude Code, Codex CLI, Kiro CLI Quality Gates Manual CI check Hook-enforced CI, bot review, E2E verification The Skills + Hooks architecture from the previous post continues to power each individual agent. The key innovation lies in the orchestration layer — comprising the dispatcher for work assignment, the label-based state machine for progress tracking, and the separation of implementation from review.\nConclusion The evolution from a solo AI engineer to an autonomous dev team parallels the growth observed in human engineering organizations. The initial phase involves a single capable developer (the AI Digital Engineer), which then scales to a team with specialized roles and clear handoff protocols.\nThe Autonomous Dev Team template provides:\nZero human intervention — Issues automatically progress to merged PRs. Independent review — Development and review agents operate with separate sessions and models. Crash resilience — A label-based state machine with automatic retry functionality. Multi-CLI support — Compatibility with Claude Code, Codex CLI, and Kiro CLI via a pluggable abstraction. Traditional quality gates — CI/CD pipelines, bot reviewers, and E2E tests that agents cannot bypass. Clear audit trails — GitHub App bots provide distinct per-agent identities. The code is open source and available as a GitHub template. Implement it in a small project, observe the agents' collaboration, and subsequently scale its application.\nResources Autonomous Dev Team Template — GitHub template with full pipeline setup OpenClaw — The orchestration engine that powers the dispatcher AI Digital Engineer (Previous Post) — The single-agent pattern that each agent follows internally Claude Code Hooks Documentation — Quality gate enforcement Claude Code Skills Guide — Workflow orchestration ","link":"https://kane.mx/posts/2026/autonomous-dev-team-openclaw/","section":"posts","tags":["Claude Code","AI Digital Engineer","GitHub Actions","DevOps Automation","OpenClaw","Multi-Agent","Autonomous Development"],"title":"From Solo AI Engineer to Autonomous Dev Team"},{"body":"","link":"https://kane.mx/tags/multi-agent/","section":"tags","tags":null,"title":"Multi-Agent"},{"body":"","link":"https://kane.mx/tags/ai-agent/","section":"tags","tags":null,"title":"AI Agent"},{"body":"","link":"https://kane.mx/tags/aws-cdk/","section":"tags","tags":null,"title":"AWS CDK"},{"body":"","link":"https://kane.mx/series/build-serverless-application/","section":"series","tags":null,"title":"Build-Serverless-Application"},{"body":"","link":"https://kane.mx/tags/cloud-map/","section":"tags","tags":null,"title":"Cloud Map"},{"body":"","link":"https://kane.mx/tags/dynamodb/","section":"tags","tags":null,"title":"DynamoDB"},{"body":"","link":"https://kane.mx/tags/ecs-fargate/","section":"tags","tags":null,"title":"ECS Fargate"},{"body":"","link":"https://kane.mx/tags/efs/","section":"tags","tags":null,"title":"EFS"},{"body":"","link":"https://kane.mx/tags/eventbridge/","section":"tags","tags":null,"title":"EventBridge"},{"body":"","link":"https://kane.mx/tags/openhands/","section":"tags","tags":null,"title":"OpenHands"},{"body":"","link":"https://kane.mx/tags/self-hosted-ai/","section":"tags","tags":null,"title":"Self-Hosted AI"},{"body":"In a previous post, I introduced an AWS CDK project for deploying OpenHands on EC2, featuring Cognito authentication and Aurora PostgreSQL. While this architecture successfully facilitated initial deployment, operating a shared AI coding platform for a team revealed three fundamental limitations:\nShared Resources: All sandbox containers executed on a single EC2 instance, leading to contention for CPU and memory. Persistent Cost: The EC2 instance incurred approximately $375/month, regardless of platform utilization. Lack of Tenant Isolation: Sandbox containers possessed network access to each other. The v1.0.0 release of openhands-infra addresses these limitations by fully replacing EC2 with ECS Fargate. This revised architecture provides per-conversation isolation through dedicated Fargate tasks, eliminates idle costs by stopping tasks when inactive, and prevents cross-tenant communication via granular security groups.\nThis post details the architectural evolution and the AWS services that enable these capabilities.\nArchitecture Evolution The initial architecture utilized a single EC2 Graviton instance running Docker Compose. The updated architecture distributes workloads across managed Fargate services, leveraging Cloud Map for private DNS-based service discovery.\nBefore: EC2-Based (v0.1.0) flowchart LR User([User]) --\u0026gt; CF[CloudFront] subgraph us-east-1 CF WAF[WAF] LE[Lambda Edge] Cognito[Cognito] end subgraph Main Region ALB[ALB] EC2[EC2 Graviton] Aurora[(Aurora)] S3[(S3)] EFS[(EFS)] end CF --\u0026gt; WAF CF --\u0026gt; LE LE --\u0026gt; Cognito CF --\u0026gt; ALB ALB --\u0026gt; EC2 EC2 --\u0026gt; Aurora EC2 --\u0026gt; S3 EC2 --\u0026gt; EFS style EC2 fill:#fca5a5,stroke:#dc2626,stroke-width:2px,color:#7f1d1d After: Fully Serverless (v1.0.0) flowchart LR User([User]) --\u0026gt; CF[CloudFront] subgraph us-east-1 CF WAF[WAF] LE[Lambda Edge] Cognito[Cognito] end subgraph Main Region ALB[ALB] App[App Fargate] Proxy[OpenResty Fargate] Orch[Sandbox Orchestrator] DDB[(DynamoDB)] Sandbox1[Sandbox 1] Sandbox2[Sandbox 2] SandboxN[Sandbox N] Aurora[(Aurora)] S3[(S3)] EFS[(EFS)] Bedrock[Bedrock] end CF --\u0026gt; WAF CF --\u0026gt; LE LE --\u0026gt; Cognito CF --\u0026gt; ALB ALB --\u0026gt; App ALB --\u0026gt; Proxy App --\u0026gt; Orch Orch --\u0026gt; DDB Orch --\u0026gt; Sandbox1 Orch --\u0026gt; Sandbox2 Orch --\u0026gt; SandboxN Sandbox1 --\u0026gt; EFS Sandbox2 --\u0026gt; EFS SandboxN --\u0026gt; EFS App --\u0026gt; Aurora App --\u0026gt; S3 App --\u0026gt; Bedrock style App fill:#bfdbfe,stroke:#2563eb,stroke-width:2px,color:#1e3a5f style Proxy fill:#bfdbfe,stroke:#2563eb,stroke-width:2px,color:#1e3a5f style Orch fill:#ccfbf1,stroke:#14b8a6,stroke-width:2px,color:#134e4a style DDB fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#78350f style Sandbox1 fill:#bfdbfe,stroke:#2563eb,stroke-width:2px,color:#1e3a5f style Sandbox2 fill:#bfdbfe,stroke:#2563eb,stroke-width:2px,color:#1e3a5f style SandboxN fill:#bfdbfe,stroke:#2563eb,stroke-width:2px,color:#1e3a5f The comprehensive EC2 layer, encompassing launch templates, Auto Scaling Groups, user-data scripts (~200 lines of bash), Watchtower for auto-updates, and Docker Compose orchestration, has been entirely replaced. The new architecture integrates three Fargate services (App, OpenResty, Sandbox Orchestrator), a DynamoDB registry for tracking sandbox lifecycles, and Cloud Map for private DNS service discovery.\nAspect EC2 (v0.1.0) Fargate (v1.0.0) CDK Stacks 6 10 Compute EC2 ASG (m7g.xlarge) ECS Fargate (ARM64) Sandbox Isolation Shared EFS mount Per-conversation EFS access points Network Isolation Self-referencing security group Per-role security groups Service Discovery Docker Compose on EC2 Cloud Map private DNS Idle Cost ~$375/month (always-on) Tasks stop when idle Updates Watchtower auto-pull ECS deployment circuit breaker Per-Conversation Fargate Isolation A key architectural shift involves the management of sandbox containers. In the EC2 architecture, OpenHands directly spawned Docker containers on the host, resulting in shared EFS mounts, network resources, and compute capacity among all containers.\nThe Fargate architecture introduces a sandbox orchestrator responsible for managing the complete lifecycle of per-conversation ECS Fargate tasks, as illustrated in the following sequence:\nsequenceDiagram participant User participant App as App Service participant Orch as Sandbox Orchestrator participant DDB as DynamoDB Registry participant ECS as ECS Fargate participant EFS as EFS Access Point User-\u0026gt;\u0026gt;App: Start conversation App-\u0026gt;\u0026gt;Orch: POST /start Orch-\u0026gt;\u0026gt;EFS: Create access point\u0026lt;br/\u0026gt;/sandbox-workspace/{conv_id} Orch-\u0026gt;\u0026gt;ECS: RunTask (sandbox task) ECS--\u0026gt;\u0026gt;Orch: Task ARN Orch-\u0026gt;\u0026gt;DDB: Register {conv_id, taskArn, status: RUNNING} Orch--\u0026gt;\u0026gt;App: Sandbox endpoint App--\u0026gt;\u0026gt;User: Conversation ready Note over ECS: Idle timeout (configurable) ECS--\u0026gt;\u0026gt;Orch: Task stopped Orch-\u0026gt;\u0026gt;DDB: Update status: PAUSED User-\u0026gt;\u0026gt;App: Resume conversation App-\u0026gt;\u0026gt;Orch: POST /resume Orch-\u0026gt;\u0026gt;ECS: RunTask (reuse access point) Orch-\u0026gt;\u0026gt;DDB: Update status: RUNNING Orch--\u0026gt;\u0026gt;App: New sandbox endpoint Filesystem Isolation with EFS Access Points Each conversation is provisioned with a dedicated EFS access point, rooted at /sandbox-workspace/\u0026lt;conversation_id\u0026gt; and configured with POSIX uid/gid 1000. This access point establishes a physical root boundary, preventing containers from accessing parent directories or other conversation workspaces.\nUpon conversation termination or sandbox task replacement, the associated access point is purged. When a conversation resumes, a new Fargate task is launched using the same access point, ensuring the persistence of all workspace files.\nService Discovery with Cloud Map The App service and Sandbox Orchestrator leverage AWS Cloud Map for private DNS-based service discovery. The orchestrator registers as orchestrator.openhands.local:8081, while the App service connects to app.openhands.local:3000. This mechanism replaces the Docker Compose networking previously utilized for inter-service communication on a single EC2 host.\nMulti-Tenant Security The transition to Fargate facilitates a layered security model, an enhancement not achievable with shared-host Docker containers.\nNetwork Isolation The EC2 architecture permitted inter-sandbox communication over any TCP port via a self-referencing security group rule. The Fargate architecture eliminates this vulnerability:\nflowchart TD subgraph Security Group Boundaries ALB_SG[ALB SG] App_SG[App Service SG] Orch_SG[Orchestrator SG] Sandbox_SG[Sandbox SG] end ALB_SG --\u0026gt;|Internet 443| App_SG App_SG --\u0026gt; Orch_SG App_SG --\u0026gt; Sandbox_SG Sandbox_SG -.-x Sandbox_SG Sandbox_SG -.-x Orch_SG Each role (ALB, App, Orchestrator, Sandbox) is assigned a distinct security group. Sandbox tasks are restricted to receiving connections solely from the App service, thereby precluding communication with other sandboxes, direct access to the orchestrator API, or circumvention of CloudFront/Lambda@Edge authentication.\nPer-User Configuration with KMS Encryption A dedicated User Configuration API, implemented via Lambda and HTTP API Gateway, enables users to customize their OpenHands experience:\nMCP Server Configurations: Users can integrate custom MCP servers for third-party tool integration. Third-Party Integrations: GitHub and Slack tokens are supported with automatic MCP injection. LLM Model Selection: Users can select from available Bedrock models (e.g., Claude Opus 4.5, Sonnet 4.5, Haiku 4.5). All user secrets (API keys, tokens) are protected using KMS envelope encryption, ensuring that sensitive values are never logged or exposed via API responses. User configurations are merged with the global config.toml at runtime, allowing administrators to define baseline defaults while empowering users to personalize their experience.\nOrphan Task Detection Within a distributed system, race conditions can result in orphaned Fargate tasks operating without corresponding DynamoDB records. An EventBridge rule monitors ECS Task State Change events, triggering a Lambda function that performs the following actions:\nEnumerates all RUNNING sandbox tasks. Compares the enumerated tasks against DynamoDB records. Applies a 5-minute grace period for tasks still in the startup phase. Stops any identified orphan tasks. Publishes OrphanSandboxesStopped CloudWatch metrics for monitoring purposes. This event-driven cleanup mechanism prevents cost accumulation from abandoned resources.\nZero Idle Cost Architecture The primary advantage of the Fargate migration is the paradigm shift from an always-on to a pay-per-use cost model.\nCost Comparison In the EC2 architecture, the instance must be sized to accommodate all concurrent sandboxes on a single host. Each OpenHands sandbox requires approximately 4 vCPU and 8 GB of memory. Supporting 10-20 concurrent conversations demands a large instance such as m7g.4xlarge (16 vCPU, 64 GB) or m7g.8xlarge (32 vCPU, 128 GB) — and that instance runs 24/7 regardless of actual usage.\nThe Fargate architecture decouples the App service from sandbox compute. CloudWatch metrics from a staging deployment confirmed that the App service is a control plane only — handling API routing, session management, and S3/DB reads — with average CPU utilization under 1% and memory usage around 580 MB. This allows the App to start with just 1 vCPU / 2 GB (~$29/month on ARM64 Fargate), with target tracking auto scaling (1-3 tasks, 60% CPU threshold) handling traffic spikes. Sandboxes launch as independent Fargate tasks billed per-second, scaling from zero to dozens of concurrent conversations with no pre-provisioning.\nComponent EC2 (v0.1.0) Fargate (v1.0.0) Compute (App + Sandbox) EC2 m7g.4xlarge: ~$450/mo Fargate App (1 vCPU): ~$29/mo Sandbox compute Included (always on) Per-second billing (scale to zero) Database Aurora: ~$43-80/mo Aurora: ~$43-80/mo Networking CloudFront + ALB: ~$110/mo CloudFront + ALB: ~$110/mo VPC Endpoints ~$50/mo ~$50/mo Other EBS + S3 + NAT: ~$60-70/mo S3 + NAT: ~$50-60/mo Total base ~$710-760/mo ~$280-330/mo Per additional sandbox Requires larger EC2 instance ~$0.18/hr per active task The cost advantage is substantial. The EC2 model requires over-provisioning for peak capacity — paying for a large instance even during off-hours when no one is coding. The Fargate model pays only for active conversations. A team of 10 engineers who each use OpenHands for 4 hours per day would consume approximately $54/month in sandbox compute (10 users x 4 hrs x 20 workdays x $0.18/hr), bringing the total to ~$334-384/month — compared to the EC2 model's fixed ~$710-760/month for an instance large enough to support them.\nIdle Timeout Mechanism Each sandbox task is configured with an adjustable idle timeout (10 minutes for staging, 30 minutes for production). When no activity is detected:\nAn idle monitor Lambda identifies inactive tasks. Tasks are gracefully terminated. DynamoDB records are updated to PAUSED status. The conversation is archived, but all workspace files persist on EFS. Upon user return, the App service invokes the orchestrator's /resume endpoint, which launches a new Fargate task utilizing the existing EFS access point. This ensures conversation continuity, leveraging Aurora for metadata, S3 for event storage, and EFS for workspace data.\nWarm Pool for Rapid Startup The cold-start latency for a Fargate task typically ranges from 30-60 seconds. To enhance user experience, the orchestrator maintains a configurable warm pool (default: 2 pre-warmed tasks). A warm pool synchronization process periodically verifies the availability of pre-started tasks every 15 seconds. When a new conversation is initiated, a pre-warmed task is immediately claimed, circumventing cold-start delays.\nThis approach represents a cost-performance trade-off: warm pool tasks consume Fargate compute resources while awaiting assignment, but they mitigate the cold-start latency that would otherwise disrupt user interaction.\nAWS Services Composition The v1.0.0 architecture integrates over 10 AWS services, each fulfilling a specific role:\nAWS Service Role ECS Fargate Compute for App, OpenResty, and sandbox tasks Cloud Map Private DNS service discovery for inter-service communication DynamoDB Sandbox lifecycle registry (conversation → task mapping) EFS Persistent workspace storage with per-conversation access points Aurora Serverless v2 Conversation metadata with RDS Proxy connection pooling EventBridge Event-driven sandbox cleanup based on task state changes Lambda Idle monitoring, orphan detection, user config API, DB bootstrap KMS Envelope encryption for user secrets CloudFront Global edge distribution with Lambda@Edge for authentication Cognito User authentication using OAuth 2.0 / Managed Login v2 WAF Rate limiting and request filtering Bedrock LLM access (Claude models) via IAM roles VPC Endpoints Private connectivity to AWS services (eliminates public internet access) All AWS API calls originating from Fargate tasks are routed through VPC Endpoints, ensuring that sandboxes communicate with AWS services without traversing the public internet. RDS Proxy manages connection pooling for Aurora, thereby preventing connection exhaustion under concurrent database access from multiple Fargate tasks.\nThe infrastructure is defined across 10 CDK stacks, with an explicit deployment order:\n1Auth → Network → Monitoring → Security → Database → 2 UserConfig → Cluster → Sandbox → Compute → Edge Deployment Prerequisites VPC with private subnets (minimum 2 AZs) NAT Gateway Route 53 hosted zone Node.js 20+ Quick Start 1git clone https://github.com/zxkane/openhands-infra.git 2cd openhands-infra 3npm install 4 5# Bootstrap CDK (both regions required) 6npx cdk bootstrap --region us-west-2 7npx cdk bootstrap --region us-east-1 8 9# Deploy all 10 stacks 10npx cdk deploy --all \\ 11 --context vpcId=vpc-xxxxx \\ 12 --context hostedZoneId=Z0xxxxx \\ 13 --context domainName=example.com \\ 14 --context subDomain=openhands Following deployment, access https://openhands.example.com and authenticate using Cognito credentials. Each user benefits from isolated conversations with dedicated, per-conversation sandboxes that spin up on demand and terminate when idle.\nConclusion The architectural evolution from EC2 to fully serverless ECS Fargate transforms openhands-infra from a single-user deployment tool into a production-ready, multi-tenant platform. Per-conversation Fargate isolation establishes robust security boundaries, event-driven cleanup prevents unnecessary costs, and the pay-per-second compute model aligns infrastructure expenditure with actual usage.\nThe openhands-infra repository is open source. For details on the original EC2-based architecture, refer to the previous post.\nResources openhands-infra Repository (v1.0.0) OpenHands Documentation AWS CDK Documentation ","link":"https://kane.mx/posts/2026/serverless-multi-tenant-openhands-on-aws/","section":"posts","tags":["AWS CDK","OpenHands","ECS Fargate","Serverless","Multi-Tenant","AI Agent","DynamoDB","EFS","Cloud Map","EventBridge","Self-Hosted AI"],"title":"Serverless Multi-Tenant OpenHands on AWS with Fargate"},{"body":"What if an AI could operate like a senior software engineer - not just writing code, but following the complete engineering process from design through deployment? This post introduces the AI Digital Engineer pattern: a system that transforms Claude Code from an interactive assistant into an autonomous engineer capable of delivering production-ready software.\nThe Problem with Traditional AI Coding Assistants Most AI coding tools operate in a simple request-response pattern: you ask for code, they generate it. This approach has fundamental limitations:\nNo process discipline: The AI writes code without tests, reviews, or verification Fragile workflows: Complex multi-step tasks get lost in context windows Unreliable execution: LLM outputs are probabilistic, not guaranteed Expensive scaling: Every verification step requires another LLM call No audit trail: How do you prove the AI followed your engineering standards? What we need is a system that combines LLM intelligence for orchestration with deterministic guarantees for execution.\nIntroducing the AI Digital Engineer An AI Digital Engineer is an autonomous system that operates like a human software engineer - following the complete engineering process:\nHuman Engineer AI Digital Engineer Reviews design requirements Reads design canvas via Pencil MCP Writes tests before code (TDD) Enforced by PreToolUse hooks Implements features Claude Code implementation Runs code review Spawns PR Review agents Responds to review feedback Addresses Amazon Q/Codex findings Verifies CI passes Stop hook blocks until green Performs E2E testing Chrome DevTools MCP integration Cannot merge without approvals Hook system enforces all gates The key difference from traditional AI assistants: the AI Digital Engineer cannot skip steps. Quality gates are enforced by deterministic systems, not LLM memory.\nThe Hybrid Architecture: Intelligence + Reliability The AI Digital Engineer architecture separates concerns between what needs intelligence (orchestration) and what needs reliability (execution):\n1┌────────────────────────────────────────────────────────────────────┐ 2│ AI Digital Engineer Architecture │ 3├────────────────────────────────┬───────────────────────────────────┤ 4│ Intelligent Orchestration │ Deterministic Execution │ 5│ (Claude Code Skills) │ (Hooks + GitHub Actions + Agents)│ 6├────────────────────────────────┼───────────────────────────────────┤ 7│ ✓ Complex workflow navigation │ ✓ 100% execution guarantee │ 8│ ✓ Exception handling \u0026amp; recovery│ ✓ Zero LLM call cost │ 9│ ✓ Start/resume from any step │ ✓ No context window limits │ 10│ ✓ Dynamic branching decisions │ ✓ Millisecond response time │ 11│ ✓ Natural language understanding│ ✓ Auditable execution logs │ 12├────────────────────────────────┼───────────────────────────────────┤ 13│ Best for: Reasoning \u0026amp; judgment │ Best for: Quality gates \u0026amp; triggers│ 14└────────────────────────────────┴───────────────────────────────────┘ Why This Separation Matters Traditional agent architectures rely on LLMs for everything - including remembering to run tests, checking CI status, and enforcing review requirements. This approach fails because:\nTraditional AI Agent AI Digital Engineer Every step calls LLM LLM only for orchestration decisions Relies on LLM to remember steps Hooks enforce execution automatically Context overflow causes skipped steps Persistent state enables resume Expensive and unpredictable Predictable cost model Difficult to audit Complete execution logs The AI Digital Engineer uses Skills for the brain (what to do, how to handle exceptions) and Hooks for the muscles (guaranteed execution of quality gates).\nThe Three Pillars Pillar 1: Claude Code Skills - The Intelligent Orchestrator Skills are markdown files that guide Claude through complex workflows. The github-workflow skill defines a 12-step development process:\n1--- 2description: GitHub development workflow for end-to-end delivery 3--- 4 5You are following a structured development workflow. Current phase: $PHASE 6 7## Workflow Steps 81. Design Canvas - Create UI/architecture mockups (Pencil MCP) 92. Branch Creation - Use feat/ or fix/ prefix 103. Test Plan - Document test cases before implementation 114. Implementation - Write code following TDD 125. Unit Tests - Verify all tests pass 136. Code Simplification - Run simplifier agent 147. PR Creation - Commit with standardized template 158. PR Review - Run review toolkit agents 169. CI Verification - Wait for GitHub Actions 1710. Bot Review Handling - Address Amazon Q/Codex findings 1811. E2E Testing - Verify on preview environment 1912. Completion - All gates passed, ready for merge 20 21## Exception Handling 22- If CI fails: Analyze logs, fix issues, re-push 23- If bot review finds issues: Address each comment thread 24- If E2E fails: Debug, fix, mark e2e-tests complete 25 26## Resume Capability 27Current state is tracked in .claude/state/ 28You can resume from any step based on completed states. Key capability: Skills enable Claude to handle exceptions intelligently. When CI fails, the skill guides Claude to analyze logs and fix issues - something deterministic scripts cannot do.\nPillar 2: Claude Code Hooks - The Enforcement Layer Hooks are shell scripts that execute at specific points in Claude's workflow. They guarantee that quality gates are enforced:\n1{ 2 \u0026#34;hooks\u0026#34;: { 3 \u0026#34;PreToolUse\u0026#34;: [ 4 { 5 \u0026#34;matcher\u0026#34;: \u0026#34;Bash\u0026#34;, 6 \u0026#34;hooks\u0026#34;: [{ 7 \u0026#34;type\u0026#34;: \u0026#34;command\u0026#34;, 8 \u0026#34;command\u0026#34;: \u0026#34;.claude/hooks/check-design-canvas.sh\u0026#34; 9 }] 10 }, 11 { 12 \u0026#34;matcher\u0026#34;: \u0026#34;Write|Edit\u0026#34;, 13 \u0026#34;hooks\u0026#34;: [{ 14 \u0026#34;type\u0026#34;: \u0026#34;command\u0026#34;, 15 \u0026#34;command\u0026#34;: \u0026#34;.claude/hooks/check-test-plan.sh\u0026#34; 16 }] 17 } 18 ], 19 \u0026#34;Stop\u0026#34;: [{ 20 \u0026#34;hooks\u0026#34;: [{ 21 \u0026#34;type\u0026#34;: \u0026#34;command\u0026#34;, 22 \u0026#34;command\u0026#34;: \u0026#34;.claude/hooks/verify-completion.sh\u0026#34; 23 }] 24 }] 25 } 26} How hooks enforce the workflow:\nHook Trigger Enforcement check-design-canvas.sh Before git commit Blocks commit without design doc check-test-plan.sh Before file write/edit Blocks code changes without test plan check-unit-tests.sh Before git commit Blocks commit with failing tests check-code-simplifier.sh Before git commit Blocks commit without simplification review check-pr-review.sh Before git push Blocks push without PR review verify-completion.sh On task stop Blocks completion without CI + E2E + resolved comments Critical insight: These hooks execute in milliseconds with zero LLM cost. They don't ask Claude to remember to check - they physically prevent violations.\nPillar 3: GitHub Actions - External Verification GitHub Actions provide verification that happens outside Claude's context:\n1name: CI 2on: [push, pull_request] 3 4jobs: 5 build-and-test: 6 runs-on: ubuntu-latest 7 steps: 8 - uses: actions/checkout@v4 9 - name: Install dependencies 10 run: npm ci 11 - name: Run tests 12 run: npm test 13 - name: Build 14 run: npm run build 15 16 security-review: 17 runs-on: ubuntu-latest 18 steps: 19 - name: Amazon Q Security Review 20 uses: aws/amazon-q-developer-action@v1 21 - name: CodeQL Analysis 22 uses: github/codeql-action/analyze@v3 The verify-completion.sh hook queries GitHub's API to ensure:\nCI workflow has passed All review comments are resolved E2E tests are marked complete 1# From verify-completion.sh 2CI_STATUS=$(gh run list --branch \u0026#34;$BRANCH\u0026#34; --limit 1 --json conclusion -q \u0026#39;.[0].conclusion\u0026#39;) 3if [ \u0026#34;$CI_STATUS\u0026#34; != \u0026#34;success\u0026#34; ]; then 4 echo \u0026#34;❌ Cannot complete: CI has not passed\u0026#34; 5 exit 1 6fi 7 8# Check unresolved review threads 9UNRESOLVED=$(gh api graphql -f query=\u0026#39;...\u0026#39; --jq \u0026#39;.data.repository.pullRequest.reviewThreads.nodes | map(select(.isResolved == false)) | length\u0026#39;) 10if [ \u0026#34;$UNRESOLVED\u0026#34; -gt 0 ]; then 11 echo \u0026#34;❌ Cannot complete: $UNRESOLVED unresolved review comments\u0026#34; 12 exit 1 13fi Real-World Example: Multi-tenant User Configuration System To demonstrate the AI Digital Engineer in action, let's walk through a real complex feature implementation: a multi-tenant user configuration system for an OpenHands deployment platform.\nThe Feature Requirements The task was to build a complete user configuration management system allowing each tenant to:\nConfigure custom MCP servers (stdio and HTTP types) Manage third-party integrations (GitHub, Slack) with auto-MCP injection Store encrypted secrets using KMS envelope encryption Merge user configs with global platform configuration Technical scope:\n6,300+ lines of code across 36 files AWS Lambda + API Gateway + KMS + S3 architecture TypeScript CDK infrastructure + Python Lambda handlers Comprehensive unit tests and E2E test cases How the AI Digital Engineer Delivered It Phase 1: Design \u0026amp; Test Plan\nThe workflow began with design documentation and test case definition. The check-design-canvas.sh hook blocked any implementation until architecture was documented. The check-test-plan.sh hook ensured test cases were written before code.\nPhase 2: Initial Implementation\nClaude implemented the full feature with:\nUserConfigStack: Lambda + HTTP API Gateway for /api/v1/user-config/* endpoints UserConfigLoader: S3-based config loader integrated with Cognito authentication KMS envelope encryption for user secrets Python Lambda with uv lock file for reproducible dependencies Phase 3: CI Failures \u0026amp; Recovery\nThis is where the intelligent orchestration proved essential. The CI pipeline failed with CDK token parsing errors:\n1Error: The URL constructor cannot parse CDK tokens at synthesis time The Skill guided Claude to analyze the error and apply the fix - using Fn.split and Fn.select intrinsic functions instead of JavaScript URL parsing. A deterministic script couldn't diagnose this; it required LLM reasoning.\nPhase 4: Bot Review Integration\nAmazon Q Security Review flagged several issues:\nPlaintext KMS keys not cleared from memory after decryption Missing explicit deny policy on KMS key for sensitive operations Potential path traversal vulnerabilities in user ID handling The workflow's verify-completion.sh hook blocked task completion until all review threads were resolved. Claude addressed each finding with targeted commits:\n1# Commit: fix(security): address reviewer bot findings 2- Clear plaintext KMS keys from memory after use 3- Add explicit deny policy to KMS key for PutKeyPolicy, CreateGrant, ScheduleKeyDeletion 4- Add input validation to prevent path traversal attacks (CWE-22) Phase 5: E2E Testing Discovery\nDuring manual E2E testing on the staging environment, a critical multi-tenancy bug was discovered: User A's secrets were visible to User B. The root cause? OpenHands stored secrets at the S3 bucket root, not in user-scoped paths.\nThis is exactly the scenario where Skills excel - handling unexpected exceptions. Claude:\nDocumented the bug in test cases (TC-019, TC-020) Designed user-scoped storage paths (users/{user_id}/secrets.json) Implemented S3SecretsStore and S3SettingsStore with proper isolation Added startup verification to ensure patches were applied correctly Phase 6: Architecture Refinement\nA reviewer suggested replacing API Gateway with ALB Lambda target groups for:\nArchitecture consistency (single entry point) Cost optimization (no API Gateway fees) Lower latency (one less hop) Claude refactored the entire routing layer, updating Lambda handlers to support ALB event format and modifying CloudFront distribution configuration.\nThe Delivery Timeline Phase Commits What Happened Initial Implementation 1 Full feature with tests CI Fixes 2 CDK token parsing, test requirements Security Review 3 Memory clearing, KMS policy, input validation E2E Bug Discovery 2 Multi-tenancy isolation bug found and fixed Architecture Refactor 3 API Gateway → ALB migration Final Polish 13 Bedrock compatibility, MCP deduplication, snapshot updates Total: 24 commits over 2 days, resulting in production-ready code.\nKey Insights Skills handled the unexpected: CI failures, security vulnerabilities, and multi-tenancy bugs all required reasoning and judgment - not scripted responses.\nHooks guaranteed quality gates: Every commit passed through code simplification. Every push triggered PR review. Task completion was blocked until CI passed and review comments were resolved.\nThe hybrid architecture worked: LLM costs were controlled (orchestration only), while execution was 100% reliable (hooks enforced every gate).\nIterative refinement was automatic: The workflow naturally drove 24 iterations of improvement, each triggered by external feedback (CI, bot reviews, E2E testing).\nImplementing the AI Digital Engineer Step 1: Clone the Workflow Template 1git clone https://github.com/zxkane/claude-code-workflow.git 2cp -r claude-code-workflow/.claude your-project/.claude Step 2: Configure Hooks The template includes pre-configured hooks in .claude/settings.json:\n1{ 2 \u0026#34;permissions\u0026#34;: { 3 \u0026#34;allow\u0026#34;: [\u0026#34;Bash(.claude/hooks/*)\u0026#34;, \u0026#34;mcp__*\u0026#34;] 4 }, 5 \u0026#34;hooks\u0026#34;: { 6 \u0026#34;PreToolUse\u0026#34;: [ 7 { 8 \u0026#34;matcher\u0026#34;: \u0026#34;Bash\u0026#34;, 9 \u0026#34;hooks\u0026#34;: [ 10 {\u0026#34;type\u0026#34;: \u0026#34;command\u0026#34;, \u0026#34;command\u0026#34;: \u0026#34;.claude/hooks/check-design-canvas.sh\u0026#34;}, 11 {\u0026#34;type\u0026#34;: \u0026#34;command\u0026#34;, \u0026#34;command\u0026#34;: \u0026#34;.claude/hooks/check-code-simplifier.sh\u0026#34;}, 12 {\u0026#34;type\u0026#34;: \u0026#34;command\u0026#34;, \u0026#34;command\u0026#34;: \u0026#34;.claude/hooks/check-pr-review.sh\u0026#34;}, 13 {\u0026#34;type\u0026#34;: \u0026#34;command\u0026#34;, \u0026#34;command\u0026#34;: \u0026#34;.claude/hooks/check-unit-tests.sh\u0026#34;} 14 ] 15 }, 16 { 17 \u0026#34;matcher\u0026#34;: \u0026#34;Write|Edit\u0026#34;, 18 \u0026#34;hooks\u0026#34;: [ 19 {\u0026#34;type\u0026#34;: \u0026#34;command\u0026#34;, \u0026#34;command\u0026#34;: \u0026#34;.claude/hooks/check-test-plan.sh\u0026#34;} 20 ] 21 } 22 ], 23 \u0026#34;Stop\u0026#34;: [ 24 { 25 \u0026#34;hooks\u0026#34;: [ 26 {\u0026#34;type\u0026#34;: \u0026#34;command\u0026#34;, \u0026#34;command\u0026#34;: \u0026#34;.claude/hooks/verify-completion.sh\u0026#34;, \u0026#34;timeout\u0026#34;: 10} 27 ] 28 } 29 ] 30 } 31} Step 3: Set Up State Management The state manager tracks workflow progress persistently:\n1# Mark a step complete 2.claude/hooks/state-manager.sh mark design-canvas 3 4# Check if a step was completed (within 30-minute window) 5.claude/hooks/state-manager.sh check test-plan 6 7# List all completed states 8.claude/hooks/state-manager.sh list 9 10# Clear state for re-run 11.claude/hooks/state-manager.sh clear-all State is stored in .claude/state/ as JSON files with metadata:\n1{ 2 \u0026#34;action\u0026#34;: \u0026#34;design-canvas\u0026#34;, 3 \u0026#34;timestamp\u0026#34;: \u0026#34;2026-01-31T10:30:00Z\u0026#34;, 4 \u0026#34;commit\u0026#34;: \u0026#34;abc123\u0026#34;, 5 \u0026#34;branch\u0026#34;: \u0026#34;feat/user-config\u0026#34;, 6 \u0026#34;files\u0026#34;: [\u0026#34;docs/design/user-config.pen\u0026#34;] 7} Step 4: Configure GitHub Actions Add CI workflow that the completion hook will verify:\n1# .github/workflows/ci.yml 2name: CI 3on: [push, pull_request] 4 5jobs: 6 test: 7 runs-on: ubuntu-latest 8 steps: 9 - uses: actions/checkout@v4 10 - run: npm ci 11 - run: npm test 12 - run: npm run build 13 14 review: 15 runs-on: ubuntu-latest 16 permissions: 17 contents: read 18 pull-requests: write 19 steps: 20 - uses: actions/checkout@v4 21 - name: Amazon Q Code Review 22 uses: aws/amazon-q-developer-action@v1 23 with: 24 command: review Step 5: Start Development Simply tell Claude what you want to build:\n1Design and implement a user authentication system with JWT tokens The github-workflow skill activates automatically and guides Claude through:\nCreating a design canvas Writing test cases Implementing the feature Running reviews and CI Completing E2E verification If you need to resume after a break:\n1Continue working on the authentication feature Claude reads the state files and resumes from the appropriate step.\nAdvanced Patterns Pattern 1: Spawning Sub-Agents for Parallel Work Hooks can spawn specialized agents for specific tasks:\n1{ 2 \u0026#34;PostToolUse\u0026#34;: [ 3 { 4 \u0026#34;matcher\u0026#34;: \u0026#34;Bash(git push)\u0026#34;, 5 \u0026#34;hooks\u0026#34;: [{ 6 \u0026#34;type\u0026#34;: \u0026#34;command\u0026#34;, 7 \u0026#34;command\u0026#34;: \u0026#34;claude -p \u0026#39;Run PR review toolkit on current branch\u0026#39; --background\u0026#34; 8 }] 9 } 10 ] 11} This allows:\nPR review agents to run asynchronously Security scan agents to analyze code in parallel Test coverage agents to report independently Pattern 2: Conditional Workflow Branches Skills can define conditional paths based on context:\n1## Workflow Branches 2 3If this is a bug fix (branch starts with fix/): 4- Skip design canvas requirement 5- Focus on regression test 6- Expedited review process 7 8If this is a security fix: 9- Require security team review 10- Run additional security scans 11- Notify security channel Pattern 3: External Tool Integration via MCP The workflow integrates with external tools through MCP:\nPencil MCP: Design canvas creation and validation GitHub MCP: PR management and review queries Chrome DevTools MCP: E2E testing automation AWS MCP: Infrastructure documentation queries Cost and Performance Analysis Comparing traditional agent approaches with the AI Digital Engineer:\nMetric Traditional Agent AI Digital Engineer LLM calls per PR 50-100+ (every check) 10-20 (decisions only) Cost per feature $5-15 $1-3 Verification reliability ~80% (LLM may forget) 100% (hooks enforce) Context overflow risk High (long workflows) None (state persisted) Audit trail Conversation only Hooks + Git + CI logs The hybrid architecture reduces costs by 70-80% while improving reliability from probabilistic to deterministic.\nConclusion The AI Digital Engineer pattern transforms Claude Code from a coding assistant into an autonomous software engineer. The key insight is separation of concerns:\nSkills provide intelligent orchestration - handling complex workflows, exceptions, and decisions that require reasoning Hooks provide guaranteed execution - enforcing quality gates without relying on LLM memory or expensive API calls GitHub Actions provide external verification - ensuring standards are met outside the AI's context This hybrid architecture delivers:\nProduction-ready code that passes bot reviews and security scans Predictable costs by minimizing LLM calls for deterministic operations Complete auditability through persistent state and execution logs Resilient workflows that can resume from any point after interruption The claude-code-workflow template provides everything you need to implement this pattern. Clone it, configure it for your project, and start delivering software with an AI Digital Engineer.\nResources Claude Code Workflow Template - Complete implementation with skills, hooks, and state management Claude Code Hooks Documentation - Official hooks reference Claude Code Skills Guide - How to create custom skills Model Context Protocol - MCP integration for external tools Have you implemented AI-assisted development workflows? Share your experiences and patterns in the comments below!\n","link":"https://kane.mx/posts/2026/ai-digital-engineer-claude-code/","section":"posts","tags":["Claude Code","AI Digital Engineer","GitHub Actions","DevOps Automation","Software Engineering","CI/CD","Test-Driven Development"],"title":"AI Digital Engineer: End-to-End Delivery with Claude Code"},{"body":"","link":"https://kane.mx/tags/software-engineering/","section":"tags","tags":null,"title":"Software Engineering"},{"body":"","link":"https://kane.mx/tags/test-driven-development/","section":"tags","tags":null,"title":"Test-Driven Development"},{"body":"","link":"https://kane.mx/tags/aurora-postgresql/","section":"tags","tags":null,"title":"Aurora PostgreSQL"},{"body":"","link":"https://kane.mx/tags/cloudfront/","section":"tags","tags":null,"title":"CloudFront"},{"body":"","link":"https://kane.mx/tags/cognito/","section":"tags","tags":null,"title":"Cognito"},{"body":"OpenHands is an open-source AI-driven development platform that enables AI agents to write code, fix bugs, and execute complex development tasks autonomously. The default setup works well for local development, but what if you want to run it for a team or make it accessible from anywhere?\nThis post introduces an AWS CDK project that extends OpenHands beyond single-user local usage. It adds multi-user authentication, persistent storage, and automatic recovery capabilities—transforming OpenHands into a shared service that your team can access from anywhere.\nThe Rise of Cloud-Based AI Coding Agents The AI coding landscape has evolved rapidly. While IDE-integrated tools like GitHub Copilot focus on code completion, a new category of autonomous AI coding agents has emerged—platforms that can independently plan, write, test, and deploy code with minimal human intervention.\nCommercial cloud platforms now offer managed AI coding experiences:\nDevin (Cognition AI): The first widely-known autonomous AI software engineer, handling complete tasks from planning to deployment Claude Code: Anthropic's agentic coding tool that runs in your terminal, with Claude.ai offering Artifacts for interactive code generation OpenAI Codex: OpenAI's cloud-based coding agent running in sandboxed environments with GitHub integration v0 (Vercel): Cloud-based platform for generating full-stack applications with one-click deployment These platforms provide convenience but come with trade-offs: vendor lock-in, data privacy concerns, limited customization, and recurring subscription costs.\nWhy self-host OpenHands?\nConsideration Cloud Platforms Self-Hosted OpenHands Data privacy Data processed by third parties Your data stays in your AWS account Customization Limited to vendor features Full control over configuration Cost model Per-seat subscription Infrastructure cost (scales with usage) LLM choice Vendor-selected models Any model (Bedrock, OpenAI, local) Integration Vendor-provided integrations Custom integrations via API For organizations with strict compliance requirements, existing AWS infrastructure, or teams that prefer open-source solutions, self-hosting OpenHands provides the autonomous coding capabilities without the constraints of commercial platforms.\nWhy AWS CDK? AWS CDK (Cloud Development Kit) is an Infrastructure as Code (IaC) tool that lets you define cloud infrastructure using familiar programming languages like TypeScript, Python, or Java. Instead of manually clicking through the AWS Console or writing verbose CloudFormation YAML, you write code that CDK compiles into CloudFormation templates.\nKey benefits for this project:\nReproducible deployments: Deploy the same infrastructure to dev, staging, and production accounts with a single codebase Multi-account management: Use CDK's cross-account deployment capabilities to manage OpenHands instances across different AWS accounts Version controlled infrastructure: Track infrastructure changes in Git, review in pull requests, and roll back if needed Type safety: TypeScript catches configuration errors at compile time rather than deployment time Reusable constructs: The six stacks in this project can be customized via context parameters for different environments For teams running OpenHands across multiple AWS accounts (e.g., separate accounts for each team or environment), CDK makes it straightforward to maintain consistency while allowing per-account customization.\nOpenHands OSS vs This Project Before diving into the architecture, let's understand what the default OpenHands provides and what this project adds.\nDefault OpenHands (Local Setup) Out of the box, OpenHands runs as a single Docker container:\nAspect Default Behavior Database SQLite (single file) Storage Local filesystem Users Single user (localhost only) Authentication None LLM Multiple providers (API keys required) Persistence Lost when container removed Scaling Single container This works well for individual developers experimenting locally, but has limitations for team usage or always-on deployments.\nWhat This Project Adds The openhands-infra project addresses these limitations:\nAspect This Project Database Aurora Serverless v2 PostgreSQL Storage S3 (events) + EFS (workspaces) Users Multi-user with Cognito Authentication OAuth 2.0 via Lambda@Edge LLM AWS Bedrock (no API keys needed) Persistence Self-healing (survives instance replacement) Scaling Auto Scaling Group Architecture Overview The infrastructure uses six CDK stacks across two AWS regions:\nflowchart LR User([User]) --\u0026gt; CF subgraph us-east-1 CF[CloudFront] WAF[WAF] LE[Lambda Edge] Cognito[Cognito] end subgraph Main Region ALB[ALB] EC2[EC2 Graviton] Aurora[(Aurora)] S3[(S3)] EFS[(EFS)] Bedrock[Bedrock] end CF --\u0026gt; WAF CF --\u0026gt; LE LE --\u0026gt; Cognito CF --\u0026gt; ALB ALB --\u0026gt; EC2 EC2 --\u0026gt; Aurora EC2 --\u0026gt; S3 EC2 --\u0026gt; EFS EC2 --\u0026gt; Bedrock Edge Layer (us-east-1): CloudFront for global access, Lambda@Edge for authentication, Cognito for user management, WAF for protection.\nMain Region: EC2 with Docker Compose running OpenHands, Aurora for conversation metadata, S3 for events, EFS for workspace files, Bedrock for LLM access.\nKey Differences Explained 1. Multi-User Authentication OSS: No authentication—anyone with access to the URL can use it.\nThis Project: Cognito User Pool with OAuth 2.0 flow. Lambda@Edge validates JWT tokens at the CloudFront edge before requests reach the backend.\nsequenceDiagram User-\u0026gt;\u0026gt;CloudFront: Request CloudFront-\u0026gt;\u0026gt;LambdaEdge: Check auth alt No valid token LambdaEdge-\u0026gt;\u0026gt;User: Redirect to Cognito User-\u0026gt;\u0026gt;Cognito: Login Cognito-\u0026gt;\u0026gt;User: JWT token end LambdaEdge-\u0026gt;\u0026gt;ALB: Forward with user ID ALB-\u0026gt;\u0026gt;OpenHands: Process request Each user's conversations are isolated—stored in per-user S3 paths and labeled containers.\n2. Persistent Storage OSS: SQLite file and local storage. Data is lost when the container is removed.\nThis Project: Three-tier persistent storage:\nData Type Storage What Happens on Instance Replacement Conversation metadata Aurora PostgreSQL Preserved Conversation events S3 (versioned) Preserved Workspace files EFS Preserved Instance state EBS Cleared When an EC2 instance is replaced (due to scaling, updates, or failures), users can resume their conversations. The new instance mounts the same EFS workspace and reconnects to Aurora.\n3. LLM Integration OSS: Supports many LLM providers (OpenAI, Anthropic, Google, local models, etc.) but requires users to configure and manage their own API keys.\nThis Project: Uses AWS Bedrock with IAM role-based access. No API keys needed—the EC2 instance's IAM role grants access to Claude models. This simplifies credential management and enables usage tracking via AWS billing.\n4. Container Discovery OSS: Single container with direct port access.\nThis Project: OpenHands creates sandbox containers dynamically for each conversation. An OpenResty proxy discovers these containers via Docker API and routes requests using wildcard subdomains:\n1https://5000-abc123.runtime.openhands.example.com/ 2 ↓ 3OpenResty queries Docker API → finds container → proxies to container IP:5000 This allows multiple concurrent conversations with isolated runtime environments.\nDeployment Prerequisites VPC with private subnets (2+ AZs) NAT Gateway Route 53 hosted zone Node.js 20+ Quick Start 1git clone https://github.com/zxkane/openhands-infra.git 2cd openhands-infra 3npm install 4 5# Bootstrap CDK 6npx cdk bootstrap --region us-west-2 7npx cdk bootstrap --region us-east-1 8 9# Deploy 10npx cdk deploy --all \\ 11 --context vpcId=vpc-xxxxx \\ 12 --context hostedZoneId=Z0xxxxx \\ 13 --context domainName=example.com \\ 14 --context subDomain=openhands After deployment, access https://openhands.example.com and log in with Cognito credentials.\nCost Considerations Running this infrastructure costs approximately $375-420/month for the base setup:\nComponent Monthly Cost EC2 m7g.xlarge (Graviton) ~$112 Aurora Serverless v2 ~$43-80 CloudFront + ALB ~$110 VPC Endpoints ~$50 Other (EBS, S3, NAT, etc.) ~$60-70 Bedrock usage is additional and varies based on Claude model and token consumption.\nThe cost is higher than running locally, but provides:\nAlways-on availability Multi-user support Automatic backups Self-healing on failures When to Use This Project Good fit for:\nTeams wanting shared access to OpenHands Organizations preferring AWS-managed services Deployments requiring authentication and audit trails Scenarios needing persistent conversations across sessions May be overkill for:\nIndividual developers working locally Quick experimentation or demos Cost-sensitive use cases with occasional usage Limitations and Trade-offs WebSocket requirement: CloudFront VPC Origin doesn't support WebSocket, so the ALB is internet-facing with origin verification headers Single region compute: The EC2 instances run in one region (though CloudFront provides global edge access) Admin-managed users: Cognito is configured without self-signup; an admin must create user accounts Resources openhands-infra Repository OpenHands Documentation AWS CDK Documentation ","link":"https://kane.mx/posts/2026/deploying-openhands-on-aws-with-cdk/","section":"posts","tags":["AWS CDK","OpenHands","AI Agent","Infrastructure as Code","Serverless","CloudFront","Cognito","Aurora PostgreSQL","Devin Alternative","Self-Hosted AI"],"title":"Deploying OpenHands AI Platform on AWS with CDK"},{"body":"","link":"https://kane.mx/tags/devin-alternative/","section":"tags","tags":null,"title":"Devin Alternative"},{"body":"","link":"https://kane.mx/tags/infrastructure-as-code/","section":"tags","tags":null,"title":"Infrastructure as Code"},{"body":"","link":"https://kane.mx/tags/aws-cli/","section":"tags","tags":null,"title":"AWS CLI"},{"body":"","link":"https://kane.mx/tags/best-practices/","section":"tags","tags":null,"title":"Best Practices"},{"body":"","link":"https://kane.mx/categories/blogging/","section":"categories","tags":null,"title":"Blogging"},{"body":"","link":"https://kane.mx/tags/credentials/","section":"tags","tags":null,"title":"Credentials"},{"body":"Managing AWS credentials securely is a fundamental challenge for developers. Storing plain text access keys in ~/.aws/credentials creates significant security risks, especially when backing up dotfiles to version control systems. This post introduces credential_process, a powerful AWS CLI feature that allows you to source credentials from external processes, enabling encrypted credential storage while maintaining seamless AWS access.\nThe Problem with Plain Text Credentials The traditional approach stores AWS credentials in ~/.aws/credentials:\n1[default] 2aws_access_key_id = AKIAIOSFODNN7EXAMPLE 3aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY This approach has several drawbacks:\nSecurity Risk: Plain text credentials can be exposed if the file is accidentally committed or shared Backup Challenges: Cannot safely backup dotfiles to cloud storage or version control Credential Rotation: Manual updates required across multiple machines Audit Difficulty: No visibility into when and how credentials are accessed credential_process: A Better Approach The credential_process configuration option, introduced in AWS CLI v1.14.0 (November 2017) and botocore 1.8.0, allows the CLI and SDKs to retrieve credentials by executing an external command. This enables sophisticated credential management patterns including encryption, hardware security modules, and custom authentication flows.\nHow It Works Instead of storing credentials directly, you specify a command in your ~/.aws/config file:\n1[profile secure-profile] 2region = us-west-2 3credential_process = /path/to/your/credential-script.sh profile-name When the AWS CLI or SDK needs credentials for this profile, it executes the specified command and expects a JSON response on stdout.\nRequired JSON Output Format The external process must output valid JSON with the following structure:\n1{ 2 \u0026#34;Version\u0026#34;: 1, 3 \u0026#34;AccessKeyId\u0026#34;: \u0026#34;AKIAIOSFODNN7EXAMPLE\u0026#34;, 4 \u0026#34;SecretAccessKey\u0026#34;: \u0026#34;wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\u0026#34;, 5 \u0026#34;SessionToken\u0026#34;: \u0026#34;optional-session-token\u0026#34;, 6 \u0026#34;Expiration\u0026#34;: \u0026#34;2025-01-01T12:00:00Z\u0026#34; 7} Field Required Description Version Yes Must be 1 AccessKeyId Yes AWS access key ID SecretAccessKey Yes AWS secret access key SessionToken No Required for temporary credentials Expiration No ISO8601 timestamp; triggers auto-refresh Practical Implementation: Encrypted Credentials Here's a practical implementation that stores credentials in an encrypted file, enabling safe backup of your dotfiles while keeping credentials secure.\nStep 1: Encrypt Your Credentials First, create a credentials file in the standard format and encrypt it:\n1# Create a strong encryption key 2mkdir -p ~/.secrets 3openssl rand -base64 32 \u0026gt; ~/.secrets/aws-creds.key 4chmod 600 ~/.secrets/aws-creds.key 5 6# Encrypt your credentials file 7openssl enc -aes-256-cbc -pbkdf2 \\ 8 -in ~/.aws/credentials \\ 9 -out ~/.aws/static_credentials.enc \\ 10 -pass file:~/.secrets/aws-creds.key 11 12# Remove the plain text file 13rm ~/.aws/credentials Step 2: Create the Credential Retrieval Script Create a script at ~/.aws/get_creds.sh:\n1#!/bin/bash 2ENC_FILE=\u0026#34;$HOME/.aws/static_credentials.enc\u0026#34; 3KEY_FILE=\u0026#34;$HOME/.secrets/aws-creds.key\u0026#34; 4PROFILE=\u0026#34;${1:-default}\u0026#34; 5 6if [ ! -f \u0026#34;$KEY_FILE\u0026#34; ]; then 7 echo \u0026#34;Key file not found: $KEY_FILE\u0026#34; \u0026gt;\u0026amp;2 8 exit 1 9fi 10 11CONTENT=$(openssl enc -aes-256-cbc -d -pbkdf2 \\ 12 -in \u0026#34;$ENC_FILE\u0026#34; \\ 13 -pass file:\u0026#34;$KEY_FILE\u0026#34; 2\u0026gt;/dev/null) 14 15if [ $? -ne 0 ]; then 16 echo \u0026#34;Failed to decrypt credentials\u0026#34; \u0026gt;\u0026amp;2 17 exit 1 18fi 19 20AK=$(echo \u0026#34;$CONTENT\u0026#34; | sed -n \u0026#34;/^\\[$PROFILE\\]/,/^\\[/p\u0026#34; | \\ 21 grep aws_access_key_id | cut -d= -f2 | tr -d \u0026#39; \u0026#39;) 22SK=$(echo \u0026#34;$CONTENT\u0026#34; | sed -n \u0026#34;/^\\[$PROFILE\\]/,/^\\[/p\u0026#34; | \\ 23 grep aws_secret_access_key | cut -d= -f2 | tr -d \u0026#39; \u0026#39;) 24 25if [ -z \u0026#34;$AK\u0026#34; ] || [ -z \u0026#34;$SK\u0026#34; ]; then 26 echo \u0026#34;Profile \u0026#39;$PROFILE\u0026#39; not found\u0026#34; \u0026gt;\u0026amp;2 27 exit 1 28fi 29 30cat \u0026lt;\u0026lt;EOF 31{\u0026#34;Version\u0026#34;:1,\u0026#34;AccessKeyId\u0026#34;:\u0026#34;$AK\u0026#34;,\u0026#34;SecretAccessKey\u0026#34;:\u0026#34;$SK\u0026#34;} 32EOF Make it executable:\n1chmod +x ~/.aws/get_creds.sh Step 3: Configure AWS CLI Profiles Update your ~/.aws/config:\n1[profile dev] 2region = us-west-2 3cli_pager = 4credential_process = sh -c \u0026#39;$HOME/.aws/get_creds.sh dev\u0026#39; 5 6[profile prod] 7region = us-east-1 8cli_pager = 9credential_process = sh -c \u0026#39;$HOME/.aws/get_creds.sh prod\u0026#39; Step 4: Safe Dotfiles Backup Now you can safely backup your AWS configuration:\n1# Files safe to backup (no plain text credentials) 2~/.aws/config # Profile configurations 3~/.aws/static_credentials.enc # Encrypted credentials 4~/.aws/get_creds.sh # Retrieval script 5 6# File to keep separate and secure 7~/.secrets/aws-creds.key # Encryption key - NEVER backup to public repos Supporting Temporary Credentials (STS) For profiles that use temporary credentials from STS, create an extended script:\n1#!/bin/bash 2PROFILE=\u0026#34;${1:-default}\u0026#34; 3CRED_FILE=\u0026#34;$HOME/.aws/credential_$PROFILE\u0026#34; 4FALLBACK_FILE=\u0026#34;$HOME/.aws/credentials\u0026#34; 5 6if [ -f \u0026#34;$CRED_FILE\u0026#34; ]; then 7 FILE=\u0026#34;$CRED_FILE\u0026#34; 8else 9 FILE=\u0026#34;$FALLBACK_FILE\u0026#34; 10fi 11 12if [ ! -f \u0026#34;$FILE\u0026#34; ]; then 13 echo \u0026#34;Credential file not found\u0026#34; \u0026gt;\u0026amp;2 14 exit 1 15fi 16 17AK=$(sed -n \u0026#34;/^\\[$PROFILE\\]/,/^\\[/p\u0026#34; \u0026#34;$FILE\u0026#34; | \\ 18 grep aws_access_key_id | cut -d= -f2 | tr -d \u0026#39; \u0026#39;) 19SK=$(sed -n \u0026#34;/^\\[$PROFILE\\]/,/^\\[/p\u0026#34; \u0026#34;$FILE\u0026#34; | \\ 20 grep aws_secret_access_key | cut -d= -f2 | tr -d \u0026#39; \u0026#39;) 21ST=$(sed -n \u0026#34;/^\\[$PROFILE\\]/,/^\\[/p\u0026#34; \u0026#34;$FILE\u0026#34; | \\ 22 grep aws_session_token | cut -d= -f2 | tr -d \u0026#39; \u0026#39;) 23 24if [ -z \u0026#34;$AK\u0026#34; ] || [ -z \u0026#34;$SK\u0026#34; ]; then 25 echo \u0026#34;Profile \u0026#39;$PROFILE\u0026#39; not found\u0026#34; \u0026gt;\u0026amp;2 26 exit 1 27fi 28 29if [ -n \u0026#34;$ST\u0026#34; ]; then 30 cat \u0026lt;\u0026lt;EOF 31{\u0026#34;Version\u0026#34;:1,\u0026#34;AccessKeyId\u0026#34;:\u0026#34;$AK\u0026#34;,\u0026#34;SecretAccessKey\u0026#34;:\u0026#34;$SK\u0026#34;,\u0026#34;SessionToken\u0026#34;:\u0026#34;$ST\u0026#34;} 32EOF 33else 34 cat \u0026lt;\u0026lt;EOF 35{\u0026#34;Version\u0026#34;:1,\u0026#34;AccessKeyId\u0026#34;:\u0026#34;$AK\u0026#34;,\u0026#34;SecretAccessKey\u0026#34;:\u0026#34;$SK\u0026#34;} 36EOF 37fi SDK and Tool Compatibility The credential_process feature is supported across the AWS ecosystem:\nTool/SDK Minimum Version Notes AWS CLI v1 1.14.0 November 2017 AWS CLI v2 All versions Built-in support botocore 1.8.0 Python SDK foundation boto3 1.5.0 Python SDK AWS SDK for Go v1.15.0 Go SDK AWS SDK for Java 2.x Java SDK v2 AWS SDK for JavaScript v3 Node.js SDK v3 Security Best Practices When implementing credential_process, follow these guidelines:\nProtect the encryption key: Store it separately from encrypted credentials Use strong encryption: AES-256 with PBKDF2 key derivation Set proper permissions: chmod 600 for sensitive files Never log secrets: Avoid writing credentials to stderr Handle errors gracefully: Return non-zero exit codes for failures Consider hardware security: For high-security environments, integrate with HSMs or TPMs Advanced Use Cases Integration with Password Managers 1#!/bin/bash 2# Retrieve credentials from 1Password 3PROFILE=\u0026#34;${1:-default}\u0026#34; 4op item get \u0026#34;AWS-$PROFILE\u0026#34; --format json | \\ 5 jq \u0026#39;{Version:1, AccessKeyId:.fields[0].value, SecretAccessKey:.fields[1].value}\u0026#39; Integration with HashiCorp Vault For organizations using HashiCorp Vault for secrets management:\n1#!/bin/bash 2# Retrieve credentials from HashiCorp Vault 3VAULT_PATH=\u0026#34;${1:-secret/data/aws/credentials}\u0026#34; 4 5SECRET=$(vault kv get -format=json \u0026#34;$VAULT_PATH\u0026#34; 2\u0026gt;/dev/null) 6 7if [ $? -ne 0 ]; then 8 echo \u0026#34;Failed to retrieve secret from Vault\u0026#34; \u0026gt;\u0026amp;2 9 exit 1 10fi 11 12echo \u0026#34;$SECRET\u0026#34; | jq \u0026#39;{ 13 Version: 1, 14 AccessKeyId: .data.data.access_key_id, 15 SecretAccessKey: .data.data.secret_access_key 16}\u0026#39; Alternative: AWS IAM Identity Center (SSO) For environments using AWS IAM Identity Center (formerly AWS SSO), note that SSO provides a native alternative to credential_process rather than being combined with it. SSO handles temporary credential generation automatically:\n1[profile sso-profile] 2sso_session = my-sso 3sso_account_id = 123456789012 4sso_role_name = DeveloperAccess 5 6[sso-session my-sso] 7sso_start_url = https://my-org.awsapps.com/start 8sso_region = us-east-1 9sso_registration_scopes = sso:account:access When to use each approach:\nApproach Use Case credential_process Long-term IAM credentials, custom auth systems, secrets managers AWS SSO Federated identity, temporary credentials, enterprise SSO integration Both (rare) Legacy systems requiring IAM credentials alongside SSO migration Conclusion The credential_process feature provides a flexible and secure approach to AWS credential management. By storing credentials in encrypted form and retrieving them through external processes, you can:\nSafely backup your AWS configuration to version control Implement custom authentication flows Integrate with enterprise security tools Maintain credential hygiene across multiple machines For more AWS CLI tips and tricks, check out my post on Awesome AWS CLI.\nResources AWS CLI External Credential Sourcing Documentation AWS Shared Credentials File Configuration OpenSSL Encryption Documentation ","link":"https://kane.mx/posts/2026/aws-credential-process/","section":"posts","tags":["AWS","AWS CLI","Security","Credentials","Best Practices"],"title":"Secure AWS Credentials with credential_process"},{"body":"","link":"https://kane.mx/tags/security/","section":"tags","tags":null,"title":"Security"},{"body":"","link":"https://kane.mx/tags/cdk/","section":"tags","tags":null,"title":"CDK"},{"body":"","link":"https://kane.mx/tags/cloudflare/","section":"tags","tags":null,"title":"Cloudflare"},{"body":"","link":"https://kane.mx/tags/iam-identity-center/","section":"tags","tags":null,"title":"IAM Identity Center"},{"body":"","link":"https://kane.mx/tags/oidc/","section":"tags","tags":null,"title":"OIDC"},{"body":"AWS IAM Identity Center (formerly AWS SSO) provides centralized access management for AWS accounts and applications. While it natively supports SAML 2.0 for external identity providers, many organizations prefer OIDC-based authentication through providers like Amazon Cognito. This post demonstrates how to use Cloudflare Access as a SAML bridge between Amazon Cognito and AWS IAM Identity Center with automatic just-in-time (JIT) user provisioning.\nThe Challenge AWS IAM Identity Center only accepts SAML 2.0 for external identity providers. However, you might want to use an OIDC provider like Amazon Cognito for several reasons:\nUnified Identity: Use Cognito as a central user directory across your applications Social Login Federation: Federate Cognito with Google, Facebook, or enterprise OIDC providers Flexible Authentication: Leverage Cognito's MFA, adaptive authentication, and custom flows Zero Trust Integration: Combine with Cloudflare Access policies for enhanced security The solution is to use Cloudflare Access as a SAML Identity Provider that bridges OIDC tokens from Cognito to SAML assertions for AWS IAM Identity Center, while a Pre-Token Lambda automatically provisions users in AWS Identity Store.\nArchitecture Overview flowchart LR subgraph User[\u0026#34;User\u0026#34;] CLI[\u0026#34;AWS CLI\u0026#34;] Console[\u0026#34;AWS Console\u0026#34;] end subgraph IDC[\u0026#34;IAM Identity Center\u0026#34;] Portal[\u0026#34;Access Portal\u0026#34;] PS[\u0026#34;Permission Sets\u0026#34;] end subgraph CF[\u0026#34;Cloudflare Access\u0026#34;] SAML[\u0026#34;SAML IdP\u0026#34;] ZT[\u0026#34;Zero Trust\u0026lt;br/\u0026gt;Policies\u0026#34;] end subgraph Auth[\u0026#34;Amazon Cognito\u0026#34;] UP[\u0026#34;User Pool\u0026#34;] PTL[\u0026#34;Pre-Token\u0026lt;br/\u0026gt;Lambda\u0026#34;] ExtIdP[\u0026#34;External OIDC\u0026lt;br/\u0026gt;(Optional)\u0026#34;] end subgraph Store[\u0026#34;Identity Store\u0026#34;] IS[\u0026#34;AWS Identity\u0026lt;br/\u0026gt;Store\u0026#34;] end CLI --\u0026gt; Portal Console --\u0026gt; Portal Portal --\u0026gt;|SAML Request| SAML SAML --\u0026gt; ZT ZT --\u0026gt;|OIDC| UP UP --\u0026gt; ExtIdP UP --\u0026gt; PTL PTL --\u0026gt;|JIT Sync| IS SAML --\u0026gt;|SAML Response| Portal Portal --\u0026gt; PS Authentication Flow User accesses AWS access portal or initiates aws sso login AWS IAM Identity Center redirects to Cloudflare Access (SAML IdP) Cloudflare Access authenticates user via Amazon Cognito (OIDC) Cognito's Pre-Token Lambda creates user in Identity Store if not exists (JIT provisioning) Cognito returns tokens with identity_store_user_id claim Cloudflare sends SAML assertion back to AWS IAM Identity Center User gains access to AWS resources based on permission sets Solution Components The solution consists of three main components:\n1. Amazon Cognito User Pool Cognito serves as the OIDC provider and user directory:\nUser Authentication: Native authentication or federation with external OIDC providers OAuth 2.0/OIDC: Provides standard OIDC endpoints for Cloudflare Access Pre-Token Lambda Trigger: Invokes JIT provisioning before token generation 2. Cloudflare Access (SAML Bridge) Cloudflare Access acts as a SAML Identity Provider:\nSAML IdP: Generates SAML assertions for AWS IAM Identity Center OIDC Authentication: Authenticates users via Cognito's OIDC endpoints Zero Trust Policies: Apply access controls before allowing AWS access 3. Pre-Token Generation Lambda A Lambda function triggered by Cognito before token generation:\nJIT Provisioning: Creates users in AWS Identity Store on first login Idempotent: Handles race conditions with conflict detection Claims Injection: Adds identity_store_user_id to tokens Implementation The complete solution is available as an AWS CDK project on GitHub.\nProject Structure 1cloudflare-access-for-aws-idc/ 2├── bin/ 3│ └── app.ts # CDK app entry point 4├── lib/ 5│ └── cognito-cloudflare-stack.ts # Main CDK stack 6├── lambda/ 7│ └── pre-token-generation/ 8│ └── index.ts # JIT user sync Lambda 9└── test/ 10 ├── cognito-cloudflare-stack.test.ts 11 └── lambda/ 12 └── pre-token-generation.test.ts CDK Stack Highlights The CDK stack creates the following resources:\n1export interface CognitoCloudflareStackProps extends cdk.StackProps { 2 identityStoreId: string; // AWS Identity Store ID 3 cloudflareTeamName: string; // Cloudflare Access team name 4 externalOidcProvider?: { // Optional external OIDC federation 5 providerName: string; 6 clientId: string; 7 clientSecret: string; 8 issuerUrl: string; 9 }; 10} Key components created:\nCognito User Pool with secure password policies and email-based sign-in Pre-Token Generation Lambda with Identity Store permissions User Pool Client configured for Cloudflare Access callback CloudWatch Dashboard and alarms for operational monitoring Secrets Manager secret for client credentials Pre-Token Generation Lambda The Lambda function synchronizes users to AWS Identity Store using the Cognito Pre Token Generation V2 trigger:\n1async function syncUserToIdentityStore( 2 email: string, 3 givenName: string, 4 familyName: string, 5 displayName: string 6): Promise\u0026lt;string\u0026gt; { 7 // Check if user exists in Identity Store 8 let identityStoreUserId = await findUserInIdentityStore(email); 9 10 // Create user if not exists 11 if (!identityStoreUserId) { 12 try { 13 identityStoreUserId = await createUserInIdentityStore( 14 email, givenName, familyName, displayName 15 ); 16 } catch (error) { 17 // Handle race condition: another request created the user 18 if (error instanceof ConflictException) { 19 identityStoreUserId = await findUserInIdentityStore(email); 20 } else { 21 throw error; 22 } 23 } 24 } 25 26 return identityStoreUserId; 27} The Lambda adds the Identity Store user ID to both ID and access tokens:\n1event.response.claimsAndScopeOverrideDetails = { 2 idTokenGeneration: { 3 claimsToAddOrOverride: { 4 identity_store_user_id: identityStoreUserId, 5 }, 6 }, 7 accessTokenGeneration: { 8 claimsToAddOrOverride: { 9 identity_store_user_id: identityStoreUserId, 10 }, 11 }, 12}; Smart Name Derivation When user attributes lack name information, the Lambda derives names from the email address:\n1// If both names are empty, derive from email prefix 2if (!givenName \u0026amp;\u0026amp; !familyName) { 3 const emailPrefix = email.split(\u0026#39;@\u0026#39;)[0]; 4 const nameParts = emailPrefix.split(/[._-]/); 5 6 if (nameParts.length \u0026gt;= 2) { 7 // john.doe@example.com -\u0026gt; Given: John, Family: Doe 8 givenName = capitalize(nameParts[0]); 9 familyName = nameParts.slice(1).map(capitalize).join(\u0026#39; \u0026#39;); 10 } else { 11 // johndoe@example.com -\u0026gt; Given: Johndoe, Family: User 12 givenName = capitalize(emailPrefix); 13 familyName = \u0026#39;User\u0026#39;; 14 } 15} Deployment Prerequisites AWS Account with IAM Identity Center enabled Cloudflare account with Zero Trust (Access) Node.js 18+ and AWS CDK CLI installed Step 1: Find Your Identity Store ID 1aws sso-admin list-instances \\ 2 --query \u0026#39;Instances[0].IdentityStoreId\u0026#39; \\ 3 --output text Step 2: Deploy the CDK Stack 1# Clone the repository 2git clone https://github.com/zxkane/cloudflare-access-for-aws-idc.git 3cd cloudflare-access-for-aws-idc 4 5# Install dependencies 6npm install 7 8# Deploy 9npx cdk deploy \\ 10 --context identityStoreId=\u0026lt;your-identity-store-id\u0026gt; \\ 11 --context cloudflareTeamName=\u0026lt;your-team-name\u0026gt; Stack Outputs Note these outputs for Cloudflare configuration:\nOutput Description CognitoIssuerUrl OIDC Issuer URL AuthorizationEndpoint OAuth authorization endpoint TokenEndpoint OAuth token endpoint JwksUri JSON Web Key Set URI UserPoolClientId Cognito client ID ClientSecretArn ARN of client secret in Secrets Manager Step 3: Configure Cloudflare Access Add OIDC Identity Provider In Cloudflare Zero Trust dashboard:\nGo to Settings \u0026gt; Authentication \u0026gt; Login methods Click Add new \u0026gt; OpenID Connect Configure: Name: Amazon Cognito App ID: Use UserPoolClientId from stack outputs Client Secret: Retrieve from Secrets Manager using ClientSecretArn Auth URL: Use AuthorizationEndpoint from stack outputs Token URL: Use TokenEndpoint from stack outputs Certificate URL: Use JwksUri from stack outputs Scopes: openid email profile Save the configuration Create SAML Application for AWS Go to Access \u0026gt; Applications \u0026gt; Add an application Select SaaS \u0026gt; AWS Configure SAML settings: Entity ID: urn:amazon:webservices ACS URL: Get from AWS IAM Identity Center external IdP settings Name ID Format: Email Assign access policies Download the SAML metadata Step 4: Configure AWS IAM Identity Center Open AWS IAM Identity Center console Go to Settings → Identity source → Actions → Change identity source Select External identity provider Upload Cloudflare's SAML metadata or configure manually: IdP sign-in URL: From Cloudflare application settings IdP issuer URL: From Cloudflare application settings IdP certificate: Download from Cloudflare Complete the configuration Step 5: Create Permission Sets and Assignments After users authenticate for the first time, they are automatically provisioned in Identity Store. You can then:\nCreate Permission Sets with appropriate IAM policies Assign users to AWS accounts with permission sets Using AWS CLI with SSO Once configured, users can authenticate via the CLI:\n1# Configure SSO profile 2aws configure sso 3# Enter: SSO start URL, SSO Region, account, role 4 5# Login 6aws sso login --profile my-sso-profile 7 8# Use AWS CLI 9aws s3 ls --profile my-sso-profile The login flow opens a browser, redirects through Cloudflare Access and Cognito for authentication, and returns credentials to the CLI.\nOptional: External OIDC Federation To federate Cognito with an external OIDC provider (Google, Okta, Auth0, etc.):\n1npx cdk deploy \\ 2 --context identityStoreId=\u0026lt;your-identity-store-id\u0026gt; \\ 3 --context cloudflareTeamName=\u0026lt;your-team-name\u0026gt; \\ 4 --context externalOidcProviderName=Google \\ 5 --context externalOidcClientId=\u0026lt;google-client-id\u0026gt; \\ 6 --context externalOidcClientSecret=\u0026lt;google-client-secret\u0026gt; \\ 7 --context externalOidcIssuerUrl=https://accounts.google.com This enables \u0026quot;Login with Google\u0026quot; (or any OIDC provider) for AWS Console and CLI access through the entire chain: External IdP → Cognito → Cloudflare → IAM Identity Center.\nMonitoring and Troubleshooting CloudWatch Dashboard The stack creates a dashboard with:\nLambda invocation and error metrics User sync success/failure rates Conflict and skip metrics Duration approaching timeout alerts Alarms Alarm Threshold Description Lambda Errors \u0026gt; 0 in 5 min Any Lambda execution errors Lambda Throttles \u0026gt; 0 in 5 min Lambda throttling events Sync Failures \u0026gt; 5 in 5 min Identity Store sync failures Duration p99 \u0026gt; 25s Approaching 30s timeout Common Issues \u0026quot;Looks like this code isn't right\u0026quot;\nThis usually means the user's UserName in Identity Store doesn't match the SAML NameID (email). Delete the manually-created user in Identity Center and let the Lambda recreate them on next login.\n\u0026quot;User not found in Identity Store\u0026quot;\nVerify Pre-Token Lambda has identitystore:CreateUser permission Check CloudWatch Logs for sync errors Verify the Identity Store ID is correct Lambda Times Out\nThe Lambda has a 30-second timeout. If Identity Store API calls are slow:\nCheck AWS service health Review CloudWatch logs for specific errors Security Considerations Secrets Management: Client secrets stored in AWS Secrets Manager Least Privilege IAM: Lambda only has GetUserId and CreateUser permissions No Self-Signup: Cognito User Pool configured for admin-only user creation Secure OAuth: Only authorization code flow enabled (no implicit grant) Zero Trust: Cloudflare Access policies provide additional security layer X-Ray Tracing: Full observability for debugging and auditing Conclusion This solution enables OIDC-based authentication for AWS IAM Identity Center by leveraging Cloudflare Access as a SAML bridge. Key benefits include:\nFlexible Identity: Use Amazon Cognito as your identity provider with optional external OIDC federation Zero Trust Security: Apply Cloudflare Access policies before AWS access Automatic Provisioning: JIT user creation eliminates manual user management Serverless Architecture: No infrastructure to manage beyond the CDK stack The complete source code is available on GitHub.\nResources AWS IAM Identity Center Documentation Amazon Cognito Documentation Cloudflare Access Documentation AWS CDK Documentation ","link":"https://kane.mx/posts/2025/external-identity-source-aws-sso/","section":"posts","tags":["AWS","IAM Identity Center","SSO","Cognito","OIDC","SAML","CDK","Serverless","Cloudflare"],"title":"OIDC External Identity Source for AWS IAM Identity Center"},{"body":"","link":"https://kane.mx/tags/saml/","section":"tags","tags":null,"title":"SAML"},{"body":"","link":"https://kane.mx/tags/sso/","section":"tags","tags":null,"title":"SSO"},{"body":"","link":"https://kane.mx/categories/ai-coding-assistants/","section":"categories","tags":null,"title":"AI Coding Assistants"},{"body":"","link":"https://kane.mx/tags/client-id-metadata/","section":"tags","tags":null,"title":"Client ID Metadata"},{"body":"When working with Claude Code on complex tasks, you often switch to other work while waiting for completion. The challenge? Knowing exactly when Claude finishes so you can review the results promptly. This post shows you how to configure desktop notifications that alert you the moment Claude Code completes a task.\nThe Problem Claude Code can run lengthy operations - refactoring codebases, writing tests, or analyzing large files. During these operations, you might:\nSwitch to another VSCode window Check emails or documentation Work on a different task entirely Without notifications, you're left constantly checking back, wasting time and breaking focus.\nThe Solution: OSC Escape Sequences Operating System Command (OSC) escape sequences allow terminal applications to communicate with their host environment. Modern terminals like VSCode's integrated terminal, iTerm2, and Windows Terminal support OSC sequences for desktop notifications.\nThe magic sequence:\n1# OSC 777 format (VSCode, rxvt-unicode) 2printf \u0026#39;\\033]777;notify;Title;Message\\007\u0026#39; 3 4# OSC 9 format (iTerm2, Windows Terminal) 5printf \u0026#39;\\033]9;Message\\007\u0026#39; When sent to a terminal that supports it, these sequences trigger native desktop notifications - even when the terminal window isn't focused.\nVSCode Remote SSH: The Key Use Case This solution shines brightest when you're working on a remote EC2 instance via VSCode Remote SSH - a common setup for cloud-based development where your compute resources live in AWS but your IDE runs locally.\nThe Challenge with Remote Development When Claude Code runs on a remote server:\nNotifications generated on the EC2 instance need to reach your local desktop Standard notification systems (like notify-send on Linux) only work locally The remote server has no direct access to your desktop notification system How OSC Sequences Bridge the Gap Here's the magic: VSCode's integrated terminal forwards OSC escape sequences from the remote host to your local machine through the SSH connection. The data flow looks like this:\n1Remote EC2 Local Machine 2┌─────────────────┐ ┌─────────────────┐ 3│ Claude Code │ │ │ 4│ ↓ │ SSH Tunnel │ │ 5│ notify_osc.sh │ ─────────────→ │ VSCode Terminal │ 6│ ↓ │ │ ↓ │ 7│ printf \u0026#39;\\033]\u0026#39; │ │ OSC Parser │ 8│ │ │ ↓ │ 9└─────────────────┘ │ Desktop Notify │ 10 └─────────────────┘ Required Extension: Terminal Notification To convert OSC sequences into native desktop notifications, install the Terminal Notification extension by wenbopan:\nFeatures:\nRecognizes OSC 777 (\\033]777;notify;Title;Message\\007) and OSC 9 (\\033]9;Message\\007) sequences Generates native notifications on macOS, Windows, and Linux Click-to-focus: Click a notification to jump directly to the originating terminal tab Remote SSH support: Works seamlessly with VSCode Remote SSH tmux compatible: Automatically unwraps sequences forwarded through tmux Requirements:\nVSCode 1.93 or later Shell Integration enabled (default for most shells) Why This Matters for Cloud Development If you're running Claude Code on an EC2 instance (common for accessing more compute power or keeping development environments isolated), this setup means:\nNo local Claude Code installation required - Everything runs on the remote server Native notifications on your laptop - Even though the work happens in AWS Works across network boundaries - SSH handles the transport layer No additional infrastructure - No webhook servers, no polling, just escape sequences This is particularly valuable when:\nRunning Claude Code on a powerful EC2 instance for faster processing Working with large codebases that benefit from cloud compute Maintaining development environments on remote servers Using multiple remote instances for different projects Implementation Step 1: Create the Notification Script Create ~/.claude/hooks/notify_osc.sh:\n1#!/bin/bash 2# Send notifications via OSC escape sequences to active terminals 3 4TITLE=\u0026#34;${1:-Claude Code}\u0026#34; 5MESSAGE=\u0026#34;${2:-Task completed}\u0026#34; 6 7LOG_DIR=\u0026#34;$HOME/.claude/hooks\u0026#34; 8LOG_FILE=\u0026#34;$LOG_DIR/notification.log\u0026#34; 9mkdir -p \u0026#34;$LOG_DIR\u0026#34; 10 11# Read hook input JSON from stdin 12if [ -t 0 ]; then 13 HOOK_INPUT=\u0026#34;\u0026#34; 14else 15 HOOK_INPUT=$(cat) 16fi 17 18# Extract project and task information 19PROJECT_NAME=\u0026#34;\u0026#34; 20TASK_SUMMARY=\u0026#34;\u0026#34; 21 22if [ -n \u0026#34;$HOOK_INPUT\u0026#34; ] \u0026amp;\u0026amp; command -v jq \u0026gt;/dev/null 2\u0026gt;\u0026amp;1; then 23 # Extract project name from cwd 24 CWD=$(echo \u0026#34;$HOOK_INPUT\u0026#34; | jq -r \u0026#39;.cwd // empty\u0026#39; 2\u0026gt;/dev/null) 25 if [ -n \u0026#34;$CWD\u0026#34; ]; then 26 PROJECT_NAME=$(basename \u0026#34;$CWD\u0026#34;) 27 fi 28 29 # Extract transcript path 30 TRANSCRIPT_PATH=$(echo \u0026#34;$HOOK_INPUT\u0026#34; | jq -r \u0026#39;.transcript_path // empty\u0026#39; 2\u0026gt;/dev/null) 31 32 # Try to get task description from session file 33 if [ -n \u0026#34;$TRANSCRIPT_PATH\u0026#34; ] \u0026amp;\u0026amp; [ -f \u0026#34;$TRANSCRIPT_PATH\u0026#34; ]; then 34 # Method 1: Find first queue-operation enqueue 35 TASK_SUMMARY=$(cat \u0026#34;$TRANSCRIPT_PATH\u0026#34; 2\u0026gt;/dev/null | \\ 36 jq -r \u0026#39;select(.type == \u0026#34;queue-operation\u0026#34; and .operation == \u0026#34;enqueue\u0026#34;) | 37 .content[].text // empty\u0026#39; 2\u0026gt;/dev/null | \\ 38 while IFS= read -r line; do 39 # Skip system messages 40 if [[ ! \u0026#34;$line\u0026#34; =~ ^\\\u0026lt;(ide_opened_file|system-reminder|command-) ]]; then 41 echo \u0026#34;$line\u0026#34; 42 break 43 fi 44 done | head -c 100) 45 46 # Method 2: Fallback to first user message 47 if [ -z \u0026#34;$TASK_SUMMARY\u0026#34; ]; then 48 TASK_SUMMARY=$(cat \u0026#34;$TRANSCRIPT_PATH\u0026#34; 2\u0026gt;/dev/null | \\ 49 jq -r \u0026#39;select(.type == \u0026#34;user\u0026#34;) | 50 select(.isMeta == null or .isMeta == false) | 51 if .message.content | type == \u0026#34;array\u0026#34; 52 then .message.content[].text // empty 53 else .message.content end\u0026#39; 2\u0026gt;/dev/null | \\ 54 while IFS= read -r line; do 55 if [ -n \u0026#34;$line\u0026#34; ] \u0026amp;\u0026amp; [ \u0026#34;$line\u0026#34; != \u0026#34;null\u0026#34; ] \u0026amp;\u0026amp; \\ 56 [[ ! \u0026#34;$line\u0026#34; =~ ^\\\u0026lt;(ide_opened_file|system-reminder|command-) ]]; then 57 echo \u0026#34;$line\u0026#34; 58 break 59 fi 60 done | head -c 100) 61 fi 62 fi 63fi 64 65# Build enhanced message 66ENHANCED_MESSAGE=\u0026#34;$MESSAGE\u0026#34; 67if [ -n \u0026#34;$PROJECT_NAME\u0026#34; ]; then 68 ENHANCED_MESSAGE=\u0026#34;[$PROJECT_NAME] $ENHANCED_MESSAGE\u0026#34; 69fi 70if [ -n \u0026#34;$TASK_SUMMARY\u0026#34; ]; then 71 ENHANCED_MESSAGE=\u0026#34;$ENHANCED_MESSAGE - Task: $TASK_SUMMARY\u0026#34; 72fi 73 74# Log notification 75{ 76 echo \u0026#34;[$(date \u0026#39;+%Y-%m-%d %H:%M:%S\u0026#39;)] Notification sent:\u0026#34; 77 echo \u0026#34; Project: ${PROJECT_NAME:-N/A}\u0026#34; 78 echo \u0026#34; Message: $MESSAGE\u0026#34; 79 echo \u0026#34; Task: ${TASK_SUMMARY:-N/A}\u0026#34; 80} \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; 81 82# Send to all writable pts devices 83for pts in /dev/pts/*; do 84 if [ \u0026#34;$pts\u0026#34; = \u0026#34;/dev/pts/ptmx\u0026#34; ]; then 85 continue 86 fi 87 88 if [ -w \u0026#34;$pts\u0026#34; ] 2\u0026gt;/dev/null; then 89 { 90 printf \u0026#39;\\033]777;notify;%s;%s\\007\u0026#39; \u0026#34;$TITLE\u0026#34; \u0026#34;$ENHANCED_MESSAGE\u0026#34; 91 printf \u0026#39;\\033]9;%s: %s\\007\u0026#39; \u0026#34;$TITLE\u0026#34; \u0026#34;$ENHANCED_MESSAGE\u0026#34; 92 printf \u0026#39;\\a\u0026#39; 93 } \u0026gt; \u0026#34;$pts\u0026#34; 2\u0026gt;/dev/null 94 fi 95done 96 97exit 0 Make it executable:\n1chmod +x ~/.claude/hooks/notify_osc.sh Step 2: Configure Claude Code Hooks Add to ~/.claude/settings.json:\n1{ 2 \u0026#34;hooks\u0026#34;: { 3 \u0026#34;Stop\u0026#34;: [ 4 { 5 \u0026#34;hooks\u0026#34;: [ 6 { 7 \u0026#34;type\u0026#34;: \u0026#34;command\u0026#34;, 8 \u0026#34;command\u0026#34;: \u0026#34;~/.claude/hooks/notify_osc.sh \u0026#39;Claude Code\u0026#39; \u0026#39;Task completed, please review results\u0026#39;\u0026#34;, 9 \u0026#34;timeout\u0026#34;: 10 10 } 11 ] 12 } 13 ], 14 \u0026#34;Notification\u0026#34;: [ 15 { 16 \u0026#34;matcher\u0026#34;: \u0026#34;idle_prompt\u0026#34;, 17 \u0026#34;hooks\u0026#34;: [ 18 { 19 \u0026#34;type\u0026#34;: \u0026#34;command\u0026#34;, 20 \u0026#34;command\u0026#34;: \u0026#34;~/.claude/hooks/notify_osc.sh \u0026#39;Claude Waiting\u0026#39; \u0026#39;Claude has been idle for over 60 seconds\u0026#39;\u0026#34;, 21 \u0026#34;timeout\u0026#34;: 10 22 } 23 ] 24 } 25 ] 26 } 27} Step 3: Test It 1# Manual test 2~/.claude/hooks/notify_osc.sh \u0026#34;Test\u0026#34; \u0026#34;Hello from Claude Code\u0026#34; 3 4# Check logs 5tail -f ~/.claude/hooks/notification.log The Multi-Window Challenge If you're like me, you probably have multiple VSCode Remote SSH windows open to the same server, working on different projects. The basic implementation above sends notifications to all terminal devices, which means:\nProject A completes → notification appears in Project B's window Confusing and potentially distracting Understanding the Problem Claude Code hooks run as detached processes without a controlling terminal. When we examine the process tree:\n1PID 794379 (bash): TTY_NR: 0 2PID 794302 (zsh): TTY_NR: 0 3PID 779087 (claude): TTY_NR: 0 All processes show TTY_NR: 0 - no controlling terminal. This makes it impossible to determine which VSCode window spawned the hook through standard process inspection.\nThe Solution: UUID-Based Terminal Mapping Each VSCode instance has a unique identifier in the VSCODE_IPC_HOOK_CLI environment variable:\n1VSCODE_IPC_HOOK_CLI=/run/user/1000/vscode-ipc-785147ca-2a10-4fce-becc-b5f600ca1dec.sock We can extract this UUID and maintain a mapping to terminal devices.\nManual Registration Script Create ~/.claude/hooks/register_current_terminal.sh:\n1#!/bin/bash 2# Register current terminal for VSCode instance 3# Run this directly in your VSCode terminal 4 5LOG_DIR=\u0026#34;$HOME/.claude/hooks\u0026#34; 6MAPPING_FILE=\u0026#34;$LOG_DIR/terminal_mapping.txt\u0026#34; 7mkdir -p \u0026#34;$LOG_DIR\u0026#34; 8 9# Get current TTY 10CURRENT_TTY=$(tty) 11if [ \u0026#34;$CURRENT_TTY\u0026#34; = \u0026#34;not a tty\u0026#34; ]; then 12 echo \u0026#34;Error: Not running in a terminal\u0026#34; 13 exit 1 14fi 15 16# Extract VSCode UUID 17VSCODE_UUID=\u0026#34;\u0026#34; 18if [ -n \u0026#34;$VSCODE_IPC_HOOK_CLI\u0026#34; ]; then 19 VSCODE_UUID=$(echo \u0026#34;$VSCODE_IPC_HOOK_CLI\u0026#34; | \\ 20 grep -oP \u0026#39;vscode-ipc-\\K[0-9a-f-]+(?=\\.sock)\u0026#39;) 21fi 22 23if [ -z \u0026#34;$VSCODE_UUID\u0026#34; ]; then 24 echo \u0026#34;Error: Could not extract VSCode UUID\u0026#34; 25 exit 1 26fi 27 28# Update mapping file 29if [ -f \u0026#34;$MAPPING_FILE\u0026#34; ]; then 30 grep -v \u0026#34;^$VSCODE_UUID:\u0026#34; \u0026#34;$MAPPING_FILE\u0026#34; \u0026gt; \u0026#34;$MAPPING_FILE.tmp\u0026#34; || true 31 mv \u0026#34;$MAPPING_FILE.tmp\u0026#34; \u0026#34;$MAPPING_FILE\u0026#34; 32fi 33 34echo \u0026#34;$VSCODE_UUID:$CURRENT_TTY\u0026#34; \u0026gt;\u0026gt; \u0026#34;$MAPPING_FILE\u0026#34; 35 36echo \u0026#34;✓ Registered VSCode UUID: $VSCODE_UUID\u0026#34; 37echo \u0026#34;✓ Terminal device: $CURRENT_TTY\u0026#34; Enhanced Notification Script Update notify_osc.sh to use the mapping:\n1# Near the beginning, after extracting HOOK_INPUT 2MAPPING_FILE=\u0026#34;$LOG_DIR/terminal_mapping.txt\u0026#34; 3TARGET_TTY=\u0026#34;\u0026#34; 4 5# Extract VSCode UUID from environment 6VSCODE_UUID=\u0026#34;\u0026#34; 7if [ -n \u0026#34;$VSCODE_IPC_HOOK_CLI\u0026#34; ]; then 8 VSCODE_UUID=$(echo \u0026#34;$VSCODE_IPC_HOOK_CLI\u0026#34; | \\ 9 grep -oP \u0026#39;vscode-ipc-\\K[0-9a-f-]+(?=\\.sock)\u0026#39;) 10 11 if [ -n \u0026#34;$VSCODE_UUID\u0026#34; ] \u0026amp;\u0026amp; [ -f \u0026#34;$MAPPING_FILE\u0026#34; ]; then 12 TARGET_TTY=$(grep \u0026#34;^$VSCODE_UUID:\u0026#34; \u0026#34;$MAPPING_FILE\u0026#34; | cut -d: -f2) 13 fi 14fi 15 16# Send notification 17if [ -n \u0026#34;$TARGET_TTY\u0026#34; ] \u0026amp;\u0026amp; [ -w \u0026#34;$TARGET_TTY\u0026#34; ]; then 18 # Send to specific terminal only 19 { 20 printf \u0026#39;\\033]777;notify;%s;%s\\007\u0026#39; \u0026#34;$TITLE\u0026#34; \u0026#34;$ENHANCED_MESSAGE\u0026#34; 21 printf \u0026#39;\\033]9;%s: %s\\007\u0026#39; \u0026#34;$TITLE\u0026#34; \u0026#34;$ENHANCED_MESSAGE\u0026#34; 22 printf \u0026#39;\\a\u0026#39; 23 } \u0026gt; \u0026#34;$TARGET_TTY\u0026#34; 2\u0026gt;/dev/null 24else 25 # Fallback: broadcast to all terminals 26 for pts in /dev/pts/*; do 27 # ... existing broadcast logic 28 done 29fi Usage In each VSCode terminal window, run once:\n1~/.claude/hooks/register_current_terminal.sh Verify the mapping:\n1cat ~/.claude/hooks/terminal_mapping.txt 2# Output: 3# 785147ca-2a10-4fce-becc-b5f600ca1dec:/dev/pts/2 4# 167d6e75-c42e-487d-9e9f-946e8396dd4f:/dev/pts/6 Now notifications will only appear in the correct VSCode window.\nExtracting Task Descriptions The notification is more useful when it includes what task was being performed. Claude Code stores session data in JSONL files at ~/.claude/projects/.\nSession File Structure 1{\u0026#34;type\u0026#34;:\u0026#34;queue-operation\u0026#34;,\u0026#34;operation\u0026#34;:\u0026#34;enqueue\u0026#34;,\u0026#34;content\u0026#34;:[ 2 {\u0026#34;type\u0026#34;:\u0026#34;text\u0026#34;,\u0026#34;text\u0026#34;:\u0026#34;\u0026lt;ide_opened_file\u0026gt;...\u0026lt;/ide_opened_file\u0026gt;\u0026#34;}, 3 {\u0026#34;type\u0026#34;:\u0026#34;text\u0026#34;,\u0026#34;text\u0026#34;:\u0026#34;Actual user task here\u0026#34;} 4]} Key Insights Content is an array: The first element is often a system message (\u0026lt;ide_opened_file\u0026gt;), the actual task is in subsequent elements\nMultiple message types: Some sessions use queue-operation, others use direct user messages\nFiltering required: Skip system tags like \u0026lt;ide_opened_file\u0026gt;, \u0026lt;system-reminder\u0026gt;, and \u0026lt;command-\nThe jq query that handles all cases:\n1jq -r \u0026#39;select(.type == \u0026#34;user\u0026#34;) | 2 select(.isMeta == null or .isMeta == false) | 3 if .message.content | type == \u0026#34;array\u0026#34; 4 then .message.content[].text // empty 5 else .message.content end\u0026#39; Troubleshooting No Notifications Appearing Check terminal support:\n1printf \u0026#39;\\033]777;notify;Test;Message\\007\u0026#39; Verify hook execution:\n1tail -f ~/.claude/hooks/notification.log Check pts devices:\n1ls -la /dev/pts/ Notifications in Wrong Window Run the registration script in your current terminal:\n1~/.claude/hooks/register_current_terminal.sh Task Description Shows \u0026quot;null\u0026quot; This usually means:\nSession file doesn't exist yet (new session) jq is not installed Session file format changed Install jq:\n1# Ubuntu/Debian 2sudo apt-get install jq 3 4# macOS 5brew install jq Trade-offs The Stop hook fires every time Claude pauses, not just on task completion. This means you might get notifications during:\nMulti-step tasks (between steps) When Claude asks clarifying questions Tool execution pauses For most users, occasional extra notifications are preferable to the alternative - adding LLM-based completion detection that adds 30+ seconds of latency to every notification.\nConclusion With this setup, you can confidently switch away from Claude Code knowing you'll be alerted the moment it needs your attention. The combination of OSC escape sequences and Claude Code hooks creates a seamless notification experience that works across VSCode Remote SSH sessions.\nThe multi-window solution using UUID-based terminal mapping ensures notifications reach the right window, and task description extraction provides context about what just completed.\nAll scripts are available in my dotfiles repository, and I hope this saves you as much context-switching overhead as it has for me.\nResources Claude Code Hooks Documentation Terminal Notification Extension - VSCode extension for OSC-based notifications OSC Escape Sequences Reference VSCode Terminal Documentation ","link":"https://kane.mx/posts/2025/claude-code-notification-hooks/","section":"posts","tags":["Claude Code","VSCode","VSCode Remote SSH","Productivity","Shell Scripting","OSC Escape Sequences"],"title":"Desktop Notifications for Claude Code: Never Miss a Completed Task"},{"body":"","link":"https://kane.mx/categories/development-tools/","section":"categories","tags":null,"title":"Development Tools"},{"body":"","link":"https://kane.mx/tags/dynamic-client-registration/","section":"tags","tags":null,"title":"Dynamic Client Registration"},{"body":"The Problem with Dynamic Client Registration In my previous deep-dive into MCP authorization, I analyzed how the protocol builds on OAuth 2.1 with mandatory PKCE, Resource Indicators (RFC 8707), and the \u0026quot;Discovery Trifecta\u0026quot; of RFC 7591, 8414, and 9728. Dynamic Client Registration (DCR) was positioned as the key enabler for MCP's federated ecosystem.\nHowever, DCR has significant practical limitations:\nChallenge Impact Requires AS support for public registration API Many identity providers don't offer this Forces OAuth proxy infrastructure Added complexity when AS lacks DCR Manual IT involvement End users need admin help for each registration The MCP ecosystem faces a unique challenge: unbounded clients connecting to unbounded servers with no prior relationship. DCR, while standardized, often requires workarounds in practice.\nSEP-991: URL-Based Client Identity On MCP's first anniversary, the team announced a simplified approach: OAuth Client ID Metadata Documents (SEP-991). This mechanism is now officially part of the 2025-11-25 stable specification.\nCore Concept Instead of registering with the Authorization Server, the client hosts its own identity document at an HTTPS URL. The client_id itself becomes the URL pointing to this metadata.\n1client_id = \u0026#34;https://my-mcp-client.com/.well-known/oauth-client.json\u0026#34; Metadata Document Structure The client hosts a JSON document containing its OAuth metadata:\n1{ 2 \u0026#34;client_id\u0026#34;: \u0026#34;https://my-mcp-client.com/.well-known/oauth-client.json\u0026#34;, 3 \u0026#34;client_name\u0026#34;: \u0026#34;My MCP Client\u0026#34;, 4 \u0026#34;redirect_uris\u0026#34;: [ 5 \u0026#34;https://my-mcp-client.com/callback\u0026#34;, 6 \u0026#34;http://localhost:8080/callback\u0026#34; 7 ], 8 \u0026#34;token_endpoint_auth_method\u0026#34;: \u0026#34;none\u0026#34;, 9 \u0026#34;grant_types\u0026#34;: [\u0026#34;authorization_code\u0026#34;], 10 \u0026#34;response_types\u0026#34;: [\u0026#34;code\u0026#34;] 11} Key fields:\nclient_id: Must exactly match the document's URL client_name: Displayed to users during authorization redirect_uris: Allowed callback URLs token_endpoint_auth_method: none for public clients, private_key_jwt for confidential New Client Registration Priority The specification defines a clear priority order:\nPriority Method When to Use 1 Pre-registered credentials Known client-server relationships 2 Client ID Metadata Documents Server supports client_id_metadata_document_supported 3 Dynamic Client Registration Fallback if AS supports RFC 7591 4 Manual user entry Last resort SEP-991 now takes precedence over DCR when supported.\nAuthorization Flow Comparison Traditional DCR Flow sequenceDiagram participant Client participant AS as Authorization Server Note over Client,AS: Registration Phase Client-\u0026gt;\u0026gt;AS: POST /register {redirect_uris, client_name...} AS-\u0026gt;\u0026gt;AS: Validate \u0026amp; store client AS--\u0026gt;\u0026gt;Client: {client_id: \u0026#34;generated-id-123\u0026#34;, client_secret...} Note over Client,AS: Authorization Phase Client-\u0026gt;\u0026gt;AS: GET /authorize?client_id=generated-id-123\u0026amp;... SEP-991 Flow sequenceDiagram participant Client participant AS as Authorization Server participant Meta as Client Metadata URL Note over Client,AS: No Registration Phase! Client-\u0026gt;\u0026gt;AS: GET /authorize?client_id=https://client.com/meta.json\u0026amp;... AS-\u0026gt;\u0026gt;Meta: GET https://client.com/meta.json Meta--\u0026gt;\u0026gt;AS: {client_id, client_name, redirect_uris...} AS-\u0026gt;\u0026gt;AS: Validate: client_id matches URL, redirect_uri allowed AS--\u0026gt;\u0026gt;Client: Continue authorization flow Key difference: The client never registers. The AS fetches and validates the metadata on-demand.\nServer Discovery Support Authorization Servers declare SEP-991 support in their metadata (RFC 8414):\n1{ 2 \u0026#34;issuer\u0026#34;: \u0026#34;https://auth.example.com\u0026#34;, 3 \u0026#34;authorization_endpoint\u0026#34;: \u0026#34;https://auth.example.com/authorize\u0026#34;, 4 \u0026#34;token_endpoint\u0026#34;: \u0026#34;https://auth.example.com/token\u0026#34;, 5 \u0026#34;client_id_metadata_document_supported\u0026#34;: true 6} Clients check this field before using URL-based client IDs.\nSpecification Status Version SEP-991 Status 2025-03-26 Not included 2025-06-18 Not included 2025-11-25 (current) Included The feature is now officially part of the stable MCP specification.\nSDK Implementation Status Not all official MCP SDKs have implemented SEP-991 yet. Here's the current support matrix:\nSDK Language SEP-991 Support Notes typescript-sdk TypeScript ✅ Implemented Full CIMD support with capability detection python-sdk Python ✅ Implemented Full CIMD support with graceful fallback rust-sdk Rust ❌ Not yet Standard OAuth 2.1 + DCR only go-sdk Go ❌ Not yet RFC 8414 metadata only kotlin-sdk Kotlin ❓ Unknown OAuth support not documented csharp-sdk C# ❌ No OAuth Protocol implementation only TypeScript SDK Example 1// The SDK automatically detects server support 2const supportsUrlBasedClientId = 3 metadata?.client_id_metadata_document_supported === true; 4 5// When supported, uses URL as client_id 6if (supportsUrlBasedClientId \u0026amp;\u0026amp; clientMetadataUrl) { 7 clientInformation = { client_id: clientMetadataUrl }; 8} Python SDK Example 1# Validates metadata URL format 2# \u0026#34;client_metadata_url must be a valid HTTPS URL with a non-root pathname\u0026#34; 3 4# Creates client info from metadata URL when supported 5client_information = create_client_info_from_metadata_url( 6 self.context.client_metadata_url, 7 redirect_uris=self.context.client_metadata.redirect_uris, 8) Both SDKs implement the priority order: check client_id_metadata_document_supported → use CIMD if available → fall back to DCR.\nImplementation Impact For MCP Clients Requirements:\nHost metadata document at HTTPS URL with path component Ensure client_id in document matches the URL exactly Include all required fields: client_id, client_name, redirect_uris Benefits:\nNo registration API calls needed Self-managed identity Works with any AS that supports SEP-991 For Authorization Servers Requirements:\nImplement metadata document fetching and validation Verify client_id matches fetched URL Respect HTTP cache headers for metadata Declare support via client_id_metadata_document_supported Trust model:\nHTTPS domain ownership proves client identity Servers can restrict to trusted domains or allow any HTTPS client Relationship to Existing Infrastructure For those who followed my guide on implementing MCP OAuth with Keycloak, SEP-991 represents a significant simplification. Instead of configuring DCR endpoints and client registration flows, implementations can now:\nHost a static JSON file on the client's domain Configure the Authorization Server to fetch and validate client metadata URLs Eliminate the need for client pre-registration or DCR infrastructure This aligns with MCP's goal of zero-configuration federation.\nSummary SEP-991 shifts the client registration paradigm:\nAspect DCR (Legacy) Client ID Metadata (New) Who registers Authorization Server Client self-hosts client_id format Server-generated string HTTPS URL Coordination needed Yes (API call) No Identity verification Registration-time Fetch-time (HTTPS domain) The change transforms \u0026quot;server registers client\u0026quot; into \u0026quot;client proves identity\u0026quot;—a fundamental simplification for MCP's open ecosystem.\nResources Specifications SEP-991 Discussion: Original proposal on GitHub MCP Authorization Spec (2025-11-25): Current stable specification with Client ID Metadata Documents Related Articles Technical Deconstruction of MCP Authorization: Deep-dive into MCP's OAuth 2.1 foundation Implementing MCP OAuth 2.1 with Keycloak on AWS: Practical deployment guide ","link":"https://kane.mx/posts/2025/mcp-oauth-sep-991-simplified-registration/","section":"posts","tags":["MCP","Model Context Protocol","OAuth 2.1","SEP-991","Dynamic Client Registration","Client ID Metadata"],"title":"MCP OAuth Evolution: SEP-991 Simplifies Client Registration"},{"body":"","link":"https://kane.mx/tags/osc-escape-sequences/","section":"tags","tags":null,"title":"OSC Escape Sequences"},{"body":"","link":"https://kane.mx/tags/productivity/","section":"tags","tags":null,"title":"Productivity"},{"body":"","link":"https://kane.mx/tags/sep-991/","section":"tags","tags":null,"title":"SEP-991"},{"body":"","link":"https://kane.mx/tags/shell-scripting/","section":"tags","tags":null,"title":"Shell Scripting"},{"body":"","link":"https://kane.mx/tags/vscode/","section":"tags","tags":null,"title":"VSCode"},{"body":"","link":"https://kane.mx/tags/vscode-remote-ssh/","section":"tags","tags":null,"title":"VSCode Remote SSH"},{"body":"","link":"https://kane.mx/tags/identity-provider/","section":"tags","tags":null,"title":"Identity Provider"},{"body":"Introduction The Model Context Protocol (MCP) ecosystem mandates OAuth 2.1-compliant authorization servers to facilitate secure, federated access to AI model services. MCP clients, such as Claude Code, Cursor, and VS Code extensions, rely on modern OAuth specifications including Dynamic Client Registration (RFC 7591), PKCE (RFC 7636), and crucially, Resource Indicators (RFC 8707) for audience-restricted tokens.\nHowever, most Identity-as-a-Service (IDaaS) providers, including the open-source Keycloak platform, currently lack full RFC 8707 support. Keycloak, while robust in OAuth 2.0 capabilities, employs a proprietary audience parameter in contrast to the standardized resource parameter defined in RFC 8707. For a comprehensive analysis of this compatibility landscape, refer to my previous post: Technical Deconstruction of MCP Authorization: A Deep Dive into OAuth 2.1 and IETF RFC Specifications.\nThis article provides a detailed guide on configuring Keycloak as an MCP-compatible authorization server through strategic use of protocol mappers and realm configuration. The implemented solution encompasses:\nRFC 8707 Workaround: Custom audience protocol mappers to inject correct aud claims into JWT tokens. Dynamic Client Registration: Automated client onboarding via realm default scopes. Zero-Configuration MCP Support: Automatic audience restriction without manual client configuration. Infrastructure Automation: Terraform deployment on AWS utilizing ECS Fargate and Aurora PostgreSQL. Upon completion of this guide, you will possess a clear understanding of how to configure Keycloak for seamless MCP client support, enabling dynamic client registration with automated audience restriction.\nArchitecture Overview This deployment leverages AWS managed services to establish a scalable Keycloak infrastructure tailored for MCP OAuth workflows.\nCore Components Compute Layer (ECS Fargate)\nKeycloak operates as containerized workloads on AWS Fargate, offering managed compute capacity:\nCustom Docker Image: Built from the official Keycloak 26.4.4 release, pre-configured with JDBC_PING for clustering. Multi-AZ Deployment: Tasks are strategically distributed across multiple Availability Zones for resilience. Health Monitoring: Integrated with AWS CloudWatch Container Insights for robust performance and health visibility. Database Layer (Aurora PostgreSQL)\nAmazon Aurora provides a highly available, scalable PostgreSQL-compatible database backend:\nDatabase Engine: PostgreSQL 16 (Keycloak 26.4.4 requires PostgreSQL 13+ minimum, 16.8 recommended). Scalability: Aurora Serverless v2, featuring configurable capacity and auto-scaling. High Availability: Multi-AZ deployment with automatic failover mechanisms. Security: Data encryption at rest and automated backup procedures. Load Balancing (Application Load Balancer)\nThe Application Load Balancer (ALB) manages TLS termination and intelligent traffic distribution:\nHTTPS/TLS: Certificate management handled by AWS Certificate Manager (ACM). Health Checks: Continuously monitors Keycloak health endpoints to ensure service availability. Session Affinity: Supports sticky sessions for maintaining stateful client connections. Networking Infrastructure\nA Virtual Private Cloud (VPC) provides a logically isolated network environment:\nSubnets: Public and private subnets distributed across multiple Availability Zones. NAT Gateways: Enable secure outbound internet access for resources within private subnets. VPC Endpoints: Facilitate private connectivity to select AWS services. Security Groups: Enforce granular network access controls. Deployment Workflow The infrastructure deployment adheres to a phased approach:\nflowchart TD A[Create VPC \u0026amp; Networking] --\u0026gt; B[Deploy Aurora RDS] B --\u0026gt; C[Create ECS Cluster] C --\u0026gt; D[Build \u0026amp; Push Container Image] D --\u0026gt; E[Start ECS Tasks] E --\u0026gt; F[Configure MCP OAuth Realm] F --\u0026gt; G[Verify Dynamic Client Registration] This structured methodology ensures that foundational infrastructure is provisioned prior to implementing MCP-specific Keycloak configurations.\nUnderstanding Keycloak's RFC 8707 Gap To comprehend the necessity of custom configuration in this deployment, we must analyze the incompatibility between Keycloak's audience implementation and the MCP specification's requirements.\nThe RFC 8707 Standard RFC 8707 (Resource Indicators for OAuth 2.0) specifies a standardized mechanism for audience restriction within OAuth access tokens. This specification introduces a resource parameter, which clients include in both authorization and token requests:\n1POST /token HTTP/1.1 2Host: auth.example.com 3Content-Type: application/x-www-form-urlencoded 4 5grant_type=authorization_code 6\u0026amp;code=ABC123 7\u0026amp;redirect_uri=https://client.example.com/callback 8\u0026amp;resource=https://api.example.com ← Target audience 9\u0026amp;client_id=CLIENT_ID The Authorization Server (AS) utilizes this resource parameter to populate the JWT's aud (audience) claim, thereby ensuring the token's validity is restricted to the specified Resource Server (RS).\nKeycloak's Proprietary Approach Keycloak's audience functionality was implemented prior to the publication of RFC 8707 in February 2020. As detailed in the MCP authorization compatibility matrix, Keycloak employs a proprietary audience parameter that predates the standardized approach.\nThe Problem: MCP clients (e.g., Claude Code, Cursor, VS Code extensions) adhere to RFC 8707 and transmit the resource parameter. Keycloak, however, disregards this parameter, resulting in JWT tokens that either lack the mandatory aud claim or contain incorrect audience values.\nThe Consequence: MCP servers validate the aud claim to mitigate token replay attacks, addressing the \u0026quot;Confused Deputy\u0026quot; problem. Without proper audience restriction, tokens risk rejection or potential misuse across disparate resource servers.\nThe Workaround Architecture The proposed solution strategically leverages Keycloak's Protocol Mappers to automatically inject the correct aud claim, circumventing the absence of native RFC 8707 support. This architecture integrates three key components:\nflowchart LR A[Dynamic Client Registration] --\u0026gt; B{Realm Default Scopes} B --\u0026gt; C[Auto-assign mcp:run scope] C --\u0026gt; D[mcp:run has Audience Mapper] D --\u0026gt; E[Token Request] E --\u0026gt; F{Mapper Active?} F --\u0026gt;|Yes| G[Inject aud claim] G --\u0026gt; H[JWT with correct audience] H --\u0026gt; I[MCP Server validates aud] I --\u0026gt; J[Access Granted] style D fill:#f59e0b style G fill:#10b981 style I fill:#3b82f6 Component 1: Audience Protocol Mapper\nA hardcoded claim mapper, associated with the mcp:run client scope, injects the MCP server's URL into the aud claim:\n1resource \u0026#34;keycloak_openid_hardcoded_claim_protocol_mapper\u0026#34; \u0026#34;mcp_run_audience_mapper\u0026#34; { 2 realm_id = keycloak_realm.mcp.id 3 client_scope_id = keycloak_openid_client_scope.mcp_run.id 4 name = \u0026#34;mcp-audience\u0026#34; 5 6 claim_name = \u0026#34;aud\u0026#34; 7 claim_value = var.resource_server_uri # e.g., \u0026#34;https://mcp-server.example.com/mcp\u0026#34; 8 claim_value_type = \u0026#34;String\u0026#34; 9 10 add_to_id_token = false 11 add_to_access_token = true # ← Critical: Only in access tokens 12 add_to_userinfo = false 13} Component 2: Realm Default Scopes\nBy configuring mcp:run as a realm-wide default scope, all clients, including those registered via Dynamic Client Registration, automatically inherit this audience mapper:\n1resource \u0026#34;keycloak_realm_default_client_scopes\u0026#34; \u0026#34;mcp_realm_defaults\u0026#34; { 2 realm_id = keycloak_realm.mcp.id 3 4 default_scopes = [ 5 \u0026#34;profile\u0026#34;, 6 \u0026#34;email\u0026#34;, 7 \u0026#34;mcp:run\u0026#34;, # ← Critical: Auto-assigned to DCR clients 8 \u0026#34;roles\u0026#34;, 9 \u0026#34;web-origins\u0026#34;, 10 \u0026#34;acr\u0026#34;, 11 \u0026#34;basic\u0026#34;, 12 ] 13} Component 3: DCR Allowed Scopes Configuration\nClient Registration Policies are configured to permit mcp:run within the allowed scopes for dynamically registered clients. This step is performed using the Keycloak Admin REST API due to current Terraform provider limitations:\n1# Extract Client Registration Policy component ID 2COMPONENT_ID=$(curl -s \u0026#34;${KEYCLOAK_URL}/admin/realms/mcp/components\u0026#34; \\ 3 -H \u0026#34;Authorization: Bearer ${ADMIN_TOKEN}\u0026#34; | \\ 4 jq -r \u0026#39;.[] | select(.name==\u0026#34;Allowed Client Scopes\u0026#34;) | .id\u0026#39;) 5 6# Update allowed scopes to include mcp:run 7curl -X PUT \u0026#34;${KEYCLOAK_URL}/admin/realms/mcp/components/${COMPONENT_ID}\u0026#34; \\ 8 -H \u0026#34;Authorization: Bearer ${ADMIN_TOKEN}\u0026#34; \\ 9 -H \u0026#34;Content-Type: application/json\u0026#34; \\ 10 -d \u0026#39;{ 11 \u0026#34;config\u0026#34;: { 12 \u0026#34;allow-default-scopes\u0026#34;: [\u0026#34;true\u0026#34;], 13 \u0026#34;allowed-client-scopes\u0026#34;: [\u0026#34;openid\u0026#34;, \u0026#34;profile\u0026#34;, \u0026#34;email\u0026#34;, \u0026#34;mcp:run\u0026#34;] 14 } 15 }\u0026#39; Complete Flow of Operations When an MCP client (e.g., Claude Code) attempts to access a protected MCP server:\nDiscovery: The client retrieves the MCP server's metadata (RFC 9728) to identify the required Authorization Server. Registration: The client dynamically registers with Keycloak via a POST request to /clients-registrations/openid-connect. Automatic Scope Inheritance: Keycloak automatically assigns the mcp:run scope (due to realm default configuration) to the newly registered client. Authorization Flow: The client initiates the OAuth Authorization Code flow, incorporating PKCE. Token Issuance: Keycloak generates a JWT access token, and the audience mapper injects the aud: \u0026quot;https://mcp-server.example.com/mcp\u0026quot; claim. Validation: The MCP server validates the aud claim against its own identifier and grants access to the MCP resources. Result: The MCP client achieves full functionality without requiring any manual configuration within the Keycloak administrative console. This pattern of realm default scopes combined with an audience mapper establishes fully automated MCP compatibility.\nMCP OAuth 2.1 Configuration Deep Dive This section details the Terraform configurations that transform a standard Keycloak deployment into an MCP-compliant authorization server.\nRFC Compliance Matrix The implementation ensures OAuth 2.1 compatibility through selective RFC adoption:\nRFC Specification Implementation Status Notes RFC 7591 Dynamic Client Registration ✅ Complete Anonymous DCR enabled for zero-configuration clients RFC 7636 PKCE (Proof Key for Code Exchange) ✅ Complete S256 challenge method mandatory for all clients RFC 8414 Authorization Server Metadata ✅ Complete OIDC discovery at /.well-known/openid-configuration RFC 8707 Resource Indicators ✅ Complete Via audience mapper workaround (native support in development) RFC 9728 Protected Resource Metadata ⚠️ MCP Server-dependent Implemented by MCP servers, not the AS Realm Configuration The MCP realm (mcp-realm.tf) is meticulously configured to establish security policies and token lifespans, optimized for AI model access patterns:\n1resource \u0026#34;keycloak_realm\u0026#34; \u0026#34;mcp\u0026#34; { 2 realm = \u0026#34;mcp\u0026#34; 3 enabled = true 4 5 display_name = \u0026#34;MCP Authorization Server\u0026#34; 6 display_name_html = \u0026#34;\u0026lt;b\u0026gt;Model Context Protocol\u0026lt;/b\u0026gt;\u0026#34; 7 8 # Token lifespans - optimized for MCP sessions 9 access_token_lifespan = \u0026#34;1h\u0026#34; # Longer for AI workflows 10 sso_session_idle_timeout = \u0026#34;30m\u0026#34; 11 sso_session_max_lifespan = \u0026#34;10h\u0026#34; 12 offline_session_idle_timeout = \u0026#34;720h\u0026#34; # 30 days 13 14 # Security policies 15 ssl_required = \u0026#34;external\u0026#34; # Require HTTPS for external connections 16 17 password_policy = \u0026#34;length(12) and upperCase(1) and lowerCase(1) and digits(1) and specialChars(1)\u0026#34; 18 19 security_defenses { 20 headers { 21 x_frame_options = \u0026#34;DENY\u0026#34; 22 content_security_policy = \u0026#34;frame-src \u0026#39;self\u0026#39;; frame-ancestors \u0026#39;self\u0026#39;; object-src \u0026#39;none\u0026#39;;\u0026#34; 23 content_security_policy_report_only = \u0026#34;\u0026#34; 24 x_content_type_options = \u0026#34;nosniff\u0026#34; 25 x_robots_tag = \u0026#34;none\u0026#34; 26 x_xss_protection = \u0026#34;1; mode=block\u0026#34; 27 strict_transport_security = \u0026#34;max-age=31536000; includeSubDomains\u0026#34; 28 } 29 30 brute_force_detection { 31 permanent_lockout = false 32 max_login_failures = 5 33 wait_increment_seconds = 60 34 quick_login_check_milli_seconds = 1000 35 minimum_quick_login_wait_seconds = 60 36 max_failure_wait_seconds = 900 37 failure_reset_time_seconds = 900 38 } 39 } 40} Client Scopes and Audience Mapper The mcp:run client scope (mcp-scopes.tf) forms the core of the workaround, intelligently combining scope definition with the critical audience mapper:\n1# Define the mcp:run client scope 2resource \u0026#34;keycloak_openid_client_scope\u0026#34; \u0026#34;mcp_run\u0026#34; { 3 realm_id = keycloak_realm.mcp.id 4 name = \u0026#34;mcp:run\u0026#34; 5 description = \u0026#34;Scope for MCP model execution with audience restriction\u0026#34; 6 consent_screen_text = \u0026#34;Access MCP model servers\u0026#34; 7 include_in_token_scope = true 8} 9 10# Attach the audience mapper to mcp:run scope 11resource \u0026#34;keycloak_openid_hardcoded_claim_protocol_mapper\u0026#34; \u0026#34;mcp_run_audience_mapper\u0026#34; { 12 realm_id = keycloak_realm.mcp.id 13 client_scope_id = keycloak_openid_client_scope.mcp_run.id 14 name = \u0026#34;mcp-audience\u0026#34; 15 16 claim_name = \u0026#34;aud\u0026#34; 17 claim_value = var.resource_server_uri 18 claim_value_type = \u0026#34;String\u0026#34; 19 20 add_to_id_token = false 21 add_to_access_token = true 22 add_to_userinfo = false 23} 24 25# Make mcp:run a default scope for all clients 26resource \u0026#34;keycloak_realm_default_client_scopes\u0026#34; \u0026#34;mcp_realm_defaults\u0026#34; { 27 realm_id = keycloak_realm.mcp.id 28 29 default_scopes = [ 30 \u0026#34;profile\u0026#34;, 31 \u0026#34;email\u0026#34;, 32 keycloak_openid_client_scope.mcp_run.name, # ← Critical 33 \u0026#34;roles\u0026#34;, 34 \u0026#34;web-origins\u0026#34;, 35 \u0026#34;acr\u0026#34;, 36 \u0026#34;basic\u0026#34;, 37 ] 38} Key Design Decision: The mapper specifically configures add_to_access_token = true and add_to_id_token = false. This intentional design ensures the aud claim is present in the access token (for resource server validation) but excluded from the ID token (consumed by the client for user information).\nTwo-Phase Deployment Pattern The Keycloak Terraform Provider currently exhibits a limitation: it cannot directly manage Client Registration Policies, which govern DCR behavior. This necessitates a hybrid deployment approach:\nPhase 1: Terraform Resources (terraform apply)\nThis phase declaratively provisions the infrastructure:\nRealm with defined security policies Client scopes with embedded protocol mappers Realm default scopes Optional example clients Phase 2: REST API Configuration (Bash scripts)\nThis phase configures imperative settings using the Keycloak Admin REST API:\nfix-allowed-scopes.sh: Modifies the Client Registration Policy to include mcp:run in the allowed scopes list. disable-trusted-hosts.sh: Removes the Trusted Hosts policy to accommodate custom redirect URI schemes (e.g., cursor://, vscode://, claude://). enable-dcr.sh: Verifies Dynamic Client Registration functionality and confirms proper scope inheritance. The integrated deploy.sh orchestrator automates the execution of both phases:\n1#!/bin/bash 2set -e 3 4echo \u0026#34;Phase 1: Terraform deployment...\u0026#34; 5terraform init 6terraform apply -auto-approve 7 8echo \u0026#34;Phase 2: REST API configuration...\u0026#34; 9./fix-allowed-scopes.sh 10./disable-trusted-hosts.sh 11 12echo \u0026#34;Verification: Testing DCR...\u0026#34; 13./enable-dcr.sh 14 15echo \u0026#34;Deployment complete! MCP OAuth 2.1 realm ready.\u0026#34; Trusted Hosts Policy Removal MCP clients frequently employ non-standard redirect URI schemes that Keycloak's default policies typically reject. The solution involves completely removing the Trusted Hosts policy component:\n1# Find the Trusted Hosts policy component 2TRUSTED_HOSTS_ID=$(curl -s \u0026#34;${KEYCLOAK_URL}/admin/realms/mcp/components\u0026#34; \\ 3 -H \u0026#34;Authorization: Bearer ${ADMIN_TOKEN}\u0026#34; | \\ 4 jq -r \u0026#39;.[] | select(.name==\u0026#34;Trusted Hosts\u0026#34;) | .id\u0026#39;) 5 6# Delete it entirely 7curl -X DELETE \u0026#34;${KEYCLOAK_URL}/admin/realms/mcp/components/${TRUSTED_HOSTS_ID}\u0026#34; \\ 8 -H \u0026#34;Authorization: Bearer ${ADMIN_TOKEN}\u0026#34; Security Consideration: This action allows all redirect URI schemes, including http://localhost:* for development purposes. For production deployments safeguarding sensitive data, it is recommended to implement a custom policy that explicitly whitelists only approved schemes such as https://, cursor://, vscode://, and claude://.\nInfrastructure Components Beyond the OAuth configuration, several AWS infrastructure components provide essential support for the deployment.\nJDBC_PING Clustering for ECS Fargate Keycloak's native clustering mechanism, JGroups, typically relies on UDP multicast for node discovery. However, AWS VPCs do not support multicast, and ECS Fargate instances lack static IP addresses. The adopted solution is JDBC_PING, which utilizes the PostgreSQL database as a robust coordination mechanism.\nHow JDBC_PING Functions:\nEach Keycloak container registers its IP address and port within the JGROUPSPING table in PostgreSQL. Containers periodically query this table to discover active cluster members. Session data is replicated across the discovered cluster members. Upon container termination, its corresponding entry is gracefully removed from the table. Configuration (cache-ispn-jdbc-ping.xml):\n1\u0026lt;config xmlns=\u0026#34;urn:org:jgroups\u0026#34; 2 xmlns:xsi=\u0026#34;http://www.w3.org/2001/XMLSchema-instance\u0026#34; 3 xsi:schemaLocation=\u0026#34;urn:org:jgroups http://www.jgroups.org/schema/jgroups-4.2.xsd\u0026#34;\u0026gt; 4 \u0026lt;TCP bind_addr=\u0026#34;${jgroups.bind.address,jgroups.tcp.address:SITE_LOCAL}\u0026#34; 5 bind_port=\u0026#34;${jgroups.bind.port,jgroups.tcp.port:7800}\u0026#34; 6 recv_buf_size=\u0026#34;5m\u0026#34; 7 send_buf_size=\u0026#34;1m\u0026#34; 8 max_bundle_size=\u0026#34;64k\u0026#34;/\u0026gt; 9 10 \u0026lt;JDBC_PING 11 connection_driver=\u0026#34;org.postgresql.Driver\u0026#34; 12 connection_url=\u0026#34;${env.KC_DB_URL}\u0026#34; 13 connection_username=\u0026#34;${env.KC_DB_USERNAME}\u0026#34; 14 connection_password=\u0026#34;${env.KC_DB_PASSWORD}\u0026#34; 15 initialize_sql=\u0026#34;CREATE TABLE IF NOT EXISTS JGROUPSPING ( 16 own_addr VARCHAR(200) NOT NULL, 17 cluster_name VARCHAR(200) NOT NULL, 18 ping_data BYTEA, 19 constraint PK_JGROUPSPING PRIMARY KEY (own_addr, cluster_name) 20 );\u0026#34; 21 info_writer_sleep_time=\u0026#34;500\u0026#34; 22 remove_all_data_on_view_change=\u0026#34;true\u0026#34; 23 stack.combine=\u0026#34;REPLACE\u0026#34; 24 stack.position=\u0026#34;MPING\u0026#34;/\u0026gt; 25 26 \u0026lt;MERGE3 min_interval=\u0026#34;10000\u0026#34; max_interval=\u0026#34;30000\u0026#34;/\u0026gt; 27 \u0026lt;FD_SOCK/\u0026gt; 28 \u0026lt;FD_ALL timeout=\u0026#34;60000\u0026#34; interval=\u0026#34;15000\u0026#34;/\u0026gt; 29 \u0026lt;VERIFY_SUSPECT timeout=\u0026#34;5000\u0026#34;/\u0026gt; 30 \u0026lt;pbcast.NAKACK2 use_mcast_xmit=\u0026#34;false\u0026#34; xmit_interval=\u0026#34;1000\u0026#34;/\u0026gt; 31 \u0026lt;UNICAST3 xmit_interval=\u0026#34;500\u0026#34;/\u0026gt; 32 \u0026lt;pbcast.STABLE desired_avg_gossip=\u0026#34;50000\u0026#34; max_bytes=\u0026#34;8m\u0026#34;/\u0026gt; 33 \u0026lt;pbcast.GMS print_local_addr=\u0026#34;true\u0026#34; join_timeout=\u0026#34;2000\u0026#34;/\u0026gt; 34 \u0026lt;UFC max_credits=\u0026#34;2m\u0026#34; min_threshold=\u0026#34;0.4\u0026#34;/\u0026gt; 35 \u0026lt;MFC max_credits=\u0026#34;2m\u0026#34; min_threshold=\u0026#34;0.4\u0026#34;/\u0026gt; 36 \u0026lt;FRAG2 frag_size=\u0026#34;60k\u0026#34;/\u0026gt; 37\u0026lt;/config\u0026gt; Keycloak Container Configuration:\nThe Dockerfile integrates this configuration during the build process:\n1FROM quay.io/keycloak/keycloak:26.4.4 as builder 2 3ENV KC_HEALTH_ENABLED=true 4ENV KC_METRICS_ENABLED=true 5ENV KC_HTTP_RELATIVE_PATH=/auth 6ENV KC_DB=postgres 7 8# Copy JDBC_PING configuration 9COPY ./cache-ispn-jdbc-ping.xml /opt/keycloak/conf/cache-ispn-jdbc-ping.xml 10 11# Build optimized image 12RUN /opt/keycloak/bin/kc.sh build --cache-config-file=cache-ispn-jdbc-ping.xml 13 14FROM quay.io/keycloak/keycloak:26.4.4 15COPY --from=builder /opt/keycloak /opt/keycloak 16 17EXPOSE 7800 # JDBC_PING coordination port 18ENTRYPOINT [\u0026#34;/opt/keycloak/bin/kc.sh\u0026#34;] ECS Task Definition:\nThe ECS task definition exposes port 7800 to facilitate cluster communication:\n1{ 2 \u0026#34;containerDefinitions\u0026#34;: [{ 3 \u0026#34;name\u0026#34;: \u0026#34;keycloak\u0026#34;, 4 \u0026#34;image\u0026#34;: \u0026#34;${ecr_repository_url}:${image_tag}\u0026#34;, 5 \u0026#34;portMappings\u0026#34;: [ 6 { 7 \u0026#34;containerPort\u0026#34;: 8080, 8 \u0026#34;protocol\u0026#34;: \u0026#34;tcp\u0026#34; 9 }, 10 { 11 \u0026#34;containerPort\u0026#34;: 7800, 12 \u0026#34;protocol\u0026#34;: \u0026#34;tcp\u0026#34; 13 } 14 ], 15 \u0026#34;environment\u0026#34;: [ 16 {\u0026#34;name\u0026#34;: \u0026#34;KC_DB\u0026#34;, \u0026#34;value\u0026#34;: \u0026#34;postgres\u0026#34;}, 17 {\u0026#34;name\u0026#34;: \u0026#34;KC_DB_URL\u0026#34;, \u0026#34;value\u0026#34;: \u0026#34;jdbc:postgresql://...\u0026#34;}, 18 {\u0026#34;name\u0026#34;: \u0026#34;KC_PROXY_HEADERS\u0026#34;, \u0026#34;value\u0026#34;: \u0026#34;xforwarded\u0026#34;}, 19 {\u0026#34;name\u0026#34;: \u0026#34;KC_CACHE_CONFIG_FILE\u0026#34;, \u0026#34;value\u0026#34;: \u0026#34;cache-ispn-jdbc-ping.xml\u0026#34;} 20 ], 21 \u0026#34;secrets\u0026#34;: [ 22 {\u0026#34;name\u0026#34;: \u0026#34;KC_DB_PASSWORD\u0026#34;, \u0026#34;valueFrom\u0026#34;: \u0026#34;arn:aws:ssm:...\u0026#34;}, 23 {\u0026#34;name\u0026#34;: \u0026#34;KEYCLOAK_ADMIN_PASSWORD\u0026#34;, \u0026#34;valueFrom\u0026#34;: \u0026#34;arn:aws:ssm:...\u0026#34;} 24 ], 25 \u0026#34;healthCheck\u0026#34;: { 26 \u0026#34;command\u0026#34;: [\u0026#34;CMD-SHELL\u0026#34;, \u0026#34;curl -f http://localhost:8080/auth/health || exit 1\u0026#34;], 27 \u0026#34;interval\u0026#34;: 30, 28 \u0026#34;timeout\u0026#34;: 5, 29 \u0026#34;retries\u0026#34;: 3 30 } 31 }] 32} Result: ECS can dynamically scale tasks up or down. New containers automatically join the cluster, and terminated containers are gracefully removed, ensuring session data persistence across container restarts.\nAurora Serverless v2 Configuration Aurora Serverless v2 offers a PostgreSQL-compatible database with sub-second scaling and granular pay-per-second billing:\n1module \u0026#34;aurora_postgresql\u0026#34; { 2 source = \u0026#34;terraform-aws-modules/rds-aurora/aws\u0026#34; 3 version = \u0026#34;~\u0026gt; 8.0\u0026#34; 4 5 name = \u0026#34;keycloak-db\u0026#34; 6 engine = \u0026#34;aurora-postgresql\u0026#34; 7 engine_version = \u0026#34;16.8\u0026#34; 8 instance_class = \u0026#34;db.serverless\u0026#34; 9 instances = { 10 one = {} 11 two = {} # Multi-AZ for high availability 12 } 13 14 serverlessv2_scaling_configuration = { 15 min_capacity = 0.5 # 1 GB RAM - minimal idle cost 16 max_capacity = 2 # 4 GB RAM - handles production traffic 17 } 18 19 vpc_id = module.vpc.vpc_id 20 db_subnet_group_name = aws_db_subnet_group.aurora.name 21 security_group_rules = { 22 keycloak_ingress = { 23 source_security_group_id = aws_security_group.keycloak_ecs.id 24 } 25 } 26 27 storage_encrypted = true 28 apply_immediately = true 29 30 backup_retention_period = 7 31 preferred_backup_window = \u0026#34;03:00-04:00\u0026#34; 32 33 database_name = \u0026#34;keycloak\u0026#34; 34 master_username = \u0026#34;keycloak\u0026#34; 35} Scaling Behavior: Aurora Serverless v2 actively monitors database load (CPU, connections, memory) and adjusts ACU capacity in sub-second increments. Typical Keycloak workload scaling ranges are:\nIdle/Development: 0.5 ACU (approximately $0.12/hour) Normal Production: 1-1.5 ACU (approximately $0.24/hour) High Load (authentication storms): 1.5-2 ACU (approximately $0.36/hour) ECS Service Configuration The ECS service is responsible for managing task placement and continuous health monitoring:\n1resource \u0026#34;aws_ecs_service\u0026#34; \u0026#34;keycloak\u0026#34; { 2 name = \u0026#34;keycloak\u0026#34; 3 cluster = aws_ecs_cluster.main.id 4 task_definition = aws_ecs_task_definition.keycloak.arn 5 desired_count = var.desired_count # 2 for HA 6 7 launch_type = \u0026#34;FARGATE\u0026#34; 8 platform_version = \u0026#34;LATEST\u0026#34; 9 10 deployment_maximum_percent = 200 # Allow 2x capacity during updates 11 deployment_minimum_healthy_percent = 100 # Always maintain full capacity 12 health_check_grace_period_seconds = 600 # Allow time for Keycloak startup 13 14 network_configuration { 15 subnets = module.vpc.private_subnets 16 security_groups = [aws_security_group.keycloak_ecs.id] 17 } 18 19 load_balancer { 20 target_group_arn = aws_lb_target_group.keycloak.arn 21 container_name = \u0026#34;keycloak\u0026#34; 22 container_port = 8080 23 } 24 25 depends_on = [aws_lb_listener.https] 26} Deployment Strategy: This configuration ensures zero-downtime updates through the following sequence:\nNew tasks are initiated, temporarily increasing capacity to 200%. New tasks successfully pass health checks (Keycloak startup is allotted 600 seconds). Traffic is progressively diverted to the new tasks. Old tasks are gracefully drained and terminated. The system returns to its stable state of 100% capacity (2 tasks). Deployment Walkthrough This section provides a comprehensive, step-by-step guide for deploying the Keycloak infrastructure.\nPrerequisites Local Tools:\nTerraform \u0026gt;= 1.0 AWS CLI v2, configured with appropriate credentials Docker (for building container images) jq (for JSON parsing in scripts) make (optional, for simplified command execution) AWS Permissions:\nCreation of VPC, Subnet, Security Group, and NAT Gateway resources. Provisioning of RDS Aurora cluster and instances. Management of ECS cluster, task definitions, and services. Creation of ECR repositories and pushing container images. Creation of IAM roles for ECS task execution. Read/write access to SSM Parameter Store. Creation of ACM certificates (or access to an existing certificate ARN). Step 1: Clone and Initialize Infrastructure 1# Clone the repository 2git clone https://github.com/your-org/terraform-keycloak-aws.git 3cd terraform-keycloak-aws 4 5# Create a new environment 6cp -r environments/template environments/production 7cd environments/production 8 9# Configure terraform.tfvars 10cat \u0026gt; terraform.tfvars \u0026lt;\u0026lt;EOF 11aws_region = \u0026#34;us-east-1\u0026#34; 12environment = \u0026#34;production\u0026#34; 13vpc_cidr = \u0026#34;10.0.0.0/16\u0026#34; 14availability_zones = [\u0026#34;us-east-1a\u0026#34;, \u0026#34;us-east-1b\u0026#34;] 15 16# Start with 0 to avoid costs during initial setup 17desired_count = 0 18 19# Use existing ACM certificate or create new one 20certificate_arn = \u0026#34;arn:aws:acm:us-east-1:ACCOUNT:certificate/CERT_ID\u0026#34; 21domain_name = \u0026#34;auth.example.com\u0026#34; 22 23# Database configuration 24db_instance_class = \u0026#34;db.serverless\u0026#34; 25db_allocated_storage = 20 26db_engine_version = \u0026#34;16.8\u0026#34; 27 28# Aurora Serverless v2 scaling 29aurora_serverless_min_capacity = 0.5 30aurora_serverless_max_capacity = 2 31EOF 32 33# Initialize and create infrastructure (no running tasks yet) 34make all 35# Or manually: 36# terraform init 37# terraform plan 38# terraform apply Result: The VPC, subnets, NAT gateways, Aurora RDS, ECS cluster, and ALB are successfully provisioned. No ECS tasks are yet operational.\nStep 2: Build and Push Container Image 1cd ../../build/keycloak 2 3# Configure environment 4export AWS_REGION=us-east-1 5export ENV_NAME=production 6 7# Build and push (uses Makefile automation) 8make all 9 10# Or manually: 11# aws ecr get-login-password --region us-east-1 | \\ 12# docker login --username AWS --password-stdin $(aws sts get-caller-identity --query Account --output text).dkr.ecr.us-east-1.amazonaws.com 13# docker build -t keycloak-mcp:latest . 14# docker tag keycloak-mcp:latest ECR_URL:latest 15# docker push ECR_URL:latest Result: A custom Keycloak container image, configured with JDBC_PING clustering, is built and pushed to Amazon ECR.\nStep 3: Scale Up ECS Service 1cd ../../environments/production 2 3# Update terraform.tfvars 4sed -i \u0026#39;s/desired_count = 0/desired_count = 2/\u0026#39; terraform.tfvars 5 6# Apply changes 7make update 8# Or: terraform apply Result: Two Keycloak containers are launched within private subnets, establish a cluster via JDBC_PING, and register with the ALB. Keycloak becomes accessible at https://auth.example.com/auth.\nStep 4: Create Admin User 1# Get admin password from SSM Parameter Store 2ADMIN_PASSWORD=$(aws ssm get-parameter \\ 3 --name \u0026#34;/keycloak/production/admin_password\u0026#34; \\ 4 --with-decryption \\ 5 --query Parameter.Value \\ 6 --output text) 7 8echo \u0026#34;Admin URL: https://auth.example.com/auth/admin\u0026#34; 9echo \u0026#34;Username: admin\u0026#34; 10echo \u0026#34;Password: ${ADMIN_PASSWORD}\u0026#34; Log in to the Keycloak admin console to verify the deployment's integrity.\nStep 5: Configure MCP OAuth Realm 1cd mcp-oauth 2 3# Auto-generate configuration from parent deployment 4./init-from-parent.sh --mcp-server-url \u0026#34;https://mcp-server.example.com/mcp\u0026#34; 5 6# Review generated terraform.tfvars 7cat terraform.tfvars 8 9# Deploy MCP OAuth realm (Terraform + REST API) 10make deploy 11 12# Verify Dynamic Client Registration 13./enable-dcr.sh Result: The MCP realm is successfully created and configured with:\nAn mcp:run client scope, incorporating the audience mapper. Properly configured realm default scopes. Enabled and verified Dynamic Client Registration. Removal of the Trusted Hosts policy. Step 6: Test with MCP Client Configure an MCP client (e.g., Claude Code, Cursor, VS Code) to establish a connection:\nMCP Server Configuration Example:\n1{ 2 \u0026#34;servers\u0026#34;: { 3 \u0026#34;my-mcp-server\u0026#34;: { 4 \u0026#34;url\u0026#34;: \u0026#34;https://mcp-server.example.com/mcp\u0026#34;, 5 \u0026#34;auth\u0026#34;: { 6 \u0026#34;type\u0026#34;: \u0026#34;oauth\u0026#34;, 7 \u0026#34;authorizationUrl\u0026#34;: \u0026#34;https://auth.example.com/auth/realms/mcp/protocol/openid-connect/auth\u0026#34;, 8 \u0026#34;tokenUrl\u0026#34;: \u0026#34;https://auth.example.com/auth/realms/mcp/protocol/openid-connect/token\u0026#34; 9 } 10 } 11 } 12} Expected Flow:\nThe MCP client discovers Authorization Server metadata from Keycloak's OIDC discovery endpoint. The client dynamically registers via DCR, obtaining a client_id. The client initiates the Authorization Code flow with PKCE. The user authenticates through Keycloak. The client receives a JWT access token containing aud: \u0026quot;https://mcp-server.example.com/mcp\u0026quot;. The MCP server validates the token's aud claim and grants access to MCP resources. Deployment Considerations High Availability Configuration The deployment is engineered for high availability, incorporating redundancy across multiple layers:\nMulti-AZ Distribution:\nECS Tasks: Distributed across two or more Availability Zones using ECS placement strategies. Aurora: Multi-AZ cluster with automated failover, achieving a Recovery Time Objective (RTO) typically under 60 seconds. ALB: Employs cross-zone load balancing by default. NAT Gateways: One per AZ (total of two) for independent outbound connectivity. Failure Scenarios:\nSingle ECS Task Failure: The ALB reroutes traffic to healthy tasks, and ECS automatically initiates a replacement. Availability Zone Failure: Aurora promotes a replica, and ECS tasks in other AZs continue to serve traffic. Database Primary Failure: Aurora automatically fails over to a replica in a different AZ. Zero-Downtime Deployments:\nThe deployment_minimum_healthy_percent = 100 configuration ensures continuous full capacity during service updates:\n1Initial state: [Task1] [Task2] (100% capacity) 2Update triggered: [Task1] [Task2] [Task3] [Task4] (200% capacity) 3Health checks: [Task1] [Task2] [Task3✓] [Task4✓] 4Drain old: [Task3✓] [Task4✓] 5Final state: [Task3] [Task4] (100% capacity) Monitoring and Observability CloudWatch Logs:\nAll container output is streamed to CloudWatch Logs with configurable retention policies:\n1resource \u0026#34;aws_cloudwatch_log_group\u0026#34; \u0026#34;keycloak\u0026#34; { 2 name = \u0026#34;/ecs/keycloak\u0026#34; 3 retention_in_days = 30 4} Key Logs to Monitor:\nAuthentication failures: Search for WARN.*org.keycloak.events. Database errors: Search for ERROR.*Hibernate. Clustering issues: Search for WARN.*JGroups or JDBC_PING. Container Insights:\nECS Container Insights should be enabled for comprehensive cluster-level metrics:\n1resource \u0026#34;aws_ecs_cluster\u0026#34; \u0026#34;main\u0026#34; { 2 name = \u0026#34;keycloak-cluster\u0026#34; 3 4 setting { 5 name = \u0026#34;containerInsights\u0026#34; 6 value = \u0026#34;enabled\u0026#34; 7 } 8} This provides metrics for:\nCPU and memory utilization per task. Network ingress/egress. Task startup and health check durations. Keycloak Built-in Endpoints:\nHealth: https://auth.example.com/auth/health → Returns {\u0026quot;status\u0026quot;: \u0026quot;UP\u0026quot;}. Metrics (Prometheus): https://auth.example.com/auth/metrics → Offers detailed application metrics. Server Info: Accessible via Admin Console → Server Info → provides version, clustering status, and memory usage. Recommended Alarms:\n1resource \u0026#34;aws_cloudwatch_metric_alarm\u0026#34; \u0026#34;ecs_cpu_high\u0026#34; { 2 alarm_name = \u0026#34;keycloak-cpu-high\u0026#34; 3 comparison_operator = \u0026#34;GreaterThanThreshold\u0026#34; 4 evaluation_periods = \u0026#34;2\u0026#34; 5 metric_name = \u0026#34;CPUUtilization\u0026#34; 6 namespace = \u0026#34;AWS/ECS\u0026#34; 7 period = \u0026#34;300\u0026#34; 8 statistic = \u0026#34;Average\u0026#34; 9 threshold = \u0026#34;80\u0026#34; 10 alarm_description = \u0026#34;ECS CPU utilization is too high\u0026#34; 11 alarm_actions = [aws_sns_topic.alerts.arn] 12 13 dimensions = { 14 ClusterName = aws_ecs_cluster.main.name 15 ServiceName = aws_ecs_service.keycloak.name 16 } 17} 18 19resource \u0026#34;aws_cloudwatch_metric_alarm\u0026#34; \u0026#34;aurora_cpu_high\u0026#34; { 20 alarm_name = \u0026#34;keycloak-db-cpu-high\u0026#34; 21 comparison_operator = \u0026#34;GreaterThanThreshold\u0026#34; 22 evaluation_periods = \u0026#34;2\u0026#34; 23 metric_name = \u0026#34;CPUUtilization\u0026#34; 24 namespace = \u0026#34;AWS/RDS\u0026#34; 25 period = \u0026#34;300\u0026#34; 26 statistic = \u0026#34;Average\u0026#34; 27 threshold = \u0026#34;80\u0026#34; 28 alarm_description = \u0026#34;Aurora CPU utilization is too high\u0026#34; 29 alarm_actions = [aws_sns_topic.alerts.arn] 30 31 dimensions = { 32 DBClusterIdentifier = module.aurora_postgresql.cluster_id 33 } 34} Security Best Practices Encryption Everywhere:\nALB: Enforces TLS 1.2+ with ACM certificates. RDS: AES-256 encryption at rest, managed by KMS. ECR: Encrypted container images. Parameter Store: SecureString encryption for sensitive credentials. In-transit: All external communication is secured via HTTPS. Secrets Management:\nSensitive values are securely stored in AWS Systems Manager Parameter Store:\n1# Store admin password 2aws ssm put-parameter \\ 3 --name \u0026#34;/keycloak/production/admin_password\u0026#34; \\ 4 --value \u0026#34;$(openssl rand -base64 32)\u0026#34; \\ 5 --type SecureString 6 7# Store database password 8aws ssm put-parameter \\ 9 --name \u0026#34;/keycloak/production/db_password\u0026#34; \\ 10 --value \u0026#34;$(openssl rand -base64 32)\u0026#34; \\ 11 --type SecureString ECS tasks retrieve secrets at runtime through IAM role permissions, eliminating hardcoded credentials in task definitions or source code.\nNetwork Isolation:\nECS Tasks: Confined to private subnets (no direct internet access). RDS: Located in private subnets, accessible only from the ECS security group. ALB: Deployed in public subnets (internet-facing). NAT Gateways: Reside in public subnets, providing outbound-only internet access for private subnets. Security Group Rules:\n1# ALB security group - allow HTTPS from anywhere 2resource \u0026#34;aws_security_group\u0026#34; \u0026#34;alb\u0026#34; { 3 name_prefix = \u0026#34;keycloak-alb-\u0026#34; 4 vpc_id = module.vpc.vpc_id 5 6 ingress { 7 from_port = 443 8 to_port = 443 9 protocol = \u0026#34;tcp\u0026#34; 10 cidr_blocks = [\u0026#34;0.0.0.0/0\u0026#34;] 11 } 12 13 egress { 14 from_port = 0 15 to_port = 0 16 protocol = \u0026#34;-1\u0026#34; 17 cidr_blocks = [\u0026#34;0.0.0.0/0\u0026#34;] 18 } 19} 20 21# ECS security group - allow traffic only from ALB 22resource \u0026#34;aws_security_group\u0026#34; \u0026#34;keycloak_ecs\u0026#34; { 23 name_prefix = \u0026#34;keycloak-ecs-\u0026#34; 24 vpc_id = module.vpc.vpc_id 25 26 ingress { 27 from_port = 8080 28 to_port = 8080 29 protocol = \u0026#34;tcp\u0026#34; 30 security_groups = [aws_security_group.alb.id] 31 } 32 33 # Allow clustering between ECS tasks 34 ingress { 35 from_port = 7800 36 to_port = 7800 37 protocol = \u0026#34;tcp\u0026#34; 38 self = true 39 } 40 41 egress { 42 from_port = 0 43 to_port = 0 44 protocol = \u0026#34;-1\u0026#34; 45 cidr_blocks = [\u0026#34;0.0.0.0/0\u0026#34;] 46 } 47} 48 49# RDS security group - allow traffic only from ECS 50resource \u0026#34;aws_security_group\u0026#34; \u0026#34;aurora\u0026#34; { 51 name_prefix = \u0026#34;keycloak-aurora-\u0026#34; 52 vpc_id = module.vpc.vpc_id 53 54 ingress { 55 from_port = 5432 56 to_port = 5432 57 protocol = \u0026#34;tcp\u0026#34; 58 security_groups = [aws_security_group.keycloak_ecs.id] 59 } 60} Troubleshooting Common Issues Issue 1: DCR Clients Missing mcp:run Scope\nSymptoms: Dynamically registered clients receive tokens without the aud claim, or with aud: [] (an empty array).\nRoot Cause: The mcp:run scope is not configured as a realm default scope, or the Client Registration Policy does not permit it.\nSolution:\n1# Verify realm default scopes 2curl -s \u0026#34;${KEYCLOAK_URL}/admin/realms/mcp\u0026#34; \\ 3 -H \u0026#34;Authorization: Bearer ${ADMIN_TOKEN}\u0026#34; | \\ 4 jq \u0026#39;.defaultDefaultClientScopes\u0026#39; 5 6# Expected output should include \u0026#34;mcp:run\u0026#34;. 7# If missing, recreate the realm or manually add via the admin console: 8# Realm Settings → Client Scopes → Default Client Scopes → Add \u0026#34;mcp:run\u0026#34;. 9 10# Verify Client Registration Policy allows mcp:run 11cd environments/production/mcp-oauth 12./fix-allowed-scopes.sh Issue 2: Trusted Hosts Policy Blocking Custom Redirect URIs\nSymptoms: Dynamic Client Registration succeeds, but authorization requests fail with an \u0026quot;Invalid redirect URI\u0026quot; error for schemes like cursor:// or vscode://.\nRoot Cause: Keycloak's Trusted Hosts policy defaults to rejecting non-HTTPS schemes.\nSolution:\n1cd environments/production/mcp-oauth 2./disable-trusted-hosts.sh Verify within the admin console: Client Registration → Policies → (confirm \u0026quot;Trusted Hosts\u0026quot; policy is absent).\nIssue 3: Clustering Failures (\u0026quot;Split Brain\u0026quot;)\nSymptoms: Users experience inconsistent authentication states, or sessions unexpectedly expire. Logs display repeated VIEW_CHANGE messages or WARN.*JDBC_PING errors.\nRoot Cause: JDBC_PING communication issues, typically due to:\nDatabase connectivity problems. Security group rules blocking port 7800. Multiple tasks simultaneously attempting to write to the JGROUPSPING table. Solution:\n1# Check JGROUPSPING table 2psql -h AURORA_ENDPOINT -U keycloak -d keycloak -c \u0026#34;SELECT * FROM JGROUPSPING;\u0026#34; 3 4# The table should display one row per running ECS task. 5# If empty or stale, investigate: 6# 1. Ensure the security group permits port 7800 between ECS tasks. 7# 2. Verify the `KC_DB_URL` environment variable is correct. 8# 3. Confirm database credentials are valid. 9 10# Restart the ECS service to force a cluster re-join 11aws ecs update-service \\ 12 --cluster keycloak-cluster \\ 13 --service keycloak \\ 14 --force-new-deployment Issue 4: ALB Health Checks Failing\nSymptoms: ECS tasks initiate, pass initial health checks, but then repeatedly fail and restart.\nRoot Cause: The health check path /auth/health may not respond promptly during Keycloak startup (which can take 60-120 seconds), or the health check interval is too aggressive.\nSolution:\nIncrease the health check grace period:\n1resource \u0026#34;aws_ecs_service\u0026#34; \u0026#34;keycloak\u0026#34; { 2 # ... 3 health_check_grace_period_seconds = 600 # 10 minutes 4} Alternatively, utilize a more reliable health check path:\n1resource \u0026#34;aws_lb_target_group\u0026#34; \u0026#34;keycloak\u0026#34; { 2 # ... 3 health_check { 4 enabled = true 5 path = \u0026#34;/auth/realms/master\u0026#34; # More reliable than /auth/health 6 port = \u0026#34;traffic-port\u0026#34; 7 protocol = \u0026#34;HTTP\u0026#34; 8 timeout = 5 9 interval = 30 10 healthy_threshold = 2 11 unhealthy_threshold = 3 12 matcher = \u0026#34;200\u0026#34; 13 } 14} Conclusion This guide has elucidated the process of configuring Keycloak as an MCP-compatible OAuth 2.1 authorization server. By strategically leveraging Keycloak's protocol mapper extensibility and realm configuration, this solution effectively addresses the platform's native lack of RFC 8707 support, while enabling a zero-configuration experience for MCP clients.\nKey Implementation Takeaways:\nAudience Mapper Workaround: Custom protocol mappers are employed to inject the requisite aud claim into JWT access tokens, compensating for Keycloak's inherent RFC 8707 limitations. Realm Default Scopes: The configuration of mcp:run as a realm default scope ensures that all dynamically registered clients automatically inherit the audience mapper. Two-Phase Configuration: A hybrid approach, combining Terraform resources with REST API configuration, effectively navigates the current limitations of the Keycloak Terraform provider concerning Client Registration Policies. Automated Infrastructure: Terraform modules facilitate repeatable AWS deployments, incorporating ECS Fargate, Aurora PostgreSQL, and essential networking components. When to Choose Keycloak for MCP:\nSelf-hosted Requirements: When on-premises or private cloud deployment is mandated. Keycloak-Specific Features: When advanced features like user federation, identity brokering, or custom authentication flows are necessary. Cost Considerations: For scenarios prioritizing open-source licensing with infrastructure-only costs. Customization Needs: When full control over authentication flows and user management is essential. Alternative Solutions for Consideration:\nFor organizations preferring managed identity solutions, IDaaS providers offering native RFC 8707 support should be evaluated:\nAmazon Cognito: Provides native RFC 8707 support, though it requires custom implementation for the DCR endpoint. Ping Identity: Offers comprehensive RFC 8707 and RFC 7591 compliance (as detailed in the MCP authorization compatibility analysis). For a deeper technical understanding of the OAuth 2.1 specifications and RFC requirements underpinning MCP authorization, refer to my comprehensive analysis: Technical Deconstruction of MCP Authorization: A Deep Dive into OAuth 2.1 and IETF RFC Specifications.\nThe complete Terraform configuration, Dockerfile, and deployment automation scripts are available in the terraform-keycloak-aws repository.\nResources \u0026amp; References Official Documentation Keycloak Documentation: Official Keycloak server administration and configuration guide. AWS ECS Best Practices: AWS guidance on container orchestration with ECS Fargate. Aurora Serverless v2 Documentation: Detailed information on Aurora Serverless scaling and configuration. OAuth and MCP Specifications RFC 7591 - OAuth 2.0 Dynamic Client Registration: Dynamic client registration protocol specification. RFC 7636 - Proof Key for Code Exchange (PKCE): PKCE specification for authorization code flow protection. RFC 8707 - Resource Indicators for OAuth 2.0: Audience restriction via resource parameter. Model Context Protocol Specification: Official MCP specification and requirements. Related Articles Technical Deconstruction of MCP Authorization: A Deep Dive into OAuth 2.1 and IETF RFC Specifications: Comprehensive analysis of OAuth 2.1 RFCs and MCP requirements. Building an MCP Agentic Chatbot on AWS: Patterns for MCP server implementation on AWS. Using MCP Client OAuthClientProvider with AWS Agentcore: Practical OAuth client implementation with MCP. GitHub Repository terraform-keycloak-aws: Complete source code for this deployment, including Terraform modules, Dockerfile, and deployment automation scripts. ","link":"https://kane.mx/posts/2025/deploy-keycloak-aws-mcp-oauth/","section":"posts","tags":["Keycloak","MCP","Model Context Protocol","OAuth 2.1","RFC 8707","Dynamic Client Registration","PKCE","AWS","Terraform","Identity Provider"],"title":"Implementing MCP OAuth 2.1 with Keycloak on AWS"},{"body":"","link":"https://kane.mx/tags/keycloak/","section":"tags","tags":null,"title":"Keycloak"},{"body":"","link":"https://kane.mx/tags/rfc-8707/","section":"tags","tags":null,"title":"RFC 8707"},{"body":"","link":"https://kane.mx/tags/edge-computing/","section":"tags","tags":null,"title":"Edge Computing"},{"body":"","link":"https://kane.mx/tags/esp32/","section":"tags","tags":null,"title":"ESP32"},{"body":"","link":"https://kane.mx/categories/iot/","section":"categories","tags":null,"title":"IoT"},{"body":"","link":"https://kane.mx/tags/iot/","section":"tags","tags":null,"title":"IoT"},{"body":"","link":"https://kane.mx/tags/voice-assistant/","section":"tags","tags":null,"title":"Voice Assistant"},{"body":"","link":"https://kane.mx/tags/websocket/","section":"tags","tags":null,"title":"WebSocket"},{"body":"","link":"https://kane.mx/tags/xiaozhi/","section":"tags","tags":null,"title":"Xiaozhi"},{"body":"The Xiaozhi hardware is an impressive ESP32-based AI voice assistant capable of offline wake-up, multi-language support, and cloud connectivity. But what if you want your Xiaozhi device to access multiple AI tools, APIs, and services without managing complex integrations on the hardware side? This is where Amazon Bedrock AgentCore Gateway shines as a unified aggregation layer for Model Context Protocol (MCP) servers.\nIn this guide, I'll walk you through building a distributed MCP architecture that connects Xiaozhi hardware to multiple cloud services through a single WebSocket connection, leveraging AgentCore Gateway to aggregate tools ranging from simple calculators to complex RESTful APIs like real-time football data.\nThe Challenge: Connecting Edge Devices to Multiple AI Tools Xiaozhi hardware excels at voice interaction and local control, but extending its capabilities to access dozens of cloud services presents several challenges:\nConnection Management: Each MCP server requires its own connection, protocol handling, and authentication Resource Constraints: ESP32 devices have limited memory and processing power for managing multiple connections API Key Security: Storing numerous API keys on edge devices poses security risks Scalability: Adding new tools requires firmware updates and device reconfiguration The solution? Use a gateway pattern to aggregate all MCP servers into a single endpoint that your Xiaozhi device can access through one WebSocket connection.\nArchitecture Overview Our architecture consists of five key components working together:\ngraph TD A[Xiaozhi ESP32 Hardware] --\u0026gt;|WebSocket| B[MCP Pipe Bridge] B --\u0026gt;|stdio| C[MCP Proxy for AWS] C --\u0026gt;|HTTPS + SigV4| D[AgentCore Gateway] D --\u0026gt;|Proxies| E1[Calculator MCP Server] D --\u0026gt;|Proxies| E2[Football API OpenAPI] D --\u0026gt;|Proxies| E3[Other MCP Tools] style A fill:#E37222,stroke:#d66820,stroke-width:3px,color:#fff style B fill:#4a5568,stroke:#3d4555,stroke-width:2px,color:#fff style C fill:#4a5568,stroke:#3d4555,stroke-width:2px,color:#fff style D fill:#232F3E,stroke:#00A4A6,stroke-width:3px,color:#fff style E1 fill:#3d4555,stroke:#545B64,stroke-width:2px,color:#fff style E2 fill:#3d4555,stroke:#545B64,stroke-width:2px,color:#fff style E3 fill:#3d4555,stroke:#545B64,stroke-width:2px,color:#fff Component Breakdown:\nXiaozhi Hardware: ESP32-based voice assistant that connects via WebSocket to the MCP endpoint MCP Pipe Bridge: A Python-based bidirectional bridge that translates between WebSocket and stdio protocols, managing multiple MCP server processes MCP Proxy for AWS: A specialized proxy that translates MCP stdio protocol to AWS HTTP/SSE with SigV4 authentication Amazon Bedrock AgentCore Gateway: Managed service that aggregates multiple MCP targets into a unified interface with IAM-based authentication Gateway Targets: Individual MCP servers (local tools) and OpenAPI endpoints (RESTful APIs) Implementation Guide Part 1: Setting Up the MCP Pipe Bridge The MCP Pipe Bridge is the crucial component that connects your Xiaozhi hardware to the cloud-based gateway. It manages WebSocket connections and spawns MCP server processes.\nConfiguration File Create mcp_config.json to define your MCP servers:\n1{ 2 \u0026#34;mcpServers\u0026#34;: { 3 \u0026#34;aws-proxy-gateway\u0026#34;: { 4 \u0026#34;type\u0026#34;: \u0026#34;stdio\u0026#34;, 5 \u0026#34;command\u0026#34;: \u0026#34;uvx\u0026#34;, 6 \u0026#34;args\u0026#34;: [ 7 \u0026#34;mcp-proxy-for-aws@latest\u0026#34;, 8 \u0026#34;https://YOUR-GATEWAY-ID.gateway.bedrock-agentcore.us-west-2.amazonaws.com/mcp\u0026#34;, 9 \u0026#34;--region\u0026#34;, 10 \u0026#34;us-west-2\u0026#34;, 11 \u0026#34;--log-level\u0026#34;, 12 \u0026#34;DEBUG\u0026#34;, 13 \u0026#34;--service\u0026#34;, 14 \u0026#34;bedrock-agentcore\u0026#34; 15 ] 16 } 17 } 18} The Bridge Implementation The MCP Pipe (mcp_pipe.py) handles bidirectional communication between WebSocket and stdio:\n1import asyncio 2import websockets 3import json 4import subprocess 5from typing import Dict, List 6 7class MCPPipe: 8 def __init__(self, websocket_url: str, config: dict): 9 self.websocket_url = websocket_url 10 self.config = config 11 self.processes: Dict[str, subprocess.Popen] = {} 12 13 async def start_server(self, name: str, server_config: dict): 14 \u0026#34;\u0026#34;\u0026#34;Launch an MCP server process\u0026#34;\u0026#34;\u0026#34; 15 process = subprocess.Popen( 16 [server_config[\u0026#39;command\u0026#39;]] + server_config.get(\u0026#39;args\u0026#39;, []), 17 stdin=subprocess.PIPE, 18 stdout=subprocess.PIPE, 19 stderr=subprocess.PIPE, 20 text=True 21 ) 22 self.processes[name] = process 23 return process 24 25 async def bridge_messages(self, websocket, process): 26 \u0026#34;\u0026#34;\u0026#34;Bidirectional message forwarding\u0026#34;\u0026#34;\u0026#34; 27 async def ws_to_stdio(): 28 async for message in websocket: 29 # Forward WebSocket messages to process stdin 30 process.stdin.write(message + \u0026#39;\\n\u0026#39;) 31 process.stdin.flush() 32 33 async def stdio_to_ws(): 34 while True: 35 # Read from process stdout and send to WebSocket 36 line = process.stdout.readline() 37 if line: 38 await websocket.send(line.strip()) 39 await asyncio.sleep(0.01) 40 41 await asyncio.gather(ws_to_stdio(), stdio_to_ws()) Key Features:\nAuto-reconnection: Exponential backoff (1s → 600s max) for resilience Multi-server management: Spawns and monitors multiple child processes Bidirectional streaming: Real-time message forwarding in both directions Docker Deployment For production deployment, use Docker with systemd auto-start:\n1FROM python:3.13-slim 2 3# Install uv package manager 4RUN curl -LsSf https://astral.sh/uv/install.sh | sh 5 6# Copy application files 7WORKDIR /app 8COPY requirements.txt mcp_pipe.py mcp_config.json ./ 9RUN pip install --no-cache-dir -r requirements.txt 10 11# Run as non-root user 12RUN useradd -m -u 1000 mcpuser 13USER mcpuser 14 15CMD [\u0026#34;python\u0026#34;, \u0026#34;mcp_pipe.py\u0026#34;] Docker Compose (docker-compose.yml):\n1services: 2 mcp-pipe: 3 build: . 4 restart: always 5 network_mode: \u0026#34;host\u0026#34; # Access EC2 instance metadata 6 env_file: .env 7 environment: 8 - AWS_REGION=us-west-2 9 - MCP_ENDPOINT=wss://api.xiaozhi.me/mcp/?token=${XIAOZHI_TOKEN} 10 volumes: 11 - ./mcp_config.json:/app/mcp_config.json:ro Part 2: Deploying Amazon Bedrock AgentCore Gateway Amazon Bedrock AgentCore Gateway provides the aggregation layer that combines multiple MCP targets into a single endpoint.\nGateway Creation The gateway can be created through AWS Console or CLI. The key configuration includes:\n1# Create gateway 2aws bedrock-agentcore create-gateway \\ 3 --name xiaozhi-gateway \\ 4 --description \u0026#34;MCP aggregation gateway for Xiaozhi hardware\u0026#34; \\ 5 --region us-west-2 6 7# Note the gateway identifier returned (example format) 8# Example output: YOUR-GATEWAY-ID IAM Permissions The gateway requires specific IAM permissions to access credential providers and secrets:\n1{ 2 \u0026#34;Version\u0026#34;: \u0026#34;2012-10-17\u0026#34;, 3 \u0026#34;Statement\u0026#34;: [ 4 { 5 \u0026#34;Sid\u0026#34;: \u0026#34;GetWorkloadAccessToken\u0026#34;, 6 \u0026#34;Effect\u0026#34;: \u0026#34;Allow\u0026#34;, 7 \u0026#34;Action\u0026#34;: [\u0026#34;bedrock-agentcore:GetWorkloadAccessToken\u0026#34;], 8 \u0026#34;Resource\u0026#34;: \u0026#34;*\u0026#34; 9 }, 10 { 11 \u0026#34;Sid\u0026#34;: \u0026#34;GetResourceApiKey\u0026#34;, 12 \u0026#34;Effect\u0026#34;: \u0026#34;Allow\u0026#34;, 13 \u0026#34;Action\u0026#34;: [\u0026#34;bedrock-agentcore:GetResourceApiKey\u0026#34;], 14 \u0026#34;Resource\u0026#34;: \u0026#34;*\u0026#34; 15 }, 16 { 17 \u0026#34;Sid\u0026#34;: \u0026#34;GetCredentials\u0026#34;, 18 \u0026#34;Effect\u0026#34;: \u0026#34;Allow\u0026#34;, 19 \u0026#34;Action\u0026#34;: [\u0026#34;secretsmanager:GetSecretValue\u0026#34;], 20 \u0026#34;Resource\u0026#34;: [\u0026#34;arn:aws:secretsmanager:*:*:secret:bedrock-agentcore-identity!*\u0026#34;] 21 } 22 ] 23} Connecting via MCP Proxy for AWS The mcp-proxy-for-aws package handles authentication and protocol translation:\n1# Install and run the proxy 2uvx mcp-proxy-for-aws@latest \\ 3 https://YOUR-GATEWAY-ID.gateway.bedrock-agentcore.us-west-2.amazonaws.com/mcp \\ 4 --region us-west-2 \\ 5 --service bedrock-agentcore \\ 6 --log-level DEBUG The proxy automatically:\nUses EC2 instance profile or local AWS credentials Signs requests with AWS Signature Version 4 (SigV4) Translates MCP stdio protocol to HTTP/SSE Handles streaming responses from the gateway Part 3: Adding Gateway Targets Now let's add actual functionality by configuring gateway targets.\nExample 1: Local Calculator MCP Server A simple calculator tool demonstrates local MCP server integration:\n1from fastmcp import FastMCP 2import math 3import ast 4import operator 5 6mcp = FastMCP(\u0026#34;Calculator\u0026#34;) 7 8# Safe operators for mathematical expressions 9SAFE_OPERATORS = { 10 ast.Add: operator.add, 11 ast.Sub: operator.sub, 12 ast.Mult: operator.mul, 13 ast.Div: operator.truediv, 14 ast.Pow: operator.pow, 15 ast.USub: operator.neg, 16} 17 18def safe_eval_math(expression: str) -\u0026gt; float: 19 \u0026#34;\u0026#34;\u0026#34; 20 Safely evaluate mathematical expressions using AST parsing. 21 Only allows basic arithmetic operations and whitelisted math functions. 22 \u0026#34;\u0026#34;\u0026#34; 23 tree = ast.parse(expression, mode=\u0026#39;eval\u0026#39;) 24 25 safe_funcs = { 26 \u0026#39;sqrt\u0026#39;: math.sqrt, 27 \u0026#39;sin\u0026#39;: math.sin, 28 \u0026#39;cos\u0026#39;: math.cos, 29 \u0026#39;tan\u0026#39;: math.tan, 30 \u0026#39;log\u0026#39;: math.log, 31 \u0026#39;abs\u0026#39;: abs, 32 } 33 34 def eval_node(node): 35 if isinstance(node, ast.Num): 36 return node.n 37 elif isinstance(node, ast.BinOp): 38 left = eval_node(node.left) 39 right = eval_node(node.right) 40 return SAFE_OPERATORS[type(node.op)](left, right) 41 elif isinstance(node, ast.UnaryOp): 42 operand = eval_node(node.operand) 43 return SAFE_OPERATORS[type(node.op)](operand) 44 elif isinstance(node, ast.Call): 45 if isinstance(node.func, ast.Name): 46 func_name = node.func.id 47 if func_name in safe_funcs: 48 args = [eval_node(arg) for arg in node.args] 49 return safe_funcs[func_name](*args) 50 raise ValueError(\u0026#34;Function not allowed\u0026#34;) 51 else: 52 raise ValueError(\u0026#34;Unsupported operation\u0026#34;) 53 54 return eval_node(tree.body) 55 56@mcp.tool() 57def calculator(expression: str) -\u0026gt; dict: 58 \u0026#34;\u0026#34;\u0026#34; 59 Safely evaluate mathematical expressions. 60 Supports: +, -, *, /, **, and math functions (sqrt, sin, cos, tan, log, abs) 61 62 Examples: 63 - \u0026#34;25 * 17\u0026#34; 64 - \u0026#34;sqrt(144)\u0026#34; 65 - \u0026#34;2 ** 10\u0026#34; 66 \u0026#34;\u0026#34;\u0026#34; 67 try: 68 result = safe_eval_math(expression) 69 return {\u0026#34;success\u0026#34;: True, \u0026#34;result\u0026#34;: result} 70 except Exception as e: 71 return {\u0026#34;success\u0026#34;: False, \u0026#34;error\u0026#34;: str(e)} 72 73if __name__ == \u0026#34;__main__\u0026#34;: 74 mcp.run(transport=\u0026#34;stdio\u0026#34;) Security Note: This implementation uses AST (Abstract Syntax Tree) parsing to safely evaluate mathematical expressions without the security risks of arbitrary code execution. It only permits whitelisted operations and functions, preventing code injection attacks.\nUsage Flow:\nUser asks Xiaozhi: \u0026quot;What is 25 times 17?\u0026quot; Request flows: Xiaozhi → WebSocket → MCP Pipe → Calculator Calculator safely evaluates: safe_eval_math(\u0026quot;25 * 17\u0026quot;) → 425 Response returns through the chain Xiaozhi responds: \u0026quot;The result is 425\u0026quot; Example 2: Football API via OpenAPI Target For external APIs, use OpenAPI targets with credential providers.\nStep 1: Create Credential Provider\n1# Create provider for API key storage 2aws bedrock-agentcore-control create-api-key-credential-provider \\ 3 --name FootballAPICredentialProvider \\ 4 --description \u0026#34;RapidAPI Football API Key\u0026#34; 5 6# Store the API key 7aws bedrock-agentcore-control update-api-key-credential-provider \\ 8 --name FootballAPICredentialProvider \\ 9 --api-key \u0026#34;YOUR_RAPIDAPI_KEY\u0026#34; Step 2: Define OpenAPI Schema\nCreate football-api-openapi.yaml with essential league IDs embedded in descriptions to reduce API calls:\n1openapi: 3.0.3 2info: 3 title: Football API 4 version: 1.0.0 5 description: | 6 Access live football data including fixtures, standings, and statistics. 7 8 **Common League IDs** (use directly to avoid extra API calls): 9 - Premier League: 39 10 - La Liga: 140 11 - Bundesliga: 78 12 - Serie A: 135 13 - Champions League: 2 14 - Europa League: 3 15 16servers: 17 - url: https://api-football-v1.p.rapidapi.com/v3 18 19paths: 20 /standings: 21 get: 22 operationId: getStandings 23 summary: Get league standings 24 parameters: 25 - name: league 26 in: query 27 required: true 28 schema: 29 type: integer 30 description: League ID (e.g., 39 for Premier League) 31 - name: season 32 in: query 33 required: true 34 schema: 35 type: integer 36 description: Season year (e.g., 2025) 37 responses: 38 \u0026#39;200\u0026#39;: 39 description: Standings data 40 41 /fixtures: 42 get: 43 operationId: getFixtures 44 summary: Get match fixtures 45 parameters: 46 - name: league 47 in: query 48 schema: 49 type: integer 50 - name: season 51 in: query 52 required: true 53 schema: 54 type: integer 55 - name: date 56 in: query 57 schema: 58 type: string 59 description: Date in YYYY-MM-DD format 60 responses: 61 \u0026#39;200\u0026#39;: 62 description: Fixtures data Step 3: Configure Gateway Target\nThe gateway target configuration links the OpenAPI schema with credential injection:\n1{ 2 \u0026#34;gatewayIdentifier\u0026#34;: \u0026#34;YOUR-GATEWAY-ID\u0026#34;, 3 \u0026#34;name\u0026#34;: \u0026#34;FootballAPITarget\u0026#34;, 4 \u0026#34;targetConfiguration\u0026#34;: { 5 \u0026#34;mcp\u0026#34;: { 6 \u0026#34;openApiSchema\u0026#34;: { 7 \u0026#34;s3\u0026#34;: { 8 \u0026#34;uri\u0026#34;: \u0026#34;s3://your-bucket/football-api-openapi.yaml\u0026#34; 9 } 10 } 11 } 12 }, 13 \u0026#34;credentialProviderConfigurations\u0026#34;: [{ 14 \u0026#34;credentialProviderType\u0026#34;: \u0026#34;API_KEY\u0026#34;, 15 \u0026#34;credentialProvider\u0026#34;: { 16 \u0026#34;apiKeyCredentialProvider\u0026#34;: { 17 \u0026#34;providerArn\u0026#34;: \u0026#34;arn:aws:bedrock-agentcore:us-west-2:YOUR-ACCOUNT-ID:token-vault/default/apikeycredentialprovider/FootballAPICredentialProvider\u0026#34;, 18 \u0026#34;credentialLocation\u0026#34;: \u0026#34;HEADER\u0026#34;, 19 \u0026#34;credentialParameterName\u0026#34;: \u0026#34;x-rapidapi-key\u0026#34; 20 } 21 } 22 }] 23} How It Works:\nLLM recognizes \u0026quot;Premier League\u0026quot; and league ID 39 from schema description Generates request: getStandings({ league: 39, season: 2025 }) Gateway retrieves API key from credential provider Injects headers: x-rapidapi-key and x-rapidapi-host Proxies to: https://api-football-v1.p.rapidapi.com/v3/standings?league=39\u0026amp;season=2025 Returns response through the chain to Xiaozhi This approach reduces API calls by 50% by embedding common league IDs directly in the schema documentation.\nRequest Flow Walkthrough Let's trace a complete request from voice query to spoken response:\nsequenceDiagram participant U as User participant X as Xiaozhi Hardware participant W as WebSocket participant P as MCP Pipe participant M as MCP Proxy participant G as AgentCore Gateway participant T as Target (API/Tool) U-\u0026gt;\u0026gt;X: \u0026#34;Show Premier League standings\u0026#34; X-\u0026gt;\u0026gt;W: JSON-RPC Request W-\u0026gt;\u0026gt;P: WebSocket Message P-\u0026gt;\u0026gt;M: stdio Message M-\u0026gt;\u0026gt;G: HTTPS + SigV4 Auth G-\u0026gt;\u0026gt;T: Proxied Request + API Key T--\u0026gt;\u0026gt;G: API Response G--\u0026gt;\u0026gt;M: JSON Response M--\u0026gt;\u0026gt;P: stdio Response P--\u0026gt;\u0026gt;W: WebSocket Message W--\u0026gt;\u0026gt;X: JSON-RPC Response X--\u0026gt;\u0026gt;U: Speaks standings Step-by-Step:\nUser speaks: \u0026quot;Show me Premier League standings\u0026quot; Xiaozhi processes: Converts speech to text, sends to LLM LLM determines tool: Recognizes need for getStandings with league ID 39 Request propagates: Xiaozhi → WebSocket → MCP Pipe → MCP Proxy → Gateway Gateway routes: Identifies Football API target, retrieves API key API call: Gateway proxies request to RapidAPI with authentication Response flows back: API → Gateway → MCP Proxy → MCP Pipe → WebSocket → Xiaozhi Xiaozhi speaks: \u0026quot;Here are the Premier League standings: Manchester City is first with 65 points...\u0026quot; Use Cases and Benefits This distributed MCP architecture enables powerful use cases:\nUse Cases Voice-Controlled Calculations: \u0026quot;What's the square root of 12,345?\u0026quot; Real-Time Sports Data: \u0026quot;When is Manchester United's next match?\u0026quot; Smart Home Integration: Control devices through natural language Personal Productivity: \u0026quot;Schedule a meeting for tomorrow at 3 PM\u0026quot; Information Retrieval: \u0026quot;What's the weather forecast for Tokyo?\u0026quot; Key Benefits 1. Unified Interface\nSingle WebSocket connection for all tools No complex client-side integration logic Consistent error handling and retry mechanisms 2. Scalability\nAdd new tools without firmware updates Independent scaling of gateway and targets Parallel request processing 3. Security\nAPI keys stored in AWS Secrets Manager IAM-based authentication and authorization No sensitive data on edge devices Safe execution environments for tools 4. Resilience\nAutomatic reconnection with exponential backoff Target-level health monitoring Graceful degradation if individual tools fail 5. Cost Optimization\nSchema optimization reduces API calls by 50% Shared credential providers across gateways Pay-per-use pricing for gateway requests Troubleshooting Tips When issues arise, check these common areas:\nConnection Issues: Verify WebSocket connectivity, authentication tokens, and network firewall rules Gateway Errors: Ensure IAM permissions are correctly configured on the gateway service role Tool Failures: Validate input schemas, check API rate limits, and review credential provider settings Logs: Use docker compose logs -f mcp-pipe to monitor the bridge process in real-time Conclusion By combining Xiaozhi hardware, MCP Pipe bridge, and Amazon Bedrock AgentCore Gateway, you've built a robust distributed AI architecture that brings cloud-scale capabilities to edge devices. This pattern demonstrates how Model Context Protocol can unify diverse tools and APIs into a single, manageable interface.\nThe architecture is extensible—you can add more MCP servers, integrate additional APIs through OpenAPI targets, or even connect multiple Xiaozhi devices to the same gateway. The gateway pattern provides a clean separation of concerns: hardware focuses on voice interaction, the bridge handles connectivity, and the gateway manages tool aggregation and authentication.\nAs MCP adoption grows, this architecture positions you to leverage new tools and services as they become available, all without touching your edge device firmware.\nResources Xiaozhi ESP32 Hardware Repository - Open-source AI voice assistant hardware based on ESP32 MCP Calculator Sample with Pipe \u0026amp; Docker - Example implementation of MCP Pipe bridge and Docker deployment for Xiaozhi AgentCore Gateway Football API Target - CDK implementation for deploying Football API as an OpenAPI target to AgentCore Gateway MCP Proxy for AWS - Official AWS proxy for connecting MCP clients to Amazon Bedrock AgentCore Gateway with SigV4 authentication Amazon Bedrock AgentCore Gateway Quick Start - Official AWS documentation for getting started with Amazon Bedrock AgentCore Gateway Model Context Protocol Specification - Complete MCP protocol specification and documentation ","link":"https://kane.mx/posts/2025/xiaozhi-agentcore-gateway-mcp/","section":"posts","tags":["AWS","Bedrock","AgentCore","MCP","Model Context Protocol","Xiaozhi","ESP32","Voice Assistant","IoT","Edge Computing","WebSocket"],"title":"Xiaozhi ESP32 MCP Gateway with Amazon Bedrock AgentCore"},{"body":"","link":"https://kane.mx/tags/ai-agents/","section":"tags","tags":null,"title":"AI Agents"},{"body":"","link":"https://kane.mx/tags/aws-bedrock/","section":"tags","tags":null,"title":"AWS Bedrock"},{"body":"You can stop hoping your Large Language Model (LLM) follows complex instructions. Context Engineering is the strategic practice of curating what enters the model's limited attention budget. To build reliable AI agents, you must master these four deterministic context patterns.\nThis post explores the key insights from my presentation \u0026quot;Claude Code Skill(s): Mastering AI-Powered Development\u0026quot;, focusing on practical patterns that transform unpredictable AI interactions into deterministic, production-ready workflows.\nWhat is Context Engineering? Context Engineering is the art and science of strategically managing what information an LLM processes within its limited attention window. Unlike traditional prompt engineering, which focuses on crafting individual requests, context engineering treats the entire development environment as a programmable context system.\nThe challenge: Modern LLMs like Claude offer massive context windows (up to 1 million tokens), but without proper engineering, this capacity becomes a liability rather than an asset. Information overload, context pollution, and inconsistent behavior plague naive implementations.\nThe solution: Treat context as a managed resource with four key patterns.\n1. Memory is a Hierarchical Filesystem Claude Code manages context through structured, version-controllable files rather than ephemeral prompts or database entries. This Hierarchical Memory system uses a clear precedence:\n1~/.claude/CLAUDE.md (User) # Global instructions across all projects 2./CLAUDE.md (Project) # Project-specific guidelines 3./CLAUDE.local.md (Local) # Machine-specific overrides (gitignored) Why Filesystem-Based Memory Matters Version Control Integration: Your team's AI expertise becomes git-committable knowledge. When a senior engineer discovers an optimal workflow pattern, it's captured in CLAUDE.md and distributed to the entire team through version control.\nConsistent Context Delivery: Every interaction with Claude Code starts with the same foundational knowledge. No more repeating \u0026quot;use conventional commits\u0026quot; or \u0026quot;follow our coding standards\u0026quot; in every session.\nLayered Context Precedence: The hierarchical structure allows for inheritance and overrides:\nUser-level instructions define personal preferences Project-level instructions enforce team standards Local overrides handle machine-specific configurations Practical Example 1# ./CLAUDE.md (Project) 2 3## AWS CDK Best Practices 4 5- **Resource Naming**: Do NOT explicitly specify resource names 6 when optional in CDK constructs. Let CDK generate unique names 7 to enable parallel stacks and environments. 8 9- **Lambda Functions**: Use `@aws-cdk/aws-lambda-nodejs` for 10 TypeScript/JavaScript. These constructs handle bundling, 11 dependencies, and transpilation automatically. This single file eliminates hundreds of repetitive prompts across your team's development lifecycle.\n2. Hooks Enforce Context Injection Hooks are shell commands triggered at specific lifecycle events in Claude Code, providing Deterministic Control over AI behavior. The most powerful pattern is the Memory Enforcer, which leverages the UserPromptSubmit event.\nThe Power of UserPromptSubmit Hooks This event executes before any AI processing begins, allowing you to:\nInject domain-specific knowledge Pre-load necessary context Trigger autonomous skills Enforce organizational policies 1# ~/.claude/hooks/user-prompt-submit.sh 2 3# Automatically inject AWS best practices for infrastructure queries 4if echo \u0026#34;$PROMPT\u0026#34; | grep -qi \u0026#34;aws\\|lambda\\|cdk\u0026#34;; then 5 echo \u0026#34;Loading AWS architecture context...\u0026#34; 6 cat ~/.claude/aws-patterns.md 7fi 8 9# Trigger security review for authentication-related changes 10if echo \u0026#34;$PROMPT\u0026#34; | grep -qi \u0026#34;auth\\|login\\|credential\u0026#34;; then 11 echo \u0026#34;/security-review\u0026#34; 12fi From Hope to Guarantee \u0026quot;Deterministic Control: Hooks transform 'hope AI follows instructions' into 'guaranteed execution' through code enforcement.\u0026quot;\nTraditional prompt engineering: \u0026quot;Please remember to check security implications...\u0026quot;\nHook-based enforcement:\n1# Security hook always runs on code changes 2if [ -n \u0026#34;$(git diff --name-only | grep -E \u0026#39;\\.(ts|js|py)$\u0026#39;)\u0026#34; ]; then 3 /run-security-scan 4fi The hook guarantees execution. The AI doesn't \u0026quot;forget\u0026quot; or \u0026quot;overlook\u0026quot;—the system enforces the behavior.\n3. Skills Are Autonomous, Context-Driven Extensions Agent Skills are modular knowledge bases that Claude autonomously invokes based on context. Unlike manual tool invocations, Skills are discovered and activated by Claude itself when the context matches their description.\nThe Anatomy of an Effective Skill Skills are defined as SKILL.md files with three critical components:\nClear Description: What the skill does (for autonomous discovery) Activation Context: When Claude should use it Domain Knowledge: The expertise it provides 1--- 2name: aws-cdk-patterns 3description: AWS CDK architecture patterns and best practices 4trigger: Use when designing or implementing AWS infrastructure with CDK 5--- 6 7# AWS CDK Architecture Patterns 8 9## Serverless API Pattern 10When building REST APIs, prefer this stack composition: 11- API Gateway (HTTP API for cost efficiency) 12- Lambda (Node.js runtime with arm64 for performance) 13- DynamoDB (single-table design when appropriate) 14 15[Implementation details...] Autonomous Activation The key difference from traditional tools:\nManual Tool Invocation (Traditional):\n1User: \u0026#34;Use the AWS tool to create a Lambda function\u0026#34; Autonomous Skill Activation (Context-Driven):\n1User: \u0026#34;Create a serverless API for user management\u0026#34; 2Claude: [Sees \u0026#34;serverless API\u0026#34; context] 3 [Discovers aws-cdk-patterns skill] 4 [Autonomously loads and applies patterns] 5 [Implements with best practices baked in] The skill's description enables proper discovery. Claude reasons: \u0026quot;This query involves serverless APIs → aws-cdk-patterns skill matches → load and apply.\u0026quot;\nReducing Repetitive Prompting Before Skills:\n1User: \u0026#34;Create Lambda with proper IAM roles\u0026#34; 2User: \u0026#34;Add DynamoDB with encryption\u0026#34; 3User: \u0026#34;Configure API Gateway with CORS\u0026#34; 4User: \u0026#34;Set up CloudWatch logging\u0026#34; With Skills:\n1User: \u0026#34;Create a serverless API\u0026#34; 2Claude: [aws-cdk-patterns skill automatically applies all patterns] The skill encapsulates the complete pattern, eliminating prompt repetition.\n4. Configure for 1M Token Scale Effectively using Claude's 1 million token context window requires explicit configuration and understanding of the infrastructure.\nAmazon Bedrock Integration When integrating Claude via Amazon Bedrock, enable extended context explicitly:\n1# Enable 1M token context window 2/model sonnet[1m] 3 4# Verify configuration 5aws bedrock get-foundation-model \\ 6 --model-identifier anthropic.claude-3-5-sonnet-20241022-v2:0[1m] The [1m] suffix is critical—without it, you're limited to smaller context windows.\nClaude Agent SDK Features For programmatic agent development, the Claude Agent SDK provides essential context management:\n1import { Agent } from \u0026#39;@anthropic-ai/agent-sdk\u0026#39;; 2 3const agent = new Agent({ 4 model: \u0026#39;claude-3-5-sonnet-20241022\u0026#39;, 5 // Automatic context compaction when approaching limits 6 autoCompact: true, 7 // Control token allocation for reasoning 8 thinkingBudget: 16000, 9 // Long-running session support 10 maxTurns: 100 11}); Token Budget Optimization Thinking Budget: Control how many tokens Claude allocates to internal reasoning:\n1// Development/debugging: verbose reasoning 2thinkingBudget: 32000 3 4// Production: efficient processing 5thinkingBudget: 8000 Context Compaction: The SDK automatically summarizes older context when approaching limits, maintaining relevance while preserving essential information.\nLong Session Strategies For extended development sessions:\nPeriodic Checkpointing: Save key decisions and context Strategic Summarization: Compress completed work Context Pruning: Remove no-longer-relevant information Hierarchical Memory: Offload persistent knowledge to CLAUDE.md files 1# Hook for automatic session checkpointing 2# ~/.claude/hooks/every-10-turns.sh 3echo \u0026#34;Creating context checkpoint...\u0026#34; 4claude-snapshot save \u0026#34;checkpoint-$(date +%s)\u0026#34; The Paradigm Shift: Programming Claude's Perception Context engineering represents a fundamental shift from request-response prompting to environment programming. You're not asking Claude to do things differently—you're changing what Claude perceives as reality.\nTraditional Approach: Persuasion 1\u0026#34;Please follow our coding standards...\u0026#34; 2\u0026#34;Remember to add tests...\u0026#34; 3\u0026#34;Don\u0026#39;t forget error handling...\u0026#34; Context Engineering: Perception Programming 1CLAUDE.md defines coding standards as reality 2Hooks guarantee test execution 3Skills embed error handling patterns Claude doesn't need to \u0026quot;remember\u0026quot; or \u0026quot;try hard\u0026quot;—the engineered context makes correct behavior the natural, obvious choice.\nImplementation Checklist Ready to implement these patterns? Here's your action plan:\nImmediate Actions Create project CLAUDE.md with team standards Set up user-level ~/.claude/CLAUDE.md for personal workflows Configure .gitignore to exclude CLAUDE.local.md Hooks Setup Create ~/.claude/hooks/user-prompt-submit.sh Implement context injection for common domains Add security enforcement hooks Skills Development Identify repetitive instruction patterns Convert top 3 patterns to Agent Skills Write clear skill descriptions for autonomous activation Scale Configuration Enable [1m] context for larger context window Configure thinking budget for your use case Implement session management strategy Conclusion Context engineering transforms AI development from an art of persuasion into a science of deterministic systems design. By treating context as managed infrastructure—through hierarchical memory, enforced hooks, autonomous skills, and optimized configuration—you build reliable AI agents that perform consistently.\nThe four patterns work synergistically:\nMemory provides foundational knowledge Hooks enforce critical behaviors Skills deliver domain expertise autonomously Configuration enables scale and efficiency Stop hoping your LLM follows instructions. Start engineering the context that makes correct behavior inevitable.\nWatch the Full Presentation For a deeper dive into these concepts with live demonstrations, watch my presentation:\nResources Claude Code Documentation Claude Agent SDK on GitHub Amazon Bedrock Claude Models Context Engineering Best Practices Presentation Slides: Claude Code Skill(s) ","link":"https://kane.mx/posts/2025/context-engineering-secrets-claude-code/","section":"posts","tags":["Claude Code","AI Agents","Context Engineering","LLM","Developer Tools","AWS Bedrock","Anthropic"],"title":"Beyond Prompts: 4 Context Engineering Secrets for Claude Code"},{"body":"","link":"https://kane.mx/tags/context-engineering/","section":"tags","tags":null,"title":"Context Engineering"},{"body":"","link":"https://kane.mx/tags/developer-tools/","section":"tags","tags":null,"title":"Developer Tools"},{"body":"","link":"https://kane.mx/tags/llm/","section":"tags","tags":null,"title":"LLM"},{"body":"","link":"https://kane.mx/categories/software-engineering/","section":"categories","tags":null,"title":"Software Engineering"},{"body":"","link":"https://kane.mx/tags/ai-security/","section":"tags","tags":null,"title":"AI Security"},{"body":"","link":"https://kane.mx/tags/federated-authentication/","section":"tags","tags":null,"title":"Federated Authentication"},{"body":"","link":"https://kane.mx/tags/jwt/","section":"tags","tags":null,"title":"JWT"},{"body":"","link":"https://kane.mx/tags/oauth-2.0/","section":"tags","tags":null,"title":"OAuth 2.0"},{"body":"","link":"https://kane.mx/categories/protocol-design/","section":"categories","tags":null,"title":"Protocol Design"},{"body":"","link":"https://kane.mx/tags/resource-indicators/","section":"tags","tags":null,"title":"Resource Indicators"},{"body":"","link":"https://kane.mx/tags/rfc-7636/","section":"tags","tags":null,"title":"RFC 7636"},{"body":"","link":"https://kane.mx/tags/rfc-9700/","section":"tags","tags":null,"title":"RFC 9700"},{"body":"Executive Summary This article provides a deep-dive technical analysis of the Model Context Protocol (MCP) authorization flow. The central insight is that MCP's authorization model is not a generic application of OAuth 2.0 but a sophisticated implementation of the emerging OAuth 2.1 standard.\nThe MCP protocol deliberately rejects the flexible but less secure patterns of the original 2012 OAuth framework (RFC 6749). Instead, it adopts a modern, secure-by-default, and dynamic protocol stack built on three pillars:\nMandatory Security: Following modern security best practices (RFC 9700), MCP mandates the Authorization Code flow with PKCE (RFC 7636) for all client types to prevent code interception attacks. Token Specificity: It uses Resource Indicators (RFC 8707) to issue precisely audience-restricted and verifiable JWT Access Tokens (RFC 7519), mitigating the \u0026quot;Confused Deputy\u0026quot; problem. Dynamic Federation: It combines Dynamic Client Registration (RFC 7591), Authorization Server Metadata (RFC 8414), and, most critically, Protected Resource Metadata (RFC 9728) to enable a fully decentralized and programmatic ecosystem. MCP's mandatory requirement for RFC 9728 is the key that unlocks its dynamic authorization flow, allowing clients to discover authorization requirements directly from the resource server (the \u0026quot;model\u0026quot;). By aligning with OAuth 2.1 principles, MCP establishes itself as a reference implementation for next-generation, secure, and federated API authorization.\nTable 1: Key RFC Specifications for the MCP Authorization Flow RFC Number Title Publication Date Status Function in the MCP/OAuth 2.1 Stack RFC 6749 The OAuth 2.0 Authorization Framework October 2012 Standards Track Core framework (historical baseline); defines roles and the Authorization Code flow RFC 6750 The OAuth 2.0 Bearer Token Usage October 2012 Standards Track Defines the Authorization: Bearer header for token usage RFC 7636 Proof Key for Code Exchange (PKCE) September 2015 Standards Track Protects the Authorization Code flow from interception attacks; mandatory in 2.1 RFC 7519 JSON Web Token (JWT) May 2015 Standards Track Defines the self-contained, verifiable access token format, enabling stateless validation RFC 8707 Resource Indicators for OAuth 2.0 February 2020 Standards Track Allows clients to specify the token's target audience (resource), solving token ambiguity RFC 7591 OAuth 2.0 Dynamic Client Registration July 2015 Standards Track Allows clients to register automatically via an API, enabling dynamic scaling RFC 8414 OAuth 2.0 Authorization Server Metadata June 2018 Standards Track Allows clients to dynamically discover AS endpoints and capabilities (e.g., PKCE support) RFC 9728 OAuth 2.0 Protected Resource Metadata April 2025 Standards Track Key MCP Requirement. Allows RS (Resource Servers) to publish their authorization needs (like AS and scope) RFC 9700 OAuth 2.0 Security Best Current Practice January 2025 Best Current Practice Codifies security lessons; formally deprecates insecure grant types (like Implicit) draft-ietf-oauth-v2-1 The OAuth 2.1 Authorization Framework In Progress IETF Draft Consolidates all the above best practices; set to replace RFC 6749 and 6750 Part 1: Deconstructing the OAuth 2.0 Baseline and Its Gaps This section examines the original 2012 OAuth 2.0 specifications. While foundational, they established a \u0026quot;framework\u0026quot; that left critical security and interoperability gaps—gaps that the MCP technology stack is specifically designed to close.\n1.1 RFC 6749: The OAuth 2.0 Authorization Framework Published in October 2012, RFC 6749 defined the core OAuth 2.0 model.\nKey Concepts:\nRoles: Defines the four primary roles: Resource Owner (user), Client (application), Authorization Server (AS), and Resource Server (RS). Delegated Access: Establishes a model where a third-party client can obtain limited access to an HTTP service on behalf of a user, without handling the user's credentials. Grant Types: Describes several flows for obtaining access tokens: Authorization Code Flow: The primary flow for web servers. It involves redirecting the user to the AS for approval and exchanging a temporary code for an access token. The user's credentials are never exposed to the client. Implicit Flow: A simplified flow for client-side applications (SPAs), where the access token is returned directly to the user-agent. Resource Owner Password Credentials (ROPC) Flow: Allows the client to collect the user's username and password directly. Client Credentials Flow: Used for machine-to-machine (M2M) authorization where the client acts on its own behalf. 1.2 RFC 6750: The OAuth 2.0 Bearer Token Usage Published alongside RFC 6749, this specification defines the \u0026quot;Bearer\u0026quot; token.\nKey Concepts:\nBearer Token: An access token where simple possession is sufficient to gain authorization. The client is not required to prove possession of a cryptographic key. Token Presentation: Specifies that the primary method for presenting the token is the HTTP Authorization header: Authorization: Bearer \u0026lt;token\u0026gt;. Security Considerations:\nBecause possession grants access, bearer tokens must be protected from leakage during transit (requiring TLS/HTTPS) and at rest. The specification warns against passing bearer tokens in URLs, a practice explicitly forbidden by the later OAuth 2.1 draft.\n1.3 Identifying the Gaps: The \u0026quot;Framework\u0026quot; vs. \u0026quot;Protocol\u0026quot; Problem The primary weakness of the original OAuth 2.0 specifications was not a single flaw but their excessive flexibility. As a \u0026quot;framework,\u0026quot; it left too many security-critical decisions to implementers, resulting in a patchwork of extensions and a high number of insecure deployments.\nThe MCP technology stack systematically addresses six key gaps left by the 2012 baseline:\nInsecure Flows: The Implicit flow, designed for SPAs, is highly vulnerable to token leakage and is now deprecated by the security BCP (RFC 9700) and removed from OAuth 2.1. Flow Vulnerabilities: The Authorization Code flow itself was vulnerable to \u0026quot;authorization code interception\u0026quot; attacks when used by public clients like native apps. Lack of a Standard Token Format: RFC 6749 did not define a token format, leading to the prevalence of \u0026quot;opaque\u0026quot; tokens. These tokens forced the RS to make a \u0026quot;Token Introspection\u0026quot; call back to the AS for every request, creating performance bottlenecks and tight coupling. Lack of a Standard Token Audience: Without a defined audience, a token from one AS could be accepted by any RS that trusted it. This led to the \u0026quot;Confused Deputy\u0026quot; attack, where a client could trick an RS into accepting a token intended for a different resource. Lack of Discovery Mechanisms: RFC 6749 assumed clients were manually configured with AS endpoint addresses. This \u0026quot;out-of-band\u0026quot; configuration is not viable in a dynamic, federated ecosystem like MCP. Risks of ROPC: The ROPC flow required clients to handle plaintext user passwords, violating the core principle of delegated authorization and creating a significant security risk. The MCP specification directly solves each of these problems by adopting the subsequent RFCs that matured the OAuth standard.\nPart 2: Building a Modern Security Foundation This section details the RFCs that form the non-negotiable security baseline for modern OAuth implementations, including MCP and the broader OAuth 2.1 standard.\n2.1 RFC 7636: Mitigating Interception Attacks with PKCE Published in September 2015, RFC 7636 (Proof Key for Code Exchange) was originally designed to protect public clients (like native apps) using the Authorization Code flow.\nProblem Solved: PKCE mitigates the \u0026quot;authorization code interception\u0026quot; attack, where a malicious app intercepts the code being redirected back to a legitimate client.\nMechanism:\nPKCE introduces a dynamic, hash-based challenge-response mechanism:\ncode_verifier: The client generates a high-entropy random string, the code_verifier. code_challenge: The client hashes the code_verifier using SHA-256 and Base64URL-encodes it to create the code_challenge. Authorization Request: The client sends the code_challenge and code_challenge_method=\u0026quot;S256\u0026quot; to the /authorization endpoint. The AS stores these values. Token Exchange: When exchanging the authorization code at the /token endpoint, the client must provide the original, un-hashed code_verifier. Server Verification: The AS hashes the received code_verifier and compares it to the stored code_challenge. It only issues an access token if they match. Analysis:\nPKCE's role evolved significantly over time. Initially a patch for public clients, its core security value—dynamically proving that the client redeeming the code is the same one that initiated the flow—was recognized as beneficial for all client types.\nAs a result, the emerging OAuth 2.1 standard mandates PKCE for all clients using the Authorization Code flow. By adopting the 2.1 stack, MCP inherits this \u0026quot;PKCE-by-default\u0026quot; security posture, transforming PKCE from a patch into a foundational security layer.\n2.2 RFC 9700: Codifying Security Best Practices (BCP) Published in January 2025, RFC 9700 (OAuth 2.0 Security Best Current Practice) is a critical document that consolidates a decade of security lessons learned since the original framework. It serves as the theoretical foundation for OAuth 2.1.\nKey Mandates:\nFormally Deprecates Insecure Flows: Explicitly recommends against using the Implicit Flow and the Resource Owner Password Credentials (ROPC) Flow due to their security risks. Elevates Security Controls: Upgrades many optional recommendations to \u0026quot;MUST\u0026quot; or \u0026quot;SHOULD\u0026quot; requirements, including: Using the state parameter to prevent CSRF. Requiring PKCE for all clients. Requiring exact string matching for redirect_uris. Recommending sender-constrained tokens (e.g., mTLS, DPoP) to prevent token replay. Analysis:\nRFC 9700 is the IETF's official acknowledgment of OAuth's architectural evolution. It provides an authoritative guide for secure implementation, moving developers away from the complex and often insecure \u0026quot;patchwork\u0026quot; of early OAuth 2.0 extensions. By aligning with OAuth 2.1, MCP's designers are building on the IETF's latest security consensus, not the outdated 2012 framework.\n2.3 RFC 7519: Defining a Standard Token Structure with JWT Published in May 2015, RFC 7519 (JSON Web Token) standardized a compact, URL-safe format for representing claims between two parties.\nKey Concepts:\nStructure: A JWT consists of three Base64URL-encoded parts separated by dots: Header.Payload.Signature. Header: Declares the token type (JWT) and signing algorithm (alg). Payload: Contains \u0026quot;claims\u0026quot; about the subject, such as issuer and expiration time. Signature: A cryptographic signature that verifies the token's integrity and authenticity. Registered Claims: Defines standard claims, including: iss (Issuer): The AS that issued the token. sub (Subject): The user or entity the token represents. aud (Audience): The intended recipient (the RS). exp (Expiration Time): The token's expiration timestamp. iat (Issued At): The token's issuance timestamp. Analysis:\nIn a federated system like MCP, JWTs are the architectural linchpin for high-performance, stateless authorization.\nIn the era of opaque tokens, a Resource Server (RS) had to make a \u0026quot;Token Introspection\u0026quot; (RFC 7662) API call back to the Authorization Server (AS) to validate every request. This created tight coupling and a severe performance bottleneck.\nIn contrast, a JWT is self-contained. The RS can validate it locally and statelessly by:\nChecking the exp claim for expiration. Checking the iss claim to verify the issuer. Checking the aud claim to ensure the token was issued for itself. Using the AS's public key (discoverable via RFC 8414) to validate the signature. For a protocol like MCP, which protects high-throughput AI model endpoints, this stateless validation is not a convenience—it is an architectural necessity. Adopting JWTs is a prerequisite for building the scalable, decoupled system MCP aims to be.\nPart 3: The Core of MCP: A Dynamic, Federated Authorization Ecosystem This section analyzes the RFCs that transform OAuth from a manually configured system into the dynamic, programmatic, and federated protocol that MCP requires.\n3.1 RFC 8707: Solving Token Ambiguity with Resource Indicators Published in February 2020, RFC 8707 (Resource Indicators for OAuth 2.0) was created to solve the \u0026quot;token ambiguity\u0026quot; problem.\nProblem Solved:\nIn classic OAuth 2.0, the AS had no standard way of knowing which API the client intended to use the token for. This created the \u0026quot;Confused Deputy\u0026quot; attack: a client could request a token for a low-privilege API (e.g., \u0026quot;Photos API\u0026quot;) and then replay it at a high-privilege API (e.g., \u0026quot;Documents API\u0026quot;). If both APIs trusted the same AS, the token would be accepted, violating the principle of least privilege.\nMechanism:\nThe specification defines a new request parameter: resource. The client includes one or more resource parameters in its authorization and token requests, explicitly stating the URI of the target Resource Server. The AS must use this resource parameter to \u0026quot;audience-restrict\u0026quot; the token. In a JWT, this means placing the resource value into the aud (Audience) claim. Analysis:\nRFC 8707 provides the technical mechanism to enforce fine-grained authorization in a federated system. While RFC 7519 defines the aud claim, RFC 8707 defines the protocol for populating it based on client intent.\nThis audience restriction is the fundamental solution to the \u0026quot;Confused Deputy\u0026quot; problem. Now, the \u0026quot;Documents API\u0026quot; will inspect the aud claim and reject any token intended for the \u0026quot;Photos API,\u0026quot; even if the signature and expiration are valid. For MCP, where a client may interact with many different AI models, this is non-negotiable. A token issued for \u0026quot;Model A\u0026quot; must never be usable at \u0026quot;Model B.\u0026quot;\n3.2 The \u0026quot;Discovery Trifecta\u0026quot;: Enabling a Fully Dynamic Ecosystem This section explains how three RFCs combine to create the automated, zero-configuration flow that MCP depends on. The central question is: How can a client, starting with only an MCP model's address, automatically perform the entire OAuth flow without any manual configuration?\nThe \u0026quot;Discovery Trifecta\u0026quot; (RFC 7591, 8414, and 9728) transforms OAuth from a system of manually configured parts into a dynamic, discoverable protocol.\n3.2.1 Pillar 1: RFC 7591 - Dynamic Client Registration (DCR) Published in July 2015, RFC 7591 (OAuth 2.0 Dynamic Client Registration Protocol) solves a critical scaling problem.\nProblem Solved: How does a client obtain a client_id? In classic OAuth, this was a manual process (e.g., using a developer portal), which cannot scale in a federated ecosystem.\nMechanism:\nThe specification defines a \u0026quot;Client Registration Endpoint\u0026quot; on the AS. A client POSTs its metadata (e.g., redirect_uris, app name) as a JSON payload to this endpoint. The AS validates the request and responds with a new client_id and other registration details. Role in MCP: DCR enables scale. New AI applications (\u0026quot;MCP clients\u0026quot;) can be spun up and programmatically register with \u0026quot;MCP Authorization Servers\u0026quot; without human intervention, which is essential for a large-scale, multi-tenant ecosystem.\n3.2.2 Pillar 2: RFC 8414 - Authorization Server Metadata Published in June 2018, RFC 8414 (OAuth 2.0 Authorization Server Metadata) standardized the successful \u0026quot;discovery\u0026quot; concept from OpenID Connect.\nProblem Solved: How does a client discover an AS's endpoints (e.g., /authorization, /token) and capabilities (e.g., supported PKCE methods)?\nMechanism:\nThe specification defines a \u0026quot;well-known\u0026quot; URI: /.well-known/oauth-authorization-server, appended to the AS's issuer identifier. A client performs an HTTP GET to this endpoint (e.g., https://as.example.com/.well-known/oauth-authorization-server). The AS returns a JSON document containing its configuration details: authorization_endpoint, token_endpoint, issuer, scopes_supported, pkce_code_challenge_methods_supported, etc. Role in MCP: The MCP specification mandates that clients must support this discovery mechanism. This decouples the client from the AS implementation, as the client only needs a single issuer URL to discover all protocol endpoints and features dynamically.\n3.2.3 Pillar 3: RFC 9728 - Protected Resource Metadata Published in April 2025 after an 8.5-year journey, RFC 9728 (OAuth 2.0 Protected Resource Metadata) is the newest and most critical piece of the MCP stack.\nProblem Solved: This RFC provides the \u0026quot;missing link\u0026quot; in the OAuth ecosystem. We could dynamically register clients (RFC 7591) and discover AS endpoints (RFC 8414), but how does a client, starting only with a Resource Server's address, know which AS to talk to and what authorization parameters to request?\nMechanism:\nThe specification defines a metadata format for the Resource Server (RS). It specifies a \u0026quot;well-known\u0026quot; URI on the RS: /.well-known/oauth-protected-resource. A client GETs this endpoint to retrieve a JSON document describing the RS's authorization requirements, including the issuer (the AS it trusts), scopes_supported, and its own resource identifier (for RFC 8707). Role in MCP: This is the core enabling standard for the MCP flow. The specification explicitly states: \u0026quot;MCP clients MUST use OAuth 2.0 Protected Resource Metadata (RFC 9728) for authorization server discovery.\u0026quot;\nAnalysis of the Complete MCP Dynamic Flow With the \u0026quot;Discovery Trifecta,\u0026quot; the entire authorization process becomes automated:\nAn MCP client is given the URI of an MCP server (RS), e.g., https://model.example.com. The client queries https://model.example.com/.well-known/oauth-protected-resource (per RFC 9728). The RS responds with its requirements: \u0026quot;My issuer is https://auth.mcp.com, and you must request scope=mcp:run and resource=https://model.example.com.\u0026quot; The client now knows the AS's issuer and queries https://auth.mcp.com/.well-known/oauth-authorization-server (per RFC 8414). The AS responds with its endpoints (/authorize, /token) and capabilities (e.g., PKCE S256 support). The client, now fully informed, initiates the Authorization Code flow with PKCE (RFC 7636), redirecting the user with the code_challenge, scope, and resource parameters. The user authenticates, and the client receives an authorization code. It exchanges this code at the /token endpoint, providing the code_verifier. The AS issues a JWT access token (RFC 7519) with the payload containing aud: \u0026quot;https://model.example.com\u0026quot; (from RFC 8707) and iss: \u0026quot;https://auth.mcp.com\u0026quot;. The client presents this token to https://model.example.com using the Authorization: Bearer header (RFC 6750). The RS statelessly validates the token's signature, expiration, issuer, and, most critically, its audience (aud), then grants access. Conclusion: The final publication of RFC 9728 was the key technical enabler for realizing MCP's vision of a fully federated and dynamic authorization model.\nPart 4: The Final Picture: MCP as a Reference Implementation for OAuth 2.1 This final part synthesizes the preceding analysis to show that the protocol stack mandated by MCP is, by definition, the emerging OAuth 2.1 standard.\n4.1 draft-ietf-oauth-v2-1: The OAuth 2.1 Authorization Framework The OAuth 2.1 draft, currently in progress, is not a rewrite but a consolidation and refinement of the best practices developed over the last decade. Its goal is to replace and obsolete the original RFC 6749 and RFC 6750 by absorbing mature extensions into the baseline and removing insecure patterns.\nKey Differences from OAuth 2.0:\nAdded: PKCE (RFC 7636) is now mandatory for the Authorization Code flow. Added: Incorporates security requirements from the BCP (RFC 9700), such as strict redirect_uri matching. Added: Recommends the use of Resource Indicators (RFC 8707) and AS Metadata (RFC 8414). Removed: The Implicit flow (response_type=token) is omitted entirely. Removed: The Resource Owner Password Credentials (ROPC) flow is omitted entirely. Removed: Sending bearer tokens in URL query parameters is forbidden. 4.2 Final Analysis: MCP as a Reference Implementation of OAuth 2.1 The MCP authorization flow is the OAuth 2.1 flow, enhanced with the most modern discovery mechanisms.\nMCP mandates the \u0026quot;Discovery Trifecta\u0026quot; (RFC 9728, RFC 8414), representing the cutting edge of the OAuth ecosystem. By aligning with OAuth 2.1, MCP implicitly adopts the security BCP of RFC 9700, the PKCE mechanism of RFC 7636, and the token specificity of RFC 8707. It inherits the deprecation of the insecure Implicit and ROPC flows, guaranteeing a secure-by-default posture. Conclusion: MCP serves as a \u0026quot;domain-specific profile\u0026quot; of OAuth 2.1. Its designers have selected a complete, modern, and dynamic set of specifications to build a protocol that is not just secure but also federation-capable and scalable—precisely what the AI ecosystem it serves requires. The specification is a textbook example of how to build a modern, interoperable authorization protocol.\nVisualizing the MCP Authorization Flow with Mermaid The following diagram illustrates the complete MCP authorization flow, showing how the \u0026quot;Discovery Trifecta\u0026quot; enables dynamic federation:\nsequenceDiagram participant User participant Client as MCP Client participant RS as MCP Server (RS) participant AS as Authorization Server Note over Client,RS: Step 1: Discover RS Metadata (RFC 9728) Client-\u0026gt;\u0026gt;RS: GET /.well-known/oauth-protected-resource RS--\u0026gt;\u0026gt;Client: {\u0026#34;issuer\u0026#34;: \u0026#34;https://auth.mcp.com\u0026#34;, \u0026#34;resource\u0026#34;: \u0026#34;https://model.example.com\u0026#34;, \u0026#34;scopes_supported\u0026#34;: [\u0026#34;mcp:run\u0026#34;]} Note over Client,AS: Step 2: Discover AS Metadata (RFC 8414) Client-\u0026gt;\u0026gt;AS: GET /.well-known/oauth-authorization-server AS--\u0026gt;\u0026gt;Client: {\u0026#34;issuer\u0026#34;: \u0026#34;https://auth.mcp.com\u0026#34;, \u0026#34;authorization_endpoint\u0026#34;: \u0026#34;/authorize\u0026#34;, \u0026#34;token_endpoint\u0026#34;: \u0026#34;/token\u0026#34;, \u0026#34;pkce_code_challenge_methods_supported\u0026#34;: [\u0026#34;S256\u0026#34;]} Note over User,Client: (Optional) Step 3: Dynamic Client Registration (RFC 7591) Client-\u0026gt;\u0026gt;AS: POST /register {redirect_uris, ...} AS--\u0026gt;\u0026gt;Client: {client_id, ...} Note over User,AS: Step 4: Authorization Code Flow with PKCE (RFC 7636) Client-\u0026gt;\u0026gt;User: Redirect to /authorize?resource=...\u0026amp;scope=...\u0026amp;code_challenge=... User-\u0026gt;\u0026gt;AS: Authenticate \u0026amp; Authorize AS--\u0026gt;\u0026gt;User: Redirect with authorization code User-\u0026gt;\u0026gt;Client: Return authorization code Note over Client,AS: Step 5: Exchange code for token Client-\u0026gt;\u0026gt;AS: POST /token {code, code_verifier, ...} AS--\u0026gt;\u0026gt;Client: JWT access token with aud: \u0026#34;https://model.example.com\u0026#34; Note over Client,RS: Step 6: Access protected resource Client-\u0026gt;\u0026gt;RS: GET /resource Authorization: Bearer \u0026lt;JWT\u0026gt; RS--\u0026gt;\u0026gt;Client: Protected resource data Analyzing IDaaS Provider Compatibility for the MCP Flow As the Model Context Protocol gains traction, Identity-as-a-Service (IDaaS) providers are evaluating support for its authorization flow. This section tracks the current state of compatibility across major identity platforms.\nSupport Matrix Provider OAuth 2.1 Core (w/ PKCE) RFC 8707 (Resource Indicators) RFC 7591 (Dynamic Client Reg.) Summary \u0026amp; Key Incompatibilities Auth0 Yes No (Incompatible) Yes Uses proprietary audience parameter. Auth0's legacy audience parameter conflicts with the MCP-mandated resource parameter. Okta Yes No Yes Does not support RFC 8707. Okta's documentation explicitly states the resource parameter/claim is not supported for this function. Amazon Cognito Yes Yes No (Workaround) Natively supports RFC 8707. No native DCR. Requires a custom endpoint (e.g., API Gateway + Lambda) to emulate RFC 7591. Microsoft Entra ID Yes No (Incompatible) No Uses proprietary scope parameter. Entra ID rejects the resource parameter and uses a non-standard scope={resource}/.default syntax. No DCR support is planned. Google Cloud Identity Yes (Caveat) No (Incompatible) No Requires manual registration. Google does not support DCR. Uses proprietary \u0026quot;custom audiences\u0026quot; instead of RFC 8707. PKCE flow still requires a client secret. Keycloak (OSS) Yes No (Workaround) Yes Uses proprietary audience parameter. RFC 8707 is not supported but is in development. The workaround is using Keycloak's audience parameter with a custom \u0026quot;Audience Mapper\u0026quot;. Ping Identity Yes Yes Yes Fully Compliant. Natively supports both RFC 8707 (since v12.1) and a mature implementation of RFC 7591. OneLogin Yes No Data Yes RFC 8707 support is unknown. The provider supports DCR and proper PKCE, but there is no documentation confirming or-denying support for the resource parameter. Zitadel (OSS) Yes No No Lacks support for both standards. Documentation explicitly states RFC 8707 is not supported and will cause an error. DCR is a known feature request but is not implemented. Key Insights RFC 8707 is the Critical Bottleneck: The resource parameter (RFC 8707) is the primary compatibility issue. Only Amazon Cognito and Ping Identity natively support this MCP-mandatory specification. Most major providers (Auth0, Okta, Microsoft, Google) use proprietary audience parameters that predate the standard, creating a significant barrier to adoption. Dynamic Client Registration is Unevenly Implemented: While many providers support RFC 7591, major cloud vendors like Google and Microsoft have chosen not to implement it, requiring manual client registration that conflicts with MCP's goal of zero-configuration federation. PKCE is Nearly Universal: All reviewed providers support PKCE (RFC 7636). However, Google's implementation is non-compliant, as it still requires a client_secret in a flow designed to eliminate it. Only One Provider is Fully Compliant: At present, Ping Identity is the only IDaaS provider that natively supports all MCP-required specifications, including both RFC 8707 and RFC 7591. Workarounds for Providers Without RFC 8707 Support Given that RFC 8707 is the main compatibility hurdle, MCP implementers can consider these strategies:\nProvider-Specific Token Exchange: Use RFC 8693 (Token Exchange) where available to exchange a generic token for a resource-specific one. Audience Mapper Configuration: For extensible platforms like Keycloak, configure custom audience mappers to emulate RFC 8707 behavior. Authorization Server Proxy: Deploy a lightweight proxy to translate between MCP's resource parameter and a provider's proprietary audience parameter. Scope-Based Resources: For Microsoft Entra ID, encode resource identifiers in the scope parameter using their proprietary format (scope={resource}/.default). This analysis reveals that MCP's dynamic federation model faces significant headwinds from IDaaS provider incompatibility. Until major providers adopt RFC 8707, production MCP deployments will require custom integration work or reliance on fully compliant providers.\nProvider-Specific Analysis This section provides a more detailed breakdown of the compatibility status for each provider.\nAuth0 (an Okta company) Auth0 strongly supports OAuth 2.1 and has excellent DCR support. However, it is incompatible with MCP's mandatory RFC 8707 requirement. Auth0 implemented a non-standard audience parameter years before RFC 8707 was finalized and does not accept the standard resource parameter.\nOkta As a leader in the IETF OAuth working group, Okta fully supports the OAuth 2.1 core and provides a compliant DCR API. However, like Auth0, it is incompatible because its documentation explicitly states it does not support the resource parameter for audience restriction as defined in RFC 8707.\nAmazon Cognito Cognito is partially compliant and a viable option with custom engineering. It is one of the few providers that natively supports RFC 8707 (Resource Indicators). However, it lacks native support for RFC 7591 (DCR). The standard workaround is to build a custom DCR endpoint using AWS API Gateway and Lambda to call Cognito's internal APIs.\nMicrosoft Entra ID (formerly Azure AD) Entra ID is incompatible with the MCP flow on two critical points. First, it rejects the resource parameter and uses a proprietary mechanism where the resource is specified within the scope parameter (e.g., scope=https://graph.microsoft.com/.default). Second, Microsoft has stated it has no plans to implement RFC 7591 (DCR).\nGoogle Cloud Identity The Google Identity Platform is incompatible and highly proprietary. It does not support DCR, requiring manual client creation. It also does not support RFC 8707, instead using a non-standard \u0026quot;custom audience\u0026quot; (aud) claim. Furthermore, its PKCE implementation is non-compliant, as it still requires a client_secret for a flow designed to be secretless.\nKeycloak (OSS) This open-source project is partially compliant. It offers excellent, extensible support for DCR but does not support RFC 8707 out-of-the-box. This is a known issue with active development. The established workaround is to use Keycloak's proprietary audience parameter and configure a custom \u0026quot;Audience Mapper.\u0026quot;\nFor a complete implementation guide showing how to deploy Keycloak on AWS with full MCP OAuth 2.1 support—including the audience mapper workaround, JDBC_PING clustering for zero-downtime deployments, and automated Terraform orchestration—see Implementing MCP OAuth 2.1 with Keycloak on AWS. The guide provides infrastructure code, step-by-step deployment instructions, and detailed configuration examples for making Keycloak compatible with MCP clients through realm default scopes and protocol mappers.\nPing Identity (PingFederate) Ping Identity is the most compliant provider on this list. Its platform offers mature, robust support for RFC 7591 (DCR) and, as of version 12.1, provides full, standards-compliant support for RFC 8707 (Resource Indicators).\nOneLogin (by One Identity) OneLogin appears partially compliant, but with incomplete data. The platform supports RFC 7591 (DCR) and correctly implements PKCE. However, there is no available documentation to confirm or deny its support for the MCP-mandatory resource parameter (RFC 8707).\nZitadel (OSS) As a newer open-source platform, Zitadel is not yet compliant with the MCP flow. Its documentation clearly states that it does not support RFC 8707 and will reject requests containing the resource parameter. Similarly, RFC 7591 (DCR) is not implemented and remains a feature request.\nResources \u0026amp; References Official Specifications RFC 6749 - The OAuth 2.0 Authorization Framework: The original 2012 OAuth 2.0 specification RFC 6750 - The OAuth 2.0 Bearer Token Usage: Bearer token specification RFC 7636 - Proof Key for Code Exchange (PKCE): PKCE extension for authorization code interception protection RFC 7519 - JSON Web Token (JWT): JWT token format specification RFC 8707 - Resource Indicators for OAuth 2.0: Audience restriction via resource parameter RFC 7591 - OAuth 2.0 Dynamic Client Registration: Dynamic client registration protocol RFC 8414 - OAuth 2.0 Authorization Server Metadata: AS discovery and metadata RFC 9728 - OAuth 2.0 Protected Resource Metadata: RS discovery and authorization requirements [NEW - April 2025] RFC 9700 - OAuth 2.0 Security Best Current Practice: Security BCP and OAuth 2.1 foundation [NEW - January 2025] draft-ietf-oauth-v2-1-14 - OAuth 2.1 Authorization Framework: Latest OAuth 2.1 draft MCP Specification Model Context Protocol Specification: Official MCP specification MCP Authorization Documentation: MCP-specific authorization requirements Related Articles MCP OAuth Evolution: SEP-991 Simplifies Client Registration: How Client ID Metadata Documents replace Dynamic Client Registration [NEW] Building an MCP Agentic Chatbot on AWS: My previous exploration of MCP server implementation Using MCP Client OAuthClientProvider with AWS Agentcore: Practical implementation of MCP OAuth client patterns Implementing MCP OAuth 2.1 with Keycloak on AWS: Complete guide to configuring Keycloak as an MCP-compatible authorization server Appendix: Detailed RFC Timeline 12012-10: RFC 6749 (OAuth 2.0 Framework) 📜 22012-10: RFC 6750 (Bearer Tokens) 📜 32015-05: RFC 7519 (JWT) 🔐 42015-07: RFC 7591 (Dynamic Registration) 🔄 52015-09: RFC 7636 (PKCE) 🔒 62018-06: RFC 8414 (AS Metadata) 🔍 72020-02: RFC 8707 (Resource Indicators) 🎯 82025-01: RFC 9700 (Security BCP) 🛡️ [RECENT] 92025-04: RFC 9728 (Protected Resource) 🔗 [LATEST] 102025-??: OAuth 2.1 (Final) 🚀 [IN PROGRESS] Key milestones in OAuth's evolution toward MCP requirements\n","link":"https://kane.mx/posts/2025/mcp-authorization-oauth-rfc-deep-dive/","section":"posts","tags":["MCP","Model Context Protocol","OAuth 2.1","OAuth 2.0","PKCE","JWT","RFC 7636","RFC 9700","RFC 9728","Resource Indicators","Federated Authentication","AI Security"],"title":"Technical Deconstruction of MCP Authorization: A Deep Dive into OAuth 2.1 and IETF RFC Specifications"},{"body":"The Problem: Premature Input Submission When working with Claude Code in a VS Code Remote SSH session (e.g., connecting from a macOS or Windows client to a remote Linux host), a common frustration arises: pressing Shift+Enter in the terminal is supposed to create a new line for multi-line input. Instead, it often submits the current line prematurely.\nThis behavior disrupts the workflow for crafting complex, multi-line prompts, leading to fragmented interactions and inefficient token usage.\nUnderstanding the Root Cause The issue stems from how VS Code handles keyboard shortcuts in a remote context. By default, the local VS Code client's keybindings take precedence. It captures the Shift+Enter event and processes it according to its local configuration before the remote session has a chance to interpret it, resulting in an unintended submission.\nThe Solution: Synchronize Keybindings To resolve this, you must ensure that the keybinding for Shift+Enter is correctly configured on both your local machine and the remote host.\nStep 1: Configure Your Local Machine First, define the correct behavior on your local client. You need to add a custom keybinding that sends a specific escape sequence for a new line.\nOpen your keybindings.json file. The location varies by operating system:\nmacOS: ~/Library/Application Support/Code/User/keybindings.json Windows: %APPDATA%\\Code\\User\\keybindings.json Linux: ~/.config/Code/User/keybindings.json You can also open this file from within VS Code by opening the Command Palette (Cmd+Shift+P or Ctrl+Shift+P), typing Preferences: Open Keyboard Shortcuts (JSON), and pressing Enter.\nAdd the following JSON object to the file:\n1[ 2 { 3 \u0026#34;key\u0026#34;: \u0026#34;shift+enter\u0026#34;, 4 \u0026#34;command\u0026#34;: \u0026#34;workbench.action.terminal.sendSequence\u0026#34;, 5 \u0026#34;args\u0026#34;: { 6 \u0026#34;text\u0026#34;: \u0026#34;\\u001b\\r\u0026#34; 7 }, 8 \u0026#34;when\u0026#34;: \u0026#34;terminalFocus\u0026#34; 9 } 10] If the file already contains other keybindings, add this object inside the existing array, separated by a comma.\nStep 2: Verify the Remote Configuration Claude Code typically attempts to configure this automatically on the remote machine. However, it's essential to verify that the configuration exists and is correct.\nConnect to your remote host via VS Code SSH and check the contents of its keybindings.json file, located at ~/.config/Code/User/keybindings.json. Ensure it contains the same JSON block from Step 1. If the file is missing or the entry is not present, create or update it accordingly.\nStep 3: Restart VS Code For the changes to take full effect, completely quit and restart your local VS Code application. After restarting, reconnect to your remote SSH host.\nHow the Keybinding Works This configuration instructs VS Code to perform a specific action only when the integrated terminal is in focus (\u0026quot;when\u0026quot;: \u0026quot;terminalFocus\u0026quot;).\ncommand: workbench.action.terminal.sendSequence tells VS Code to send a sequence of characters to the terminal. args.text: \u0026quot;\\u001b\\r\u0026quot; is the key part of the solution. It sends an ESC character (\\u001b) followed by a Carriage Return (\\r). This sequence is interpreted by the terminal as a command to insert a newline character rather than executing the current command. This approach correctly enables multi-line input in the Claude Code prompt without affecting the Enter key's behavior in the chat panel or other editor inputs.\nVerification To confirm the fix is working:\nOpen the Claude Code terminal prompt in your remote session. Type a line of text. Press Shift+Enter. The cursor should move to a new line, allowing you to continue typing a multi-line prompt. The input should only be sent to Claude when you press Enter by itself.\nResources VS Code Remote Development Documentation Claude Code Official Documentation ","link":"https://kane.mx/posts/2025/vscode-remote-ssh-claude-code-keybindings/","section":"posts","tags":["VS Code","Claude Code","Remote SSH","Keyboard Shortcuts","Troubleshooting"],"title":"How to Fix Shift+Enter in VS Code Remote SSH for Claude Code"},{"body":"","link":"https://kane.mx/tags/keyboard-shortcuts/","section":"tags","tags":null,"title":"Keyboard Shortcuts"},{"body":"","link":"https://kane.mx/tags/remote-ssh/","section":"tags","tags":null,"title":"Remote SSH"},{"body":"","link":"https://kane.mx/categories/tips--tricks/","section":"categories","tags":null,"title":"Tips \u0026 Tricks"},{"body":"","link":"https://kane.mx/tags/troubleshooting/","section":"tags","tags":null,"title":"Troubleshooting"},{"body":"","link":"https://kane.mx/tags/vs-code/","section":"tags","tags":null,"title":"VS Code"},{"body":"","link":"https://kane.mx/tags/amazon-quick-suite/","section":"tags","tags":null,"title":"Amazon Quick Suite"},{"body":"Introduction Business intelligence has long been the domain of specialists, requiring complex tools and time-consuming analysis. But what if you could simply ask your data questions in plain English and receive comprehensive, visualized answers in seconds? What if you could automate your weekly reporting with a simple conversation?\nAmazon Quick Suite is AWS's answer. It's a generative AI-powered business intelligence platform designed to democratize data analysis, making it accessible to everyone in your organization—no machine learning expertise required.\nIn this deep dive, we'll explore the game-changing benefits of Quick Suite and share five essential best practices for building production-ready data analysis agents. Whether you're new to AWS or an experienced developer, this guide will help you avoid common pitfalls and accelerate your path to AI-driven insights.\nWhat is Amazon Quick Suite? Amazon Quick Suite is a comprehensive business intelligence platform that combines five powerful components:\nQuick Index: A unified knowledge base that consolidates documents, files, and application data to power AI-driven insights. It creates a secure, searchable repository and automatically indexes unstructured data. Quick Research: A powerful agent that conducts research across enterprise data and external sources to deliver contextual, actionable insights in minutes. It breaks down complex questions, gathers information, and validates findings with citations. Quick Sight: An AI-powered BI solution that transforms data into insights through natural language queries and interactive visualizations. It enables users to build dashboards, perform what-if analysis, and respond with one-click actions. Quick Flows: A tool for automating repetitive tasks using natural language. It fetches information, takes action in business applications, generates content, and handles process-specific requirements. Quick Automate: A solution for enterprise-scale process automation that transforms complex business processes into multi-agent workflows, complete with advanced orchestration, governance, and observability. The platform serves three user personas:\nReaders: Access dashboards, run automations, and consume insights. Authors: Build datasets, create agents, and design workflows. Administrators: Manage permissions, monitor costs, and maintain data sources. What truly sets Quick Suite apart is its conversational AI interface. Instead of wrestling with complex query languages, you can ask questions and get intelligent, context-aware responses backed by your organization's data.\nFour Powerful Benefits of Amazon Quick Suite 1. Custom Chat Agents with Your Data Create AI agents that understand your business domain and answer questions using your organization's private data—no coding required.\nKey Features:\nUpload company documentation, policies, and procedures as agent knowledge. Connect to multiple data sources: databases (RDS, PostgreSQL), SaaS apps (Salesforce, Jira), data warehouses (Redshift), and spreadsheets. Ask questions in plain English and receive answers with auto-generated visualizations. Export results as reports, dashboards, or data files. Example in Action:\n1You: \u0026#34;Show me our top 10 products by revenue this quarter 2 compared to last quarter, broken down by region.\u0026#34; 3 4Agent: [Queries database, performs analysis, generates comparative charts] 5 6You: \u0026#34;Why did the Northeast region decline?\u0026#34; 7 8Agent: [Analyzes details, identifies patterns, visualizes root causes] The agent maintains conversational context, understands your business logic, and automatically chooses the right visualizations, turning complex data analysis into a natural dialogue.\nExample: An Amazon Quick Suite agent analyzing business data with natural language queries.\n2. Simple Workflow Automation with Quick Flows Automate repetitive tasks by describing them in natural language.\nHow It Works: Simply describe your workflow, and Quick Flows builds it for you.\n1You describe: \u0026#34;Every Monday morning, check last week\u0026#39;s sales data, 2 generate a summary report, and email it to leadership.\u0026#34; 3 4Quick Flows creates: A complete, automated workflow that: 5→ Connects to the sales database 6→ Filters the previous week\u0026#39;s data 7→ Calculates key metrics 8→ Generates a formatted report 9→ Emails it to the specified recipients What You Can Automate:\nData Gathering: Fetch information from databases, applications, and external APIs. Notifications: Send updates via email, Slack, or Microsoft Teams. Content Generation: Create brand-compliant reports and documentation. Approval Routing: Handle conditional logic and multi-step approvals. Scheduled Tasks: Run daily, weekly, or monthly processes automatically. With over 50 pre-built connectors for services like Slack, Jira, and Salesforce, Quick Flows streamlines your operations.\n3. Enterprise-Scale Automation with Quick Automate For complex business processes requiring sophisticated orchestration and governance, Quick Automate provides enterprise-grade automation powered by Amazon Bedrock agents.\nKey Differentiators:\nMulti-Agent Orchestration: Coordinate specialized agents for research, analysis, and execution. Dynamic AI Planning: Break down complex tasks and adapt based on real-time results. Human-in-the-Loop: Pause workflows for human approval at critical decision points. Enterprise Governance: Enforce role-based access, audit logging, and compliance tracking. Full Observability: Monitor performance with real-time metrics and execution history. When to Use Quick Flows vs. Quick Automate:\nQuick Flows: Ideal for simple tasks, notifications, and basic data workflows. Quick Automate: Best for mission-critical processes, multi-system orchestration, and regulated compliance workflows. 4. Seamless Collaboration and Sharing Share insights, agents, and workflows across your organization while maintaining enterprise-grade security.\nShareable Assets:\nDashboards: Share or embed visualizations with granular permissions and row-level security. AI Agents: Package custom agents as reusable organizational assets. Workflows: Distribute Quick Flows templates across teams with version control. Datasets: Publish curated datasets for governed, self-service analysis. Enterprise Security:\nSingle Sign-On (SSO) integration with your identity provider. Data encryption at rest and in transit. Complete audit logging for compliance (GDPR, HIPAA, SOC 2). Row-level security to ensure users only see authorized data. 5 Best Practices for Production-Ready BI Agents These five practices are critical for building reliable and accurate data analysis agents.\nPractice 1: Preview Every Transformation Step Amazon Quick Suite's visual data preparation interface lets you transform data without SQL, but the key to success is previewing after every single change. One bad transformation can corrupt everything downstream.\nThe Visual Workflow: Amazon Quick Suite visualizes your transformations as a sequential pipeline, where each step is editable.\nflowchart LR A[Import Data] --\u0026gt; P1{Preview} P1 --\u0026gt; B[Filter Rows] --\u0026gt; P2{Preview} P2 --\u0026gt; C[Cast Types] --\u0026gt; P3{Preview} P3 --\u0026gt; D[Create Columns] --\u0026gt; P4{Preview} P4 --\u0026gt; E[Aggregate] --\u0026gt; P5{Preview} P5 --\u0026gt; F[Join Tables] --\u0026gt; P6{Preview} P6 --\u0026gt; G[Publish Dataset] style P1 fill:#60a5fa,stroke:#3b82f6,color:#fff style P2 fill:#60a5fa,stroke:#3b82f6,color:#fff style P3 fill:#60a5fa,stroke:#3b82f6,color:#fff style P4 fill:#60a5fa,stroke:#3b82f6,color:#fff style P5 fill:#60a5fa,stroke:#3b82f6,color:#fff style P6 fill:#60a5fa,stroke:#3b82f6,color:#fff style G fill:#10b981,stroke:#059669,color:#fff Example:\n1Step 1: Import sales_data.csv → Preview shows 10,000 rows. ✓ 2Step 2: Filter out test transactions → Preview shows 9,847 rows. ✓ 3Step 3: Cast \u0026#34;sale_date\u0026#34; from String to Date → Preview shows proper dates. ✓ 4Step 4: Cast \u0026#34;amount\u0026#34; from String to Decimal → Preview shows 12 nulls. ⚠️ 5 → Investigation: Found \u0026#34;$1,234.56\u0026#34; format needs cleaning. 6 → Add step: Remove \u0026#34;$\u0026#34; and \u0026#34;,\u0026#34; before casting. 7 → Preview again: 0 nulls. ✓ This iterative validation is crucial for building reliable datasets.\nPractice 2: Handle DateTime Conversion Carefully When datetime strings don't match supported formats, Amazon Quick Suite doesn't fail—it silently converts them to null. This can lead to incomplete results that are difficult to detect.\nThe Problem:\n1Your data: \u0026#34;2025.10.29\u0026#34; (dot separator - not supported) 2After cast to Date: NULL 3Import status: ✓ Success (no error!) 4Query result: Missing dates, inaccurate metrics. Supported Formats (AWS docs):\nISO 8601 (Recommended): yyyy-MM-dd'T'HH:mm:ss.SSSZ Common US Formats: MM/dd/yyyy HH:mm:ss and MM-dd-yyyy ⚠️ Important: Always use a 4-digit year (yyyy). Two-digit years (yy) are not supported and will result in nulls.\nPrevention: Always preview your data after a datetime conversion and check for unexpected nulls.\nPractice 3: Sanitize Monetary Values Financial data is often formatted with currency symbols and commas (\u0026quot;$1,234.56\u0026quot;), which Amazon Quick Suite interprets as strings. Attempts to perform calculations on these fields will fail or produce incorrect results.\nThe Solution: Clean the data either before import or during dataset preparation in Amazon Quick Suite.\nCreate a calculated field to remove symbols and commas: 1parseDecimal(replace(replace(revenue, \u0026#34;$\u0026#34;, \u0026#34;\u0026#34;), \u0026#34;,\u0026#34;, \u0026#34;\u0026#34;)) Preview the conversion to ensure all values are now numeric. Set the field role to \u0026quot;Measure\u0026quot; (see Practice 4). Hide the original string field to avoid confusion. For high-precision financial calculations, use the Decimal-fixed data type to prevent floating-point errors.\nPractice 4: Set Field Roles Explicitly (Measure vs. Dimension) Amazon Quick Suite assigns every field a role: Measure (for aggregation) or Dimension (for grouping). The auto-detection can be wrong, leading to broken analytics.\nMeasure: A field you can aggregate (e.g., SUM, AVG). Examples: revenue, quantity. Dimension: A field you use to group or filter. Examples: product_category, region, customer_id. Common Mistake: A numeric customer_id might be auto-detected as a Measure, leading to meaningless calculations like AVG(customer_id).\nBest Practice: Manually set roles for all fields.\nNumeric IDs (customer_id, zip_code) should be Dimensions. Calculated metrics (profit, conversion_rate) should be Measures. Practice 5: Implement Row-Level Security (RLS) Row-Level Security (RLS) is essential for controlling data access in multi-user environments. It restricts which rows users can see in a dataset. (RLS is an Amazon QuickSight Enterprise Edition feature).\nHow It Works: RLS uses a separate rules dataset to define permissions.\n1Main Dataset (sales_data): Rules Dataset (sales_rls_rules): 2order_id, region, amount UserName, region 3O-1001, East, $5000 alice@company.com, East 4O-1002, West, $8000 bob@company.com, West 5O-1003, East, $3000 alice@company.com, East 6 7When alice queries → She only sees \u0026#34;East\u0026#34; region rows. 8When bob queries → He only sees \u0026#34;West\u0026#34; region rows. Critical Limitation: RLS rules only work with text fields. To filter by numbers or dates, you must first cast those fields to strings in your rules dataset.\nImplementation Steps:\nCreate a Rules Dataset: A CSV or database table with UserName and the fields to restrict. Upload the Rules: Create a new dataset in Amazon Quick Suite and mark it as containing RLS rules. Apply to Main Dataset: In your main dataset's permissions, apply the rules dataset. Quick Start: Build Your First AI Agent in 7 Steps Let's build a Customer Support Analytics Agent.\nGoal: Analyze support tickets to identify trends, measure team performance, and predict resolution times.\nPrepare Your Data: Start with a support_tickets.csv file containing ticket details. Create \u0026amp; Transform Dataset: In Amazon Quick Suite, create a new dataset and apply transformations. Fix Dates: Use parseDate to convert date strings to the correct format. Clean Money: Use parseDecimal and replace to clean monetary values. Calculate Metrics: Create new fields like resolution_hours using dateDiff. Set Roles: Define agent_id as a Dimension and resolution_hours as a Measure. Create a Topic: Make your dataset conversational by linking it to a new Topic and adding custom instructions like \u0026quot;SLA is 24 hours\u0026quot;. Create a Space: Bundle your Topic and related documentation into a Space for your support team. Create the Chat Agent: Create an agent, link it to your Space, and give it a persona, such as \u0026quot;You are a support analytics assistant.\u0026quot; Test with Real Questions: Validate the agent with progressively complex queries, from simple counts to comparative analyses. Share with Your Team: Deploy the agent, space, and topic for your organization to use. Conclusion Amazon Quick Suite represents a fundamental shift in business intelligence. By combining generative AI with a comprehensive BI platform, it empowers everyone in an organization to interact with data conversationally.\nThe key to success lies in a disciplined approach. By following the five best practices—previewing every transformation, handling data types carefully, and implementing robust security—you can build reliable, production-ready AI agents that deliver real business value.\nThe future of BI is conversational, intelligent, and accessible. Amazon Quick Suite brings that future to your organization today.\nResources Amazon Quick Suite Official Documentation Supported Data Types and Formats in QuickSight Row-Level Security (RLS) Guide for QuickSight Have you built AI-powered analytics agents with Amazon Quick Suite? Share your experiences and lessons learned in the comments below!\n","link":"https://kane.mx/posts/2025/amazon-quicksuite-deep-dive/","section":"posts","tags":["Amazon Quick Suite","AWS","QuickSight","Business Intelligence","AI Agents","Data Analysis","Workflow Automation","QuickFlows","QuickAutomate"],"title":"Amazon Quick Suite Deep Dive: Build AI-Powered Business Intelligence on AWS"},{"body":"","link":"https://kane.mx/categories/business-intelligence/","section":"categories","tags":null,"title":"Business Intelligence"},{"body":"","link":"https://kane.mx/tags/business-intelligence/","section":"tags","tags":null,"title":"Business Intelligence"},{"body":"","link":"https://kane.mx/tags/data-analysis/","section":"tags","tags":null,"title":"Data Analysis"},{"body":"","link":"https://kane.mx/tags/quickautomate/","section":"tags","tags":null,"title":"QuickAutomate"},{"body":"","link":"https://kane.mx/tags/quickflows/","section":"tags","tags":null,"title":"QuickFlows"},{"body":"","link":"https://kane.mx/tags/quicksight/","section":"tags","tags":null,"title":"QuickSight"},{"body":"","link":"https://kane.mx/tags/workflow-automation/","section":"tags","tags":null,"title":"Workflow Automation"},{"body":"","link":"https://kane.mx/tags/aws-skills/","section":"tags","tags":null,"title":"AWS Skills"},{"body":"Introduction Building on AWS is powerful but complex. What if your AI assistant had deep AWS expertise built-in? That's what AWS Skills brings to Claude Code.\nAWS Skills is a plugin that transforms Claude Code into an intelligent AWS development partner. It understands CDK best practices, estimates costs before you deploy, and guides you through serverless patterns. This post shows you how to build a serverless REST API using Claude Code supercharged with AWS Skills.\nHow It Works: Agent Skills AWS Skills uses Claude Agent Skills, which are modular extensions that give Claude new capabilities. Claude autonomously decides when to use a skill based on your request, allowing it to interact with external tools like the AWS CDK, pricing calculators, and documentation.\nThe AWS Skills Capabilities AWS Skills bundles this power into three plugins:\nAWS CDK Plugin: Brings CDK expertise, best practices, and cdk-nag security checks into your workflow. Cost \u0026amp; Operations Plugin: Provides pre-deployment cost estimates, billing analysis, and CloudWatch monitoring. Serverless \u0026amp; EDA Plugin: Offers patterns for event-driven architectures using EventBridge, Step Functions, and SAM. Building a Serverless REST API Let's build a simple task management REST API to see how AWS Skills improves the development workflow.\nArchitecture We'll build an API with three main components:\nAPI Gateway: Provides the RESTful HTTP endpoint. Lambda Function: Contains the business logic. DynamoDB: Stores the task data. Step 1: Ask Claude to Create the CDK Infrastructure With AWS Skills, you can ask for the infrastructure in plain English.\nYou:\n1/aws-cdk-development Create a CDK stack for a task management API with an API Gateway, a Lambda function(without detailed implementation), and a DynamoDB table. Follow AWS best practices and write in Python. Claude (with AWS Skills):\nThe AWS CDK Plugin helps Claude generate a complete, best-practice CDK stack in TypeScript based on the prompt. It defines the DynamoDB table, the Lambda function, and the API Gateway, including setting up the necessary IAM permissions.\nStep 2: Implement the Lambda Handler You can then ask Claude to generate the Lambda function code with best practices.\nYou:\n1/aws-serverless-eda Now, create the Python Lambda handler for the task API. It should handle POST requests to create a task and GET requests to retrieve one. Claude (with AWS Skills):\nClaude generates the Lambda handler code, including the logic for creating and retrieving tasks from the DynamoDB table using the AWS SDK.\nStep 3: Estimate Costs Before Deployment This is where AWS Skills really shines. Ask for a cost estimate before deploying anything.\nYou:\n1Estimate the monthly cost for this API assuming 1 million requests per month, 100ms average Lambda duration, and 10GB of data in DynamoDB. Claude (with Cost \u0026amp; Operations Plugin):\nClaude uses its cost estimation skill to give you a clear breakdown of the estimated monthly costs for API Gateway, Lambda, and DynamoDB based on your usage estimates. This helps you make informed decisions before deployment.\nStep 4: Deploy with Confidence When you're ready, Claude guides you through deployment.\nYou:\n1Deploy this stack to us-east-1. Claude:\nClaude provides the exact cdk commands to bootstrap your environment (if needed) and deploy the stack. It will also show you the API endpoint URL from the stack outputs once the deployment is complete.\nDevelopment Workflow: Before vs. After AWS Skills streamlines your workflow by keeping you in your IDE, saving time and reducing context switching.\nBefore: The Old Way\nWrite code in your IDE. Switch to a browser to search documentation. Open another tab for the AWS Pricing Calculator. Manually check for best practices. Deploy from the CLI. Debug issues in the AWS Console. After: With AWS Skills\nDescribe your goal in plain language. Claude generates the code and a cost estimate. Review and refine in your IDE. Deploy automatically from Claude. You ship better code faster, all from one place.\nConclusion AWS Skills turns Claude Code into a specialized AWS development partner. It brings real-time AWS knowledge directly into your editor, helping you build faster, more cost-effective, and more reliable applications.\nBy automating best practices, providing pre-deployment cost insights, and guiding you through complex patterns, AWS Skills lets you focus on what matters most: your application's business logic.\nResources AWS Skills GitHub Repository Claude Code Agent Skills Model Context Protocol Ready to build on AWS faster? Install AWS Skills and let me know what you think in the comments!\n","link":"https://kane.mx/posts/2025/aws-skills-claude-code/","section":"posts","tags":["Claude Code","AWS Skills","AWS CDK","Serverless","Lambda","DynamoDB","API Gateway","Agent Skills","MCP Protocol","Infrastructure as Code"],"title":"Build on AWS Faster with Claude Code and AWS Skills"},{"body":"","link":"https://kane.mx/tags/lambda/","section":"tags","tags":null,"title":"Lambda"},{"body":"","link":"https://kane.mx/tags/mcp-protocol/","section":"tags","tags":null,"title":"MCP Protocol"},{"body":"","link":"https://kane.mx/tags/agent-framework/","section":"tags","tags":null,"title":"Agent Framework"},{"body":"","link":"https://kane.mx/tags/ai-automation/","section":"tags","tags":null,"title":"AI Automation"},{"body":"","link":"https://kane.mx/tags/claude-agent-sdk/","section":"tags","tags":null,"title":"Claude Agent SDK"},{"body":"","link":"https://kane.mx/tags/sdk-migration/","section":"tags","tags":null,"title":"SDK Migration"},{"body":"","link":"https://kane.mx/tags/typescript/","section":"tags","tags":null,"title":"TypeScript"},{"body":"Heads up, builders! You may have noticed that Anthropic has rebranded the Claude Code SDK to the new Claude Agent SDK.\nThis is more than just a new name. As the official announcement explains, this change reflects a strategic focus on making it easier than ever to build, debug, and deploy powerful AI agents.\nIf you've been following our deep dive into building agentic applications, you'll be happy to know that the migration is incredibly straightforward. All the core concepts and powerful features like conversation management, MCP integration, and response streaming are still there.\nHere’s the TL;DR on how to upgrade your project.\n1. Update Your Dependencies First, uninstall the old package and install the new one.\n1npm uninstall @anthropic-ai/claude-code 2npm install @anthropic-ai/agent-sdk 2. Update Your Imports Next, just find and replace the import statements in your TypeScript files.\nBefore:\n1import { query, type SDKMessage } from \u0026#39;@anthropic-ai/claude-code\u0026#39;; After:\n1import { query, type SDKMessage } from \u0026#39;@anthropic-ai/agent-sdk\u0026#39;; 3. Handle the Breaking Change in query Options This is the most important part of the migration. The new SDK introduces a breaking change in how you provide a system prompt. The appendSystemPrompt option has been replaced with a more structured systemPrompt object.\nHere’s how to adapt your query calls:\nBefore (claude-code-sdk):\n1const response = query({ 2 prompt: prompt, 3 options: { 4 appendSystemPrompt: systemPrompt, // This is now deprecated 5 mcpServers: mcpServers, 6 // ...other options 7 } 8}); After (claude-agent-sdk):\n1const response = query({ 2 prompt: prompt, 3 options: { 4 systemPrompt: systemPrompt, // Use the new systemPrompt option 5 mcpServers: mcpServers, 6 // ...other options 7 } 8}); In some cases, you might also want to specify the systemPrompt as a preset, like so:\n1const response = query({ 2 prompt: slashCommand, 3 options: { 4 systemPrompt: { type: \u0026#39;preset\u0026#39;, preset: \u0026#39;claude_code\u0026#39; }, 5 // ...other options 6 } 7}); And that's it! With these changes, your existing code, including all your slash commands and MCP configurations, will work exactly as before.\nThe move to the Claude Agent SDK is a clear signal of the agent-first future of AI. By making this small update, you're keeping your projects aligned with the latest and greatest from Anthropic.\nFor a complete guide to the architecture and patterns for building with the SDK, be sure to check out our original, in-depth tutorial.\nHappy building!\n","link":"https://kane.mx/posts/2025/claude-agent-sdk-update/","section":"posts","tags":["Claude Agent SDK","Claude Code","Agent Framework","AI Automation","TypeScript","SDK Migration"],"title":"Upgrade to Claude Agent SDK: A Quick Migration Guide from Claude Code"},{"body":"","link":"https://kane.mx/tags/agentic-ai/","section":"tags","tags":null,"title":"Agentic AI"},{"body":"Introduction If you've been following the AI space, you know that \u0026quot;agents\u0026quot; are the next big thing. These autonomous, intelligent systems promise to revolutionize how we automate complex tasks. But building them can be a daunting undertaking, often involving a tangled web of conversation management, tool integration, and state tracking.\nWhat if I told you there's a framework that handles the heavy lifting, letting you focus on your agent's unique logic? Enter Claude Code. It has evolved beyond a simple CLI tool into a powerful AI application framework for building sophisticated AI applications.\nIn this deep dive and Claude Code tutorial, I'll show you how to leverage Claude Code's SDK and infrastructure to create intelligent agents that can interact with external services, maintain conversations, and execute complex agentic workflows. This guide is based on the social-agents project, a real-world implementation that demonstrates Claude Code's capabilities for multi-platform social media automation. We'll explore the AI agent architecture, patterns, and best practices that make it an excellent choice for your AI agent development.\nWhat Makes Claude Code Special for Agent Development? So what sets Claude Code apart from the crowd of AI frameworks? While many tools leave you wrestling with conversation state, tool integration, and response streaming, Claude Code delivers these as core, out-of-the-box features for building AI agents:\n🧠 Intelligent Conversation Management Built-in session tracking and resumption for stateful workflows. Maintains context across multiple interactions. Automatically stores conversation history in a simple JSONL format. 🔧 Seamless Tool Integration via MCP Natively supports the Model Context Protocol (MCP) for connecting to local or remote tools. Provides standardized tool discovery, execution, and permission management. ⚡ Effortless Streaming Responses Processes AI responses in real-time with progress tracking. Natively handles different message types (system, assistant, user, tool results). Comes with built-in error handling and recovery mechanisms. 🎯 Intuitive Agent Customization Define agent behaviors declaratively using simple Markdown files (.claude/commands/*.md). Easily customize system prompts and agent personalities. Provides a natural language interface for triggering complex operations. Architecture Overview Let's examine how the social-agents project structures a multi-platform agent system. The architecture is designed for modularity and scalability.\n1social-agents/ 2├── src/ 3│ ├── social-sdk-executor.ts # Core agent execution engine 4│ ├── env-loader.ts # Environment configuration 5│ ├── logger.ts # Structured logging 6│ └── types.ts # TypeScript definitions 7├── .claude/ 8│ └── commands/ 9│ ├── twitter.md # Twitter agent configuration 10│ ├── reddit.md # Reddit agent configuration 11│ └── linkedin.md # LinkedIn agent configuration 12├── twitter.ts # Twitter command interface 13├── reddit.ts # Reddit command interface 14├── linkedin.ts # LinkedIn command interface 15├── .mcp.json # MCP server configuration 16└── package.json # Scripts and dependencies This architecture demonstrates several key patterns for building agentic applications:\nGeneric Executor Pattern: A single SocialSDKExecutor handles the core logic for all platforms. Platform-Specific Commands: Individual TypeScript files create dedicated command-line interfaces for each social media platform. Declarative Agent Configuration: Markdown files in the .claude/commands/ directory define each agent's unique behavior, tools, and prompts. Standardized MCP Integration: External services are connected through a single, unified protocol. Here’s a look at the interaction flow:\nsequenceDiagram participant User as User participant CLI as Platform CLI (e.g., twitter.ts) participant Executor as SocialSDKExecutor participant ClaudeCode as Claude Code SDK participant MCP as MCP Server (Rube) participant SocialAPI as Social Media API (e.g., Twitter) User-\u0026gt;\u0026gt;+CLI: npm run twitter -- \u0026#34;post a tweet\u0026#34; CLI-\u0026gt;\u0026gt;+Executor: execute(\u0026#39;twitter\u0026#39;, \u0026#39;post a tweet\u0026#39;, options) Executor-\u0026gt;\u0026gt;+ClaudeCode: query({ prompt: \u0026#39;/twitter post a tweet\u0026#39; }) ClaudeCode-\u0026gt;\u0026gt;ClaudeCode: Load .claude/commands/twitter.md ClaudeCode-\u0026gt;\u0026gt;+MCP: list_tools() MCP--\u0026gt;\u0026gt;-ClaudeCode: Available tools ClaudeCode-\u0026gt;\u0026gt;+MCP: call_tool(\u0026#39;post_tweet\u0026#39;, {text: \u0026#39;...\u0026#39;}) MCP-\u0026gt;\u0026gt;+SocialAPI: POST /2/tweets SocialAPI--\u0026gt;\u0026gt;-MCP: Tweet created MCP--\u0026gt;\u0026gt;-ClaudeCode: Tool result ClaudeCode--\u0026gt;\u0026gt;-Executor: Streaming messages (assistant, result) Executor--\u0026gt;\u0026gt;-CLI: Process and log messages CLI--\u0026gt;\u0026gt;-User: Display output Core Implementation: The Agent Executor At the heart of our agent is the executor. This is the engine that drives the AI interaction, orchestrating everything from loading configurations to processing streaming responses. Let's break down how it works.\nHere's a simplified version of the implementation:\n1import { query, type SDKMessage, type McpServerConfig } from \u0026#39;@anthropic-ai/claude-code\u0026#39;; 2 3export class SocialSDKExecutor { 4 static async execute(platform: string, prompt: string, options: SocialOptions): Promise\u0026lt;void\u0026gt; { 5 // Load MCP server configuration 6 const mcpServers = this.loadMCPServers(); 7 8 // Build the slash command with platform and options 9 const slashCommand = `/${platform} ${prompt} ${options.dryRun ? \u0026#39;--dry-run\u0026#39; : \u0026#39;\u0026#39;}`.trim(); 10 11 // Execute the command using the Claude Code SDK 12 const response = query({ 13 prompt: slashCommand, 14 options: { 15 mcpServers: mcpServers, 16 cwd: process.cwd(), 17 ...(options.resume \u0026amp;\u0026amp; { resume: options.resume }) 18 } 19 }); 20 21 // Process the streaming responses 22 for await (const message of response) { 23 await this.processMessage(message, options.verbose); 24 } 25 } 26 27 private static async processMessage(message: SDKMessage, verbose: boolean): Promise\u0026lt;void\u0026gt; { 28 switch (message.type) { 29 case \u0026#39;assistant\u0026#39;: 30 // Handle AI responses 31 console.log(message.message.content); 32 break; 33 case \u0026#39;result\u0026#39;: 34 // Handle execution results 35 if (message.subtype === \u0026#39;success\u0026#39;) { 36 console.log(\u0026#39;Operation completed successfully!\u0026#39;); 37 } 38 break; 39 case \u0026#39;system\u0026#39;: 40 // Handle system messages (e.g., model info, MCP status) 41 if (verbose) { 42 console.log(`System: ${message.model}`); 43 } 44 break; 45 } 46 } 47} Slash Command Configuration The real power of this framework comes from configuring agent behaviors through simple Markdown files. These \u0026quot;slash commands\u0026quot; are the brains of the operation, defining what an agent is and what it can do.\nHere’s how the Twitter agent is defined in .claude/commands/twitter.md:\n1--- 2allowed-tools: mcp__rube__RUBE_SEARCH_TOOLS, mcp__rube__RUBE_MULTI_EXECUTE_TOOL, mcp__rube__RUBE_CREATE_PLAN 3description: Twitter/X engagement and content creation specialist 4argument-hint: [natural language request] [--dry-run] [--verbose] 5--- 6 7You are an expert Twitter operations specialist with comprehensive AI-driven capabilities. 8 9AVAILABLE OPERATIONS (when RUBE tools are accessible): 10• Generate viral tweets, threads, and engaging content 11• Search and analyze Twitter posts, trends, and conversations 12• Engage with tweets through likes, retweets, and replies 13• Monitor topics, hashtags, and mentions for social listening 14• Analyze sentiment, engagement metrics, and performance data 15 16EXECUTION APPROACH: 171. Try to use RUBE_SEARCH_TOOLS to discover available Twitter tools. 182. If permission is needed, explain the requirement clearly. 193. Execute operations using RUBE_MULTI_EXECUTE_TOOL. 204. Provide detailed results, insights, and recommendations. 21 22Execute the following Twitter operation: $ARGUMENTS This configuration file tells Claude Code:\nTools: Which MCP tools the agent is permitted to use. Persona: The agent's designated role and capabilities. Logic: How to handle different scenarios, such as permissions or failures. Task: The specific operation to execute, passing in the user's arguments. MCP Integration for External Services The Model Context Protocol (MCP) enables seamless integration with external services. Instead of writing custom API clients for every service, you just point to an MCP server. The configuration is straightforward in .mcp.json:\n1{ 2 \u0026#34;mcpServers\u0026#34;: { 3 \u0026#34;rube\u0026#34;: { 4 \u0026#34;type\u0026#34;: \u0026#34;http\u0026#34;, 5 \u0026#34;url\u0026#34;: \u0026#34;https://rube.app/mcp\u0026#34;, 6 \u0026#34;headers\u0026#34;: { 7 \u0026#34;Authorization\u0026#34;: \u0026#34;Bearer ${RUBE_API_TOKEN}\u0026#34; 8 } 9 } 10 } 11} In this project, we connect to RUBE, an MCP server that provides a massive library of pre-built tool integrations. This instantly gives our agent access to:\nTwitter/X API operations Reddit API integration LinkedIn automation Over 500 other applications A unified way to discover and execute tools. Platform-Specific Command Interfaces To make the agents easy to use from the command line, each platform gets its own wrapper script. This provides a clean, dedicated entry point for each agent.\n1// twitter.ts 2import { SocialSDKExecutor, type SocialOptions } from \u0026#39;./src/social-sdk-executor.js\u0026#39;; 3 4async function main() { 5 const args = process.argv.slice(2); 6 7 // Parse command-line options 8 const options: SocialOptions = { 9 dryRun: args.includes(\u0026#39;--dry-run\u0026#39;), 10 verbose: args.includes(\u0026#39;--verbose\u0026#39;), 11 resume: extractResumeId(args) 12 }; 13 14 // Extract the natural language prompt 15 const prompt = args 16 .filter(arg =\u0026gt; !arg.startsWith(\u0026#39;--\u0026#39;)) 17 .join(\u0026#39; \u0026#39;); 18 19 // Execute with the generic executor, specifying the \u0026#39;twitter\u0026#39; platform 20 await SocialSDKExecutor.execute(\u0026#39;twitter\u0026#39;, prompt, options); 21} 22 23main().catch(console.error); This approach provides:\nA Consistent Interface: The same patterns are used across all platforms. Platform Flexibility: It's easy to add new platforms by creating a new command file. Command-Line Integration: It uses standard Unix-style flags and arguments. Type Safety: It offers full TypeScript support with proper error handling. Session Management and Conversation Continuity One of the most frustrating parts of building chatbots is managing conversation history. Claude Code turns this into a superpower with its built-in session management. You don't have to do anything; it just works.\n1# Start a new conversation 2npm run twitter -- \u0026#34;create viral content about TypeScript\u0026#34; 3# Output: 📌 Session ID: 77552924-a31c-4c1a-a07c-990855aa95a3 4 5# Resume and continue the conversation 6npm run twitter -- \u0026#34;now create a follow-up thread\u0026#34; --resume 77552924-a31c-4c1a-a07c-990855aa95a3 7 8# Keep iterating within the same context 9npm run twitter -- \u0026#34;make it more technical\u0026#34; --resume 77552924-a31c-4c1a-a07c-990855aa95a3 Sessions are automatically stored locally in ~/.claude/projects/ as JSONL transcripts, enabling:\nStateful Workflows: Maintain context across multiple interactions. Conversation History: Review previous exchanges and decisions. Debugging: Trace the execution flow to identify issues. Collaboration: Share session IDs with team members to pick up where you left off. Error Handling and Fallback Strategies Robust agent applications require comprehensive error handling. The social-agents project demonstrates several useful patterns for building resilient agents.\nPermission Management The agent can detect when it needs permissions for a tool and inform the user.\n1// Handle MCP tool permission requirements 2if (!toolsAccessible) { 3 logger.info(\u0026#39;🔑 Permission Required for RUBE MCP Server\u0026#39;); 4 logger.info(\u0026#39;Please grant permission when prompted to enable operations.\u0026#39;); 5 return; 6} Graceful Degradation If a tool fails or isn't available, the agent can fall back to a different mode of operation, such as providing strategic advice instead of executing a task.\n1// Provide strategic guidance when tools aren\u0026#39;t available 2if (message.type === \u0026#39;result\u0026#39; \u0026amp;\u0026amp; message.subtype === \u0026#39;error\u0026#39;) { 3 logger.warning(\u0026#39;Tools not accessible - providing strategic guidance instead.\u0026#39;); 4 // Continue with educational/planning mode 5} Advanced Patterns and Best Practices As you build more complex agents, you'll find these patterns from the social-agents project invaluable.\nPrioritized Environment Configuration The project uses a sophisticated system for loading environment variables, ensuring that local overrides are respected while maintaining sensible defaults.\n1export function loadEnvironment(): EnvironmentConfig { 2 // Priority: .env.local → system environment variables 3 const localEnvPath = path.join(process.cwd(), \u0026#39;.env.local\u0026#39;); 4 5 // Load and merge configurations 6 const env = { 7 ...process.env, 8 ...loadEnvFile(localEnvPath) // .env.local has the highest priority 9 }; 10 11 return validateEnvironment(env); 12} Streaming Response Processing Handle different message types as they arrive to provide a rich, real-time user experience.\n1for await (const message of response) { 2 switch (message.type) { 3 case \u0026#39;assistant\u0026#39;: 4 // Stream AI responses in real-time 5 process.stdout.write(message.message.content); 6 break; 7 case \u0026#39;system\u0026#39;: 8 // Capture session IDs and server status 9 if (message.session_id) { 10 sessionId = message.session_id; 11 } 12 break; 13 case \u0026#39;result\u0026#39;: 14 // Handle final outcomes of tool executions 15 displayResults(message); 16 break; 17 } 18} Type Safety with Zod Validation Ensure runtime type safety for configurations and options using Zod.\n1import { z } from \u0026#39;zod\u0026#39;; 2 3const SocialOptionsSchema = z.object({ 4 dryRun: z.boolean(), 5 verbose: z.boolean(), 6 resume: z.string().optional() 7}); 8 9export type SocialOptions = z.infer\u0026lt;typeof SocialOptionsSchema\u0026gt;; Building Your Own Agent Application Ready to build your own autonomous agent? Here's a quick-start guide to get you up and running in minutes.\n1. Project Setup 1npm init -y 2npm install @anthropic-ai/claude-code tsx typescript @types/node 2. Create the Core Executor Create a generic executor to handle the agent's core logic.\n1// src/agent-executor.ts 2import { query } from \u0026#39;@anthropic-ai/claude-code\u0026#39;; 3 4export class AgentExecutor { 5 static async execute(agentType: string, prompt: string, options: AgentOptions) { 6 const response = query({ 7 prompt: `/${agentType} ${prompt}`, 8 options: { 9 mcpServers: await this.loadMCPServers(), 10 cwd: process.cwd() 11 } 12 }); 13 14 for await (const message of response) { 15 // Your message processing logic here 16 } 17 } 18} 3. Configure Agent Behaviors Create a file at .claude/commands/my-agent.md to define your agent's persona and tools.\n1--- 2allowed-tools: your_mcp_tools_here 3description: A short description of your agent 4--- 5 6You are an expert assistant for [your domain]. 7 8Execute the following operation: $ARGUMENTS 4. Create the Command Interface Create a simple command-line entry point for your agent.\n1// my-agent.ts 2import { AgentExecutor } from \u0026#39;./src/agent-executor.js\u0026#39;; 3 4async function main() { 5 const args = process.argv.slice(2); 6 const prompt = args.join(\u0026#39; \u0026#39;); 7 8 await AgentExecutor.execute(\u0026#39;my-agent\u0026#39;, prompt, { 9 dryRun: args.includes(\u0026#39;--dry-run\u0026#39;), 10 verbose: args.includes(\u0026#39;--verbose\u0026#39;) 11 }); 12} 13 14main().catch(console.error); 5. Configure MCP Servers Create a .mcp.json file to connect to your tools.\n1{ 2 \u0026#34;mcpServers\u0026#34;: { 3 \u0026#34;your-service\u0026#34;: { 4 \u0026#34;type\u0026#34;: \u0026#34;http\u0026#34;, 5 \u0026#34;url\u0026#34;: \u0026#34;https://your-mcp-server.com\u0026#34;, 6 \u0026#34;headers\u0026#34;: { 7 \u0026#34;Authorization\u0026#34;: \u0026#34;Bearer ${YOUR_API_TOKEN}\u0026#34; 8 } 9 } 10 } 11} Conclusion Claude Code represents a significant step forward in AI application development. By providing built-in conversation management, MCP integration, and streaming responses, it eliminates much of the boilerplate that traditionally plagued agent development.\nThe social-agents project demonstrates how these capabilities enable sophisticated, multi-platform automation with surprisingly little code. The slash command architecture makes agents configurable and maintainable, while the generic executor pattern ensures consistency across different domains.\nKey takeaways for your own agent applications:\nStart with the Executor Pattern: Build a generic executor that can handle multiple agent types. Use Slash Commands: Configure agent behaviors through .claude/commands/*.md files. Embrace MCP Integration: Connect to external services through standardized protocols. Implement Session Management: Support conversation continuity and stateful workflows. Plan for Fallbacks: Handle permissions, errors, and degraded functionality gracefully. Test with Dry Run: Build safe testing into every operation. Whether you're building social media automation, customer service bots, or complex workflow orchestration, Claude Code provides the foundation for sophisticated agentic applications. The patterns from the social-agents project offer a proven template for scaling AI automation across multiple domains and platforms.\nThe future of AI application development is agentic, and Claude Code gives you the tools to build it today. I encourage you to clone the social-agents repository, experiment with the patterns, and start building your own intelligent agents.\nResources Social Agents Repository - Complete implementation example Claude Code SDK Documentation - Official documentation and guides RUBE MCP Server - 500+ app integrations for your agents Model Context Protocol - Learn about MCP standards Have you built agentic applications with Claude Code? Share your experiences and patterns in the comments below!\n","link":"https://kane.mx/posts/2025/claude-code-agent-framework/","section":"posts","tags":["Claude Code","Agent Framework","AI Automation","TypeScript","MCP Protocol","Social Media Automation","Agentic AI"],"title":"Building Agentic Applications with Claude Code: A Developer's Guide to AI-Powered Automation"},{"body":"","link":"https://kane.mx/tags/social-media-automation/","section":"tags","tags":null,"title":"Social Media Automation"},{"body":"","link":"https://kane.mx/tags/authentication/","section":"tags","tags":null,"title":"Authentication"},{"body":"","link":"https://kane.mx/tags/aws-agentcore-gateway/","section":"tags","tags":null,"title":"AWS AgentCore Gateway"},{"body":"","link":"https://kane.mx/tags/aws-agentcore-runtime/","section":"tags","tags":null,"title":"AWS AgentCore Runtime"},{"body":"Overview Building on my previous exploration of connecting to MCP servers hosted on AWS AgentCore, I've been working extensively with the native MCP SDK's OAuth Client Provider to streamline authentication workflows. The MCP SDK's built-in OAuth support has evolved significantly, offering robust solutions for both interactive user authentication and machine-to-machine (M2M) flows.\nIn this follow-up article, I'll share the key improvements and special techniques I've discovered for using the MCP Client's OAuthClientProvider with AWS AgentCore, including handling AgentCore's unique behavior with 403 responses, implementing M2M authentication flows, and leveraging automatic token refresh capabilities.\nWhat makes this approach particularly compelling is how the native SDK abstracts away much of the OAuth complexity while providing the flexibility needed for enterprise-grade deployments on AWS AgentCore.\nKey Improvements Over Manual OAuth Implementation The native MCP SDK OAuth Client Provider offers several advantages over the manual OAuth implementations I covered in my previous post:\n1. Automatic Token Management Built-in token storage and refresh mechanisms Seamless handling of expired tokens with automatic retry logic Support for both refresh_token (interactive) and client_credentials (M2M) flows 2. AgentCore-Specific Compatibility Custom handling of 403 HTTP responses (AgentCore returns 403 instead of 401 for unauthorized requests) Proper cross-domain OAuth metadata configuration Enhanced error handling and debugging capabilities 3. Dual-Mode Authentication Automatic detection of M2M vs Interactive mode based on client configuration Single codebase supporting both authentication patterns Intelligent scope selection based on OAuth provider type The AgentCoreOAuthClientProvider The heart of this improved implementation is a custom OAuth provider that extends the native MCP SDK's OAuthClientProvider:\n1class AgentCoreOAuthClientProvider(OAuthClientProvider): 2 \u0026#34;\u0026#34;\u0026#34;Custom OAuth provider that triggers on 403 (not just 401) for AgentCore compatibility. 3 4 Supports both interactive OAuth flows and M2M (client_credentials) flows with automatic 5 token refresh for both modes. 6 \u0026#34;\u0026#34;\u0026#34; 7 8 def __init__(self, *args, **kwargs): 9 super().__init__(*args, **kwargs) 10 self.is_m2m_mode = False # Will be set after client info is available 11 12 def _detect_m2m_mode(self) -\u0026gt; bool: 13 \u0026#34;\u0026#34;\u0026#34;Detect if we\u0026#39;re in M2M mode based on client_secret availability.\u0026#34;\u0026#34;\u0026#34; 14 return bool( 15 self.context.client_info and 16 self.context.client_info.client_secret and 17 hasattr(self.context, \u0026#39;client_metadata\u0026#39;) and 18 not hasattr(self.context.client_metadata, \u0026#39;redirect_uris\u0026#39;) or 19 not self.context.client_metadata.redirect_uris 20 ) 21 22 async def async_auth_flow(self, request: httpx.Request) -\u0026gt; AsyncGenerator[httpx.Request, httpx.Response]: 23 \u0026#34;\u0026#34;\u0026#34;HTTPX auth flow integration with 403 support and M2M mode.\u0026#34;\u0026#34;\u0026#34; 24 # ... initialization logic ... 25 26 response = yield request 27 28 # CUSTOM FIX: Trigger OAuth flow on 403 OR 401 (AgentCore returns 403) 29 if response.status_code in (401, 403): 30 # Perform appropriate OAuth flow based on mode 31 if self.is_m2m_mode: 32 # M2M mode: Use client_credentials directly, no browser interaction 33 token_request = await self._get_m2m_token() 34 token_response = yield token_request 35 await self._handle_m2m_token_response(token_response) 36 else: 37 # Interactive mode: Use authorization code flow 38 auth_code, code_verifier = await self._perform_authorization() 39 token_request = await self._exchange_token(auth_code, code_verifier) 40 token_response = yield token_request 41 await self._handle_token_response(token_response) 42 43 # Retry with new tokens 44 self._add_auth_header(request) 45 yield request Special Tricks for AgentCore Runtime 1. 403 Response Handling AWS AgentCore returns HTTP 403 (Forbidden) instead of the standard HTTP 401 (Unauthorized) when authentication is required. This is a critical detail that trips up most OAuth implementations:\n1# Standard OAuth implementations only handle 401 2if response.status_code == 401: 3 # Trigger OAuth flow 4 5# AgentCore-compatible implementation handles both 6if response.status_code in (401, 403): 7 # Trigger OAuth flow - works with both standard servers and AgentCore 2. Cross-Domain Metadata Configuration AgentCore MCP servers run on a different domain from the OAuth provider (typically AWS Cognito). This requires manual configuration of protected resource metadata:\n1# Extract OAuth server URL from discovery URL 2oauth_server_url = config[\u0026#39;discovery_url\u0026#39;].replace(\u0026#39;/.well-known/openid_configuration\u0026#39;, \u0026#39;\u0026#39;) 3 4# Create protected resource metadata pointing to Cognito 5protected_metadata = ProtectedResourceMetadata( 6 resource=PydanticUrl(config[\u0026#39;mcp_server_url\u0026#39;]), 7 authorization_servers=[PydanticUrl(oauth_server_url)] 8) 9 10# Manually inject the metadata into the OAuth context 11oauth_auth.context.protected_resource_metadata = protected_metadata 12oauth_auth.context.auth_server_url = oauth_server_url 3. Pre-configured Client Information AWS Cognito doesn't support OAuth dynamic client registration, so we need to pre-configure client information:\n1# Pre-configure client info to skip registration 2client_info = OAuthClientInformationFull( 3 client_id=config[\u0026#39;client_id\u0026#39;], 4 client_secret=config.get(\u0026#39;client_secret\u0026#39;), 5 authorization_endpoint=\u0026#34;\u0026#34;, # Will be populated during OAuth metadata discovery 6 token_endpoint=\u0026#34;\u0026#34;, # Will be populated during OAuth metadata discovery 7 redirect_uris=redirect_uris 8) 9await token_storage.set_client_info(client_info) M2M Authentication Flow Support One of the most significant improvements is robust support for M2M authentication using the OAuth 2.0 client credentials flow:\nAutomatic Mode Detection The system automatically detects whether to use M2M or interactive authentication based on the presence of a client secret:\n1# Detect M2M mode based on client_secret presence 2is_m2m_mode = bool(config.get(\u0026#39;client_secret\u0026#39;)) 3 4if is_m2m_mode: 5 print(\u0026#34;🏭 Detected client_secret - using M2M authentication\u0026#34;) 6 print(\u0026#34;🚀 No user interaction required - fully automated\u0026#34;) 7else: 8 print(\u0026#34;🔐 No client_secret detected - using interactive authentication\u0026#34;) 9 print(\u0026#34;🌐 Browser-based user authentication required\u0026#34;) M2M Token Acquisition The M2M flow bypasses browser-based authorization entirely:\n1async def _get_m2m_token(self) -\u0026gt; httpx.Request: 2 \u0026#34;\u0026#34;\u0026#34;Get M2M access token using client_credentials flow.\u0026#34;\u0026#34;\u0026#34; 3 token_data = { 4 \u0026#34;grant_type\u0026#34;: \u0026#34;client_credentials\u0026#34;, 5 \u0026#34;client_id\u0026#34;: self.context.client_info.client_id, 6 \u0026#34;client_secret\u0026#34;: self.context.client_info.client_secret, 7 } 8 9 # Add scope if specified 10 if self.context.client_metadata.scope: 11 token_data[\u0026#34;scope\u0026#34;] = self.context.client_metadata.scope 12 13 return httpx.Request( 14 \u0026#34;POST\u0026#34;, 15 token_url, 16 data=token_data, 17 headers={\u0026#34;Content-Type\u0026#34;: \u0026#34;application/x-www-form-urlencoded\u0026#34;} 18 ) AWS Cognito M2M Configuration For AWS Cognito M2M flows, specific configuration is required:\n1# For AWS Cognito M2M, configure appropriate scopes 2if \u0026#39;cognito-idp\u0026#39; in discovery_url.lower(): 3 if is_m2m_mode: 4 # Use configured M2M scopes or None for default 5 scope = config[\u0026#39;m2m_scopes\u0026#39;] # e.g., \u0026#34;mcp-server/read mcp-server/write\u0026#34; 6 else: 7 scope = \u0026#39;openid email aws.cognito.signin.user.admin\u0026#39; Complete Implementation Example Here's how to use the improved OAuth Client Provider:\n1async def test_native_sdk_oauth_flow(config: dict): 2 \u0026#34;\u0026#34;\u0026#34;Test native MCP SDK OAuth flow with auto-detection of M2M vs interactive mode.\u0026#34;\u0026#34;\u0026#34; 3 # Detect M2M mode based on client_secret presence 4 is_m2m_mode = bool(config.get(\u0026#39;client_secret\u0026#39;)) 5 6 # Configure appropriate scopes based on provider and mode 7 if \u0026#39;cognito-idp\u0026#39; in config[\u0026#39;discovery_url\u0026#39;].lower(): 8 if is_m2m_mode: 9 scope = config[\u0026#39;m2m_scopes\u0026#39;] # Resource server scopes 10 else: 11 scope = \u0026#39;openid email aws.cognito.signin.user.admin\u0026#39; 12 else: 13 scope = \u0026#39;openid email profile\u0026#39; if not is_m2m_mode else config[\u0026#39;m2m_scopes\u0026#39;] 14 15 # Create OAuth client metadata 16 client_metadata = OAuthClientMetadata( 17 client_name=\u0026#34;MCP AgentCore OAuth Client\u0026#34;, 18 redirect_uris=[AnyUrl(\u0026#34;http://localhost:3000\u0026#34;)], 19 grant_types=[\u0026#34;authorization_code\u0026#34;, \u0026#34;refresh_token\u0026#34;], 20 response_types=[\u0026#34;code\u0026#34;], 21 scope=scope, 22 ) 23 24 # Create token storage with debugging 25 token_storage = DebugTokenStorage() 26 27 # Pre-configure client info 28 client_info = OAuthClientInformationFull( 29 client_id=config[\u0026#39;client_id\u0026#39;], 30 client_secret=config.get(\u0026#39;client_secret\u0026#39;), 31 authorization_endpoint=\u0026#34;\u0026#34;, 32 token_endpoint=\u0026#34;\u0026#34;, 33 redirect_uris=[AnyUrl(\u0026#34;http://localhost:3000\u0026#34;)] 34 ) 35 await token_storage.set_client_info(client_info) 36 37 # Create custom OAuth client provider with AgentCore compatibility 38 oauth_auth = AgentCoreOAuthClientProvider( 39 server_url=config[\u0026#39;mcp_server_url\u0026#39;], 40 client_metadata=client_metadata, 41 storage=token_storage, 42 redirect_handler=handle_redirect if not is_m2m_mode else dummy_handler, 43 callback_handler=handle_callback if not is_m2m_mode else dummy_handler, 44 ) 45 46 # Configure protected resource metadata for cross-domain support 47 oauth_server_url = config[\u0026#39;discovery_url\u0026#39;].replace(\u0026#39;/.well-known/openid_configuration\u0026#39;, \u0026#39;\u0026#39;) 48 protected_metadata = ProtectedResourceMetadata( 49 resource=PydanticUrl(config[\u0026#39;mcp_server_url\u0026#39;]), 50 authorization_servers=[AnyUrl(oauth_server_url)] 51 ) 52 oauth_auth.context.protected_resource_metadata = protected_metadata 53 oauth_auth.context.auth_server_url = oauth_server_url 54 55 # Use the OAuth provider with streamable HTTP client 56 async with streamablehttp_client(config[\u0026#39;mcp_server_url\u0026#39;], auth=oauth_auth) as (read, write, _): 57 async with ClientSession(read, write) as session: 58 await session.initialize() 59 60 # List and invoke tools 61 tools_result = await session.list_tools() 62 print(f\u0026#34;Found {len(tools_result.tools)} tools available\u0026#34;) 63 64 return True Configuration and Environment Setup The improved implementation supports flexible configuration through environment variables:\n1# OAuth 2.0 Configuration 2export OAUTH_DISCOVERY_URL=\u0026#34;https://cognito-idp.us-east-1.amazonaws.com/us-east-1_ABC123/.well-known/openid_configuration\u0026#34; 3export OAUTH_CLIENT_ID=\u0026#34;your-cognito-client-id\u0026#34; 4 5# M2M Mode (optional - enables machine-to-machine authentication) 6export OAUTH_CLIENT_SECRET=\u0026#34;your-client-secret\u0026#34; 7export OAUTH_M2M_SCOPES=\u0026#34;mcp-server/read mcp-server/write\u0026#34; 8 9# AgentCore Runtime Configuration 10export AGENTCORE_RUNTIME_ARN=\u0026#34;arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/my-server\u0026#34; 11export AGENTCORE_REGION=\u0026#34;us-west-2\u0026#34; 12 13# Interactive Mode Testing (optional) 14export OAUTH_TEST_USERNAME=\u0026#34;testuser@example.com\u0026#34; 15export OAUTH_TEST_PASSWORD=\u0026#34;your-password\u0026#34; Key Advantages 1. Simplified Integration The native SDK OAuth provider handles all the complex OAuth state management, token storage, and refresh logic automatically.\n2. Production-Ready M2M Support M2M authentication enables fully automated server-to-server communication without user intervention, perfect for production deployments.\n3. AgentCore Compatibility Custom handling of AgentCore's 403 responses and cross-domain metadata configuration ensures seamless integration.\n4. Automatic Token Refresh Both interactive and M2M modes support automatic token refresh, ensuring long-running applications maintain connectivity.\n5. Comprehensive Error Handling Detailed logging and error handling makes troubleshooting authentication issues much easier.\nTroubleshooting Common Issues M2M Authentication Failures 1# Ensure client_credentials flow is enabled in Cognito 2aws cognito-idp update-user-pool-client \\ 3 --user-pool-id \u0026lt;your-user-pool-id\u0026gt; \\ 4 --client-id \u0026lt;your-client-id\u0026gt; \\ 5 --allowed-o-auth-flows \u0026#34;client_credentials\u0026#34; \\ 6 --generate-secret Scope Configuration For AWS Cognito M2M, you may need to configure resource server scopes:\nInteractive mode: openid email aws.cognito.signin.user.admin M2M mode: Custom resource server scopes like mcp-server/read mcp-server/write Cross-Domain Issues Ensure the protected resource metadata correctly maps your MCP server URL to the OAuth authorization server.\nConclusion The native MCP SDK's OAuth Client Provider, enhanced with AgentCore-specific compatibility fixes, provides a robust foundation for production MCP client applications. The automatic detection of M2M vs interactive modes, combined with comprehensive error handling and token management, significantly reduces the complexity of integrating with OAuth-protected MCP servers on AWS AgentCore.\nThe key innovations—handling 403 responses, cross-domain metadata configuration, and dual-mode authentication—make this approach far more reliable than manual OAuth implementations for enterprise deployments.\nAs the MCP ecosystem continues to evolve, I expect we'll see these patterns become standard practice for production MCP client implementations, particularly in enterprise environments where M2M authentication and automated token management are essential requirements.\nResources Previous Post: How invoking remote MCP servers hosted on AWS AgentCore AWS AgentCore Documentation Model Context Protocol Specification Amazon Bedrock AgentCore MCP Guide MCP Inspector Tool Complete Sample Implementation ","link":"https://kane.mx/posts/2025/use-mcp-client-oauthclientprovider-invoke-mcp-hosted-on-aws-agentcore/","section":"posts","tags":["MCP","MCP Client","OAuth Client Provider","AWS AgentCore Runtime","AWS AgentCore Gateway","OAuth","M2M Authentication","Authentication"],"title":"Leveraging MCP Client's OAuthClientProvider for Seamless AWS AgentCore Authentication"},{"body":"","link":"https://kane.mx/tags/m2m-authentication/","section":"tags","tags":null,"title":"M2M Authentication"},{"body":"","link":"https://kane.mx/tags/mcp-client/","section":"tags","tags":null,"title":"MCP Client"},{"body":"","link":"https://kane.mx/tags/oauth/","section":"tags","tags":null,"title":"OAuth"},{"body":"","link":"https://kane.mx/tags/oauth-client-provider/","section":"tags","tags":null,"title":"OAuth Client Provider"},{"body":"Overview Recently, I've been exploring AWS AgentCore's new capability to host Model Context Protocol (MCP) servers, and I wanted to share my experience with connecting to these remote servers as a client. The Model Context Protocol is an open standard that enables AI assistants to securely connect with external data sources and tools, and AWS AgentCore provides a managed hosting environment for these servers with built-in authentication and scaling capabilities.\nIn this article, I'll walk through the process of invoking MCP servers hosted on AWS AgentCore Runtime or proxied via AgentCore Gateway, covering different authentication methods, client implementation patterns, and practical considerations. What struck me most about this approach is how it bridges the gap between local development and enterprise-grade deployment while maintaining the flexibility that makes MCP so powerful.\nUnderstanding AWS AgentCore and MCP Before diving into the implementation details, let's understand what we're working with. AWS AgentCore is Amazon's managed runtime environment for AI agent applications that supports the Model Context Protocol natively. When you deploy an MCP server to AgentCore Runtime or Gateway, you get:\nManaged Infrastructure: No need to worry about scaling, monitoring, or infrastructure management Built-in Authentication: OAuth 2.0 and AWS SigV4 authentication out of the box Session Isolation: Each client connection gets its own isolated session Serverless Scaling: Automatically scales based on demand The MCP servers deployed on AgentCore expose their tools and resources through a standardized HTTP interface, making them accessible from any MCP-compatible client regardless of where it's running.\nAuthentication Methods One of the first challenges I encountered was understanding the authentication options. AWS AgentCore supports several authentication mechanisms for MCP servers:\n1. OAuth 2.0 Authentication This is the most common approach for production deployments. The OAuth flow involves several modes:\nManual Mode: Interactive browser-based authentication\n1class OAuth2Handler: 2 def __init__(self, discovery_url: str, client_id: str, client_secret: str = None): 3 self.discovery_url = discovery_url.rstrip(\u0026#39;/\u0026#39;) 4 self.client_id = client_id 5 self.client_secret = client_secret 6 self.redirect_uri = \u0026#34;http://localhost:3000\u0026#34; 7 8 async def discover_endpoints(self) -\u0026gt; dict: 9 \u0026#34;\u0026#34;\u0026#34;Discover OAuth 2.0 endpoints using well-known configuration.\u0026#34;\u0026#34;\u0026#34; 10 async with httpx.AsyncClient() as client: 11 response = await client.get(self.discovery_url, timeout=10.0) 12 if response.status_code == 200: 13 return response.json() 14 raise ValueError(f\u0026#34;Discovery failed: HTTP {response.status_code}\u0026#34;) 15 16 async def get_authorization_url(self) -\u0026gt; str: 17 config = await self.discover_endpoints() 18 auth_endpoint = config.get(\u0026#39;authorization_endpoint\u0026#39;) 19 20 params = { 21 \u0026#39;response_type\u0026#39;: \u0026#39;code\u0026#39;, 22 \u0026#39;client_id\u0026#39;: self.client_id, 23 \u0026#39;redirect_uri\u0026#39;: self.redirect_uri, 24 \u0026#39;scope\u0026#39;: \u0026#39;openid email aws.cognito.signin.user.admin\u0026#39;, 25 \u0026#39;state\u0026#39;: \u0026#39;random_state_12345\u0026#39; 26 } 27 return f\u0026#34;{auth_endpoint}?{urlencode(params)}\u0026#34; 28 29 async def exchange_code_for_tokens(self, authorization_code: str) -\u0026gt; dict: 30 config = await self.discover_endpoints() 31 token_endpoint = config.get(\u0026#39;token_endpoint\u0026#39;) 32 33 data = { 34 \u0026#39;grant_type\u0026#39;: \u0026#39;authorization_code\u0026#39;, 35 \u0026#39;client_id\u0026#39;: self.client_id, 36 \u0026#39;code\u0026#39;: authorization_code, 37 \u0026#39;redirect_uri\u0026#39;: self.redirect_uri 38 } 39 40 async with httpx.AsyncClient() as client: 41 response = await client.post(token_endpoint, data=data) 42 if response.status_code == 200: 43 return response.json() 44 raise ValueError(f\u0026#34;Token exchange failed: {response.status_code}\u0026#34;) 45 46# Usage example 47async def manual_oauth_flow(): 48 handler = OAuth2Handler( 49 discovery_url=\u0026#34;https://cognito-idp.us-east-1.amazonaws.com/.../openid-configuration\u0026#34;, 50 client_id=\u0026#34;your-cognito-client-id\u0026#34; 51 ) 52 53 auth_url = await handler.get_authorization_url() 54 webbrowser.open(auth_url) 55 56 # User completes auth and provides callback URL 57 callback_url = input(\u0026#34;Enter callback URL: \u0026#34;) 58 code = parse_qs(urlparse(callback_url).query)[\u0026#34;code\u0026#34;][0] 59 60 tokens = await handler.exchange_code_for_tokens(code) 61 return tokens[\u0026#39;access_token\u0026#39;] Machine-to-Machine Mode: For automated systems using client credentials\n1async def get_m2m_token(oauth_handler: OAuth2Handler) -\u0026gt; dict: 2 \u0026#34;\u0026#34;\u0026#34;Get M2M access token using client_credentials flow.\u0026#34;\u0026#34;\u0026#34; 3 config = await oauth_handler.discover_endpoints() 4 token_endpoint = config.get(\u0026#39;token_endpoint\u0026#39;) 5 6 data = { 7 \u0026#39;grant_type\u0026#39;: \u0026#39;client_credentials\u0026#39;, 8 \u0026#39;client_id\u0026#39;: oauth_handler.client_id, 9 \u0026#39;client_secret\u0026#39;: oauth_handler.client_secret, 10 } 11 12 async with httpx.AsyncClient() as client: 13 response = await client.post(token_endpoint, data=data) 14 if response.status_code == 200: 15 return response.json() 16 raise ValueError(f\u0026#34;M2M token request failed: {response.status_code}\u0026#34;) 17 18# Usage example 19async def m2m_oauth_flow(): 20 handler = OAuth2Handler( 21 discovery_url=\u0026#34;https://cognito-idp.us-east-1.amazonaws.com/.../openid-configuration\u0026#34;, 22 client_id=\u0026#34;your-m2m-client-id\u0026#34;, 23 client_secret=\u0026#34;your-m2m-client-secret\u0026#34; 24 ) 25 26 tokens = await get_m2m_token(handler) 27 return tokens[\u0026#39;access_token\u0026#39;] Quick Mode: For AWS Cognito with existing user credentials\n1async def cognito_quick_mode(discovery_url: str, client_id: str, username: str, password: str): 2 \u0026#34;\u0026#34;\u0026#34;Quick token retrieval using AWS Cognito direct authentication.\u0026#34;\u0026#34;\u0026#34; 3 # Extract region from discovery URL 4 region = re.search(r\u0026#39;cognito-idp\\.([^.]+)\\.amazonaws\\.com\u0026#39;, discovery_url).group(1) 5 6 # Use boto3 for direct authentication 7 session = boto3.Session() 8 cognito_client = session.client(\u0026#39;cognito-idp\u0026#39;, region_name=region) 9 10 response = cognito_client.initiate_auth( 11 ClientId=client_id, 12 AuthFlow=\u0026#39;USER_PASSWORD_AUTH\u0026#39;, 13 AuthParameters={\u0026#39;USERNAME\u0026#39;: username, \u0026#39;PASSWORD\u0026#39;: password} 14 ) 15 16 return response[\u0026#39;AuthenticationResult\u0026#39;][\u0026#39;AccessToken\u0026#39;] 2. AWS SigV4 Authentication For AWS-native integrations, you can use SigV4 signing with your AWS credentials:\n1from botocore.auth import SigV4Auth 2from botocore.awsrequest import AWSRequest 3 4class HTTPXSigV4Auth(httpx.Auth): 5 def __init__(self, credentials, service: str, region: str): 6 self.credentials = credentials 7 self.service = service 8 self.region = region 9 10 def auth_flow(self, request: httpx.Request): 11 # Extract request body for signing 12 body = request.content if hasattr(request, \u0026#39;content\u0026#39;) else b\u0026#39;\u0026#39; 13 14 # Create AWS request for signing 15 aws_request = AWSRequest(method=request.method, url=str(request.url), data=body) 16 aws_request.headers[\u0026#39;Host\u0026#39;] = request.url.host 17 18 # Sign the request 19 signer = SigV4Auth(self.credentials, self.service, self.region) 20 signer.add_auth(aws_request) 21 22 # Update HTTPX request with signed headers 23 for name, value in aws_request.headers.items(): 24 request.headers[name] = value 25 26 yield request 27 28class SigV4AgentCoreMCPClient: 29 def __init__(self, agent_arn: str, region: str = \u0026#34;us-west-2\u0026#34;): 30 self.agent_arn = agent_arn 31 self.region = region 32 self.session = boto3.Session() 33 self.credentials = self.session.get_credentials() 34 35 def get_mcp_url(self) -\u0026gt; str: 36 encoded_arn = self.agent_arn.replace(\u0026#39;:\u0026#39;, \u0026#39;%3A\u0026#39;).replace(\u0026#39;/\u0026#39;, \u0026#39;%2F\u0026#39;) 37 return f\u0026#34;https://bedrock-agentcore.{self.region}.amazonaws.com/runtimes/{encoded_arn}/invocations?qualifier=DEFAULT\u0026#34; 38 39 async def connect(self): 40 mcp_url = self.get_mcp_url() 41 auth = HTTPXSigV4Auth(self.credentials, \u0026#39;bedrock-agentcore\u0026#39;, self.region) 42 43 async with streamablehttp_client(url=mcp_url, auth=auth) as (read, write, _): 44 async with ClientSession(read, write) as session: 45 await session.initialize() 46 return session 47 48# Usage example 49async def sigv4_connection_example(): 50 client = SigV4AgentCoreMCPClient( 51 agent_arn=\u0026#34;arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/my-server\u0026#34; 52 ) 53 session = await client.connect() 54 return session Client Implementation Patterns Based on my experience, here are the key patterns I've found effective for implementing MCP clients that connect to AgentCore-hosted servers:\nBasic Client Structure 1async def connect_to_agentcore_server(agent_arn, bearer_token): 2 # Encode the ARN for URL usage 3 encoded_arn = agent_arn.replace(\u0026#39;:\u0026#39;, \u0026#39;%3A\u0026#39;).replace(\u0026#39;/\u0026#39;, \u0026#39;%2F\u0026#39;) 4 mcp_url = f\u0026#34;https://bedrock-agentcore.us-west-2.amazonaws.com/runtimes/{encoded_arn}/invocations?qualifier=DEFAULT\u0026#34; 5 6 headers = {\u0026#34;authorization\u0026#34;: f\u0026#34;Bearer {bearer_token}\u0026#34;} 7 8 async with streamablehttp_client(mcp_url, headers) as (read, write, _): 9 async with ClientSession(read, write) as session: 10 await session.initialize() 11 12 # Discover capabilities 13 tools = await session.list_tools() 14 resources = await session.list_resources() 15 16 return session 17 18# Usage 19async def main(): 20 session = await connect_to_agentcore_server(agent_arn, bearer_token) 21 result = await session.call_tool(\u0026#34;add_numbers\u0026#34;, {\u0026#34;a\u0026#34;: 5, \u0026#34;b\u0026#34;: 3}) 22 print(f\u0026#34;Result: {result}\u0026#34;) MCP Connection Testing Here's a simplified approach to test MCP connections:\n1async def test_mcp_connection(mcp_server_url: str, access_token: str): 2 \u0026#34;\u0026#34;\u0026#34;Test MCP connection with access token.\u0026#34;\u0026#34;\u0026#34; 3 headers = {\u0026#34;Authorization\u0026#34;: f\u0026#34;Bearer {access_token}\u0026#34;} 4 5 async with streamablehttp_client(mcp_server_url, headers) as (read, write, _): 6 async with ClientSession(read, write) as session: 7 await session.initialize() 8 9 # List available tools and resources 10 tools = await session.list_tools() 11 resources = await session.list_resources() 12 13 print(f\u0026#34;Found {len(tools.tools)} tools and {len(resources.resources)} resources\u0026#34;) 14 return session 15 16def load_config() -\u0026gt; dict: 17 \u0026#34;\u0026#34;\u0026#34;Load configuration from environment variables.\u0026#34;\u0026#34;\u0026#34; 18 return { 19 \u0026#39;discovery_url\u0026#39;: os.getenv(\u0026#39;OAUTH_DISCOVERY_URL\u0026#39;), 20 \u0026#39;client_id\u0026#39;: os.getenv(\u0026#39;OAUTH_CLIENT_ID\u0026#39;), 21 \u0026#39;agentcore_runtime_arn\u0026#39;: os.getenv(\u0026#39;AGENTCORE_RUNTIME_ARN\u0026#39;), 22 \u0026#39;agentcore_region\u0026#39;: os.getenv(\u0026#39;AGENTCORE_REGION\u0026#39;, \u0026#39;us-west-2\u0026#39;), 23 } 24 25async def test_oauth_flow(config: dict): 26 \u0026#34;\u0026#34;\u0026#34;Test OAuth flow and MCP connection.\u0026#34;\u0026#34;\u0026#34; 27 # Get OAuth token 28 handler = OAuth2Handler(config[\u0026#39;discovery_url\u0026#39;], config[\u0026#39;client_id\u0026#39;]) 29 auth_url = await handler.get_authorization_url() 30 webbrowser.open(auth_url) 31 32 # User provides callback URL 33 callback_url = input(\u0026#34;Enter callback URL: \u0026#34;) 34 code = parse_qs(urlparse(callback_url).query)[\u0026#34;code\u0026#34;][0] 35 36 tokens = await handler.exchange_code_for_tokens(code) 37 38 # Test MCP connection 39 encoded_arn = config[\u0026#39;agentcore_runtime_arn\u0026#39;].replace(\u0026#39;:\u0026#39;, \u0026#39;%3A\u0026#39;).replace(\u0026#39;/\u0026#39;, \u0026#39;%2F\u0026#39;) 40 mcp_url = f\u0026#34;https://bedrock-agentcore.{config[\u0026#39;agentcore_region\u0026#39;]}.amazonaws.com/runtimes/{encoded_arn}/invocations?qualifier=DEFAULT\u0026#34; 41 42 session = await test_mcp_connection(mcp_url, tokens[\u0026#39;access_token\u0026#39;]) 43 return session Practical Usage Patterns Tool Discovery and Invocation MCP enables dynamic discovery of server capabilities:\n1async def explore_server_capabilities(session): 2 # Discover available tools and resources 3 tools_response = await session.list_tools() 4 resources_response = await session.list_resources() 5 6 for tool in tools_response.tools: 7 print(f\u0026#34;Tool: {tool.name} - {tool.description}\u0026#34;) 8 9 for resource in resources_response.resources: 10 print(f\u0026#34;Resource: {resource.name} ({resource.mimeType})\u0026#34;) 11 12async def call_tool_dynamically(session, tool_name, **kwargs): 13 result = await session.call_tool(tool_name, kwargs) 14 return result.content Resource Access Access server resources with simple calls:\n1async def read_server_resource(session, resource_uri): 2 result = await session.read_resource(resource_uri) 3 return result.contents 4 5# Example usage 6contents = await read_server_resource(session, \u0026#34;file://config.json\u0026#34;) 7for content in contents: 8 print(f\u0026#34;{content.mimeType}: {content.text}\u0026#34;) Complete Working Example Here's a simplified production-ready example:\n1async def main(): 2 \u0026#34;\u0026#34;\u0026#34;Main function demonstrating MCP client connection.\u0026#34;\u0026#34;\u0026#34; 3 config = load_config() 4 5 print(\u0026#34;Select authentication mode:\u0026#34;) 6 print(\u0026#34;1. OAuth 2.0 (Manual)\u0026#34;) 7 print(\u0026#34;2. AWS Cognito (Quick)\u0026#34;) 8 print(\u0026#34;3. AWS SigV4\u0026#34;) 9 10 choice = input(\u0026#34;Choose (1/2/3): \u0026#34;) 11 12 if choice == \u0026#34;1\u0026#34;: 13 session = await test_oauth_flow(config) 14 elif choice == \u0026#34;2\u0026#34;: 15 token = await cognito_quick_mode( 16 config[\u0026#39;discovery_url\u0026#39;], 17 config[\u0026#39;client_id\u0026#39;], 18 config[\u0026#39;test_username\u0026#39;], 19 config[\u0026#39;test_password\u0026#39;] 20 ) 21 session = await test_mcp_connection(config[\u0026#39;mcp_server_url\u0026#39;], token) 22 elif choice == \u0026#34;3\u0026#34;: 23 client = SigV4AgentCoreMCPClient(config[\u0026#39;agentcore_runtime_arn\u0026#39;]) 24 session = await client.connect() 25 26 # Use the session 27 tools = await session.list_tools() 28 print(f\u0026#34;Connected! Found {len(tools.tools)} tools available.\u0026#34;) 29 30# Environment setup example 31def setup_environment(): 32 \u0026#34;\u0026#34;\u0026#34;Required environment variables.\u0026#34;\u0026#34;\u0026#34; 33 env_vars = { 34 \u0026#39;OAUTH_DISCOVERY_URL\u0026#39;: \u0026#39;https://cognito-idp.us-east-1.amazonaws.com/.../openid-configuration\u0026#39;, 35 \u0026#39;OAUTH_CLIENT_ID\u0026#39;: \u0026#39;your-cognito-client-id\u0026#39;, 36 \u0026#39;AGENTCORE_RUNTIME_ARN\u0026#39;: \u0026#39;arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/my-server\u0026#39;, 37 \u0026#39;OAUTH_TEST_USERNAME\u0026#39;: \u0026#39;testuser@example.com\u0026#39;, 38 \u0026#39;OAUTH_TEST_PASSWORD\u0026#39;: \u0026#39;your-password\u0026#39; 39 } 40 41 for key, example in env_vars.items(): 42 print(f\u0026#34;export {key}=\u0026#39;{example}\u0026#39;\u0026#34;) 43 44if __name__ == \u0026#34;__main__\u0026#34;: 45 asyncio.run(main()) Update: Improved OAuth Client Provider Approach Update: For a more robust, production-ready solution using the native MCP SDK's built-in OAuth Client Provider with automatic token management, M2M authentication support, and AgentCore-specific compatibility fixes, see: Leveraging MCP Client's OAuthClientProvider for Seamless AWS AgentCore Authentication.\nKey Learnings 1. Authentication Complexity The biggest lesson from working with MCP servers on AgentCore is that authentication setup is often the most complex part. Whether you're using OAuth with Cognito, Azure AD, or other providers, getting the client credentials and discovery URLs right is crucial. I recommend starting with the manual OAuth mode for testing before moving to automated flows.\n2. Connection Lifecycle Management Unlike local MCP servers where you might maintain persistent connections, AgentCore-hosted servers require careful attention to connection lifecycle. The platform provides session isolation, but you need to handle reconnection gracefully in your client code.\n3. Error Handling is Critical Remote MCP servers introduce network-related failure modes that don't exist with local servers. Building robust retry logic and graceful degradation from the start saves significant debugging time later.\n4. Tool Discovery Enables Dynamic Behavior One of MCP's most powerful features is tool discovery. Rather than hard-coding tool names and parameters, building clients that dynamically discover and adapt to server capabilities makes your code much more resilient to server updates.\nConclusion Working with MCP servers hosted on AWS AgentCore opens up exciting possibilities for building distributed AI agent systems. The combination of MCP's flexible protocol with AgentCore's managed infrastructure provides a powerful foundation for enterprise AI applications.\nThe key to success lies in understanding the authentication flows, building robust connection management, and embracing MCP's dynamic discovery capabilities. While there are complexity challenges, particularly around authentication and error handling, the benefits of managed hosting and automatic scaling make this approach very compelling for production deployments.\nAs the ecosystem continues to mature, I expect we'll see more standardized client libraries and simplified authentication flows that make this integration even more accessible to developers.\nResources AWS AgentCore Documentation Model Context Protocol Specification Amazon Bedrock AgentCore MCP Guide MCP Inspector Tool Sample Implementation ","link":"https://kane.mx/posts/2025/invoke-mcp-hosted-on-aws-agentcore/","section":"posts","tags":["MCP","MCP Client","AWS AgentCore Runtime","AWS AgentCore Gateway","OAuth","Authentication"],"title":"How invoking remote MCP servers hosted on AWS AgentCore"},{"body":"","link":"https://kane.mx/tags/aws-amplify/","section":"tags","tags":null,"title":"AWS Amplify"},{"body":"","link":"https://kane.mx/tags/bedrock-knowledgebase/","section":"tags","tags":null,"title":"Bedrock Knowledgebase"},{"body":"Overview In this article, I'll share my experience building an agentic chatbot on AWS using Amazon Bedrock, Amplify Gen2, and Amplify AI kit. This project, called Industry Assistant Portal, serves as an internal industry assistant that provides industry-specific AWS solutions guidance. The chatbot leverages Amazon Bedrock's powerful foundation models and knowledge base capabilities to deliver contextually relevant information about AWS industry solutions.\nThe journey of building this chatbot taught me valuable lessons about implementing agentic AI systems that can reason, plan, and execute complex tasks while maintaining context awareness. I'll cover the architecture, implementation details, challenges faced, and key learnings from this project.\nArchitecture The Portal is built with a modern tech stack:\nFrontend: Next.js 14 with Amplify UI React components (including AIConversation component) Backend: AWS Amplify Gen2 with GraphQL API AI Services: Amazon Bedrock (Claude Sonnet 3.5 v2 and Haiku 3.5) Knowledge Base: Amazon Bedrock Knowledge Base Vector Search: Amazon OpenSearch Serverless's vector search capabilities are used to retrieve relevant documents Authentication: Amazon Cognito flowchart TD Client[\u0026#34;Next.js Client\u0026#34;] Auth[\u0026#34;Amplify UI Authenticator + Cognito\u0026#34;] OIDCProvider[\u0026#34;Corp SSO\u0026#34;] API[\u0026#34;AWS AppSync (GraphQL) + Lambda\u0026#34;] AI[\u0026#34;AWS Bedrock (Claude 3.5 \u0026amp; Haiku 3.5)\u0026#34;] KB[\u0026#34;Bedrock Knowledge Base\u0026#34;] DB[\u0026#34;DynamoDB\u0026#34;] VectorDB[\u0026#34;Amazon OpenSearch Serverless\u0026#34;] DocExporter[\u0026#34;Doc Exporter Lambda\u0026#34;] DocProcessor[\u0026#34;Doc Sheet Processor Lambda\u0026#34;] KBSync[\u0026#34;Knowledge Base Sync Workflow\u0026#34;] S3[\u0026#34;S3 Buckets\u0026#34;] Client \u0026lt;--\u0026gt; Auth Auth \u0026lt;--\u0026gt; OIDCProvider Client \u0026lt;--\u0026gt; API API \u0026lt;--\u0026gt; AI API \u0026lt;--\u0026gt; DB AI \u0026lt;--\u0026gt; KB KB \u0026lt;--\u0026gt; VectorDB DocExporter --\u0026gt; S3 S3 --\u0026gt; DocProcessor DocProcessor --\u0026gt; S3 KBSync --\u0026gt; KB The architecture follows a serverless approach, with the frontend hosted on Amplify Hosting and the backend services managed through Amplify Gen2. The chatbot's intelligence comes from Amazon Bedrock's foundation models, particularly Claude Sonnet 3.5 v2, enhanced with a custom knowledge base containing industry-specific information.\nKey Components 1. Agentic Conversation Flow The heart of the system is an agentic conversation flow that follows a structured approach to understanding and responding to user queries:\nIntention Validation: Every user query is first analyzed to determine if it falls within the chatbot's domain of expertise (AWS solutions and guidance for specific industries). Sequential Intent Identification: The chatbot follows a strict sequence to identify the industry, solution area, and specific use case the user is interested in. Knowledge Base Integration: Relevant information is retrieved from the knowledge base to provide accurate and up-to-date responses. Response Generation: The chatbot generates comprehensive responses with proper citations and relevant industry contacts. sequenceDiagram User-\u0026gt;\u0026gt;+Frontend: Send message Frontend-\u0026gt;\u0026gt;+API: Send message to AI API-\u0026gt;\u0026gt;+AI Service: Process with Claude Sonnet AI Service-\u0026gt;\u0026gt;+Knowledge Base: Query for relevant info via searchDocumentation tool Knowledge Base-\u0026gt;\u0026gt;+Vector DB: Initial vector search (30 results) Vector DB-\u0026gt;\u0026gt;-Knowledge Base: Return initial chunks Knowledge Base-\u0026gt;\u0026gt;+Reranking: Process with Cohere Rerank Reranking-\u0026gt;\u0026gt;-Knowledge Base: Return top 10 reranked results Knowledge Base--\u0026gt;\u0026gt;-AI Service: Return documentation with source URIs AI Service--\u0026gt;\u0026gt;-API: Generate response with citations API--\u0026gt;\u0026gt;-Frontend: Return AI response Frontend--\u0026gt;\u0026gt;-User: Display response with citations and expert contacts This structured approach ensures that the chatbot provides relevant and accurate information while guiding users through a logical conversation flow.\n2. Knowledge Base Integration One of the most powerful features of the chatbot is its integration with Amazon Bedrock Knowledge Base. The knowledge base contains curated information about AWS industry solutions, best practices, and partner offerings.\nThe integration is implemented through a custom resolver that:\nAccepts user queries as input Performs vector search against the knowledge base Applies filtering based on metadata (e.g., updated date, status) Uses reranking to improve relevance of results Returns the most relevant documents to inform the chatbot's responses sequenceDiagram AI Service-\u0026gt;\u0026gt;+Custom Resolver: Call searchDocumentation tool Custom Resolver-\u0026gt;\u0026gt;+Bedrock API: POST to /knowledgebases/{id}/retrieve Bedrock API-\u0026gt;\u0026gt;+Vector DB: Perform vector search Vector DB-\u0026gt;\u0026gt;-Bedrock API: Return 30 most similar chunks Bedrock API-\u0026gt;\u0026gt;+Reranking Model: Process with Cohere Rerank v3.5 Reranking Model-\u0026gt;\u0026gt;-Bedrock API: Return top 10 reranked results Bedrock API-\u0026gt;\u0026gt;-Custom Resolver: Return documentation with metadata Custom Resolver-\u0026gt;\u0026gt;-AI Service: Return formatted results 1export function request(ctx) { 2 const { input } = ctx.args; 3 const { KNOWLEDGE_BASE_ID, RERANK_MODEL_ARN } = ctx.env; 4 5 return { 6 resourcePath: `/knowledgebases/${KNOWLEDGE_BASE_ID}/retrieve`, 7 method: \u0026#34;POST\u0026#34;, 8 params: { 9 headers: { 10 \u0026#34;Content-Type\u0026#34;: \u0026#34;application/json\u0026#34;, 11 }, 12 body: JSON.stringify({ 13 retrievalQuery: { 14 text: input, 15 }, 16 retrievalConfiguration: { 17 vectorSearchConfiguration: { 18 numberOfResults: 30, 19 filter: { 20 orAll: [ 21 { 22 greaterThan: { 23 key: \u0026#39;updated_date\u0026#39;, 24 value: \u0026lt;relative date\u0026gt;, 25 }, 26 }, 27 { 28 equals: { 29 key: \u0026#39;status\u0026#39;, 30 value: \u0026lt;status\u0026gt;, 31 } 32 } 33 ] 34 }, 35 rerankingConfiguration: { 36 type: \u0026#39;BEDROCK_RERANKING_MODEL\u0026#39;, 37 bedrockRerankingConfiguration: { 38 modelConfiguration: { 39 modelArn: RERANK_MODEL_ARN, 40 }, 41 numberOfRerankedResults: 10 42 } 43 } 44 } 45 }, 46 }), 47 }, 48 }; 49} This approach ensures that the chatbot has access to the most relevant and up-to-date information when responding to user queries.\n3. User Interface The chatbot's user interface is built using Next.js and Amplify UI React components (especially \u0026lt;AIConversation\u0026gt;), providing a clean and intuitive chat experience. Key features include:\nMarkdown rendering with support for code blocks, tables, and lists Custom rendering for special content types (contacts, user choices) generated by LLM Message feedback mechanism for continuous improvement Responsive design for desktop and mobile devices The UI is designed to handle various response formats, including:\nText responses with proper formatting User choice prompts for guided conversations Contact information for industry experts Error messages with appropriate styling Implementation Highlights Amplify Gen2 Configuration The project leverages Amplify Gen2's TypeScript-based configuration for defining backend resources. This approach provides type safety and better developer experience compared to traditional YAML or JSON configurations.\n1export const schema = a.schema({ 2 knowledgeBase: CONFIG.knowledgeBaseId 3 ? a 4 .query() 5 .arguments({ 6 input: a.string(), 7 }) 8 .handler( 9 a.handler.custom({ 10 dataSource: \u0026#39;KnowledgeBaseDataSource\u0026#39;, 11 entry: \u0026#39;./resolvers/kbResolver.js\u0026#39;, 12 }) 13 ) 14 .returns(a.string()) 15 .authorization((allow) =\u0026gt; allow.authenticated()) 16 : a.customType({ 17 input: a.string(), 18 kbId: a.string(), 19 }), 20 21 chat: a.conversation({ 22 aiModel: modelId, 23 systemPrompt: `You are industry assistant, a highly skilled AWS Industry solutions expert...`, 24 tools: [ 25 ...(CONFIG.knowledgeBaseId ? [ 26 a.ai.dataTool({ 27 name: \u0026#39;searchDocumentation\u0026#39;, 28 description: `Performs a similarity search over the documentation...`, 29 query: a.ref(\u0026#39;knowledgeBase\u0026#39;), 30 }), 31 ] : []), 32 a.ai.dataTool({ 33 name: \u0026#39;intentionCheck\u0026#39;, 34 description: `Analyzes the user\u0026#39;s intention...`, 35 query: a.ref(\u0026#39;chatIntention\u0026#39;), 36 }), 37 ], 38 inferenceConfiguration: { 39 maxTokens: 3600, 40 temperature: 0.1, 41 topP: 0.6, 42 }, 43 }).authorization((allow) =\u0026gt; allow.owner()), 44}); This configuration defines the GraphQL schema, AI conversation resources, and authorization rules in a concise and maintainable way.\nAgentic Behavior with Function Calling The chatbot's agentic behavior is implemented using Amazon Bedrock's function calling capabilities. Two main functions are defined:\nsearchDocumentation: Performs similarity search over the knowledge base to retrieve relevant information. intentionCheck: Analyzes the user's intention to determine if it meets the criteria for appropriate responses. These functions allow the chatbot to:\nRetrieve relevant information from the knowledge base Validate user queries against its domain of expertise Guide users through a structured conversation flow Provide accurate and contextually relevant responses Challenges and Solutions Challenge 1: Structured Conversation Flow Challenge: Implementing a structured conversation flow that guides users through industry selection, solution area identification, and use case specification without feeling rigid or unnatural.\nSolution: I designed a sequential intent identification framework that analyzes user queries at multiple levels:\nFirst checking if the query relates to supported industries Then identifying specific solution areas within that industry Finally determining the specific use case the user is interested in This approach allows the chatbot to maintain context while guiding users through a logical conversation flow. The implementation uses a combination of knowledge base retrieval and intent analysis to make this process feel natural.\nChallenge 2: Knowledge Base Integration Challenge: Integrating the knowledge base effectively to provide accurate and relevant information while handling the limitations of retrieval-augmented generation.\nSolution: The solution involved two key aspects:\nData Preprocessing and Conversion:\nConverting various internal resources (PPT, PDF, HTML, etc.) into Markdown format for consistent processing Implementing effective chunking strategies to ensure chunks aren't too large but contain sufficient context Preserving document structure and relationships between sections during chunking Adding metadata to chunks to enable effective filtering and retrieval Optimized Retrieval Implementation:\nImplementing a custom resolver that performs vector search with a relatively high number of results (30) Applying metadata filtering to focus on recent or important documents Using reranking to improve relevance of results Returning a manageable number of reranked results (10) Knowledge Base Sync Workflow:\nImplementing a daily sync process to keep the knowledge base up-to-date Monitoring ingestion jobs to ensure data quality Publishing metrics to CloudWatch for observability This comprehensive approach ensures that the chatbot has access to the most relevant and up-to-date information while avoiding context window limitations. The careful preprocessing of knowledge base data proved to be just as important as the retrieval mechanism itself, as it directly impacts the quality and relevance of the information available to the model.\nChallenge 3: Response Quality and Citations Challenge: Ensuring response quality with proper citations while maintaining a natural conversation flow.\nSolution: I implemented a comprehensive response formatting system that:\nStructures responses in a natural conversation flow Includes citations when providing answers based on knowledge base retrieval Formats citations as footnotes with proper links Includes a references section at the end of responses This approach ensures that responses are both informative and trustworthy, with clear attribution to source materials.\nKey Learnings 1. The Power of Structured Prompting One of the most important lessons from this project is the power of structured prompting. By designing a clear and logical framework for the chatbot's behavior, I was able to create a system that feels intelligent and helpful without going off-track.\nThe sequential intent identification framework ensures that the chatbot:\nStays within its domain of expertise Guides users through a logical conversation flow Provides relevant and accurate information This structured approach is essential for creating agentic AI systems that can handle complex tasks while maintaining context awareness.\n2. Knowledge Base Design Matters The design of the knowledge base significantly impacts the quality of the chatbot's responses. Key considerations include:\nDocument Chunking: Finding the right balance between document size and context preservation Metadata Enrichment: Adding relevant metadata to enable effective filtering Regular Updates: Keeping the knowledge base up-to-date with the latest information Quality Control: Ensuring that the knowledge base contains accurate and relevant information Investing time in knowledge base design pays dividends in the quality of the chatbot's responses.\n3. User Experience Considerations Creating a good user experience for an agentic chatbot involves more than just accurate responses. Important considerations include:\nResponse Formatting: Using markdown, tables, and other formatting to improve readability Guided Options: Providing clear choices when users need to select from multiple options Error Handling: Gracefully handling errors and providing helpful guidance Feedback Mechanisms: Allowing users to provide feedback on responses for continuous improvement These considerations help create a chatbot that feels helpful and intuitive to use.\nFuture Improvements While the current implementation is functional and useful, there are several areas for future improvement:\nMulti-modal Support: Adding support for image and document upload to enable more complex interactions Personalization: Tailoring responses based on user preferences and history Improved Analytics: Implementing more comprehensive analytics to track usage patterns and identify areas for improvement Enhanced Knowledge Base: Expanding the knowledge base with more industry-specific information and use cases Multi-language Support: Adding support for multiple languages to serve a global audience Conclusion Building an agentic chatbot on AWS using Amazon Bedrock and Amplify Gen2 has been a rewarding experience. The combination of powerful foundation models, knowledge base integration, and a well-designed conversation flow creates a system that can provide valuable industry-specific guidance.\nThe key to success lies in the structured approach to conversation design, effective knowledge base integration, and attention to user experience details. By following these principles, you can create agentic AI systems that are both powerful and user-friendly.\nAs AI technology continues to evolve, the possibilities for agentic chatbots will only expand. I'm excited to see how this field develops and how these systems can be further enhanced to provide even more value to users.\nResources Amazon Bedrock Documentation Amplify Gen2 Documentation Next.js Documentation Amplify UI React Documentation Amplify AI kit ","link":"https://kane.mx/posts/2025/build-agentic-chatbot-on-aws/","section":"posts","tags":["AWS Amplify","Amazon Bedrock","AWS","Bedrock Knowledgebase","Agentic AI","Chatbot","Claude","Next.js"],"title":"Build Agentic Chatbot on AWS with Amazon Bedrock"},{"body":"","link":"https://kane.mx/tags/chatbot/","section":"tags","tags":null,"title":"Chatbot"},{"body":"","link":"https://kane.mx/tags/next.js/","section":"tags","tags":null,"title":"Next.js"},{"body":"Overview This comprehensive benchmark evaluates the capabilities of 10 leading AI-powered developer tools and IDEs. The focus is on their ability to autonomously complete real-world programming tasks through natural language conversations, minimizing the need for manual coding. The evaluation excludes common features like code explanation, completion, unit testing, and documentation generation to focus on advanced AI capabilities.\nTesting Period: Late December 2024 to Mid-January 2025\nTools Tested: 10 major AI development assistants\nTasks Evaluated: 3 real-world programming scenarios\nNote: Several tools were tested across multiple versions during this period.\nTest Methodology The evaluation focuses on real-world programming scenarios, measuring each tool's ability to act as an autonomous developer through natural language conversations. We employed two distinct testing approaches based on the tool's capabilities:\n1. AI Agentic Approach Applied to: Tools with full agentic capabilities (Cursor, Cline, Continue, Windsurf)\nKey Characteristics:\nComplete codebase context awareness Autonomous exploration and decision-making Multi-tool utilization (file system, terminal, etc.) Minimal human intervention (only approving changes) Performance measured by autonomous work quality 2. Multi-Round Conversation Approach Applied to: Tools without full agentic support\nKey Characteristics:\nTask decomposition into manageable steps Iterative instruction and feedback cycles Active human developer guidance Focus on code generation and modifications Performance measured by code quality and iteration count Evaluation Criteria Each tool was evaluated across four key dimensions:\nTask Completion Accuracy\nSolution correctness Requirement adherence Code functionality Code Quality \u0026amp; Maintainability\nCode structure and organization Documentation quality Best practices adherence Human Intervention Requirements\nNumber of guidance instances needed Complexity of required interventions Error resolution assistance Efficiency Metrics\nTime to completion Resource utilization Cost considerations Test Cases Test Environment Background The web application used for testing is built with the following technologies:\nAmplify Gen2: Latest version of AWS Amplify for frontend and backend development Frontend: Next.js 14.x with MUI (Material-UI) and Amplify UI components Backend: AWS AppSync for GraphQL API management and DynamoDB for data storage Amplify AI: Integrated for GenAI capabilities and chat conversations Test Environment Application Stack The benchmark was conducted using a modern web application built with the following technology stack:\nFrontend Architecture Framework: Next.js 14.x UI Components: Material-UI (MUI) for core components Amplify UI for AWS service integrations Custom themed components State Management: React Context and Hooks Styling: Tailwind CSS with custom theming Backend Services API Layer: AWS AppSync (GraphQL) Database: Amazon DynamoDB Authentication: AWS Cognito AI Services: Amplify AI for GenAI features Development Platform Infrastructure: AWS Amplify Gen2 CI/CD: Amplify Hosting This modern stack was chosen to evaluate the AI tools' capabilities across various technologies and integration points, providing a realistic enterprise development scenario.\nTools and IDEs Tested AI-Powered IDEs Cursor\nStandalone AI-powered IDE built on VSCode Uses Claude 3.5 Sonnet v2 as primary model Built for AI-first development (https://cursor.sh) Windsurf\nStandalone AI-powered editor built on VSCode Features Cascade for deep contextual awareness Powered by advanced AI models Includes inline commands, codelenses, and terminal integration Available at https://codeium.com/windsurf VSCode Extensions Cline (Open Source)\nCommunity-driven AI assistant with CLI and editor integration VSCode extension with human-in-the-loop GUI Features include file editing, terminal execution, browser automation Supports multiple models via OpenRouter, Anthropic, OpenAI, Amazon Bedrock Active open-source community Available at https://github.com/cline/cline Continue (Open Source)\nOpen-source VSCode extension for AI pair programming Supports multiple AI models with flexible configuration Community-driven development Learn more at https://continue.dev GitHub Copilot\nMicrosoft and GitHub's AI pair programmer Multi-IDE integration Available at https://github.com/features/copilot Amazon Q Developer\nAWS's AI coding assistant VS Code and JetBrains IDE integration Access via https://aws.amazon.com/q/developers MarsCode\nByteDance's Douyin AI coding assistant Features code completion, testing, explanation, and error fixing Available as VSCode and JetBrains plugins Focus on data security and privacy Try at https://marscode.cn Tongyi\nAlibaba's AI coding assistant Features real-time completion, multi-file editing, testing IDE integration with VS Code, Visual Studio, JetBrains Enterprise knowledge base integration Access at https://lingma.aliyun.com/ Baidu Comate\nAI coding assistant powered by ERNIE model Features code completion, explanation, debugging Multi-file editing and task decomposition support R\u0026amp;D knowledge system integration Available at https://comate.baidu.com/en Task 1: Theme Management in Next.js Task Detail: Add a new theme, set it as default, and refactor it as a shared variable. Additionally, the web application has two different theme implementations. Based on the prompt, generate a new UI theme, add it to the existing implementation, and make it configurable.\nDifficulty: Easy\nExplanation: For human developers, this task is relatively straightforward. It involves copying and pasting the existing theme code multiple times, then making necessary adjustments to the color and style properties. The process is simple and does not require advanced programming skills, as it mainly focuses on duplicating and modifying existing code snippets to create a new theme.\nTool Result Cost Detailed Notes Cline (Claude 3.5 Sonnet v2) ✅ Success $0.9929 Completed in a single attempt to add and configure the new theme. Used the additional prompts to refactor the code to shared variables. Cline (Deepseek v3) ✅ Success $0.02/¥0.15 Required 2-3 iterations for proper theme implementation and refactoring the code to shared variables. Cursor (Claude 3.5 Sonnet v2) ✅ Success few fast requests in Pro subscription Excellent first-try implementation. Used the additional prompts to refactor the code to shared variables. Tongyi ✅ Success - Clean implementation but needed help with icon integration. Provided additional optimization suggestions for theme switching. Successfully added the new theme and made it configurable. Windsurf (GPT-4o) ✅ Success - Required iterations for proper CSS variable scope. Strong documentation output. Successfully added and configured the new theme. GitHub Copilot ⚠️ Partial - Successfully implemented theme but struggled with refactoring to shared variables. Continue ❌ Failed - The generated code looked good, but the tool failed to apply the diff changes to source files. Looked like it was a bug of applying the diff changes to source files. Amazon Q ❌ Failed - Generated the wrong code. Do not support applying the diff changes to source files. MarsCode ❌ Failed - Multiple syntax errors in generated code. Poor understanding of Next.js theme architecture. Do not support applying the diff changes to source files. Task 2: Amplify AI Function Calling Integration Task Detail: Implement function calling capabilities in Amplify AI conversations to fetch football scores and standings using the external api-football API. The application already has a chat dialog powered by Amplify AI conversations, but it is currently limited to model training knowledge and cannot provide refreshed game results or information. The task involves extending the existing chat functionality to fetch and display up-to-date football data through function calling.\nDifficulty: Medium\nExplanation: This task requires deep understanding of Amplify AI's conversation tools and function calling features. The main challenge lies in correctly implementing the tools specification and handling external API integration, as most LLMs lack comprehensive knowledge about Amplify AI's latest features.\nTool Result Cost Detailed Notes Cline (Claude 3.5 Sonnet v2) ✅ Success $2.3679 Completed with minimal iterations. Required manual update to standings tool description for season parameter clarity. Demonstrated strong understanding of Amplify AI tools implementation via learning the online documentation. Cursor (Claude 3.5 Sonnet v2) ✅ Success ~5 fast requests in Pro subscription Generated correct implementation after providing documentation links and sample code. Lambda function for standings worked perfectly on first attempt. Required minor manual fixes for season parameter description. Cline (Deepseek v3) ❌ Failed $0.049/¥0.36 Hit auto-approved API request limit (30 requests for a task). Struggled with Amplify Gen2 code comprehension despite existing references. Accidentally deleted crucial code (later restored). Continue ❌ Failed - Failed to index external documentation despite configuration. Struggled with Amplify AI tools implementation across both Sonnet 3.5 and Deepseek v3. Multiple bugs in editor functionality. Amazon Q ❌ Failed - Unable to generate correct tools definition. Limited knowledge of AWS Amplify Gen2. Poor source file integration from chat window, it could not apply the diff changes to source files. MarsCode ❌ Failed - Lacked Amplify Gen2 knowledge. No support for external documentation fetching. Unable to reference Amplify and api-football documentation. Baidu Comate ❌ Failed - Required extensive human assistance. Poor source file integration from chat interface. GitHub Copilot ❌ Failed - Generated incorrect code despite being provided sample code for Amplify AI conversation tools. Tongyi ❌ Failed - Failed to generate correct code despite detailed prompts with sample code. Unable to utilize IDE lint outputs for error resolution. Generated hallucinated responses. Windsurf ❌ Failed - No support for external documentation fetching. Generated code had multiple compilation errors in both Amplify Gen2 resources and Lambda functions. Task 3: Next.js 15 and React 19 Migration Task Detail: Upgrade the web application from Next.js 14.x to Next.js 15.x, which includes upgrading to React from 18.x to 19.x accordingly. There is an implicit requirement to upgrade other UI dependencies to support React 19.x, including Amplify UI React, MUI 5.x, and react-draggable.\nDifficulty: Hard\nExplanation: This task requires exploring the implicit dependencies between UI components and React 19. The AI developer needs to understand these dependencies through internet searches and codebase analysis. Another challenge is handling compatibility issues between Next.js 15, React 19, and existing code usage, updating routing patterns, and adapting to new Next.js or React APIs.\nNote: For this challenging migration test case, we only evaluated tools/IDEs that demonstrated strong performance in the previous test cases. This focused approach allowed us to better assess the capabilities of the most promising solutions when handling complex migration tasks.\nTool Result Cost Detailed Notes Cursor (Claude 3.5 Sonnet v2) ✅ Success ~25 fast requests in Pro subscription Successfully upgraded core Next.js and React versions but required significant human assistance for dependency resolution. Struggled with conflicts between React 19 and UI libraries (amplify-ui-react, mui 5.x, react-draggable). Successfully implemented code changes after being provided with specific resolution steps. Successfully resolved the errors after upgrading to newer React and Next.js with providing the online migration guidelines. Cline (Claude 3.5 Sonnet v2) ✅ Success $25+ Successfully completed version upgrades for Next.js, React, MUI, and Amplify UI React with detailed human guidance. Generated correct code to mitigate errors in Next.js based on migration guides and external documentation (like GitHub issues). However, showed inconsistent performance when handling migration-related errors - a same task was resolved quickly for under $1, while another attempt required multiple requests costing over $12. Cline (Deepseek V3) ⚠️ Partial $0.055/¥0.4 Successfully upgraded Next.js, React, MUI, and Amplify UI React versions with human guidance. Generated correct fixes for some migration errors in newer Next.js and React versions, but couldn't resolve all issues. The Deepseek API demonstrated stability issues, often encountering unexpected errors during processing. Key Takeaways Model Performance Differences Claude 3.5 Sonnet v2 demonstrated consistent excellence Tool implementation significantly impacts model performance Amazon Q showed limited AWS service knowledge AI Agentic Capabilities Matter Tools with agentic capabilities (Cursor, Cline) consistently outperformed traditional AI assistants The ability to autonomously explore codebases and make decisions significantly reduced human intervention Agentic tools showed better understanding of complex project structures and dependencies Web Content Access Critical Tools with web access capabilities performed significantly better Access to up-to-date documentation, GitHub issues, and Stack Overflow was crucial for complex tasks Cost Efficiency Monthly subscription-based tools (like Cursor Pro) proved more cost-effective for programming tasks, offering a low price rate for each request Pay-per-request tools (like Cline/Continue) showed suboptimal cost efficiency due to high API costs for programming tasks requiring multiple iterations Emerging models like Deepseek demonstrated potential for significantly reducing costs while maintaining reasonable performance, though reliability needs improvement Open Source vs Commercial Tools Open-source tools (Cline, Continue) demonstrated several advantages: Faster feature development and community-driven improvements Greater flexibility in model selection and configuration Transparent operation and customizable workflows Cost advantages through flexible API provider choices Commercial products showed no absolute advantage even when using the same underlying models Open source solutions offered better cost control and feature customization Community contributions led to innovative features like MCP protocol support and expanded agentic capabilities Additional Resources LLM Rankings on OpenRouter OpenRouter provides a unified API for accessing multiple LLM providers, similar to AWS Bedrock's InvokeModel API. The platform maintains comprehensive rankings of LLMs based on performance metrics and capabilities across different use cases.\nAccording to OpenRouter's public statistics, weekly LLM request volume has shown exponential growth over the past year, increasing from approximately 14 billion to over 460 billion tokens per week.\nNotable findings from the rankings:\nAI development tools dominate the top 3 LLM usage positions, including Cline, Roo Cline, Aide, and Aider DeepSeek v3, an open-source LLM from China, maintains a strong position in the top 6 Usage patterns indicate growing adoption of AI-assisted development workflows Roo Cline: Community-Driven Innovation Roo Cline represents a significant community contribution to AI-assisted development. This open-source fork of Cline extends the original platform with:\nChat Modes: choose between different prompts for Roo Cline to better suit your workflow Chat Mode Prompt Customization \u0026amp; Prompt Enhancements With over 25,000 installations in the VS Code marketplace, Roo Cline has become a popular choice for developers seeking customizable AI assistance in their workflow. Its specialized chat modes have sparked significant interest in the developer community, with some seeing it as a revolutionary step in AI-assisted development.\n","link":"https://kane.mx/posts/2025/ai-developer-tools-benchmark-comparison/","section":"posts","tags":["AI Development Tools","IDE Comparison","Cursor IDE","Cline","GitHub Copilot","Amazon Q","MarsCode","Tongyi","Claude","Deepseek","GPT-4","Next.js","React","AWS Amplify"],"title":"2025 AI Developer Tools Benchmark: Comprehensive IDE \u0026 Assistant Comparison"},{"body":"","link":"https://kane.mx/tags/ai-development-tools/","section":"tags","tags":null,"title":"AI Development Tools"},{"body":"","link":"https://kane.mx/tags/amazon-q/","section":"tags","tags":null,"title":"Amazon Q"},{"body":"","link":"https://kane.mx/tags/cline/","section":"tags","tags":null,"title":"Cline"},{"body":"","link":"https://kane.mx/tags/cursor-ide/","section":"tags","tags":null,"title":"Cursor IDE"},{"body":"","link":"https://kane.mx/tags/deepseek/","section":"tags","tags":null,"title":"Deepseek"},{"body":"","link":"https://kane.mx/tags/github-copilot/","section":"tags","tags":null,"title":"GitHub Copilot"},{"body":"","link":"https://kane.mx/tags/gpt-4/","section":"tags","tags":null,"title":"GPT-4"},{"body":"","link":"https://kane.mx/tags/ide-comparison/","section":"tags","tags":null,"title":"IDE Comparison"},{"body":"","link":"https://kane.mx/tags/marscode/","section":"tags","tags":null,"title":"MarsCode"},{"body":"","link":"https://kane.mx/tags/react/","section":"tags","tags":null,"title":"React"},{"body":"","link":"https://kane.mx/tags/tongyi/","section":"tags","tags":null,"title":"Tongyi"},{"body":"","link":"https://kane.mx/tags/amazon-nova/","section":"tags","tags":null,"title":"Amazon Nova"},{"body":"","link":"https://kane.mx/tags/amazon-nova-canvas/","section":"tags","tags":null,"title":"Amazon Nova Canvas"},{"body":"","link":"https://kane.mx/tags/claude-desktop/","section":"tags","tags":null,"title":"Claude Desktop"},{"body":"Ever wished you could conjure up the perfect images for your blog posts or articles without leaving your editor? Wouldn't it be amazing to generate professional-quality visuals with just a few keystrokes while you're in your creative flow?\nWell, grab your virtual paintbrush because we're about to dive into how you can do exactly that using GenAI's image generation capabilities with Model Context Protocol (MCP) server!\nThe Magic Behind the Scenes: Model Context Protocol (MCP) Think of Model Context Protocol (MCP) as the universal translator for AI applications. Just like how USB-C lets you plug anything into anything (well, almost), MCP is the cool new standard that helps Large Language Models (LLMs) talk to all sorts of data sources and tools.\nflowchart LR subgraph \u0026#34;Your Computer\u0026#34; Host[\u0026#34;Host with MCP Client\\n(Claude, IDEs, Tools)\u0026#34;] S1[\u0026#34;MCP Server A\u0026#34;] S2[\u0026#34;MCP Server B\u0026#34;] S3[\u0026#34;MCP Server C\u0026#34;] Host \u0026lt;--\u0026gt;|\u0026#34;MCP Protocol\u0026#34;| S1 Host \u0026lt;--\u0026gt;|\u0026#34;MCP Protocol\u0026#34;| S2 Host \u0026lt;--\u0026gt;|\u0026#34;MCP Protocol\u0026#34;| S3 S1 \u0026lt;--\u0026gt; D1[(\u0026#34;Local\\nData Source A\u0026#34;)] S2 \u0026lt;--\u0026gt; D2[(\u0026#34;Local\\nData Source B\u0026#34;)] end subgraph \u0026#34;Internet\u0026#34; S3 \u0026lt;--\u0026gt;|\u0026#34;Web APIs\u0026#34;| D3[(\u0026#34;Remote\\nService C\u0026#34;)] end The elegant simplicity of MCP Architecture\nHere's how this clever system works:\nMCP Hosts: Your favorite apps like Claude Desktop and IDEs that want to tap into the AI goodness MCP Clients: The friendly middlemen ensuring smooth conversations between apps and servers MCP Servers: The specialized workers that make the magic happen Local Data Sources: Your computer's treasure trove of files and services Remote Services: The vast world of internet services at your fingertips What makes MCP really shine is its ability to:\nPlug and play with a growing collection of pre-built tools Switch between different AI providers as easily as changing TV channels Keep your data safe and sound within your own setup Amazon Nova Canvas: Your AI Art Studio Amazon Nova Canvas is like having a professional artist at your beck and call. This cutting-edge image generation model from AWS turns your ideas into stunning visuals, whether you describe them in words or show it reference images.\nWhat's in the toolbox?\nCreative Control: Want to tweak colors or adjust layouts? Just tell it what you want! Magic Wand for Images: Need to swap backgrounds or remove objects? Nova's got your back Safety First: Built-in watermarking and content checks keep things professional and traceable Let's see it in action! Here's what happened when I asked Nova to create \u0026quot;a dinosaur sitting in a teacup\u0026quot;:\nWho knew dinosaurs could be so adorable at teatime?\nPretty neat, right? Nova took this whimsical idea and turned it into something that's both charming and surprisingly believable. The way it handled the size difference between our tea-loving dino and its delicate perch is just chef's kiss.\nMaking It Work: MCP Server for Amazon Bedrock Nova Canvas Ready to connect your favorite editor to this image-generating wonderland? Enter the Amazon Bedrock MCP Server - your bridge to Nova Canvas's creative powers.\nThis open-source gem comes packed with:\nText-to-image generation that actually works Fine-tuning through negative prompts (for when you want to say \u0026quot;but definitely not like that!\u0026quot;) Flexible size and quality settings Consistent results with seed control Rock-solid error handling (because we all need a safety net) Playing Nice with Others: MCP Client Integration The beauty of MCP is how well it plays with others. Here are some of your soon-to-be favorite companions:\nClaude Desktop App: The full package with all the bells and whistles Continue: Your open-source coding buddy that speaks MCP fluently Cline: A VS Code extension that makes AI feel like a natural part of your workflow Setting Up Claude Desktop Want to start generating images in Claude Desktop? Here's your quick setup guide:\nGet your AWS credentials sorted (with Bedrock permissions) Add a sprinkle of MCP server config to Claude Desktop settings Here's what the magic looks like in action:\nNova Canvas bringing ideas to life in Claude Desktop\nWrapping Up As we roll into the New Year, it's incredible to see tools like Nova Canvas and MCP making creative AI so accessible and fun to use. Whether you're a developer, writer, or creative soul, generating professional images is now as easy as describing what's in your imagination.\nThe future is looking bright (and beautifully illustrated) with these tools at our fingertips. As we step into 2025, we can expect even more exciting developments in the world of AI-assisted creativity.\nHere's to a New Year filled with endless creative possibilities! May your prompts be inspired and your generations be spectacular. 🎨✨\n","link":"https://kane.mx/posts/2024/ai-image-generation-with-amazon-nova/","section":"posts","tags":["Model Context Protocol","Amazon Nova","Amazon Nova Canvas","Image Generation","GenAI","AWS","Anthropic","Claude Desktop","Cline"],"title":"Create Amazing Images with Amazon Nova and Model Context Protocol"},{"body":"","link":"https://kane.mx/tags/genai/","section":"tags","tags":null,"title":"GenAI"},{"body":"","link":"https://kane.mx/tags/image-generation/","section":"tags","tags":null,"title":"Image Generation"},{"body":"","link":"https://kane.mx/tags/aws-appsync/","section":"tags","tags":null,"title":"AWS AppSync"},{"body":"","link":"https://kane.mx/tags/aws-cognito/","section":"tags","tags":null,"title":"AWS Cognito"},{"body":"","link":"https://kane.mx/tags/fullstack/","section":"tags","tags":null,"title":"Fullstack"},{"body":"AWS Amplify is a powerful set of tools and services for developing, hosting, and managing serverless applications. With the recent launch of Amplify Gen 212, the platform has evolved significantly to enhance the developer experience. In this guide, we'll explore nine essential tips that will help you maximize your productivity with AWS Amplify, covering everything from authentication and infrastructure management to AI integration and deployment.\nUnderstanding Amplify Gen 2 Before diving into the tips, let's understand what makes Amplify Gen 2 special. It introduces a code-first developer experience that enables building fullstack applications using TypeScript. Key benefits include:\nTypeScript-first backend development Faster local development with cloud sandbox environments Improved team workflows with fullstack Git branches Unified management console Enhanced integration with AWS CDK Tip 1: Implementing Third-Party Authentication AWS Amplify provides seamless integration with popular authentication providers like Google, Facebook, and Amazon. You can also leverage any service supporting industry-standard protocols like OpenID Connect (OIDC) or SAML. While the built-in Authenticator component doesn't directly support third-party provider customization, you can achieve this through Header and Footer customization.\n1\u0026lt;Authenticator 2 components={{ 3 Header: SignInHeader, 4 SignIn: { 5 Header() { 6 return ( 7 \u0026lt;div className=\u0026#34;px-8 py-2\u0026#34;\u0026gt; 8 \u0026lt;Flex direction=\u0026#34;column\u0026#34; 9 className=\u0026#34;federated-sign-in-container\u0026#34;\u0026gt; 10 \u0026lt;Button 11 onClick={async () =\u0026gt; { 12 await signInWithRedirect({ 13 provider: { 14 custom: \u0026#39;OIDC-Provider\u0026#39; // OIDC Provider name created in Cognito User Pool 15 } 16 }); 17 }} 18 className=\u0026#34;federated-sign-in-button\u0026#34; 19 gap=\u0026#34;1rem\u0026#34; 20 \u0026gt; 21 \u0026lt;svg 22 xmlns=\u0026#34;http://www.w3.org/2000/svg\u0026#34; 23 fill=\u0026#34;#000\u0026#34; 24 version=\u0026#34;1.1\u0026#34; 25 viewBox=\u0026#34;0 0 32 32\u0026#34; 26 xmlSpace=\u0026#34;preserve\u0026#34; 27 className=\u0026#34;amplify-icon federated-sign-in-icon\u0026#34; 28 \u0026gt; 29 \u0026lt;path 30 d=\u0026#34;M31 31.36H1v-.72h30v.72zm0-7H1A.36.36 0 01.64 24V1A.36.36 0 011 .64h30a.36.36 0 01.36.36v23a.36.36 0 01-.36.36zm-29.64-.72h29.28V1.36H1.36v22.28zm7.304-7.476c-.672 0-1.234-.128-1.687-.385s-.842-.6-1.169-1.029l.798-.644c.28.355.593.628.938.819.345.191.747.287 1.204.287.476 0 .847-.103 1.113-.308.266-.206.399-.495.399-.868 0-.28-.091-.52-.273-.721-.182-.201-.511-.338-.987-.414l-.574-.084a4.741 4.741 0 01-.924-.217c-.28-.098-.525-.229-.735-.392s-.374-.366-.49-.609a1.983 1.983 0 01-.175-.868c0-.354.065-.665.196-.931.13-.266.31-.488.539-.665s.501-.311.819-.399a3.769 3.769 0 011.022-.133c.588 0 1.08.103 1.477.308.396.206.744.49 1.043.854l-.742.672c-.159-.224-.392-.427-.7-.609-.308-.182-.695-.272-1.162-.272s-.819.1-1.057.3c-.238.201-.357.474-.357.819 0 .354.119.611.357.77.238.159.581.275 1.029.35l.56.084c.803.122 1.372.353 1.708.693.336.341.504.786.504 1.337 0 .7-.238 1.251-.714 1.652-.476.402-1.13.603-1.96.603zm6.733 0c-.672 0-1.234-.128-1.687-.385s-.842-.6-1.169-1.029l.798-.644c.28.355.593.628.938.819.345.191.747.287 1.204.287.476 0 .847-.103 1.113-.308.266-.206.399-.495.399-.868 0-.28-.091-.52-.273-.721-.182-.201-.511-.338-.987-.413l-.574-.084c-.336-.046-.644-.119-.924-.217s-.525-.229-.735-.392-.374-.366-.49-.609a1.983 1.983 0 01-.175-.868c0-.354.065-.665.196-.931.13-.266.31-.488.539-.665.229-.177.501-.311.819-.399a3.769 3.769 0 011.022-.133c.588 0 1.08.103 1.477.308.396.206.744.49 1.043.854l-.742.672c-.158-.224-.392-.427-.7-.609s-.695-.273-1.162-.273-.819.101-1.057.301c-.238.201-.357.474-.357.819 0 .354.119.611.357.77s.581.275 1.029.35l.56.084c.803.122 1.372.353 1.708.693.337.341.505.786.505 1.337 0 .7-.238 1.251-.715 1.652-.475.401-1.129.602-1.96.602zm7.378 0c-.485 0-.929-.089-1.33-.266s-.744-.432-1.028-.763a3.584 3.584 0 01-.665-1.19 4.778 4.778 0 01-.238-1.561c0-.569.079-1.087.238-1.554a3.56 3.56 0 01.665-1.197c.284-.332.627-.586 1.028-.763s.845-.266 1.33-.266.927.089 1.323.266.739.432 1.029.763c.289.331.513.73.672 1.197.158.467.238.985.238 1.554 0 .579-.08 1.099-.238 1.561a3.546 3.546 0 01-.672 1.19c-.29.331-.633.585-1.029.763a3.19 3.19 0 01-1.323.266zm0-.995c.606 0 1.102-.187 1.484-.56.383-.373.574-.942.574-1.708v-1.036c0-.765-.191-1.334-.574-1.708s-.878-.56-1.484-.56-1.102.187-1.483.56c-.383.374-.574.943-.574 1.708v1.036c0 .766.191 1.335.574 1.708.382.374.877.56 1.483.56z\u0026#34;\u0026gt;\u0026lt;/path\u0026gt; 31 \u0026lt;path fill=\u0026#34;none\u0026#34; d=\u0026#34;M0 0H32V32H0z\u0026#34;\u0026gt;\u0026lt;/path\u0026gt; 32 \u0026lt;/svg\u0026gt; 33 \u0026lt;span style={{color: \u0026#34;white !important\u0026#34;}}\u0026gt;Sign In with My OIDC Provider\u0026lt;/span\u0026gt; 34 \u0026lt;/Button\u0026gt; 35 \u0026lt;Divider label=\u0026#34;or\u0026#34; size=\u0026#34;small\u0026#34;/\u0026gt; 36 \u0026lt;/Flex\u0026gt; 37 \u0026lt;/div\u0026gt; 38 ); 39 } 40 } 41 }} 42 loginMechanisms={[\u0026#39;email\u0026#39;]} 43 signUpAttributes={[\u0026#39;email\u0026#39;]} 44 initialState=\u0026#34;signIn\u0026#34; 45 hideSignUp={true} 46/\u0026gt; Tip 2: Building Passwordless Authentication Amazon Cognito now supports passwordless authentication, including sign-in with passkeys, email, and text messages. While the Authenticator component doesn't natively support these features, you can create a custom authentication experience using the Amplify JS library.\n1import { useState } from \u0026#39;react\u0026#39;; 2import { useRouter } from \u0026#39;next/navigation\u0026#39;; 3import { signIn, confirmSignIn, fetchUserAttributes } from \u0026#39;aws-amplify/auth\u0026#39;; 4import { TextField, Button, CircularProgress, Alert } from \u0026#39;@mui/material\u0026#39;; 5 6export default function Home() { 7 const [email, setEmail] = useState(\u0026#39;\u0026#39;); 8 const [code, setCode] = useState(\u0026#39;\u0026#39;); 9 const [loading, setLoading] = useState(false); 10 const [error, setError] = useState(\u0026#39;\u0026#39;); 11 const [showConfirmation, setShowConfirmation] = useState(false); 12 const router = useRouter(); 13 14 const handleSignIn = async (e: React.FormEvent) =\u0026gt; { 15 e.preventDefault(); 16 setLoading(true); 17 setError(\u0026#39;\u0026#39;); 18 19 try { 20 const { nextStep } = await signIn({ 21 username: email, 22 options: { 23 authFlowType: \u0026#39;USER_AUTH\u0026#39;, 24 preferredChallenge: \u0026#39;EMAIL_OTP\u0026#39;, 25 }, 26 }); 27 if (nextStep.signInStep === \u0026#39;CONFIRM_SIGN_IN_WITH_EMAIL_CODE\u0026#39; || 28 nextStep.signInStep === \u0026#39;CONTINUE_SIGN_IN_WITH_FIRST_FACTOR_SELECTION\u0026#39; 29 ) { 30 setShowConfirmation(true); 31 } 32 } catch (err) { 33 setError(err instanceof Error ? err.message : \u0026#39;Sign in failed\u0026#39;); 34 } finally { 35 setLoading(false); 36 } 37 }; 38 39 const handleConfirmSignIn = async (e: React.FormEvent) =\u0026gt; { 40 e.preventDefault(); 41 setLoading(true); 42 setError(\u0026#39;\u0026#39;); 43 44 try { 45 const { nextStep: confirmSignInNextStep } = await confirmSignIn({ challengeResponse: code }); 46 47 if (confirmSignInNextStep.signInStep === \u0026#39;DONE\u0026#39;) { 48 const attributes = await fetchUserAttributes(); 49 if (attributes.email) { 50 router.push(\u0026#39;/home\u0026#39;); 51 } 52 } 53 } catch (err) { 54 setError(err instanceof Error ? err.message : \u0026#39;Confirmation failed\u0026#39;); 55 } finally { 56 setLoading(false); 57 } 58 }; 59 60 return ( 61 \u0026lt;div className=\u0026#34;flex items-center justify-center min-h-screen\u0026#34;\u0026gt; 62 \u0026lt;div className=\u0026#34;w-full max-w-md p-6\u0026#34;\u0026gt; 63 \u0026lt;div className=\u0026#34;text-center mb-8\u0026#34;\u0026gt; 64 \u0026lt;h1 className=\u0026#34;text-2xl font-bold mb-2\u0026#34;\u0026gt;Sign in to My App\u0026lt;/h1\u0026gt; 65 \u0026lt;p className=\u0026#34;text-gray-600\u0026#34;\u0026gt; 66 {showConfirmation ? \u0026#39;Enter the code sent to your email\u0026#39; : \u0026#39;Enter your email to receive a code\u0026#39;} 67 \u0026lt;/p\u0026gt; 68 \u0026lt;/div\u0026gt; 69 70 {error \u0026amp;\u0026amp; ( 71 \u0026lt;Alert severity=\u0026#34;error\u0026#34; className=\u0026#34;mb-4\u0026#34;\u0026gt; 72 {error} 73 \u0026lt;/Alert\u0026gt; 74 )} 75 76 {!showConfirmation ? ( 77 \u0026lt;form onSubmit={handleSignIn}\u0026gt; 78 \u0026lt;TextField 79 fullWidth 80 label=\u0026#34;Email\u0026#34; 81 type=\u0026#34;email\u0026#34; 82 value={email} 83 onChange={(e) =\u0026gt; setEmail(e.target.value)} 84 disabled={loading} 85 required 86 className=\u0026#34;mb-4\u0026#34; 87 /\u0026gt; 88 \u0026lt;Button 89 fullWidth 90 variant=\u0026#34;contained\u0026#34; 91 type=\u0026#34;submit\u0026#34; 92 disabled={loading} 93 className=\u0026#34;mt-2\u0026#34; 94 \u0026gt; 95 {loading ? \u0026lt;CircularProgress size={24} /\u0026gt; : \u0026#39;Continue\u0026#39;} 96 \u0026lt;/Button\u0026gt; 97 \u0026lt;/form\u0026gt; 98 ) : ( 99 \u0026lt;form onSubmit={handleConfirmSignIn}\u0026gt; 100 \u0026lt;TextField 101 fullWidth 102 label=\u0026#34;Verification Code\u0026#34; 103 value={code} 104 onChange={(e) =\u0026gt; setCode(e.target.value)} 105 disabled={loading} 106 required 107 className=\u0026#34;mb-4\u0026#34; 108 /\u0026gt; 109 \u0026lt;Button 110 fullWidth 111 variant=\u0026#34;contained\u0026#34; 112 type=\u0026#34;submit\u0026#34; 113 disabled={loading} 114 className=\u0026#34;mt-2\u0026#34; 115 \u0026gt; 116 {loading ? \u0026lt;CircularProgress size={24} /\u0026gt; : \u0026#39;Verify\u0026#39;} 117 \u0026lt;/Button\u0026gt; 118 \u0026lt;/form\u0026gt; 119 )} 120 \u0026lt;/div\u0026gt; 121 \u0026lt;/div\u0026gt; 122 ); 123} Tip 3: Managing Backend Access with ID Tokens When working with authenticated users, proper token management is crucial. While Amplify automatically handles access tokens for data API requests, some scenarios require manual token management for accessing user attributes in your backend services.\n1import { fetchAuthSession } from \u0026#39;aws-amplify/auth\u0026#39;; 2 3const session = await fetchAuthSession(); 4if (!session.tokens?.idToken) throw new Error(\u0026#39;User not signed in\u0026#39;); 5 6await client.mutations.action({ 7 ...formData, 8}, { 9 authMode: \u0026#39;userPool\u0026#39;, 10 headers: { 11 \u0026#39;Authorization\u0026#39;: session.tokens.idToken.toString(), 12 } 13}); In the backend, you can use the email attribute of the user like below if you are using AppSync JS resolver: 1import { util } from \u0026#39;@aws-appsync/utils\u0026#39;; 2 3export function request(ctx) { 4 const owner = ctx.identity.claims.email || ctx.identity.username; 5} 6 7export function response(ctx) { 8 return ctx.result; 9} Tip 4: Mastering UI Development Amplify UI provides a rich set of components designed for seamless integration. Learn how to maintain a consistent look and feel when combining Amplify UI with other popular libraries like Material-UI (MUI).\n1import { ThemeProvider, createTheme, defaultDarkModeOverride } from \u0026#39;@aws-amplify/ui-react\u0026#39;; 2import { styled, ThemeProvider as MUIThemeProvider, createTheme } from \u0026#39;@mui/material/styles\u0026#39;; 3 4const theme = createTheme({ 5 name: \u0026#39;christmas-theme\u0026#39;, 6 tokens: { 7 colors: { 8 background: { 9 primary: { value: \u0026#39;#FFFFFF\u0026#39; }, // Snow white background 10 secondary: { value: \u0026#39;#165B33\u0026#39; }, // Christmas green 11 }, 12 }, 13 components: { 14 button: { 15 primary: { 16 backgroundColor: { value: \u0026#39;#CC231E\u0026#39; }, 17 color: { value: \u0026#39;#FFFFFF\u0026#39; }, 18 _hover: { 19 backgroundColor: { value: \u0026#39;#165B33\u0026#39; }, 20 }, 21 }, 22 }, 23 }, 24 }, 25 overrides: [defaultDarkModeOverride] 26}); 27 28const muiTheme = createTheme({ 29 palette: { 30 primary: { 31 main: theme.tokens.colors.font.interactive.value, 32 }, 33 }, 34}); 35 36export default function RootLayout({ 37 children, 38}: { 39 children: React.ReactNode; 40}) { 41 return ( 42 \u0026lt;html lang=\u0026#34;en\u0026#34; className={inter.className}\u0026gt; 43 \u0026lt;head\u0026gt; 44 \u0026lt;meta name=\u0026#34;viewport\u0026#34; content=\u0026#34;width=device-width, initial-scale=1\u0026#34; /\u0026gt; 45 \u0026lt;/head\u0026gt; 46 \u0026lt;body\u0026gt; 47 \u0026lt;ThemeProvider theme={theme}\u0026gt; 48 \u0026lt;AmplifyProvider\u0026gt; 49 \u0026lt;main className=\u0026#34;min-h-screen\u0026#34;\u0026gt; 50 \u0026lt;MUIThemeProvider theme={muiTheme}\u0026gt; 51 {children} 52 \u0026lt;/MUIThemeProvider\u0026gt; 53 \u0026lt;/main\u0026gt; 54 \u0026lt;/AmplifyProvider\u0026gt; 55 \u0026lt;/ThemeProvider\u0026gt; 56 \u0026lt;/body\u0026gt; 57 \u0026lt;/html\u0026gt; 58 ); 59} Tip 5: Extending Infrastructure with CDK For complex backend requirements, AWS CDK enables powerful customization of your Amplify backend. This allows you to manage all types of AWS resources while benefiting from the extensive CDK construct ecosystem.\nExample 1: Customizing Lambda Logging You might want to customize the logging configuration of a Lambda function in your Amplify backend. Here's how you can achieve this using CDK Interoperability:\n1(backend.leagueHandler.resources.lambda.node.defaultChild as CfnFunction).addPropertyOverride(\u0026#39;LoggingConfig\u0026#39;, { 2 LogFormat: \u0026#39;JSON\u0026#39;, 3 ApplicationLogLevel: process.env.PRODUCTION ? \u0026#39;WARN\u0026#39; : \u0026#39;TRACE\u0026#39;, 4 SystemLogLevel: \u0026#39;INFO\u0026#39;, 5}); Example 2: Use environment variables in AppSync JS resolvers When using AppSync resolvers, you might need to access different external resources in different stags. You can use environment variables of AppSync to achieve this. Here's how you can set them in your Amplify backend and access them in a resolver:\nFilename: amplify/backend.ts 1const { cfnResources } = backend.data.resources; 2cfnResources.cfnGraphqlApi.xrayEnabled = true; 3cfnResources.cfnGraphqlApi.environmentVariables = { 4 ...(config.data.knowledgeBaseId ? { KNOWLEDGE_BASE_ID: config.data.knowledgeBaseId } : {}), 5 RERANK_MODEL_ID: \u0026#39;cohere.rerank-v3-5:0\u0026#39;, 6}; Filename: amplify/data/resolvers/kbResolver.js 1export function request(ctx) { 2 const { input } = ctx.args; 3 const { REGION, KNOWLEDGE_BASE_ID, RERANK_MODEL_ID } = ctx.env; 4 return { 5 resourcePath: `/knowledgebases/${KNOWLEDGE_BASE_ID}/retrieve`, 6 method: \u0026#34;POST\u0026#34;, 7 params: { 8 headers: { 9 \u0026#34;Content-Type\u0026#34;: \u0026#34;application/json\u0026#34;, 10 }, 11 body: JSON.stringify({ 12 retrievalQuery: { 13 text: input, 14 }, 15 vectorSearchConfiguration: { 16 numberOfResults: 30, 17 rerankingConfiguration: { 18 type: \u0026#39;BEDROCK\u0026#39;, 19 bedrockRerankingConfiguration: { 20 modelConfiguration: { 21 modelArn: `arn:aws:bedrock:${REGION}::foundation-model/${RERANK_MODEL_ID}`, 22 additionalModelRequestFields: { 23 topK: 20 24 } 25 }, 26 numberOfRerankedResults: 10 27 } 28 } 29 }, 30 }), 31 }, 32 }; 33} 34 35export function response(ctx) { 36 return JSON.stringify(ctx.result.body); 37} Tip 6: Optimizing DynamoDB Access Learn how to handle common challenges like circular dependencies when accessing DynamoDB tables from Lambda resolvers in your Amplify-generated AppSync API. You can access user identity information in your resolvers using AppSync identity context.\n1import { defineBackend } from \u0026#39;@aws-amplify/backend\u0026#39;; 2import { auth } from \u0026#39;./auth/resource\u0026#39;; 3import { data, leagueHandler } from \u0026#39;./data/resource\u0026#39;; 4export const backend = defineBackend({ 5 auth, 6 data, 7 leagueHandler, 8}); 9 10const externalTableStack = backend.createStack(\u0026#39;ExternalTableStack\u0026#39;); 11 12const leagueTable = new Table(externalTableStack, \u0026#39;League\u0026#39;, { 13 partitionKey: { 14 name: \u0026#39;id\u0026#39;, 15 type: AttributeType.STRING 16 }, 17 billingMode: BillingMode.PAY_PER_REQUEST, 18 removalPolicy: RemovalPolicy.DESTROY, 19}); 20 21backend.data.addDynamoDbDataSource( 22 \u0026#34;ExternalLeagueTableDataSource\u0026#34;, 23 leagueTable as any 24); 25 26leagueTable.grantReadWriteData(backend.leagueHandler.resources.lambda); 27(backend.leagueHandler.resources.lambda as NodejsFunction).addEnvironment(\u0026#39;LEAGUE_TABLE_NAME\u0026#39;, leagueTable.tableName); declare the DynamoDB table outside of generated Amplify stack in amplify/backend.ts 1const schema = a.schema({ 2 League: a.customType({ 3 id: a.string().required(), 4 leagueCountry: a.ref(\u0026#39;LeagueCountry\u0026#39;), 5 teams: a.ref(\u0026#39;Team\u0026#39;).array(), 6 season: a.integer(), 7 }), 8}); Declare the DynamoDB schema as a custom type for AppSync in amplify/data/resource.ts, see here for more details.\nTip 7: Building Resilient AI Features Improve your application's reliability by implementing cross-region model inference with the Amplify AI Kit. While not supported out-of-the-box, you can achieve this using CDK Interoperability.\nHack the role of Lambda function for conversation and AppSync resolver role for generate in amplify/backend.ts 1function createBedrockPolicyStatement(currentRegion: string, accountId: string, modelId: string, crossRegionModel: string) { 2 return new PolicyStatement({ 3 resources: [ 4 `arn:aws:bedrock:*::foundation-model/${modelId}`, 5 `arn:aws:bedrock:${currentRegion}:${accountId}:inference-profile/${crossRegionModel}`, 6 ], 7 actions: [\u0026#39;bedrock:InvokeModel*\u0026#39;], 8 }); 9} 10 11if (CROSS_REGION_INFERENCE \u0026amp;\u0026amp; CUSTOM_MODEL_ID) { 12 const currentRegion = getCurrentRegion(backend.stack); 13 const crossRegionModel = getCrossRegionModelId(currentRegion, CUSTOM_MODEL_ID); 14 15 // [chat converstation] 16 const chatStack = backend.data.resources.nestedStacks?.[\u0026#39;ChatConversationDirectiveLambdaStack\u0026#39;]; 17 if (chatStack) { 18 const conversationFunc = chatStack.node.findAll() 19 .find(child =\u0026gt; child.node.id === \u0026#39;conversationHandlerFunction\u0026#39;) as IFunction; 20 21 if (conversationFunc) { 22 conversationFunc.addToRolePolicy( 23 createBedrockPolicyStatement(currentRegion, backend.stack.account, CUSTOM_MODEL_ID, crossRegionModel) 24 ); 25 } 26 } 27 28 // [insights generation] 29 const insightsStack = backend.data.resources.nestedStacks?.[\u0026#39;GenerationBedrockDataSourceGenerateInsightsStack\u0026#39;]; 30 if (insightsStack) { 31 const dataSourceRole = insightsStack.node.findChild(\u0026#39;GenerationBedrockDataSourceGenerateInsightsIAMRole\u0026#39;) as IRole; 32 if (dataSourceRole) { 33 dataSourceRole.attachInlinePolicy( 34 new Policy(insightsStack, \u0026#39;CrossRegionInferencePolicy\u0026#39;, { 35 statements: [ 36 createBedrockPolicyStatement(currentRegion, backend.stack.account, CUSTOM_MODEL_ID, crossRegionModel) 37 ], 38 }), 39 ); 40 } 41 } 42}\nSpecify the model ID in amplify/data/resource.ts 1const schema = a.schema({ 2 generateInsights: a.generation({ 3 aiModel: CROSS_REGION_INFERENCE ? { 4 resourcePath: getCrossRegionModelId(getCurrentRegion(undefined), CUSTOM_MODEL_ID!), 5 } : a.ai.model(LLM_MODEL), 6 systemPrompt: LLM_SYSTEM_PROMPT, 7 inferenceConfiguration: { 8 maxTokens: 1000, 9 temperature: 0.65, 10 }, 11 }) 12 .arguments({ 13 requirement: a.string().required(), 14 }) 15 .returns(a.customType({ 16 insights: a.string().required(), 17 })) 18 .authorization(allow =\u0026gt; [allow.authenticated()]), 19 20 chat: a.conversation({ 21 aiModel: CROSS_REGION_INFERENCE ? { 22 resourcePath: getCrossRegionModelId(getCurrentRegion(undefined), CUSTOM_MODEL_ID!), 23 } : a.ai.model(LLM_MODEL), 24 systemPrompt: FOOTBALL_SYSTEM_PROMPT, 25 }).authorization(allow =\u0026gt; allow.owner()), 26});\nTip 8: Creating Sophisticated Chat Interfaces The AIConversation component provides a flexible foundation for building chat applications. Master state management and user context handling for multiple conversations.\n1import { useState } from \u0026#39;react\u0026#39;; 2import { Fab, Paper, IconButton, Box, Tooltip, Typography } from \u0026#39;@mui/material\u0026#39;; 3import { AIConversation } from \u0026#39;@aws-amplify/ui-react-ai\u0026#39;; 4import { Avatar } from \u0026#39;@aws-amplify/ui-react\u0026#39;; 5import \u0026#39;@aws-amplify/ui-react/styles.css\u0026#39;; 6import { generateClient } from \u0026#39;aws-amplify/data\u0026#39;; 7import { createAIHooks } from \u0026#39;@aws-amplify/ui-react-ai\u0026#39;; 8import { type Schema } from \u0026#39;../../amplify/data/resource\u0026#39;; 9import ReactMarkdown from \u0026#39;react-markdown\u0026#39;; 10 11const client = generateClient\u0026lt;Schema\u0026gt;({ authMode: \u0026#39;userPool\u0026#39; }); 12const { useAIConversation } = createAIHooks(client); 13 14interface ChatBotProps { 15 chatId?: string; 16 refreshKey: number; 17 onStartNewChat: () =\u0026gt; void; 18 onLoadConversations: () =\u0026gt; void; 19 isLoading: boolean; 20} 21 22export default function ChatBot({ 23 chatId, 24 refreshKey, 25 onStartNewChat, 26 onLoadConversations, 27 isLoading 28}: ChatBotProps) { 29 const [open, setOpen] = useState(refreshKey \u0026gt; 0); 30 const [position, setPosition] = useState({ x: 0, y: 0 }); 31 32 const conversation = useAIConversation(\u0026#39;chat\u0026#39;, { 33 id: chatId, 34 }); 35 const [{ data: { messages }, isLoading: isLoadingChat }, sendMessage] = conversation; 36 37 const handleOpen = () =\u0026gt; { 38 setOpen(true); 39 onLoadConversations(); 40 }; 41 42 const handleClose = () =\u0026gt; setOpen(false); 43 44 const handleNewChat = () =\u0026gt; { 45 // Reset conversation and create new chat 46 onStartNewChat(); 47 }; 48 49 return ( 50\u0026lt;Box sx={{ flexGrow: 1, overflow: \u0026#39;hidden\u0026#39; }}\u0026gt; 51 \u0026lt;AIConversation 52 key={chatId} 53 allowAttachments 54 messages={messages} 55 handleSendMessage={sendMessage} 56 isLoading={isLoadingChat || isLoading} 57 avatars={{ 58 user: { 59 avatar: \u0026lt;Avatar size=\u0026#34;small\u0026#34; alt={email} /\u0026gt;, 60 username: \u0026#39;People\u0026#39; 61 }, 62 ai: { 63 avatar: \u0026lt;Avatar size=\u0026#34;small\u0026#34; alt=\u0026#34;AI\u0026#34; /\u0026gt;, 64 username: \u0026#39;Chat Bot\u0026#39; 65 } 66 }} 67 messageRenderer={{ 68 text: ({ text }) =\u0026gt; \u0026lt;ReactMarkdown\u0026gt;{text}\u0026lt;/ReactMarkdown\u0026gt;, 69 }} 70 /\u0026gt; 71\u0026lt;/Box\u0026gt; 72 ); 73} Tip 9: Streamlining Deployment Debugging When troubleshooting deployment issues in Amplify Hosting, leverage the --debug flag for deeper insights into pipeline failures, especially when code works in sandbox but fails in production.\n1version: 1 2backend: 3 phases: 4 build: 5 commands: 6 - nvm install 20 7 - nvm use 20 8 - npm ci --cache .npm --prefer-offline 9 - npx ampx pipeline-deploy --branch $AWS_BRANCH --app-id $AWS_APP_ID --debug 10frontend: 11 phases: 12 preBuild: 13 commands: 14 - nvm install 20 15 - nvm use 20 16 build: 17 commands: 18 - npm run build 19 artifacts: 20 baseDirectory: .next 21 files: 22 - \u0026#39;**/*\u0026#39; 23 cache: 24 paths: 25 - .next/cache/**/* 26 - .npm/**/* Conclusion AWS Amplify Gen 2 represents a significant evolution in fullstack development on AWS, offering a developer experience comparable to platforms like Vercel with Next.js. These tips will help you leverage Amplify's generated services alongside CDK's powerful constructs to build sophisticated serverless applications efficiently. The platform's seamless integration with the AWS ecosystem makes it an excellent choice for teams looking to accelerate their development process while maintaining enterprise-grade quality and scalability.\nIntroducing the Next Generation of AWS Amplify's Fullstack Development Experience\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nFullstack TypeScript: Reintroducing AWS Amplify\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","link":"https://kane.mx/posts/2024/aws-amplify-you-need-to-know/","section":"posts","tags":["AWS","AWS Amplify","Amazon Bedrock","AWS AppSync","AWS Cognito","LLM","Claude","Serverless","Fullstack","GenAI"],"title":"Nine Essential Tips of AWS Amplify for Boosting Development Productivity"},{"body":"","link":"https://kane.mx/categories/serverless-computing/","section":"categories","tags":null,"title":"Serverless-Computing"},{"body":"So, Who Am I? Hey there! 👋 I'm a proud dad of three awesome boys who keep me on my toes (and occasionally drive me crazy 😅). Currently calling Beijing my home, where I juggle parenting with my tech adventures.\nSpeaking of tech - I'm one of those *nix nerds who gets excited about terminal commands and clean code. By day, I wear multiple hats as a full-stack software engineer, but I also geek out over DevOps and all things cloud native especially Serverless technologies. Think of me as a digital Swiss Army knife - I love tinkering with different technologies and making things work seamlessly together!\nWhen I'm not chasing my kids around or diving deep into code, you might find me obsessing over the latest tech trends or trying to automate yet another part of my life (because who doesn't love a good automation script, right? 🤖).\nWant to know more about my adventures in tech and parenting? Feel free to stalk me (professionally, of course!) through the social links above. I promise my feed is a fun mix of code snippets, tech thoughts, and occasional dad jokes! 😉\n","link":"https://kane.mx/about/","section":"","tags":null,"title":"About"},{"body":"如果你平时关注科技圈的动态，那么你一定在朋友圈或直播中看到过这样的吸睛话题：\n编程小白用 AI 编程，人人都能成为编程高手 零基础也能跨界开发 编程小白的逆袭：借助 AI 打造 xxx 在自媒体的推波助澜下，AI 编程已成为一种潮流，甚至被包装成人人都可掌握的技能。当然，这背后也少不了培训公司借机贩卖焦虑的推动。\n随着大模型能力的持续增强，特别是 Claude Sonnet 3.5 v2 的发布，AI 编程的能力再次得到显著提升。各种 AI 编程工具和 AI IDE 层出不穷，比如 Cline（VS Code 插件）、Aider（主要支持 terminal）、Continue（VS Code 和 JetBrains 插件）、Cursor（基于 VS Code 的自定义 IDE）、Amazon Q Developer（VS Code 和 JetBrains 插件）以及 GitHub Copilot（VS Code 插件）等。这些工具如雨后春笋般涌现，确实说明 LLM 在编程领域的能力已经得到了显著提升，具备了实际应用的场景和能力。\n最近，我在一个个人项目中完整体验了 Cursor 的 AI 辅助编程能力，下面我就以这个项目为例，分享我的使用体验。\n这是一个基于 Next.js 和 Material UI 实现的 Web 应用，用于英语单词听写练习，同时可以帮助电脑初学者统计键盘输入效率。虽然我有多年的 JavaScript/TypeScript 开发经验，但这是我首次使用 React、Next.js、Material UI 等框架开发 Web 应用。整个应用从零开始，通过 Chat 模式与 LLM（主要是 Claude Sonnet 3.5 v2）协作创建项目骨架，并逐步完善功能及页面展示。最终通过 Vercel 和 AWS Amplify 完成部署上线（Vercel 版本、AWS 版本）。\n在开发过程中，我主要通过 Cursor 的 Chat 功能来描述需求，让 Cursor 据此生成代码。我的工作主要是审查生成的代码，决定接受或拒绝，并根据新的反馈让 Cursor 继续优化。与传统开发模式不同，我不再需要在编辑器中大量输入代码，而是通过对话模式与 Cursor 交流，描述需求并获取相应的代码实现。值得一提的是，Cursor 能够阅读整个项目的代码库，基于现有代码生成新的代码片段，并且支持同时为多个文件生成代码。它还支持引用外部文档和链接，通过 RAG 能力有效弥补了 LLM 知识库可能不够完整的短板。\n根据我的实践经验，AI/LLM 在编程中的优势主要体现在以下几个方面：\n快速生成项目骨架 帮助开发者迅速上手不熟悉的技术栈 快速查询并解决常见错误 高效总结和解释代码 快速生成测试代码 快速生成文档 为代码实现提供多样化的思路 然而，AI 编程目前仍无法完全替代人类编程，主要局限在于：\n高效的代码辅助生成需要扎实的领域知识。所谓的\u0026quot;小白用户\u0026quot;往往难以将用户需求准确转化为编程领域的专业概念，因此无法提供有效的 Prompt 来生成满足业务需求的代码。以我的实践为例，最初我对 MUI 的组件完全陌生，不了解组件名称及其可能呈现的视觉效果。这导致我无法仅通过描述页面样式需求来实现预期的页面效果和布局。通过深入学习 MUI 文档，逐步熟悉其组件体系后，我开始在 Prompt 中精确指定组件名称、样式特性，甚至附上 MUI 组件文档链接，这才实现了理想的页面效果。 另一个典型案例是在实现听写功能时，我需要实现一个复杂的播放控制逻辑：单词可以被设置为间隔一段时间后重复播放，但当用户完成当前单词输入转入下一词时，需要中断之前未完成的播放调度。仅通过文字描述这样复杂的逻辑，AI 难以生成满意的代码。而对于具备相关编程知识和数据结构基础的开发者来说，很容易想到使用队列来管理待播放的单词，通过清理队列来实现播放中断的控制。\nLLM 模型擅长检索和匹配已有信息，而不是通过推理探索问题的根本原因。这意味着当用户和 LLM 都缺乏相关知识时，可能会在错误的解决思路上反复尝试。例如，在我遇到的 Next.js 15.0.x、React 18.x 和 MUI 6.1.x 版本不兼容导致的间歇性 Bug 和运行时警告时，即便 LLM 参考了最新文档，也只能不断提供无效的解决方案，而没有意识到问题的本质在于软件版本的兼容性。\nLLM 的知识库可能存在滞后性，即使通过 RAG 获取了最新资料，受限于模型训练数据的时效性，仍可能得出错误的结论。不过，随着模型的持续进步和 RAG 能力的增强，这个问题有望得到逐步改善。\n我的观点是，AI 编程目前还不能完全取代人类编程，但已经可以作为强有力的辅助工具。对于新手来说，AI 辅助编程能帮助他们更快地上手陌生的技术栈，但要解决复杂或非标准的问题，仍需要开发者具备扎实的专业知识；而对于经验丰富的开发者而言，则能显著提升开发效率，快速生成可用的代码框架，并提供多样化的实现思路。有经验的开发者可以结合自身的专业知识和思维能力，配合 AI 的辅助功能，更快速、更高质量地完成复杂的编程任务。\n","link":"https://kane.mx/posts/2024/ai-copilot-for-programming/","section":"posts","tags":["Programming","IDE","GenAI","Cline","Aider","Continue","Cursor","Amazon Q Developer","GitHub Copilot","Visual Studio Code","LLM","OpenAI","Amazon Bedrock","Anthropic Claude","Productivity","Next.js","Material UI","Vercel","AWS Amplify"],"title":"AI 真能编程了吗？"},{"body":"","link":"https://kane.mx/tags/aider/","section":"tags","tags":null,"title":"Aider"},{"body":"","link":"https://kane.mx/tags/amazon-q-developer/","section":"tags","tags":null,"title":"Amazon Q Developer"},{"body":"","link":"https://kane.mx/tags/anthropic-claude/","section":"tags","tags":null,"title":"Anthropic Claude"},{"body":"","link":"https://kane.mx/tags/continue/","section":"tags","tags":null,"title":"Continue"},{"body":"","link":"https://kane.mx/tags/cursor/","section":"tags","tags":null,"title":"Cursor"},{"body":"","link":"https://kane.mx/tags/ide/","section":"tags","tags":null,"title":"IDE"},{"body":"","link":"https://kane.mx/tags/material-ui/","section":"tags","tags":null,"title":"Material UI"},{"body":"","link":"https://kane.mx/tags/openai/","section":"tags","tags":null,"title":"OpenAI"},{"body":"","link":"https://kane.mx/tags/programming/","section":"tags","tags":null,"title":"Programming"},{"body":"","link":"https://kane.mx/tags/vercel/","section":"tags","tags":null,"title":"Vercel"},{"body":"","link":"https://kane.mx/tags/visual-studio-code/","section":"tags","tags":null,"title":"Visual Studio Code"},{"body":"","link":"https://kane.mx/tags/clickstream-analytics/","section":"tags","tags":null,"title":"Clickstream Analytics"},{"body":"","link":"https://kane.mx/series/clickstream-analytics/","section":"series","tags":null,"title":"Clickstream-Analytics"},{"body":"In this post, we will explore the observability features of our clickstream solution. Observability is crucial for understanding the health of your data pipeline, identifying issues promptly, and ensuring optimal performance. We'll cover the monitoring, logging, and troubleshooting capabilities built into the solution.\nOverview The clickstream analytics solution incorporates several observability features to help users monitor and maintain their data pipelines:\nLogging Custom CloudWatch dashboards Automated alerting Pipeline health checks Troubleshooting tools Let's delve into each of these components.\nLogging The solution utilizes Amazon CloudWatch Logs and Amazon S3 to centralize logs from various components of the data pipeline and store them cost-effectively. This includes:\nData ingestion service logs, generated from the containers in ECS. You can also enable access logs for the Application Load Balancer when configuring the data pipeline. Data processing job logs, which use Amazon EMR Serverless to process the raw ingestion data and persist the job logs in an S3 bucket. Data modeling workflow logs, which use AWS Step Functions and AWS Lambda to orchestrate the workflow. All these logs are stored in CloudWatch Logs. Lambda functions now support configuring CloudWatch log groups, allowing all logs to be stored in a single CloudWatch log group, facilitating easy searching and analysis across the entire workflow. Web console application logs, which are stored in CloudWatch Logs. Furthermore, if you need to search and analyze logs across the entire data pipeline, consider implementing the AWS solution Centralized Logging with OpenSearch.\nCustom CloudWatch Dashboards The solution automatically creates custom CloudWatch dashboards for each data pipeline. These dashboards provide at-a-glance views of key metrics, including:\nIngestion service health and performance Data processing job health and key metrics statistics Data modeling workflow health and performance Redshift resource usage The challenge lies in the modular design of the data pipeline. The pipeline has different components based on user choices, and even the data modeling on Redshift can be either Redshift Serverless or provisioned Redshift. Therefore, the dashboard is not fixed before the customer creates the data pipeline. The implementation approach is as follows:\nA common metrics stack is deployed before all other components, monitoring the metrics configurations in AWS Systems Manager Parameter Store. Each module (stack) creates one or more metric configurations as standard parameters (which are free but have a 4KB size limit) with the name pattern /Clickstream/metrics/\u0026lt;project id\u0026gt;/\u0026lt;stack id prefix\u0026gt;/\u0026lt;stack id\u0026gt;/\u0026lt;index\u0026gt;. A function in the common metrics stack updates the dashboard based on the metrics specified in the parameters. Figure 1: Example CloudWatch Dashboard for the Data Pipeline with KDS as sink, enabling data processing and data modeling with Redshift Serverless Figure 2: Example CloudWatch Dashboard for the Data Pipeline with MSK as sink, enabling data processing only You can refer to the FAQ How do I monitor the health of the data pipeline for my project? for guidance on using these metrics to monitor the health of your data pipeline.\nAutomated Alerting To proactively notify users of potential issues, the solution sets up built-in CloudWatch Alarms for critical metrics. These alarms can trigger notifications via Amazon SNS, allowing users to respond quickly to problems. Some key alarms include:\nECS Cluster CPU Utilization, used for scaling out/in of the ingestion fleet ECS Pending Task Count, which alarms when the fleet is scaling out Data processing job failures Failures in the workflow of loading data into Redshift Abnormal situations where no data is loaded into Redshift Maximum file age exceeding the data processing interval threshold, which often indicates a failure in the data loading workflow or insufficient Redshift capacity Failures in the workflow of refreshing materialized views Failures in the workflow of scanning metadata Troubleshooting To aid in troubleshooting, consider the following steps:\nReview the Troubleshooting documentation to see if it addresses your issue.\nLog Queries: Query CloudWatch Logs for common troubleshooting scenarios. When encountering an \u0026quot;Internal Server Error\u0026quot; in the solution web console, collect verbose error logs of the Lambda function running the solution web console as follows:\nNavigate to the CloudFormation stack of the solution web console. Click the Resources tab to search for the Lambda function with Logical ID ApiFunction68. Click the link in the Physical ID column to open the Lambda function on the Function page. Choose Monitor and then select View logs in CloudWatch. Refer to Accessing logs with the console for more details. Error Pattern Detection: Use automated analysis of logs to identify common error patterns and suggest potential solutions.\nData Quality Checks: Implement regular checks on processed data to identify potential data quality issues early in the pipeline.\nPerformance Analyzer: Utilize tools to analyze and optimize query performance in Redshift, including recommendations for table design and query structure.\nContact Support: Get expert assistance with this solution if you have an active support plan.\nBest Practices for Pipeline Observability To make the most of these observability features, consider the following best practices:\nRegular Monitoring: Check the CloudWatch dashboards regularly to familiarize yourself with normal patterns and quickly spot anomalies.\nAlert Tuning: Adjust alert thresholds based on your specific use case to minimize false positives while ensuring you catch real issues. Add additional alerts as needed.\nLog Analysis: Use CloudWatch Logs Insights to perform ad-hoc analysis on your logs when troubleshooting specific issues.\nOn-call Integration: Integrate alert notifications with your preferred tools, such as PagerDuty or AWS Chatbot.\nContinuous Improvement: Use the insights gained from observability tools to continuously improve your data pipeline's performance and reliability.\nConclusion Observability is a critical aspect of maintaining a healthy and efficient clickstream analytics pipeline. By leveraging the built-in monitoring, logging, and troubleshooting capabilities of the solution, you can ensure that your data pipeline remains robust and performant, allowing you to focus on deriving insights from your clickstream data.\n","link":"https://kane.mx/posts/deep-dive-clickstream-analytics/pipeline-observability/","section":"posts","tags":["Clickstream Analytics","AWS","Observability","Monitoring","Logging","Troubleshooting"],"title":"Deep Dive Clickstream Analytics Series: Data Pipeline Observability"},{"body":"","link":"https://kane.mx/tags/logging/","section":"tags","tags":null,"title":"Logging"},{"body":"","link":"https://kane.mx/tags/monitoring/","section":"tags","tags":null,"title":"Monitoring"},{"body":"","link":"https://kane.mx/tags/observability/","section":"tags","tags":null,"title":"Observability"},{"body":"","link":"https://kane.mx/tags/amazon-quicksight/","section":"tags","tags":null,"title":"Amazon QuickSight"},{"body":"","link":"https://kane.mx/tags/data-visualization/","section":"tags","tags":null,"title":"Data Visualization"},{"body":"In this post, we will explore the reporting module of our clickstream solution. This module leverages Amazon QuickSight to provide powerful visualization and analysis capabilities for clickstream data, enabling users to gain valuable insights from their data.\nOverview The reporting module is engineered to be flexible, user-friendly, scalable, and highly customizable. It provides a comprehensive suite of capabilities designed to meet diverse analytical needs:\nIntegration with Amazon QuickSight for interactive data visualization and dashboard creation. Pre-built dashboards for common clickstream analytics use cases. A collection of advanced analytics models, including event analysis, funnel analysis, path analysis, retention analysis, and attribution analysis. Custom dashboard creation capabilities. Out-of-the-box Dashboards The clickstream solution comes with a set of pre-built dashboards that provide immediate value to users. These dashboards cover common analytics use cases and are designed to work with the standardized data schema produced by the data processing and data modeling modules.\nOverview Dashboard The out-of-the-box dashboards include:\nOverview: Provides a high-level summary of key metrics and trends. Acquisition: Focuses on user acquisition channels and their performance. Engagement: Analyzes user engagement patterns and metrics. Retention: Examines user retention rates and factors affecting them. Device: Provides insights into the devices, operating systems, and browsers used by your app or website users. Details: Views common and custom dimensions for individual events, and queries all user attributes for a specific user. These dashboards are automatically created and configured when a new application pipeline is registered, allowing users to start gaining insights immediately without any additional setup.\nThe solution leverages the CreateAnalysis API to create the analysis and dashboard programmatically.\nCustom Reporting While the out-of-the-box dashboards cover many common use cases, the solution also provides the flexibility for users to create custom reports tailored to their specific needs.\nThe custom reporting feature leverages Amazon QuickSight's capabilities, allowing users to:\nCreate new visualizations using the rich set of chart types available in QuickSight. Design custom dashboards by combining various visualizations. Use advanced features like parameters, controls, and calculated fields to create interactive and dynamic reports. Join other data sources outside of clickstream data for comprehensive analysis. To facilitate custom reporting, the solution creates a semantic layer in QuickSight, which includes:\nA dataset based on the clickstream_event_base_view table in Redshift. Predefined dimensions and measures derived from the clickstream data. Proper relationships between tables to enable easy joining of data. This semantic layer abstracts the complexity of the underlying data model, making it easier for business users to create reports without needing to write complex SQL queries.\nData Exploration For more advanced users or those who need to perform ad-hoc analysis, the solution provides a data exploration interface within the Analytics Studio.\nData Exploration Interface The data exploration feature allows users to:\nExplore event data with built-in analysis models without SQL and QuickSight knowledge. Create temporary visualizations based on query results. Save and share insights derived from exploratory analysis. This feature is particularly useful for:\nInvestigating specific user behaviors or event patterns. Validating hypotheses about user interactions. Identifying new metrics or dimensions for custom reporting. Performance Optimization To ensure optimal performance of the reporting module, especially when dealing with large volumes of clickstream data, the solution implements several optimization strategies:\nMaterialized Views: Frequently used aggregations and complex joins are pre-computed and stored as materialized views in Redshift, reducing query time for common reporting scenarios.\nPrecalculation: The metrics used in out-of-the-box dashboards are pre-calculated on a daily basis and stored in individual tables.\nCaching: QuickSight's SPICE engine is utilized to cache frequently accessed data, improving dashboard load times. See the FAQ How to speed up the loading of the default dashboard? to enable SPICE for the out-of-the-box dashboard.\nQuery Optimization: The semantic layer in QuickSight is designed with performance in mind, using appropriate aggregations and filters to minimize data scanned during queries.\nSecurity and Access Control The reporting module integrates with the solution's overall security model, ensuring that:\nOnly authorized users can access the dashboards and exploration interface. All data transfers between QuickSight and Redshift are encrypted in transit within the VPC via QuickSight's VPC connection. Conclusion The reporting module of the clickstream analytics solution provides a powerful and flexible way to derive insights from clickstream data. By combining out-of-the-box dashboards, custom reporting capabilities, and a data exploration interface, it caters to a wide range of analytics needs while maintaining performance and security.\n","link":"https://kane.mx/posts/deep-dive-clickstream-analytics/report/","section":"posts","tags":["Clickstream Analytics","AWS","Amazon QuickSight","Reporting","Data Visualization","Business Intelligence"],"title":"Deep Dive Clickstream Analytics Series: Reporting"},{"body":"","link":"https://kane.mx/tags/reporting/","section":"tags","tags":null,"title":"Reporting"},{"body":"","link":"https://kane.mx/tags/chatgpt/","section":"tags","tags":null,"title":"ChatGPT"},{"body":"","link":"https://kane.mx/tags/development-tools/","section":"tags","tags":null,"title":"Development Tools"},{"body":"","link":"https://kane.mx/tags/openai-api/","section":"tags","tags":null,"title":"OpenAI API"},{"body":"Cursor is a powerful AI-assisted programming Integrated Development Environment (IDE) that comes with built-in Large Language Model (LLM) capabilities to aid in coding. With the introduction of Amazon's Bedrock Claude 3.5 model, developers now have access to advanced alternatives. In this post, we'll explore how to set up a custom gateway for Amazon Bedrock with the Claude Sonnet 3.5 foundation model and use it in Cursor as a custom model provider. This approach allows us to harness the cutting-edge language model capabilities of Amazon Bedrock within the familiar Cursor environment, potentially offering enhanced performance and cost-effectiveness for AI-assisted coding tasks.\nSetting Up the Bedrock Access Gateway Follow the steps outlined in this post to set up your Bedrock gateway.\nConfiguring Cursor to Use Bedrock Gateway To configure Cursor to use our Bedrock gateway:\nOpen the full settings of Cursor Navigate to the Models section Input the Bedrock gateway endpoint URL and API key (specified in the previous step) into the OpenAI API Key field Configure Bedrock gateway as OpenAI API server and key Choose any GPT model as your Cursor model. The Bedrock gateway will forward your request to Claude or other LLMs you've configured in the Bedrock gateway. Testing the Integration Try out some prompts in Cursor to ensure everything is working correctly. You should now be using Claude Sonnet 3.5 models via Amazon Bedrock!\nConclusion By setting up this custom gateway, you can leverage the power of Amazon's Bedrock models (such as Anthropic's Claude or Meta's Llama) within your familiar Cursor environment. This approach offers flexibility, potentially lower costs, and the ability to keep your data within the AWS ecosystem.\nRemember to monitor your usage and costs, as Bedrock pricing follows a pay-as-you-go model. Happy coding!\n","link":"https://kane.mx/posts/2024/cursor-meets-bedrock/","section":"posts","tags":["Cursor","LLM","OpenAI API","GenAI","ChatGPT","Anthropic Claude","AWS","Development Tools"],"title":"Using Amazon Bedrock as a Custom OpenAI Server Alternative in Cursor"},{"body":"Many Alfred users enjoy the convenience of the OpenAI ChatGPT workflow for quick AI-powered assistance. However, with the introduction of Amazon's Bedrock Claude 3 and 3.5 models, some may want to leverage these powerful alternatives. In this post, we'll explore how to set up a custom gateway to access Bedrock Claude models instead of the OpenAI API for your Alfred OpenAI workflow.\nWhy Use Bedrock Claude? Potentially lower costs compared to OpenAI's API Access to the latest Claude models Data privacy considerations (your data stays within AWS) Integration with other AWS services Setting Up the Bedrock Access Gateway Bedrock Access Gateway provides OpenAI-compatible RESTful APIs for Amazon Bedrock. It supports streaming responses via server-sent events (SSE) and is compatible with various Bedrock model families, including Anthropic Claude 3 (Haiku/Sonnet/Opus), Claude 3.5 Sonnet, Meta Llama 3, Mistral, and more features.\nTo set up the Bedrock access gateway endpoint in your AWS account, follow the deployment guide. You can choose to deploy using either AWS Lambda or AWS Fargate for Amazon ECS as the compute resource based on your cost and performance trade-off. Once the deployment is complete, you can find the gateway endpoint in the Outputs tab of the CloudFormation stack, as shown below:\nProxy API Base URL aka OPENAI_API_BASE Configuring the Alfred ChatGPT Workflow Now, we need to configure the Alfred ChatGPT workflow to use our Bedrock gateway:\nOpen Alfred Preferences Install the the ChatGPT workflow then choose it in workflow tab Replace the OpenAI API endpoint with your API Gateway URL(NOTE: remove the trailing /v1 from the url got from the stack output) Configure ChatGPT workflow Specify the API key when depoloying the Bedrock Access Gateway as OpenAPI key Testing the Integration Try out some prompts in Alfred to ensure everything is working correctly. You should now be using Bedrock Claude models instead of OpenAI!\nConclusion By setting up this custom gateway, you can leverage the power of Amazon's Bedrock Claude models within your familiar Alfred OpenAI workflow. This approach offers flexibility, potentially lower costs, and the ability to keep your data within the AWS ecosystem.\nRemember to monitor your usage and costs, as Bedrock pricing may differ from OpenAI's. Happy prompting!\n","link":"https://kane.mx/posts/2024/alfred-integration-with-bedrock-claude/","section":"posts","tags":["Alfred","Alfred Workflow","LLM","GenAI","OpenAI API","ChatGPT","Amazon Bedrock","Anthropic Claude","AWS","Productivity"],"title":"Access Bedrock Claude 3/3.5 Models with Alfred OpenAI ChatGPT Workflow"},{"body":"","link":"https://kane.mx/tags/alfred/","section":"tags","tags":null,"title":"Alfred"},{"body":"","link":"https://kane.mx/tags/alfred-workflow/","section":"tags","tags":null,"title":"Alfred Workflow"},{"body":"","link":"https://kane.mx/tags/amazon-redshift/","section":"tags","tags":null,"title":"Amazon Redshift"},{"body":"","link":"https://kane.mx/tags/data-modeling/","section":"tags","tags":null,"title":"Data Modeling"},{"body":"In this post, we will delve into the data modeling module of our clickstream solution. This module is an optional component that creates data models in the Amazon Redshift data warehouse and calculates reporting dimensions based on the event, session, and user factor tables generated in the data processing module.\nOverview Architecture Overview architecture for data modeling module The overview architecture demonstrates how the solution orchestrates loading clickstream data into the Amazon Redshift data warehouse and triggers data modeling within Redshift.\nData Loading Workflow The first workflow in the architecture diagram above illustrates how factor event, session, item, and user records are loaded into the Redshift data warehouse for further processing. The sequence for loading clickstream data into Redshift is as follows:\nRaw event data is processed by Apache Spark, a distributed system that shards the data across multiple nodes and sinks the processed data to S3 without maintaining order. The best approach for processing this sink data is using an event-driven architecture. We subscribe to the ObjectCreated event of the sink S3 bucket in EventBridge, then record the file information in DynamoDB as a to-be-processed item.\nWhen the EMR job execution completes, the Job Success event of the EMR Serverless job execution is emitted via EventBridge. This triggers the load data workflow orchestrated by AWS Step Functions to load the items from the DynamoDB table into Redshift.\nThe workflow mentioned in the previous step parallelly triggers four sub-workflows to load records into the corresponding target tables (event, user, session, and item).\nEach sub-workflow follows the same procedures:\nIt scans the DynamoDB table to find a batch of files to be loaded into the target Redshift table. Once files are found, a manifest file required by the Redshift COPY command is created in the S3 bucket, which is the most efficient way to load batch data into Redshift tables. The workflow submits batch SQL statements to Redshift via the Redshift Data API, then periodically checks the execution status with backoff until completion. When a batch of files is loaded into Redshift, the workflow scans the DynamoDB table again to load another batch if available, continuing until all existing files are loaded. The statements to load data into Redshift tables follow these steps:\nCopy the batch files from S3 into a temporary staging table with the same schema as the target table. The COPY command removes any duplicate records in that batch of data. Use the MERGE command to upsert the records to the target table, avoiding duplicate items. A best practice for using Step Functions to orchestrate long-running workflows is to be aware of the hard limit for Maximum execution history size in standard state machines. To mitigate this limitation, extract steps as sub-workflows when orchestrating long-running processes.\nData Modeling Workflow Once a batch of data is loaded into the tables, the load data workflow asynchronously starts the data modeling workflow without waiting for its completion. The data modeling in the clickstream solution consists of materialized views, self-managed view tables, dimension tables, and stored procedures. The workflow triggers the refreshing of materialized views, incrementally populates data into the clickstream_base_view table (which joins event and session data), and executes stored procedures to calculate daily metric dimensions for out-of-the-box dashboards. For cost-effectiveness, materialized view refreshing and base view population are executed at four-hour intervals by default, even if data loading occurs more frequently. The daily metric calculation is executed on a daily basis by default.\nMetadata Scanning Workflow The final step of the data loading workflow asynchronously triggers another workflow to scan metadata of new events on a daily basis. It aggregates the keys of custom properties for events and users, collecting their top 20 values and counts, then persists this information to the database (DynamoDB) of the web console for further exploratory analysis and metadata management.\nData Expiration Workflow The solution is designed to keep only recent (hot) events in Redshift to reduce data volume during exploratory event analysis. A time-based event scheduler triggers a workflow to clean up aging events as specified by the user.\nRedshift Management The solution supports both provisioned Redshift clusters and Redshift Serverless. Users must provision the Redshift cluster outside of the solution and can choose an existing provisioned Redshift cluster in the same region and account when configuring the pipeline.\nRedshift Serverless is handled slightly differently. The solution provisions a new Redshift Serverless namespace and workgroup, using IAM roles to access Redshift Serverless without creating an administrator user with a password. For information on accessing Redshift Serverless in the query editor, see the question \u0026quot;I already enable data modeling on Redshift, so why can't I see the schema and tables created by this solution in the Redshift query editor?\u0026quot; in the FAQ.\nThe two types of Redshift offer different pricing models for various use scenarios:\nProvisioned Redshift has a fixed cost, suitable for long-running workloads such as streaming ingestion. Redshift Serverless uses a pay-as-you-go model, where you only pay for actual usage. It's ideal for varied workloads. See this post for a deep dive into Redshift Serverless costs and use cases. You can also combine Redshift Serverless and provisioned Redshift to share the same data for optimal performance and cost trade-offs. See the question \u0026quot;How to implement a dedicated Redshift for Analytics Studio?\u0026quot; in the FAQ for information on combining the two types of Redshift in clickstream analysis.\nRedshift Resource Management The solution uses the infrastructure-as-code (IaC) tool AWS CDK to manage cloud resources. After provisioning the data pipeline with the given configuration, users can immediately use the solution without additional manual configurations. This means that Redshift tables, views, materialized views, and other resources are created or updated when the data pipeline is created or updated. The challenge of updating existing materialized views or creating new ones based on large volumes of existing records (tens of billions or more) is time-consuming, potentially taking minutes to hours depending on data volume and overall Redshift cluster load. This could cause the entire pipeline provisioning process to timeout or fail on rollback if Redshift resource updates are synchronized steps in the pipeline creation or update phase.\nTo address this, the solution makes the job of updating Redshift resources an asynchronous workflow, which never blocks pipeline creation or updates. The workflow is designed to be idempotent, allowing safe resumption of the job if it fails due to timeout or other constraints.\nRedshift Schema Best Practices Properly Configure BACKUP Option for MATERIALIZED VIEW When creating materialized views in Redshift that are crucial for persisting data for business purposes, use the BACKUP YES option. This ensures that the materialized view is included in automated and manual cluster snapshots.\nFor provisioned Redshift clusters, this practice is particularly important. When a cluster is maintained due to underlying hardware failure, it typically involves dumping a snapshot and restoring a new cluster from that snapshot. Materialized views created with the BACKUP NO option would be lost in this process.\nUse Proper Sort Key Clickstream events are time-series data, with each event containing a mandatory event_timestamp field of type timestamp. This field serves as the sort key for the event table, distributing records across Redshift clusters. When querying the event table, using the event_timestamp field in filter conditions is crucial. Without specifying this field in query conditions, Redshift would perform a full table scan, significantly impacting performance.\nUse SUPER Type for Semi-structured Data Clickstream events often require the ability to store arbitrary key-value pairs as custom properties for events and users. This presents a challenge for traditional relational databases. A common approach is to create an event_prop table to store these dynamic key-value pairs, like so:\nuuid event_id key value value_type 123e4567-e89b-12d3-a456-426614174000 1 user_id 42 INTEGER 123e4567-e89b-12d3-a456-426614174001 1 session_id abc123 STRING 123e4567-e89b-12d3-a456-426614174002 1 event_type click STRING 123e4567-e89b-12d3-a456-426614174003 2 user_id 43 INTEGER 123e4567-e89b-12d3-a456-426614174004 2 session_id def456 STRING 123e4567-e89b-12d3-a456-426614174005 2 event_type view STRING 123e4567-e89b-12d3-a456-426614174006 2 custom_field my_field_value STRING However, querying custom properties of events becomes extremely slow when joining billions of event records with hundreds of billions of event property records (assuming an event has 10 or more properties).\nThe SUPER type in Redshift offers a solution to this problem. It can contain complex values such as arrays, nested structures, and other complex structures associated with serialization formats like JSON. The SUPER data type is a set of schemaless array and structure values that encompass all other scalar types in Amazon Redshift. Our solution utilizes a SUPER field to represent the custom properties of events and users, containing an object with arbitrary key-value pairs.\n","link":"https://kane.mx/posts/deep-dive-clickstream-analytics/data-modeling/","section":"posts","tags":["Clickstream Analytics","AWS","Amazon Redshift","Data Modeling"],"title":"Deep dive clickstream analytic series: Data Modeling"},{"body":"","link":"https://kane.mx/tags/amazon-emr/","section":"tags","tags":null,"title":"Amazon EMR"},{"body":"","link":"https://kane.mx/tags/apache-spark/","section":"tags","tags":null,"title":"Apache Spark"},{"body":"","link":"https://kane.mx/tags/data-processing/","section":"tags","tags":null,"title":"Data Processing"},{"body":"In this post, we will delve into the data processing module of our clickstream solution. This module is an optional component that normalizes raw clickstream events by cleaning, transforming, and enriching them to fit the standard clickstream data schema defined in the solution. It's designed for flexibility, reliability, high performance, and cost-effectiveness.\nOverview Architecture Overview architecture for data process module The data processing is designed for batch or micro-batch data processing to optimize performance and cost-effectiveness. It's primarily an Apache Spark application running on Amazon EMR Serverless to achieve a balance between high performance and cost. It offers the following capabilities:\nThe data processing job is triggered by a time-based Amazon EventBridge rule. The rule can be set to a fixed interval or any EventBridge rule-supported crontab expression. Users can tune the interval based on their business needs and budget; generally, more frequent processing intervals incur higher costs. A Lambda function is executed when the scheduler is triggered. This function acts as a job submitter, which submits a new job to the EMR Serverless application created in the data pipeline provisioning. It performs the following steps: Scans the S3 bucket storing the raw events collected by the data ingestion service. Estimates the minimum computing resources needed for processing the batch data based on the data volume. It uses the last modified timestamp of S3 objects, which is a strongly consistent system, to find new and previously unprocessed files. Does nothing if no new data files are found in the S3 bucket since the last run. This saves costs, especially in test environments with occasional events. Submits a new job execution to the EMR Serverless application with initial CPU cores, memory size, and timestamp range as the application's arguments. The data processing application, powered by Apache Spark, loads the given data files, cleans, transforms, and enriches the data, and finally writes the results back to the sink S3 bucket. Flexibility - A Pluggable Data Processing Implementation The solution provides flexibility to support multiple SDKs and even unknown third-party clients based on HTTP protocol, as mentioned in the data ingestion section. The data processing application has a pluggable implementation to support other data transform and data enrichment implementations. When introducing a new client that sends clickstream events, you can create a custom data transform implementation to clean and transform the data to the normalized data schema of the solution. You can also specify zero or multiple enrichment implementations to enhance the clickstream event.\nBy default, the solution provides three data transformers to support the official SDKs of the clickstream solution, Google Tag Manager for server-side tagging (you can follow the Guidance for Using Google Tag Manager for Server-Side Website Analytics on AWS to set up GTM server-side servers on AWS), and Sensors Data.\nAdditionally, the solution provides the following built-in enrichments for clickstream events:\nIP enrichment: Uses GeoLite2 Free Geolocation Data by MaxMind to enrich source IP with city, continent, and country information. UA enrichement: Uses ua_parser Java library to enrich User-Agent in the HTTP header with device and browser information. Traffic source enrichment: Uses page referrer, well-known UTM parameters, and configurable mediums to enrich the source, medium, campaign, and Click ID of the traffic for your websites and mobile applications. If you want to analyze your custom data with these built-in benefits, you can refer to this documentation to start developing your transform implementation and custom enrichment implementations.\nData Schema The data processing module processes raw clickstream events, which are mostly JSON data containing one or more client-side events, into the normalized data schema defined in the solution for further data modeling. The solution uses the following four tables to represent clickstream events:\nEvent: This table stores event data. Each record represents an individual event. The event records keep all common properties collected by the SDK and the custom properties specified by the user as a JSON object. Additionally, the data processing application appends a few JSON object columns to collect process info such as source IP, source file, etc. User: This table stores the latest user attributes. Each record represents a visitor (pseudonymous user). Item: This table stores event-item data. Each record represents an event that is associated with an item. Session: This table stores session data. Each record represents a session for each pseudonymous user. See detailed session definition in the SDK manual. Data Storage By default all processed data are stored in S3 bucket forever, you can use the lifecycle of S3 object to automatically deleting the out-dated files and save costs.\nAll processed files are saved in the data lake (S3 bucket) in Parquet format. The prefix of the object path contains the project id and app id for each application in the same project. The data processing module also creates a Glue table for each project when provisioning the pipeline. Users can use Amazon Athena to query the processed data with predefined partitions (app id, year, month, and day) to improve performance and save costs.\nBy default, all processed data is stored in the S3 bucket indefinitely. You can use S3 object lifecycle policies to automatically delete outdated files and save costs.\nCost-effectiveness Due to the nature of batch data processing, we use EMR Serverless to achieve a balance between performance and cost-effectiveness. The EMR Serverless application is not charged when no job is executing.\nSecondly, the job submitter intelligently allocates only the necessary compute resources for job execution to avoid using excessive resources and save costs.\nThird, the EMR Serverless application uses AWS Graviton architecture, providing at least 20% cost savings with better performance compared to using x86_64 architecture, without requiring any code changes!\nBased on the characteristics of batch data processing, larger data volumes and right-sized compute resources can achieve a better performance-cost ratio. In the benchmark conducted by the team, it costs $0.26 per GB when processing an appropriately sized data volume.\n","link":"https://kane.mx/posts/deep-dive-clickstream-analytics/data-processing/","section":"posts","tags":["Clickstream Analytics","AWS","Amazon EMR","Apache Spark","Data Processing"],"title":"Deep dive clickstream analytic series: Data Processing"},{"body":"","link":"https://kane.mx/tags/amazon-ecs/","section":"tags","tags":null,"title":"Amazon ECS"},{"body":"","link":"https://kane.mx/tags/aws-cloudformation/","section":"tags","tags":null,"title":"AWS CloudFormation"},{"body":"","link":"https://kane.mx/tags/container/","section":"tags","tags":null,"title":"Container"},{"body":"In this post, we will delve into the data ingestion service of our clickstream solution. This service is a vital part of the clickstream analytics system. It is designed to be reliable, resilient, high-performing, flexible, and cost-effective. It plays a key role in capturing clickstream data from various sources and delivering it to downstream processing and modeling components.\nOverview Architecture Overview architecture for data ingestion service The data ingestion service of our clickstream solution is a standard web service that can be used by a wide range of clients, including web servers, mobile apps, IoT devices, and third-party platforms. It offers the following capabilities:\nOptional use of AWS Global Accelerator for accelerating end-users' global footprints and fast failover for multi-Region architectures. Application Load Balancer for backend fleet load balancing, SSL certificate offload, basic payload verification, and authentication via OIDC provider. The backend service is hosted on Amazon Elastic Container Service (Amazon ECS), supporting either EC2 or serverless Fargate (in future releases) for container hosting. Two-service backend architecture: Apache Nginx service for processing CORS requests and forwarding incoming events. Worker service powered by open-source Vector for sending events to the configured sink service and enriching events with fields like ingest time and source IP. Support for three different sink services based on user preference: Amazon Kinesis or Apache Kafka for near real-time analysis with seconds latency. Amazon Managed Streaming for Apache Kafka (MSK) for those who prefer a fully managed Apache Kafka service, self-managed Kafka also is supported. Amazon S3 bucket for cost-effective sink when real-time analysis isn't required. Provisioning of MSK Connect instances or AWS Lambda functions to periodically synchronize events from upstream services to S3 bucket when using Kafka or Kinesis as sink destinations. Design Tenets Reliability Reliability is crucial for data ingestion, especially when handling large-scale events. Our service ensures reliability through:\nAuto Scaling groups with warm pools for rapid scaling during traffic spikes. Buffering mechanisms using Amazon Kinesis Data Streams, Amazon MSK, or Amazon S3 to handle sudden traffic spikes and potential downstream processing delays. All sink requests to Kafka and Kinesis are synchronized; the error HTTP status code (500 or greater) will be returned when failing to write an event to the downstream sink destination. All SDKs of clickstream will retry the event-sending request when receiving the error response code from the data ingestion service. However, using Amazon S3 as a sink destination is a bit different. The worker uses an in-memory cache to buffer the incoming events, then persists the batch events (use total event size or time interval as threshold) to the S3 bucket. It might lose a few events if the container or host instance is crashed or restarted before the batch events are persisted. Once the incoming events are persisted in streaming services (Kafka or Kinesis), the solution uses asynchronized jobs to write the events to the S3 bucket to make sure all events are well received. Resilience Our data ingestion service is designed to withstand and recover from various failures and disruptions:\nMulti-AZ deployment for enhanced fault tolerance and high availability. Retry mechanisms with exponential backoff for handling transient failures. High Performance at Scale Optimized for high performance, especially during large scaling events:\nAWS Global Accelerator integration for improved global availability and performance. High-performance ingestion service capable of handling 4000 to 6000 requests per second (RPS) per EC2 server, with horizontal scaling to serve massive data volumes. Due to the ingestion service being stateless, we could use a horizontal scaling policy to serve the massive data volumes. We successfully guided the customer to serve the 500, 000 RPS in a single ingestion fleet. Flexibility The service offers flexibility through:\nMultiple sink options: Amazon S3, Amazon Kinesis Data Streams, and Amazon MSK. Customizable authentication and compatibility with third-party clickstream SDKs or clients. Cost-Effectiveness We ensure cost-effectiveness by:\nLeveraging serverless technologies for a pay-per-use model. Implementing data lifecycle management for optimized storage costs. Utilizing auto-scaling fleet with built-in policies to scale in during low-load periods. Resources Management Resource management is handled through AWS CloudFormation and AWS CDK, with:\nSeparate CloudFormation nested stacks for different sink destinations. Conditional resource management to control sink destination configurations. Shared stack for managing replacement CloudFormation resources to minimize outages during service updates. In conclusion, the data ingestion service is a critical component of our clickstream analytics system, providing reliable, resilient, high-performance, flexible, and cost-effective capabilities for capturing and processing clickstream data. Its robust architecture makes it suitable for businesses of all sizes, from small startups to large enterprises handling massive data volumes.\nStay tuned for our next post, where we'll delve into the data processing module of the clickstream analytics system!\n","link":"https://kane.mx/posts/deep-dive-clickstream-analytics/data-ingestion/","section":"posts","tags":["Clickstream Analytics","AWS","Container","Amazon ECS","AWS CDK","AWS CloudFormation"],"title":"Deep dive clickstream analytic series: Data Ingestion"},{"body":"This post explores the web console module of the clickstream solution.\nThe web console allows users to create and manage projects with their data pipeline, which ingests, processes, analyzes, and visualizes clickstream data. In version 1.1, the Analytics Studio was introduced for business analysts, enabling them to view metrics dashboards, explore clickstream data, design customized dashboards, and manage metadata without requiring in-depth knowledge of data warehouses and SQL.\nOne code base for different architectures The web console is a web application built using AWS serverless technologies, as demonstrated in the Build serverless web application with AWS Serverless series.\nUse CloudFront, S3 and API Gateway for hosting web console Another use case is deploying the web console within a private network, such as a VPC, to meet compliance requirements. In this architecture, CloudFront, S3, and API Gateway are replaced by an internal application load balancer and Lambda functions running within the VPC.\nUse Application Load Balancer and Lambda for hosting web console privately When implementing those two deployment modes, using the AWS Lambda Web Adapter allows for sharing the same code base to gracefully process the events sent from both API Gateway and Application Load Balancer.\nAuthentication and Authorization The web console supports two deployment modes for authentication:\nIf the AWS region has Amazon Cognito User Pool available, the solution can automatically create a Cognito user pool as an OIDC provider. If the region does not have Cognito User Pool, or the user wants to use existing third-party OIDC providers like Okta or Keycloak, the solution allows specifying the OIDC provider information. For the API layer, a custom authorizer is used when the API is provided by API Gateway, and a middleware of Express framework is used when the API is provided by Application Load Balancer.\nThe backend code uses the Express framework to implement the authorization of the API, supporting both deployment modes.\nCentralized web console The clickstream tenet utilizes a single web console, which serves as the control plane, to manage one or more isolated data pipelines across any supported region.\nPipeline lifecycle management The web console manages the lifecycle of the project's data pipeline, which is composed of modular components managed by multiple AWS CloudFormation stacks. The web console uses Step Functions workflows to orchestrate the workflow for managing the lifecycle of those CloudFormation stacks. The workflows are abstracted as parallel execution, serial execution and stack execution.\nStack Orchestration - Parallel execution Stack Orchestration - Serial execution Stack Orchestration - Stack execution And the status change events of CloudFormation stacks are emitted to SNS topic via EventBridge, then the message is deliverred to same region or cross region SQS queue for meeting the centralized web console.\nUse SQS \u0026 SNS cross-regions messages Service availability checks The web console provides a wizard that allows users to provision a data pipeline on the cloud based on their requirements, such as pipeline region, VPC, sink type, and data processing interval. Since AWS services have varying regional availability and features, the web console needs to dynamically display the available options based on the service availability, which it checks through the CloudFormation registry, as the pipeline components are managed by CloudFormation.\nBelow is a sample code snippet for checking the availability of key services used in the solution.\n1import { CloudFormationClient, DescribeTypeCommand } from \u0026#39;@aws-sdk/client-cloudformation\u0026#39;; 2 3export const describeType = async (region: string, typeName: string) =\u0026gt; { 4 try { 5 const cloudFormationClient = new CloudFormationClient({ 6 ...aws_sdk_client_common_config, 7 region, 8 }); 9 const params: DescribeTypeCommand = new DescribeTypeCommand({ 10 Type: \u0026#39;RESOURCE\u0026#39;, 11 TypeName: typeName, 12 }); 13 return await cloudFormationClient.send(params); 14 } catch (error) { 15 logger.warn(\u0026#39;Describe AWS Resource Types Error\u0026#39;, { error }); 16 return undefined; 17 } 18}; 19 20export const pingServiceResource = async (region: string, service: string) =\u0026gt; { 21 let resourceName = \u0026#39;\u0026#39;; 22 switch (service) { 23 case \u0026#39;emr-serverless\u0026#39;: 24 resourceName = \u0026#39;AWS::EMRServerless::Application\u0026#39;; 25 break; 26 case \u0026#39;msk\u0026#39;: 27 resourceName = \u0026#39;AWS::KafkaConnect::Connector\u0026#39;; 28 break; 29 case \u0026#39;redshift-serverless\u0026#39;: 30 resourceName = \u0026#39;AWS::RedshiftServerless::Workgroup\u0026#39;; 31 break; 32 case \u0026#39;quicksight\u0026#39;: 33 resourceName = \u0026#39;AWS::QuickSight::Dashboard\u0026#39;; 34 break; 35 case \u0026#39;athena\u0026#39;: 36 resourceName = \u0026#39;AWS::Athena::WorkGroup\u0026#39;; 37 break; 38 case \u0026#39;global-accelerator\u0026#39;: 39 resourceName = \u0026#39;AWS::GlobalAccelerator::Accelerator\u0026#39;; 40 break; 41 case \u0026#39;flink\u0026#39;: 42 resourceName = \u0026#39;AWS::KinesisAnalyticsV2::Application\u0026#39;; 43 break; 44 default: 45 break; 46 }; 47 if (!resourceName) return false; 48 const resource = await describeType(region, resourceName); 49 return !!resource?.Arn; 50}; Optimize package size Node modules hell One crucial best practice when working with serverless functions like AWS Lambda is optimizing the package size. Just like in the analogy shown here, a smaller and more compact package results in faster performance, lower costs, and more efficient resource utilization. However, when we reach the node_modules folder, it's akin to a vast, sprawling structure, representing an unoptimized and bloated package size. This can lead to slower cold starts, higher compute costs, and potential issues with deployment limits. By following best practices such as minimizing dependencies, leveraging code bundling and tree-shaking techniques, and optimizing asset handling, you can achieve an optimized package size, ensuring your serverless functions are lean, efficient, and cost-effective.\nOriginally, the solution uses node_prune to remove useless dependencies in node_modules for reducing the lambda size. The package size was reduced to 120MB from 200MB. After introducing new dependencies, the original package size became 300MB. Node_prune could not reduce the size to meet the hard limit of Lambda package size 256 MB. The team introduced a new open-sourced tool vercel/ncc for bundling and tree-shaking the node_modules. Vercel/ncc bundled the solution code and dependencies as a single file, it only 20MB from original 300MB. It’s amazing!\nOptimized package size One more thing to keep in mind is to maintain the relative path of the configuration files used by your application.\nEmbedded QuickSight dashboards Analytics Studio is a component of the clickstream web console that integrates Amazon QuickSight dashboards into our web application.\nHigh level architecture of embedded QuickSight in web console Embedded QuickSight dashboard In the Exploration of Analytics Studio, the web console will automatically create temporary visuals and dashboards for each query owned by the QuickSight role/user used by the web console, ensuring these temporary resources remain invisible to other QuickSight users.\n","link":"https://kane.mx/posts/deep-dive-clickstream-analytics/web-console/","section":"posts","tags":["Clickstream Analytics","AWS","Serverless computing"],"title":"Deep dive clickstream analytic series: Serverless web console"},{"body":"","link":"https://kane.mx/tags/serverless-computing/","section":"tags","tags":null,"title":"Serverless Computing"},{"body":"In the last couple of months, I led a team to build a comprehensive and open-sourced solution that helps customers analyze clickstream events on the cloud. The solution provides data autonomy, allowing users full access to raw data, near real-time ingestion, flexible configurations, and cost-effectiveness. It is a system that utilizes serverless services to cater to various customers, whether small businesses or large-scale events with massive data volumes, offering fully managed services with minimal operational efforts or the flexibility to use preferred open-source technical stacks.\nThe clickstream analytics system typically consists of several modules: SDKs, Data Ingest, Data Processing, Data Modeling and Visualization.\nGeneral clickstream analytics system architecture Building a well-architected and secure cloud-native system with modular, resilient, and cost-effective components is always challenging. The solution provides a production-ready and decoupled implementation, with most modules being optional.\nI will thoroughly explore all modules to deeply dive how to build a cloud-native system and implement a system that supports multiple technical variations and components.\nServerless Web Console Data Ingestion Data Processing Data Modeling Reporting Data Pipeline Observability ","link":"https://kane.mx/posts/deep-dive-clickstream-analytics/preface/","section":"posts","tags":["Clickstream Analytics","AWS","Serverless computing"],"title":"How to build a clickstream analytic system for small businesses to large-scale events"},{"body":"","link":"https://kane.mx/tags/amazon-athena/","section":"tags","tags":null,"title":"Amazon Athena"},{"body":"","link":"https://kane.mx/tags/analytics/","section":"tags","tags":null,"title":"Analytics"},{"body":"In today's digital age, businesses are constantly seeking ways to understand and analyze user behavior on their websites. Clickstream events provide valuable insights into how users interact with a website, and analyzing this data can help businesses make informed decisions to improve user experience and drive conversions.\nClickstream Analytics on AWS collects, ingests, analyzes, and visualizes clickstream events from your websites and mobile applications. The solution manages an ingestion endpoint to receive clickstream events, which are multiple events in a batch sent by the solution‘s SDKs.\nOnce the ingestion endpoint receives the events, they are stored in an Amazon S3 bucket without additional processing. The bucket path is configured as a Glue table in the solution's AWS Glue Data Catalog. So the data is available for analysis using Amazon Athena.\nOne use case is to query and analyze the raw clickstream data to gain immediate insights after the data is stored in the S3 bucket. For example, the operators can debug the clickstream events without waiting for the data to be processed. However, the challenges of querying the raw data are:\nthe clickstream events are compressed by SDKs, so the data is not easily query-able reach the Lambda payload limitation Response payload size exceeded maximum allowed payload size (6291556 bytes) when using Athena UDF to extract the events In this post, I will show you how to use Amazon Athena UDFs to query the raw clickstream data to overcome the challenges.\nThe steps are:\nclone repo: https://github.com/zxkane/aws-athena-query-federation Follow the steps to build and deploy the UDFs as Lambda function. After completing the deployment, find the ARN of Lambda function. Let’s say it as clickstream-udfs. Go to the console of Glue. Run the below query to load the latest partitions of raw data. 1msck repair table \u0026lt;your project id\u0026gt;.ingestion_events; Run below sample query to view compressed data. 1-- view compressed data 2USING EXTERNAL FUNCTION decompress_clickstream_common_fields(col1 VARCHAR) RETURNS VARCHAR LAMBDA \u0026#39;\u0026lt;your lambda arn\u0026gt;\u0026#39;, 3 EXTERNAL FUNCTION decompress_clickstream_attribute_fields(col1 VARCHAR) RETURNS VARCHAR LAMBDA \u0026#39;\u0026lt;your lambda arn\u0026gt;\u0026#39;, 4 EXTERNAL FUNCTION decompress_clickstream_user_fields(col1 VARCHAR) RETURNS VARCHAR LAMBDA \u0026#39;\u0026lt;your lambda arn\u0026gt;\u0026#39; 5SELECT 6 json_parse(decompress_clickstream_user_fields(data)), 7 json_parse(decompress_clickstream_common_fields(data)), 8 json_parse(decompress_clickstream_attribute_fields(data)) 9FROM \u0026#34;\u0026lt;your project id\u0026gt;\u0026#34;.\u0026#34;ingestion_events\u0026#34; 10WHERE year=\u0026#39;2024\u0026#39; and month=\u0026#39;06\u0026#39; and day=\u0026#39;20\u0026#39; and hour=\u0026#39;02\u0026#39; 11limit 10; 12 13-- count the received raw events 14USING EXTERNAL FUNCTION decompress_clickstream_common_fields(col1 VARCHAR) RETURNS VARCHAR LAMBDA \u0026#39;\u0026lt;your lambda arn\u0026gt;\u0026#39;, 15 EXTERNAL FUNCTION decompress_clickstream_attribute_fields(col1 VARCHAR) RETURNS VARCHAR LAMBDA \u0026#39;\u0026lt;your lambda arn\u0026gt;\u0026#39;, 16 EXTERNAL FUNCTION decompress_clickstream_user_fields(col1 VARCHAR) RETURNS VARCHAR LAMBDA \u0026#39;\u0026lt;your lambda arn\u0026gt;\u0026#39; 17SELECT 18 sum(json_array_length(json_parse(decompress_clickstream_common_fields(data)))) 19FROM \u0026#34;\u0026lt;your project id\u0026gt;\u0026#34;.\u0026#34;ingestion_events\u0026#34; 20WHERE year=\u0026#39;2024\u0026#39; and month=\u0026#39;06\u0026#39; and day=\u0026#39;20\u0026#39; and hour=\u0026#39;02\u0026#39;; This conclusion summarizes the key benefits of using Amazon Athena UDFs for querying raw clickstream data, provides some final thoughts and considerations.\nImmediate access to data: You can analyze clickstream events as soon as they're stored in the S3 bucket, without waiting for additional processing. Debugging capabilities: Operators can quickly debug clickstream events by directly querying the raw data. Overcoming compression challenges: The UDFs allow you to decompress and parse the data on-the-fly, making it easily queryable. Avoiding Lambda payload limitations: By using separate UDFs for different parts of the data, you can circumvent the Lambda payload size restrictions. ","link":"https://kane.mx/posts/2024/analyzing-clickstream-events-using-amazon-athena-udfs/","section":"posts","tags":["Amazon Athena","Analytics","Athena UDF","Clickstream Analytics","AWS","AWS Lambda"],"title":"Analyzing Clickstream Events Using Amazon Athena UDFs"},{"body":"","link":"https://kane.mx/tags/athena-udf/","section":"tags","tags":null,"title":"Athena UDF"},{"body":"","link":"https://kane.mx/tags/aws-lambda/","section":"tags","tags":null,"title":"AWS Lambda"},{"body":"","link":"https://kane.mx/tags/clean-code/","section":"tags","tags":null,"title":"Clean Code"},{"body":"","link":"https://kane.mx/tags/devops/","section":"tags","tags":null,"title":"DevOps"},{"body":"As developers, we all know the importance of maintaining high code quality standards. One powerful tool that can help us achieve this is SonarQube, a renowned platform for continuous code quality inspection. However, setting up and maintaining a dedicated SonarQube instance can be a cumbersome task, requiring significant resources and ongoing maintenance.\nFortunately, GitHub Actions offers a convenient solution by allowing us to spin up an ephemeral (short-lived) SonarQube instance directly within our workflow. This approach streamlines the process, eliminating the need for a permanent SonarQube server while still reaping the benefits of its code analysis capabilities.\nWhy Ephemeral SonarQube? Using an ephemeral SonarQube instance in your GitHub Actions workflow provides several advantages:\nNo Infrastructure Overhead: With ephemeral SonarQube, you don't need to worry about provisioning and maintaining a dedicated server or virtual machine for SonarQube. This reduces infrastructure costs and management overhead. Scalability: Ephemeral instances can be easily spun up and torn down as needed, making the process highly scalable and adaptable to your project's requirements. Consistent Environment: By running SonarQube within the GitHub Actions environment, you ensure a consistent and reproducible analysis environment across all your builds. For open-source projects, it validates every Pull Request without requiring a SonarQube instance and permissions for contributors. Secure and Isolated: Each ephemeral SonarQube instance is isolated and secure, reducing the risk of cross-contamination or security vulnerabilities. Setting up Ephemeral SonarQube in GitHub Actions Setting up an ephemeral SonarQube instance in your GitHub Actions workflow is a straightforward process. Here's a high-level overview of the steps involved:\nDefine a GitHub Actions Workflow: Create a new workflow file (e.g., .github/workflows/sonar-check.yml) in your repository. Configure SonarQube Instance as a service: Run SonarQube instance as a service container. Specify the instance details, including the version and edition (community or enterprise) you want to use. Then, configure the custom quality profiles and quality gate for meeting the code quality standards of your project. Build Your Project and Run Tests with Reports: Build your code and run tests for coverage reports, if applicable. Trigger Code Analysis: Use the sonarsource/sonarqube-scanner-action or another community action using SonarQube Scanner CLI to run the SonarQube Scanner against your codebase, configuring any necessary analysis properties or exclusions. Check the Analysis Result: After the analysis is complete, check if the result meets the quality gate or not. Update Analysis Result to Pull Request: Write the SonarQube analysis result to the Pull Request as a new comment. It would be ideal to comment on the new code for any new finding issues. Here's an example of how your GitHub Actions workflow might look:\n1name: code scans 2on: 3 pull_request: {} 4 workflow_dispatch: {} 5jobs: 6 sonarqube: 7 name: sonarqube scan 8 runs-on: \u0026#39;ubuntu-latest\u0026#39; 9 services: 10 sonarqube: 11 image: public.ecr.aws/docker/library/sonarqube:10-community 12 ports: 13 - 9000:9000 14 steps: 15 - uses: actions/checkout@v4 16 - name: Setup Node.js 17 uses: actions/setup-node@v4 18 with: 19 node-version: 20.x 20 - name: Install dependencies 21 run: yarn install --check-files \u0026amp;\u0026amp; yarn --cwd example/ install --check-files 22 - name: Run build and unit tests 23 run: npx projen compile \u0026amp;\u0026amp; npx projen test 24 - name: Configure sonarqube 25 env: 26 SONARQUBE_URL: http://localhost:9000 27 SONARQUBE_ADMIN_PASSWORD: ${{ secrets.SONARQUBE_ADMIN_PASSWORD }} 28 run: | 29 bash .github/workflows/sonarqube/sonar-configure.sh 30 - name: SonarQube Scan 31 uses: sonarsource/sonarqube-scan-action@master 32 env: 33 SONAR_HOST_URL: http://sonarqube:9000 34 SONAR_TOKEN: ${{ env.SONARQUBE_TOKEN }} 35 with: 36 args: \u0026gt; 37 -Dsonar.projectKey=pr-${{ github.event.pull_request.number }} 38 # Check the Quality Gate status. 39 - name: SonarQube Quality Gate check 40 id: sonarqube-quality-gate-check 41 uses: sonarsource/sonarqube-quality-gate-action@master 42 # Force to fail step after specific time. 43 timeout-minutes: 5 44 env: 45 SONAR_TOKEN: ${{ env.SONARQUBE_TOKEN }} 46 SONAR_HOST_URL: http://localhost:9000 47 - uses: phwt/sonarqube-quality-gate-action@v1 48 id: quality-gate-check 49 if: always() 50 with: 51 sonar-project-key: pr-${{ github.event.pull_request.number }} 52 sonar-host-url: http://sonarqube:9000 53 sonar-token: ${{ env.SONARQUBE_TOKEN }} 54 github-token: ${{ secrets.PROJEN_GITHUB_TOKEN }} 55 - name: Comment results and findings on Pull Request 56 uses: zxkane/sonar-quality-gate@master 57 if: always() 58 env: 59 DEBUG: true 60 GITHUB_TOKEN: ${{ secrets.PROJEN_GITHUB_TOKEN }} 61 GIT_URL: \u0026#34;https://api.github.com\u0026#34; 62 GIT_TOKEN: ${{ secrets.PROJEN_GITHUB_TOKEN }} 63 SONAR_URL: http://sonarqube:9000 64 SONAR_TOKEN: ${{ env.SONARQUBE_TOKEN }} 65 SONAR_PROJECT_KEY: pr-${{ github.event.pull_request.number }} 66 with: 67 login: ${{ env.SONARQUBE_TOKEN }} 68 skipScanner: true In this example, the workflow is triggered on pull request events to the main branch. The sonarsource/sonarqube-scanner-action is used to install the SonarQube Scanner and perform the code analysis. The SONAR_TOKEN environment variables is used to authenticate with the ephemeral SonarQube instance and its URL is specified as localhost:9000 or sonarqube:9000 which only could be accessed from the workflow runtime, respectively.\nAfter the analysis is complete, the sonarsource/sonarqube-quality-gate-action is used to check if the code meets the defined quality gate criteria. The customize quality gate is configured in the Configure sonarqube step.\nThe final step comments the results on the Pull Request and adds finding issues as inline comments too.\nYou can check the complete sample in this repo.\nConclusion Incorporating ephemeral SonarQube into your GitHub Actions workflow streamlines the process of continuous code quality inspection. By leveraging the power of SonarQube without the overhead of maintaining a dedicated instance, you can ensure that your codebase adheres to high quality standards with minimal effort.\nGive ephemeral SonarQube a try in your next project and experience the benefits of seamless code analysis and quality assurance within your GitHub Actions workflows.\n","link":"https://kane.mx/posts/2024/scan-your-code-with-ephemeral-sonarqube-in-github-actions/","section":"posts","tags":["Clean Code","SonarQube","Github Actions","CI","DevOps"],"title":"Scan Your Code with Ephemeral SonarQube in GitHub Actions"},{"body":"","link":"https://kane.mx/tags/sonarqube/","section":"tags","tags":null,"title":"SonarQube"},{"body":"","link":"https://kane.mx/tags/amazon-dynamodb/","section":"tags","tags":null,"title":"Amazon DynamoDB"},{"body":"","link":"https://kane.mx/tags/amazon-vpc/","section":"tags","tags":null,"title":"Amazon VPC"},{"body":"Amazon DynamoDB now supports AWS PrivateLink as of March 19, 2024. This feature allows you to securely access DynamoDB from your Amazon Virtual Private Cloud (VPC) without exposing your traffic to the public internet.\nHowever, unlike VPC endpoints for other AWS managed services, the AWS PrivateLink for Amazon DynamoDB does not support the Private DNS feature. This means that if your subnets are configured with only a DynamoDB Interface VPC endpoint, the public DNS name of the DynamoDB service (e.g., dynamodb.us-east-1.amazonaws.com in the us-east-1 region) cannot be resolved in those subnets.\nAs a result, you cannot share the same code to connect to the DynamoDB endpoint via the internet or a Gateway VPC endpoint when using Interface VPC endpoints. Instead, when you create an interface endpoint, DynamoDB generates two types of endpoint-specific DNS names: Regional and zonal. You must specify your own endpoint information when creating the DynamoDB client.\n1# replace the Region us-east-1 and VPC endpoint ID https://vpce-1a2b3c4d-5e6f.dynamodb.us-east-1.vpce.amazonaws.com with your own information. 2ddb_client = session.client( 3service_name=\u0026#39;dynamodb\u0026#39;, 4region_name=\u0026#39;us-east-1\u0026#39;, 5endpoint_url=\u0026#39;https://vpce-1a2b3c4d-5e6f.dynamodb.us-east-1.vpce.amazonaws.com\u0026#39; 6) As an experienced AWS developer, it's easy to assume that the newly launched DynamoDB Interface VPC endpoint behaves like other AWS managed services, allowing you to continue using existing code to initialize the DynamoDB client in isolated subnets. However, this assumption would be incorrect and could lead to issues.😂😂😂\nMake sure to update your application code to use the endpoint-specific DNS names or the endpoint URL when working with DynamoDB Interface VPC endpoints. You can find more examples in the AWS documentation.\n","link":"https://kane.mx/posts/2024/dynamodb-interface-vpc-endpoint/","section":"posts","tags":["AWS","Amazon DynamoDB","Amazon VPC","Tip"],"title":"Avoiding Pitfalls When Using Amazon DynamoDB Interface VPC Endpoints"},{"body":"","link":"https://kane.mx/tags/tip/","section":"tags","tags":null,"title":"Tip"},{"body":"Serverless computing is all the rage, promising pay-as-you-go magic and freedom from infrastructure woes. But what about serverless for data warehouses? Let's delve into the fascinating (and sometimes confusing) world of Redshift Serverless: its cost structure, ideal use cases, and situations where it might not be the best fit.\nCost Breakdown: Beyond the Illusion of Free Redshift Serverless offers a compelling promise: only pay for what you use. But like any good magic trick, there's more to the story. Here's the primary cost breakdown:\nCompute Units (RPUs): You're charged per second for used compute capacity. This is fantastic for burst workloads, but beware of idle charges. Even when your warehouse is inactive, the base capacity incurs costs. It's with a 60-second minimum charge, even just one query is executed in a second in the charge period. Storage: Redshift Managed Storage (RMS) charges apply to the data you store, regardless of serverless or provisioned clusters. Data Transfer: Cross-region data sharing or accessing data from other AWS services like S3, Glue, etc outside the region attract data transfer charges. By breaking down the cost of Redshift serverless, the RPUs usage majorly impacts the cost. Let's see a few examples of how to analyze the cost of your Redshift serverless.\nRedshift serverless uses SYS_SERVERLESS_USAGE to view details of Amazon Redshift serverless usage of resources. After selecting some rows from the tables, it looks like below,\nstart_time end_time compute_seconds compute_capacity data_storage cross_region_transferred_data charged_seconds 2024-02-24 16:34:00 2024-02-24 16:35:00 62 8 31224 0 480 2024-02-24 16:33:00 2024-02-24 16:34:00 48 8 31218 0 0 2024-02-24 16:30:00 2024-02-24 16:31:00 0 0 31218 0 480 2024-02-24 16:29:00 2024-02-24 16:30:00 13 8 31217 0 0 2024-02-24 16:28:00 2024-02-24 16:29:00 0 0 31217 0 480 2024-02-24 16:27:00 2024-02-24 16:28:00 29 8 31210 0 480 From the above records, we know the Redshift serverless is configured with 8 RPUs (minimum RPUs). Every minute for the active Redshift serverless, 480 seconds (60 * 8 RPUs) are charged for the compute units of the Redshift serverless, though the actual usage of compute-seconds is small!\nYou can use the below query to view the percentage of actual computed seconds for your queries in the charged seconds.\n1-- query actual compute usage vs charged usage per day for Redshift Serverless running in us-west-2 2select 3 DATE_TRUNC(\u0026#39;day\u0026#39;, start_time) AS query_day, 4 sum(compute_seconds) as used_compute_seconds, sum(charged_seconds) as charged_seconds, 2/3 as utility_percentage, 5 sum(charged_seconds)*0.360/3600 as cost 6from sys_serverless_usage where CAST(start_time AS TIMESTAMP) \u0026gt;= CURRENT_DATE - INTERVAL \u0026#39;7 days\u0026#39; 7GROUP BY 1; If you want to which queries are charged in a specific period, leverage SYS_QUERY_HISTORY to view details of user queries. You could join those two tables to see the relationship like below example,\n1with query_info as ( 2 select 3 start_time, 4 end_time, 5 (\u0026#39;[ user_id: \u0026#39; || user_id || \u0026#39; query_id: \u0026#39; || query_id || \u0026#39; transaction_id: \u0026#39; || transaction_id || \u0026#39; session_id: \u0026#39; || session_id ||\u0026#39; - queries: \u0026#39; || SUBSTRING(btrim(query_text), 1, 100) || \u0026#39; ]\u0026#39;) as per_query_info 6 from sys_query_history where start_time ilike \u0026#39;%2024-02-24%\u0026#39; 7 order by start_time 8) 9select 10 syu.start_time, 11 syu.end_time, 12 compute_capacity, 13 charged_seconds, 14 listagg(per_query_info, \u0026#39;,\u0026#39;) as queries_within 15from sys_serverless_usage syu 16inner join query_info sqh 17on sqh.end_time \u0026lt;= syu.end_time 18and sqh.end_time \u0026gt;= syu.start_time 19group by 1,2,3,4; By analyzing the queries, you can evaluate the performance of Redshift serverless and identify the most expensive queries for optimization.\nUse Cases: Where Serverless Shines Redshift Serverless shines in specific scenarios:\nThe intensive queries in a short period (like massive BI queries in few hours). Ad-hoc analytics: Need to run quick queries on your data without spinning up a cluster? Serverless is perfect. Dev/test environments: Test your data pipeline and queries without managing infrastructure. Unpredictable workloads: For workloads with variable demand, serverless scales automatically, saving you from overprovisioning costs. When to Say No: Serverless Isn't for Everyone While tempting, serverless isn't always the answer. Consider these situations:\nLong-running queries: Serverless charges per second, making it less cost-effective for long-running queries compared to provisioned clusters. For example, streaming ingestion from Kinesis data stream or Kafka. Cost-sensitive workloads: If strict budget control is crucial, the base capacity charge and potential idle costs might outweigh the benefits. Conclusion: Choose Wisely Redshift Serverless offers a powerful, flexible option for specific data warehouse needs. However, understanding its cost structure and ideal use cases is crucial to avoid surprises. Carefully evaluate your workload characteristics and budget constraints before diving in. Remember, the magic of serverless lies in using it wisely!\nBonus Tip: Explore hybrid approaches, combining serverless for ad-hoc queries with provisioned clusters for predictable workloads via data sharing feature.\nI hope this blog post helps you navigate the world of Redshift Serverless! Do you have any questions or experiences to share? Let's discuss in the comments!\n","link":"https://kane.mx/posts/2024/redshift-serverless-cost-deep-dive/","section":"posts","tags":["AWS","Amazon Redshift","Serverless Computing"],"title":"Redshift Serverless: Cost Deep Dive and Use Cases"},{"body":"AWS CDK accelerates cloud development using common programming languages to model your applications. I had a series of posts using CDK to demonstrate Building serverless web applications with AWS Serverless. Because CDK uses a programming language to model your application, you can encapsulate your library via Constructs, and then reuse it crossing the entire application.\nMeanwhile, you can create your own constructs to encapsulate the compliance requirements to simplify the code. For example, in our solution, I used the construct SolutionFunction to force using the same Node.js version(18.x), architecture(ARM64), Lambda logging configuration(JSON log), environment variables for Powertools Logger and so on crossing all NodejsFunction. In addition, using Aspects and escape hatches to make sure the application meets the compliance requirements.\nLet's deep dive into how to make all Nodejs Lambda functions compliant with the above requirements.\nFirstly, define the SolutionFunction for making a generic configuration of solutions's Nodejs Lambda,\n1export class SolutionNodejsFunction extends NodejsFunction { 2 3 constructor(scope: Construct, id: string, props?: NodejsFunctionProps) { 4 super(scope, id, { 5 ...props, 6 bundling: props?.bundling ? { 7 ...props.bundling, 8 externalModules: props.bundling.externalModules?.filter(p =\u0026gt; p === \u0026#39;@aws-sdk/*\u0026#39;) ?? [], 9 } : { 10 externalModules: [], 11 }, 12 runtime: Runtime.NODEJS_18_X, 13 architecture: Architecture.ARM_64, 14 environment: { 15 ...POWERTOOLS_ENVS, 16 ...(props?.environment ?? {}), 17 }, 18 logRetention: props?.logRetention ?? RetentionDays.ONE_MONTH, 19 logFormat: \u0026#39;JSON\u0026#39;, 20 applicationLogLevel: props?.applicationLogLevel ?? \u0026#39;INFO\u0026#39;, 21 }); 22 } 23} Then, add an Aspect to the application to make sure the NodejsFunction functions are an instance of SolutionFunction.\n1class NodejsFunctionSanityAspect implements IAspect { 2 3 public visit(node: IConstruct): void { 4 if (node instanceof NodejsFunction) { 5 if (!(node instanceof SolutionNodejsFunction)) { 6 Annotations.of(node).addError(\u0026#39;Directly using NodejsFunction is not allowed in the solution. Use SolutionNodejsFunction instead.\u0026#39;); 7 } 8 if (node.runtime != Runtime.NODEJS_18_X) { 9 Annotations.of(node).addError(\u0026#39;You must use Nodejs 18.x runtime for Lambda with javascript in this solution.\u0026#39;); 10 } 11 } 12 } 13} 14Aspects.of(app).add(new NodejsFunctionSanityAspect()); The above code snippets help us to archive the compliance of Nodejs Lambda functions without modifying tens or hundreds of occurrences one by one.\nHowever, due to service availability, the ARM64 architect and JSON log Lambda function are not available in the AWS China partition. Also, using another Aspect with escape hatches to override the attributes with conditional values.\n1class CNLambdaFunctionAspect implements IAspect { 2 3 private conditionCache: { [key: string]: CfnCondition } = {}; 4 5 public visit(node: IConstruct): void { 6 if (node instanceof Function) { 7 const func = node.node.defaultChild as CfnFunction; 8 if (func.loggingConfig) { 9 func.addPropertyOverride(\u0026#39;LoggingConfig\u0026#39;, 10 Fn.conditionIf(this.awsChinaCondition(Stack.of(node)).logicalId, 11 Fn.ref(\u0026#39;AWS::NoValue\u0026#39;), { 12 LogFormat: (func.loggingConfig as CfnFunction.LoggingConfigProperty).logFormat, 13 ApplicationLogLevel: (func.loggingConfig as CfnFunction.LoggingConfigProperty).applicationLogLevel, 14 LogGroup: (func.loggingConfig as CfnFunction.LoggingConfigProperty).logGroup, 15 SystemLogLevel: (func.loggingConfig as CfnFunction.LoggingConfigProperty).systemLogLevel, 16 })); 17 } 18 if (func.architectures \u0026amp;\u0026amp; func.architectures[0] == Architecture.arm64) { 19 func.addPropertyOverride(\u0026#39;Architectures\u0026#39;, 20 Fn.conditionIf(this.awsChinaCondition(Stack.of(node)).logicalId, 21 Fn.ref(\u0026#39;AWS::NoValue\u0026#39;), func.architectures)); 22 } 23 } 24 } 25 26 private awsChinaCondition(stack: Stack): CfnCondition { 27 const conditionName = \u0026#39;AWSCNCondition\u0026#39;; 28 // Check if the resource already exists 29 const existingResource = this.conditionCache[stack.artifactId]; 30 31 if (existingResource) { 32 return existingResource; 33 } else { 34 const awsCNCondition = new CfnCondition(stack, conditionName, { 35 expression: Fn.conditionEquals(\u0026#39;aws-cn\u0026#39;, stack.partition), 36 }); 37 this.conditionCache[stack.artifactId] = awsCNCondition; 38 return awsCNCondition; 39 } 40 } 41} 42Aspects.of(app).add(new CNLambdaFunctionAspect()); Alright, using the above two aspects forces the solution to meet the compliance requirements of Lambda functions with the same runtime version, architecture, and logger configuration. \u0026#x1f929; \u0026#x1f604; \u0026#x1f929;\n","link":"https://kane.mx/posts/2024/custom-compliance-for-aws-cdk/","section":"posts","tags":["AWS","AWS CDK","AWS Lambda","Tips"],"title":"Custom compliance implementation in AWS CDK"},{"body":"","link":"https://kane.mx/tags/tips/","section":"tags","tags":null,"title":"Tips"},{"body":"","link":"https://kane.mx/tags/amazon-codewhisperer/","section":"tags","tags":null,"title":"Amazon CodeWhisperer"},{"body":" Disclaimer: the cover image was generated by Amazon Bedrock's Titan Image Generator G1.\nAWS CLI is a swiss knife for orchestrating the operations of AWS resources. Especially, the filter option could help your filter and transform the output then combine with other Linux commands together.\nThis post collects the CLI usages to resolve my AWS operation needs.\nDelete the legacy versions of a service catalog product AWS Service Catalog has default 100 versions per product. Below is a one line command to delete the legacy versions.\n1export PRODUCT_ID=\u0026lt;product-id\u0026gt; 2 3# query the version name starting with \u0026#39;v5.0.0\u0026#39; then show Id and Name only 4aws servicecatalog describe-product --no-paginate --id $PRODUCT_ID --query \u0026#39;ProvisioningArtifacts[?starts_with(Name, `v5.0.0`)].{Id:Id, Name:Name}\u0026#39; 5 6# query the version name contains \u0026#39;v5.0.0-beta\u0026#39; then delete them 7aws servicecatalog describe-product --no-paginate --id $PRODUCT_ID --query \u0026#39;ProvisioningArtifacts[?contains(Name, `v5.0.0-beta`)].Id\u0026#39; |jq -r \u0026#39;.[]\u0026#39; | xargs -I {} aws servicecatalog delete-provisioning-artifact --product-id $PRODUCT_ID --provisioning-artifact-id {} Public all S3 objects with specific prefix 1aws s3 ls s3://$name/$prefix --recursive | awk \u0026#39;{print $4}\u0026#39; | xargs -I {} -n 1 aws s3api put-object-acl --acl public-read --bucket $name --key {} Reset resource policy of CloudWatch logs You might encounter a CloudFormation stack deployment failure due to creating CloudWatch log group with an error message like the one below,\nCannot enable logging. Policy document length breaking Cloudwatch Logs Constraints, either \u0026lt; 1 or \u0026gt; 5120 (Service: AmazonApiGatewayV2; Status Code: 400; Error Code: BadRequestException; Request ID: xxx-yyy-zzz; Proxy: null)\nCloudWatch Logs resource policies are limited to 5120 characters. The remediation is merging or removing useless policies, then updating the resource policies of CloudWatch logs to reduce the number of policies.\nBelow is a sample command to reset resource policy of CloudWatch logs:\nPush Helm chart to all regional ECR repositories Import a local SSH key to all AWS regions Query latest amazon linux2 AMI Delete multiple CloudWatch Log groups Launch an EC2 within default VPC with default security group Add below script in your .zshrc, then run ec2-launch-amazon-linux in terminal to launch a new instance.\nAmazon CodeWhisperer for command line is a new set of capabilities and integrations for AI-powered productivity tool, Amazon CodeWhisperer, that makes software developers more productive in the command line. It can also assist you generating the CLI command based on your natural language inputs.\n","link":"https://kane.mx/posts/2024/awscli-collection/","section":"posts","tags":["AWS","AWS CLI","Amazon CodeWhisperer","Tips","Collections"],"title":"Awesome AWS CLI"},{"body":"","link":"https://kane.mx/tags/collections/","section":"tags","tags":null,"title":"Collections"},{"body":" Disclaimer: the cover image was generated by StableDiffusionXL with prompts 'cover image, spring boot, flask framework running in aws lambda'.\nWhen deploying and operating a web application on the cloud, you prefer to use your favorite programming language and web framework. Also, you want to benefit from Serverless technologies for stability, scalability, cost optimization, and operation excellence.\nAWS Lambda Web Adapter is a tool that perfectly meets your expectations. It lifts and shifts the web application based on your preferred language and web framework, including FastAPI, Flask, Django, Express.js, Next.js, Spring Boot, Nginx, PHP, Rust, Golang Gin, Laravel, ASP.NET, and so on! You don't have to change any code to migrate your application to Lambda runtime. It also supports WebSocket and streaming features that work well with your LLM applications.\nAnother use case is that you can orchestrate your cloud infrastructure to support different network topologies without changing any code. Assuming you are ISV, your customers want to deploy your services as both public service and private service. With lambda web adapter, you can share the source code of the service, just orchestrating the AWS services to meet the requirements.\nPublic service pattern: CloudFront + S3 + API Gateway + Lambda You can use CloudFront to publish your entire web service. Using S3 to host all static content of your site and API Gateway with Lambda integration serves as the backend API.\nThe pattern architect for hosting web application as public service Private service pattern: Application Load Balancer (ALB) + Lambda With lambda web adapter, you can deploy your web application with Amazon VPC without exposing it to the internet. In this pattern, we choose ALB as the gateway of network traffic, then forward the different requests to two Lambda functions running web frontend and backend correspondingly.\nThe pattern architect for hosting web application as private service without public access In Clickstream Analytics on AWS solution it applies the above patterns to deploy its web console for different network topologies without changing the code of the web application. Also, the solution implements the above pattern as CDK constructs for replication using,\nCloudFront + S3 + API Gateway + Lambda ALB + Lambda Learns While implementing the above patterns in the Clickstream solution, we learned the below tips for applying them on AWS.\nFor CloudFront + S3 + API Gateway + Lambda Put API Gateway behind CloudFront for the same origin Use CloudFront function to rewrite requests for React Browser Router Can not enable access log of CloudFront in the same region when deploying to opt-in regions For ALB + Lambda The payload size for Lambda behind ALB is 1MB Split the bundled JS into multiple chunks Handle with the authentication and authorization via your Web framework I presented this topic in AWS User Group Taiwan CDK Squad Meetup in Chinese. Below are the slides in the community sharing,\n","link":"https://kane.mx/posts/2023/build-serverless-web-application-with-aws-lambda-web-adapter/","section":"posts","tags":["AWS","AWS CDK","AWS Lambda","Lambda Web Adapter","Serverless Pattern","Serverless","CDK Construct"],"title":"Build serverless web application with AWS Lambda web adapter"},{"body":"","link":"https://kane.mx/tags/cdk-construct/","section":"tags","tags":null,"title":"CDK Construct"},{"body":"","link":"https://kane.mx/tags/lambda-web-adapter/","section":"tags","tags":null,"title":"Lambda Web Adapter"},{"body":"","link":"https://kane.mx/tags/serverless-pattern/","section":"tags","tags":null,"title":"Serverless Pattern"},{"body":"","link":"https://kane.mx/tags/aws-js-sdk/","section":"tags","tags":null,"title":"AWS JS SDK"},{"body":"When programming with the AWS SDK, developers sometimes want to debug a specific HTTP request when invoking an SDK API. Due to the poor documentation of AWS JS SDK v3, it takes a lot of work to find a way to print the verbose logging of AWS SDK by asking it to the LLMs.\nBelow is a practical tip for enabling verbose logging for AWS JS SDK v3.\nSolution 1 - specify a custom logger for AWS SDK clients 1import { DescribeParametersCommand, SSMClient } from \u0026#34;@aws-sdk/client-ssm\u0026#34;; 2import * as log4js from \u0026#34;log4js\u0026#34;; 3 4log4js.configure({ 5 appenders: { out: { type: \u0026#34;stdout\u0026#34; } }, 6 categories: { default: { appenders: [\u0026#34;out\u0026#34;], level: \u0026#34;debug\u0026#34; } }, 7}); 8 9const logger = log4js.getLogger(); 10 11const ssmClient = new SSMClient({ 12 logger: logger, 13}); Solution 2 - use middleware to hook the life cyele of request 1import { DescribeParametersCommand, SSMClient } from \u0026#34;@aws-sdk/client-ssm\u0026#34;; 2 3const logRequestMiddleware = (next: any, _context: any) =\u0026gt; async (args: any) =\u0026gt; { 4 console.log(\u0026#39;Request:\u0026#39;, args.request); 5 return next(args); 6}; 7 8const ssmClient = new SSMClient({ 9}); 10 11ssmClient.middlewareStack.add(logRequestMiddleware, { step: \u0026#39;finalizeRequest\u0026#39; }); See complete working example gist below,\n","link":"https://kane.mx/posts/2023/aws-js-sdk-v3-verbose-logging/","section":"posts","tags":["AWS JS SDK","Tip","AWS"],"title":"Verbose logging for AWS JS SDK v3"},{"body":"","link":"https://kane.mx/tags/amazon-api-gateway/","section":"tags","tags":null,"title":"Amazon API Gateway"},{"body":"","link":"https://kane.mx/tags/amazon-sqs/","section":"tags","tags":null,"title":"Amazon SQS"},{"body":"Application Programming Interfaces(APIs) is a critical part of the web service, Werner Vogel, the CTO of AWS had a great 6 Rules for Good API Design presentation in 2021 re:Invent keynote.\nIn AWS the developers could manage and proxy the APIs via Amazon API Gateway. The developers can use console, CLI, API or IaC code(for example, Terraform/CloudFormation/CDK) to provisioning the API resources on AWS. However some developers might flavor with using OpenAPI specification to define the APIs. It enables multiple services/tools to understand the APIs' specification, such as Postman. Amazon API Gateway supports this use case, you can import the existing OpenAPI definition as API.\nAmazon API Gateway offers two RESTful API products, REST API and HTTP API. Both of those two APIs support importing OpenAPI definition, but they might use different OpenAPI extensions to support different features.\nAnd below example will use infrastructure as code(AWS CDK) to import the OpenAPI definition to the API Gateway APIs. While importing OpenAPI definition, the most challenge is updating the OpenAPI definition with dynamic resources information(for example, IAM role for calling downstream resources of integration) before importing the OpenAPI definition. For AWS CDK(on top of AWS CloudFormation) uses the intrinsic functions of CloudFormation(Fn::Join) to archive it.\nREST API 1 const deployOptions = { 2 stageName: \u0026#39;\u0026#39;, 3 loggingLevel: MethodLoggingLevel.ERROR, 4 dataTraceEnabled: false, 5 metricsEnabled: true, 6 tracingEnabled: false, 7 }; 8 const restOpenAPISpec = this.resolve(Mustache.render( 9 fs.readFileSync(path.join(__dirname, \u0026#39;./rest-sqs.yaml\u0026#39;), \u0026#39;utf-8\u0026#39;), 10 variables)); 11 new SpecRestApi(this, \u0026#39;rest-to-sqs\u0026#39;, { 12 apiDefinition: ApiDefinition.fromInline(restOpenAPISpec), 13 endpointExportName: \u0026#39;APIEndpoint\u0026#39;, 14 deployOptions, 15 }); HTTP API But above solution does not work with HTTP API, because the CloudFormation of HTTP API does not support intrinsic functions of CFN. \u0026#x1f625; The workaround is putting the OpenAPI definition to Amazon S3 firstly, then import it from S3 bucket via CloudFormation. It involves putting the OpenAPI definition with dynamic resource information to S3 bucket before importing the OpenAPI definition from S3. Here I leveage the CDK built-in custom resource to call S3 API to put the OpenAPI definition file to S3.\n22/11/09 UPDATE: The Body of AWS::ApiGatewayV2::Api only supports the json object. It works after converting the Yaml OpenAPI definition to JSON!\n1const yaml = require(\u0026#39;js-yaml\u0026#39;); 2 3... 4 5 // import openapi as http api 6 const variables = { 7 integrationRoleArn: apiRole.roleArn, 8 queueName: bufferQueue.queueName, 9 queueUrl: bufferQueue.queueUrl, 10 }; 11 const openAPISpec = this.resolve(yaml.load(Mustache.render( 12 fs.readFileSync(path.join(__dirname, \u0026#39;./http-sqs.yaml\u0026#39;), \u0026#39;utf-8\u0026#39;), variables))); 13 14 const httpApi = new CfnApi(this, \u0026#39;http-api-to-sqs\u0026#39;, { 15 body: openAPISpec, 16 failOnWarnings: false, 17 }); The example code creates both REST API and HTTP API, both of them forwards the events to Amazon SQS queue that are sent by HTTP POST requests. See OpenAPI definition of HTTP to SQS, OpenAPI definition of REST to SQS or complete source for further reference.\n","link":"https://kane.mx/posts/2022/import-oas-as-api-on-aws/","section":"posts","tags":["Serverless","Amazon API Gateway","OpenAPI","OAS","Amazon SQS","AWS","AWS CDK"],"title":"Define your API via OpenAPI definition on AWS"},{"body":"","link":"https://kane.mx/tags/oas/","section":"tags","tags":null,"title":"OAS"},{"body":"","link":"https://kane.mx/tags/openapi/","section":"tags","tags":null,"title":"OpenAPI"},{"body":"","link":"https://kane.mx/tags/codepipeline/","section":"tags","tags":null,"title":"CodePipeline"},{"body":"","link":"https://kane.mx/tags/continuous-deployment/","section":"tags","tags":null,"title":"Continuous Deployment"},{"body":"DevOps pipeline is a key component of project operation, it helps you automate steps in your software delivery process.\nAmazon itself has rich expirence on DevOps with large scale services, it shares the lesson and learn from operating the Amazon's services. You can read this summary post written in Chinese.\nAlso AWS provides fully managed SaaS services for the lifecycle of software development, including AWS CodePipeline for automating continuous delivery pipelines, AWS CodeCommit for securely hosting highly scalable private Git repositories, AWS CodeArtifact for artifact management, AWS CodeBuild for building and testing code with continuous scaling and AWS CodeDeploy for automating code deployments to maintain application uptime.\nAWS Code series services are feasible to build the different DevOps pipelines to satisfy the customer's requirements. But it's required some work to assemble the building blocks to build the pipeline.\nCDK Pipeline is an abstract to simplify the builder experience to build DevOps pipeline for CDK application. It leveages the Infrastructure as Code and Construct to standarndize and customize the pipeline of CDK application.\nThe pipeline code just has few lines and looks like below,\n1 const connectArn = scope.node.tryGetContext(\u0026#39;SourceConnectionArn\u0026#39;); 2 if (!connectArn) {throw new Error(\u0026#39;Must specify the arn of source repo connection.\u0026#39;);} 3 const oidcSecret: string = scope.node.tryGetContext(\u0026#39;OIDCSerectArn\u0026#39;); 4 if (!oidcSecret) {throw new Error(\u0026#39;Must specify the context \u0026#34;OIDCSerectArn\u0026#34; for storing secret.\u0026#39;);} 5 6 const pipeline = new CodePipeline(this, \u0026#39;Pipeline\u0026#39;, { 7 synth: new ShellStep(\u0026#39;Synth\u0026#39;, { 8 input: CodePipelineSource.connection(\u0026#39;zxkane/cdk-collections\u0026#39;, \u0026#39;master\u0026#39;, { 9 connectionArn: connectArn, 10 codeBuildCloneOutput: true, 11 }), 12 installCommands: [ 13 \u0026#39;git submodule init \u0026amp;\u0026amp; git submodule update \u0026amp;\u0026amp; git submodule sync\u0026#39;, 14 \u0026#39;npm i --prefix serverlesstodo/frontend\u0026#39;, 15 \u0026#39;npm run build --prefix serverlesstodo/frontend\u0026#39;, 16 \u0026#39;yarn --cwd serverlesstodo install --check-files --frozen-lockfile\u0026#39;, 17 ], 18 commands: [ 19 \u0026#39;cd serverlesstodo\u0026#39;, 20 \u0026#39;npx projen\u0026#39;, 21 \u0026#39;npx projen test\u0026#39;, 22 `npx cdk synth serverlesstodo -c OIDCSerectArn=${oidcSecret} -c SourceConnectionArn=${connectArn} -c CognitoDomainPrefix=todolist-userpool-prod`, 23 ], 24 primaryOutputDirectory: \u0026#39;serverlesstodo/cdk.out/\u0026#39;, 25 }), 26 dockerEnabledForSynth: true, 27 codeBuildDefaults: { 28 cache: Cache.local(LocalCacheMode.SOURCE, LocalCacheMode.DOCKER_LAYER), 29 }, 30 synthCodeBuildDefaults: { 31 partialBuildSpec: BuildSpec.fromObject({ 32 version: \u0026#39;0.2\u0026#39;, 33 phases: { 34 install: { 35 \u0026#39;runtime-versions\u0026#39;: { 36 nodejs: 14, 37 }, 38 }, 39 }, 40 }), 41 }, 42 }); 43 44 pipeline.addStage(new TodolistApplication(this, \u0026#39;Prod\u0026#39;, { 45 env: { 46 account: process.env.CDK_DEFAULT_ACCOUNT, 47 region: process.env.CDK_DEFAULT_REGION, 48 }, 49 })); Some key points in above pipeline code snippet,\nthis example code hosts on Github, so using CodeStar connection to fetch code from Github synth of CodePipeline is the configuration of CodeBuild project, it installs the dependencies of project then build, test and generate the deployment artifacts(CloudFormation template), see docs of deploying from source the CDK pipeline has built-in mutation step to update pipeline itself before deploying the application After deploying the pipeline stack, the screenshot of pipeline looks like below, Todolist app pipeline As usual, all AWS resources are orchestrated by a AWS CDK project, it's easliy to be deployed to any account and any region of AWS!\nHappying continuously deploy your application \u0026#x1f680; \u0026#x1f606;\u0026#x1f606;\u0026#x1f606;\n","link":"https://kane.mx/posts/2022/build-serverless-app-on-aws/devops-pipeline/","section":"posts","tags":["Serverless","AWS","AWS CDK","CodePipeline","DevOps","Continuous Deployment"],"title":"Setup DevOps pipeline with few code"},{"body":"","link":"https://kane.mx/tags/amplify/","section":"tags","tags":null,"title":"Amplify"},{"body":"When working on either 2C application or 2B service, the customers do not want to or is not allowed to sign up the new account, they can login the application via existing IdP or enterprise SSO. So, building the application supports the federated OIDC login to address such requirements.\nThis post extends the capability of Todolist application protected by Amazon Cognito, using Auth0 as the third party OpenID Connect provider introduces the external user pool.\nThe application also uses the AWS Amplify to build the frontend capabilities(for example, authentication, invoke backend restful api), Amazon Cognito providing both federated OIDC login and self-managed users sign in/sign up, and Amazon API Gateway providing the backend API and validating the token with OIDC provider.\nBelow is the key procedures to add the federated OIDC login to the existing web application protected by Cognito,\n1. Update the authorizer of API Gateway to validate the token issued by OIDC providers.\nThe previous authorizer is using API Gateway Cognito authorizer, it only can validate the token issued by Cognito user pool. Cognito user pool also complies with the OIDC standard, using Lambda authorizer can implement to validate the tokens issued by either Cognito user pool and third party OIDC provider.\nThe CDK code creates a lambda function as Lambda Authorizer of API Gateway, which sets the supported OIDC issuers as environment,\n1 const authFunc = new NodejsFunction(this, `${resourceName}AuthFunc`, { 2 entry: path.join(__dirname, \u0026#39;./lambda.d/authorizer/index.ts\u0026#39;), 3 handler: \u0026#39;handler\u0026#39;, 4 architecture: Architecture.ARM_64, 5 timeout: Duration.seconds(5), 6 memorySize: 128, 7 runtime: Runtime.NODEJS_16_X, 8 tracing: Tracing.ACTIVE, 9 environment: { 10 ISSUERS: issuers, 11 RESOURCE_PREFIX: Arn.format({ 12 service: \u0026#39;execute-api\u0026#39;, 13 resource: api.restApiId, 14 }, Stack.of(this)), 15 }, 16 }); The custom lambda authorizer uses the Auth0's jwt-decode and AWS JWT Verify to verify the ID tokens issued by OIDC provider. See source for detail implementation.\n2. Add the third party OIDC provider to Cognito user pool. It involves the client information with secrets generated by OIDC provider, we use the AWS Secrets Manager to securely store the credentials.\nAs prerequisites of this step, you must create an application in your OIDC provider. For example, creating an application in Auth0, then configure the allowed callback URLs to the pool domain. The next saving the issuer domain, client id, client secret and name(will be readable string in UI) to a secret in Secrets Manager. Todolist app in Auth0 The code snippet of CDK creates the external OIDC provider looks like below,\n1 const oidcSecretArn = this.node.tryGetContext(\u0026#39;OIDCSerectArn\u0026#39;); 2 var oidcProvider: UserPoolIdentityProviderOidc | undefined; 3 if (oidcSecretArn) { 4 const secret = Secret.fromSecretAttributes(this, \u0026#39;OIDCSecret\u0026#39;, { 5 secretCompleteArn: oidcSecretArn, 6 }); 7 oidcProvider = new UserPoolIdentityProviderOidc(this, \u0026#39;FedarationOIDC\u0026#39;, { 8 clientId: secret.secretValueFromJson(\u0026#39;clientId\u0026#39;).toString(), 9 clientSecret: secret.secretValueFromJson(\u0026#39;clientSecret\u0026#39;).toString(), 10 issuerUrl: secret.secretValueFromJson(\u0026#39;issuerUrl\u0026#39;).toString(), 11 name: secret.secretValueFromJson(\u0026#39;name\u0026#39;).toString(), 12 userPool: userpool, 13 scopes: [ 14 \u0026#39;profile\u0026#39;, 15 \u0026#39;openid\u0026#39;, 16 \u0026#39;email\u0026#39;, 17 ], 18 }); 19 userpool.registerIdentityProvider(oidcProvider); 20 } 3. Update the amplify configuration file with OIDC provider information.\n1 const amplifyConfFile = \u0026#39;aws-exports.json\u0026#39;; 2 const body = 3`{ 4 \u0026#34;aws_project_region\u0026#34;: \u0026#34;${Aws.REGION}\u0026#34;, 5 \u0026#34;Auth\u0026#34;: { 6 \u0026#34;region\u0026#34;: \u0026#34;${Aws.REGION}\u0026#34;, 7 \u0026#34;userPoolId\u0026#34;: \u0026#34;${poolInfo.userpool.userPoolId}\u0026#34;, 8 \u0026#34;userPoolWebClientId\u0026#34;: \u0026#34;${poolInfo.client.userPoolClientId}\u0026#34;, 9 \u0026#34;authenticationFlowType\u0026#34;: \u0026#34;USER_SRP_AUTH\u0026#34;, 10 \u0026#34;oauth\u0026#34;: { 11 \u0026#34;name\u0026#34;: \u0026#34;${poolInfo.oidc.name}\u0026#34;, 12 \u0026#34;domain\u0026#34;: \u0026#34;${poolInfo.poolDomain.domainName}.auth.${Aws.REGION}.amazoncognito.com\u0026#34;, 13 \u0026#34;scope\u0026#34;: [\u0026#34;email\u0026#34;, \u0026#34;openid\u0026#34;, \u0026#34;aws.cognito.signin.user.admin\u0026#34;, \u0026#34;profile\u0026#34;], 14 \u0026#34;redirectSignIn\u0026#34;: \u0026#34;${poolInfo.oidc.signinUrl}\u0026#34;, 15 \u0026#34;redirectSignOut\u0026#34;: \u0026#34;${poolInfo.oidc.signinUrl}\u0026#34;, 16 \u0026#34;responseType\u0026#34;: \u0026#34;code\u0026#34; 17 } 18 }, 19 \u0026#34;API\u0026#34;: { 20 \u0026#34;endpoints\u0026#34;: [ 21 { 22 \u0026#34;name\u0026#34;: \u0026#34;backend-api\u0026#34;, 23 \u0026#34;endpoint\u0026#34;: \u0026#34;https://${cloudFrontS3.cloudFrontWebDistribution.distributionDomainName}/prod/\u0026#34; 24 } 25 ] 26 } 27}`; 4. Customize the Amplify's React Authenticator component to add the federated OIDC login entrance.\n1 SignIn: { 2 Footer() { 3 const { toResetPassword } = useAuthenticator(); 4 5 return ( 6 \u0026lt;View textAlign=\u0026#34;center\u0026#34;\u0026gt; 7 \u0026lt;Divider orientation=\u0026#34;horizontal\u0026#34; /\u0026gt; 8 \u0026lt;Text\u0026gt; 9 { 10 !isAuthenticated \u0026amp;\u0026amp; ( 11 \u0026lt;View 12 as=\u0026#34;div\u0026#34; 13 backgroundColor=\u0026#34;var(--amplify-colors-white)\u0026#34; 14 borderRadius=\u0026#34;6px\u0026#34; 15 color=\u0026#34;var(--amplify-colors-blue-60)\u0026#34; 16 height=\u0026#34;4rem\u0026#34; 17 maxWidth=\u0026#34;100%\u0026#34; 18 padding=\u0026#34;1rem\u0026#34; 19 \u0026gt; 20 \u0026lt;Button 21 variation=\u0026#34;primary\u0026#34; 22 onClick={ 23 () =\u0026gt; { 24 Auth.federatedSignIn({ customProvider: oidcProviderName }); 25 }} 26 \u0026gt; 27 Sign In with {oidcProviderName} 28 \u0026lt;/Button\u0026gt; 29 \u0026lt;/View\u0026gt; 30 ) 31 } 32 \u0026lt;/Text\u0026gt; 33 \u0026lt;/View\u0026gt; 34 ); 35 }, 36 }, The new look of Amplify's authoricator component looks like below with both self-managed user pool and federated OIDC login, Todolist federated OIDC login As usual, all AWS resources are orchestrated by a AWS CDK project, it's easliy to be deployed to any account and any region of AWS!\nHappying logging your website with externl OIDC provider \u0026#x1f512; \u0026#x1f606;\u0026#x1f606;\u0026#x1f606;\n","link":"https://kane.mx/posts/2022/build-serverless-app-on-aws/federated-oidc-login-with-cognito-and-amplify/","section":"posts","tags":["Serverless","AWS","AWS CDK","API Gateway","Cognito","Amplify","OpenID Connect","Authentication"],"title":"Federated OIDC login with Cognito and Amplify"},{"body":"","link":"https://kane.mx/tags/openid-connect/","section":"tags","tags":null,"title":"OpenID Connect"},{"body":"","link":"https://kane.mx/tags/authorization/","section":"tags","tags":null,"title":"Authorization"},{"body":"Previous post we demonstrated how distributing and securely deploying the website to global end users. The authentication and authorization are always mandatory features of web application. Amazon Cognito is a managed AWS serverless service helping the applications to implement AuthN and AuthZ, with Cognito the applications securely scales to millions of users(up to 50,000 free users) supporting identity and access management standards, such as OAuth 2.0, SAML 2.0, and OpenID Connect.\nThe web application uses AWS Amplify to integrate with AWS services, such as Cognito and API Gateway. Below the procedures how integrating Cognito as AuthN via Amplify in Todolist project,\nadd amplify JS libraries into your project's dependencies 1{ 2 \u0026#34;name\u0026#34;: \u0026#34;todo-list\u0026#34;, 3 \u0026#34;dependencies\u0026#34;: { 4 \u0026#34;@aws-amplify/ui-react\u0026#34;: \u0026#34;^3.5.0\u0026#34;, 5 \u0026#34;aws-amplify\u0026#34;: \u0026#34;^4.3.34\u0026#34;, 6 \u0026#34;axios\u0026#34;: \u0026#34;^0.27.2\u0026#34;, 7 \u0026#34;react\u0026#34;: \u0026#34;^18.2.0\u0026#34;, 8 \u0026#34;react-dom\u0026#34;: \u0026#34;^18.2.0\u0026#34;, 9 \u0026#34;react-icons\u0026#34;: \u0026#34;^4.4.0\u0026#34;, 10 \u0026#34;sweetalert2\u0026#34;: \u0026#34;^11.4.24\u0026#34;, 11 \u0026#34;uuid\u0026#34;: \u0026#34;^8.3.2\u0026#34; 12 } 13} load the configuration file from server side and configure the Amplify categories 1 useEffect(() =\u0026gt; { 2 setLoadingConfig(true); 3 Axios.get(\u0026#34;/aws-exports.json\u0026#34;).then((res) =\u0026gt; { 4 const configData = res.data; 5 const tokenHeader = async () =\u0026gt; { return { Authorization: `Bearer ${(await Auth.currentSession()).getIdToken().getJwtToken()}` }; }; 6 configData.API.endpoints[0].custom_header = tokenHeader; 7 Amplify.configure(configData); 8 apiEndpointName = configData.API.endpoints[0].name; 9 setApiEndpoint(configData.API.endpoints[0].name); 10 11 Hub.listen(\u0026#39;auth\u0026#39;, ({ payload }) =\u0026gt; { 12 const { event } = payload; 13 switch (event) { 14 case \u0026#39;signIn\u0026#39;: 15 case \u0026#39;signUp\u0026#39;: 16 case \u0026#39;autoSignIn\u0026#39;: 17 getTasks(); 18 break; 19 } 20 }); 21 22 getTasks(); 23 24 setLoadingConfig(false); 25 }); 26 }, []); use [Authenticator component][authenticator] adding complete authentication flows with minimal boilerplate 1 return ( 2 \u0026lt;Authenticator components={components} loginMechanisms={[\u0026#39;email\u0026#39;]}\u0026gt; 3 {({ signOut, user }) =\u0026gt; ( 4 \u0026lt;Flex 5 direction=\u0026#34;column\u0026#34; 6 justifyContent=\u0026#34;flex-start\u0026#34; 7 alignItems=\u0026#34;center\u0026#34; 8 alignContent=\u0026#34;flex-start\u0026#34; 9 wrap=\u0026#34;nowrap\u0026#34; 10 gap=\u0026#34;1rem\u0026#34; 11 textAlign=\u0026#34;center\u0026#34; 12 \u0026gt; 13 \u0026lt;View width=\u0026#34;100%\u0026#34;\u0026gt; 14 ... 15 \u0026lt;/View\u0026gt; 16 \u0026lt;/Flex\u0026gt; 17 )} 18 \u0026lt;/Authenticator\u0026gt; 19 ) update TODO CRUD methods to use Amplify's API catagory to make HTTP requests to API Gateway 1 const getTasks = async () =\u0026gt; { 2 const canEnter = await ionViewCanEnter(); 3 if (canEnter) { 4 try { 5 setLoadingData(true); 6 7 const initData = { 8 headers: { \u0026#34;content-type\u0026#34;: \u0026#34;application/json\u0026#34; }, // OPTIONAL 9 response: true, // OPTIONAL (return the entire Axios response object instead of only response.data) 10 }; 11 API 12 .get(apiEndpointName || apiEndpoint, \u0026#34;/todo\u0026#34;, initData) 13 .then(res =\u0026gt; { 14 setLoadingData(false); 15 const tasksData = res.data; 16 if ((typeof tasksData === \u0026#34;string\u0026#34;)) { 17 Swal.fire(\u0026#34;Ops..\u0026#34;, tasksData); 18 } else { 19 setTasks(tasksData); 20 } 21 }) 22 .catch(error =\u0026gt; { 23 setLoadingData(false); 24 console.error(error); 25 Swal.fire( 26 `${error.message}`, 27 `${error?.response?.data?.message}`, 28 undefined 29 ); 30 }); 31 } catch (error) { 32 console.info(error); 33 } 34 } 35 }; All above changes are implemented Cognito authN with the web react application.\nIn the server-side the Cognito user pool will be provisioned, the API Gateway endpoint is authorized by Cognito user pool authorizer. The Amplify configuration file aws-exports.json will be created on the air when provisioning the stack with the user pool and API information.\nAs usual, all AWS resources are orchestrated by AWS CDK project, it's easliy to be deployed to any account and any region of AWS!\nHappying protecting the website with Cognito \u0026#x1f512; \u0026#x1f606;\u0026#x1f606;\u0026#x1f606;\n","link":"https://kane.mx/posts/2022/build-serverless-app-on-aws/protect-website-with-cognito/","section":"posts","tags":["Serverless","AWS","AWS CDK","API Gateway","Cognito","Amplify","Authentication","Authorization"],"title":"Protect website with Cognito"},{"body":"It's a well known pattern to distribute the website via CDN globally, it reduces the latency of the site and improve the availibity and security leveraging the infrastructure of cloud provider.\nUsing CDN service CloudFront and simple storage S3 on AWS hosts the static website. It well fits the SPA(single page application) framework technologies, for example, React, Vue and Angularjs. There are lots of existing project and code snippets to sharing this pattern, such as CloudFront to S3 and API Gateway and AWS S3 / React Website Pattern.\nIn the TODO application it reuses an existing project Todolist built by React. The original Todolist application is a pure client application without communicating the backend service. In this demo the Todolist application is updated to communicate with Restful TODO APIs created by Amazon API Gateway. Also the restful backend API is distributed by CDN CloudFront to reduce the latency and protect the origin service without crossing domain request.\nTodolist app The demo uses the aws-cloudfront-s3 construct from AWS Solutions Constructs to simplify orchestrating the CloudFront to S3/API Gateway pattern. And use AWS S3 Deployment Construct Library to publish the static web page to S3 bucket. See below code snippet how archive it in CDK.\nAs usual, all AWS resources are orchestrated by AWS CDK project, it's easliy to be deployed to any account and any region of AWS!\nHappying distributing the website \u0026#x1f310; \u0026#x1f606;\u0026#x1f606;\u0026#x1f606;\n","link":"https://kane.mx/posts/2022/build-serverless-app-on-aws/static-website/","section":"posts","tags":["Serverless","AWS","AWS CDK","CloudFront","S3"],"title":"Distribute the website globally"},{"body":"","link":"https://kane.mx/tags/s3/","section":"tags","tags":null,"title":"S3"},{"body":"Most web applications are using Restful APIs to interactive with the backend services. In the TODO application, it's the straight forward to get, update and delete the items from backend database. Amazon DynamoDB is a key-value database, it fits for this scenario with scalability and optimized pay-as-you-go cost. Also Amazon API Gateway has built-in integration with AWS serivces, the restful API can be transformed to the request to DynamoDB APIs. Using this combination you can provide the restful APIs only provisioning AWS resources without writing the CRUD code!\nLet's assume the TODO application having below model to represent the TODO items,\n1{ 2\u0026#34;subject\u0026#34;: \u0026#34;my-memo\u0026#34;, // some subject of TODO item 3\u0026#34;description\u0026#34;: \u0026#34;the great idea\u0026#34;, // some description for the TODO item 4\u0026#34;dueDate\u0026#34;: 1661926828, // the timestamp of sceonds for the due date of TODO item 5} Then define below restful APIs for list, fetch, update and delete TODO item/items.\nCreate new TODO item 1PUT /todo Update a TODO item 1POST /todo/\u0026lt;todo id\u0026gt; Delete a TODO item 1DELETE /todo/\u0026lt;todo id\u0026gt; List TODO items 1GET /todo All magic with no code restful API of API Gateway is setting up data transformations for REST API.\nBelos is using the Apache VTL to transform the request JSON payload to DynamoDB UpdateItem API request.\nAlso using API Gateway's transformation feature of the response of integration(DynamoDB API in this case) to shape the response like below,\nThere are few best practise of using API Gateway and AWS services integration to simplify the CRUD operations,\nuse request validator to validate the request payload use integration response to handle with the error cases of integration services. Below is an example checking the error message of DynamoDB API then reshape the error message 1#if($input.path(\u0026#39;$.__type\u0026#39;) == \u0026#34;com.amazonaws.dynamodb.v20120810#ConditionalCheckFailedException\u0026#34;) 2{ 3 \u0026#34;message\u0026#34;: \u0026#34;the todo id already exists.\u0026#34; 4} 5#end sanity all string inputs from client via API Gateway built-in $util method $util.escapeJavaScript() to avoid NoSQL injection attack response valid json if the string contains signle quotes(') 1\u0026#34;subject\u0026#34;: \u0026#34;$util.escapeJavaScript($input.path(\u0026#39;$.Attributes.subject.S\u0026#39;)).replaceAll(\\\u0026#34;\\\\\\\\\u0026#39;\\\u0026#34;,\\\u0026#34;\u0026#39;\\\u0026#34;)\u0026#34; As usual, all AWS resources are orchestrated by AWS CDK project, it's easliy to be deployed to any account and any region of AWS!\nHappying 👨‍💻 API \u0026#x1f606;\u0026#x1f606;\u0026#x1f606;\n","link":"https://kane.mx/posts/2022/build-serverless-app-on-aws/restful-api/","section":"posts","tags":["Serverless","AWS","API Gateway","DynamoDB","AWS CDK"],"title":"Build no code restful HTTP API with API Gateway and DynamoDB"},{"body":"Building web application is a common use case, leveraging cloud services could accelerate the builders to develop and deploy the services. With AWS serverless services, the application can easily get the capabilities like security, highly availability, scalability, resiliency and cost optimized.\nThis is a series posts to demonstrate how building a serverless TODO web application on AWS with AWS serverless services and AWS CDK, it consists of,\nRestful HTTP APIs, use Amazon API Gateway and Amazon DynamoDB Securely and accelerately distribute the static website via Amazon CloudFront and Amazon S3 Authentication and Authorization via Amazon Cognito and AWS Amplify Federated OIDC authentication with Amazon Cognito CI/CD DevOps pipeline source code written by AWS CDK to archive above features ","link":"https://kane.mx/posts/2022/build-serverless-app-on-aws/intro/","section":"posts","tags":["Serverless","AWS"],"title":"Build serverless web application with AWS Serverless"},{"body":"","link":"https://kane.mx/tags/cd/","section":"tags","tags":null,"title":"CD"},{"body":"","link":"https://kane.mx/tags/continuous-delivery/","section":"tags","tags":null,"title":"Continuous Delivery"},{"body":"","link":"https://kane.mx/tags/debugging/","section":"tags","tags":null,"title":"Debugging"},{"body":"","link":"https://kane.mx/tags/flux/","section":"tags","tags":null,"title":"Flux"},{"body":"After enabling E2E testing of FluxCD powered GitOps continuous deployment, the feedback of new commits are quite slow. Because you have to wait for the E2E testing result, lots of time cost on setuping the environment and provisioning your development from scrath.\nInspired by E2E testing in Github actions, the DevOps engineers can build local debugging environment in Kind or minikube.\nBelow is a script how using Kind to provision FluxCD then reconciling the latest commits by FluxCD.\n","link":"https://kane.mx/posts/gitops/fluxcd-local-debug-tip/","section":"posts","tags":["Flux","GitOps","Kubernetes","Git","CD","Continuous Delivery","Debugging"],"title":"FluxCD GitOps debugging tip"},{"body":"","link":"https://kane.mx/tags/git/","section":"tags","tags":null,"title":"Git"},{"body":"","link":"https://kane.mx/series/gitops/","section":"series","tags":null,"title":"Gitops"},{"body":"","link":"https://kane.mx/tags/gitops/","section":"tags","tags":null,"title":"GitOps"},{"body":"","link":"https://kane.mx/categories/kubernetes/","section":"categories","tags":null,"title":"Kubernetes"},{"body":"","link":"https://kane.mx/tags/kubernetes/","section":"tags","tags":null,"title":"Kubernetes"},{"body":"","link":"https://kane.mx/tags/aws-secrets-manager/","section":"tags","tags":null,"title":"AWS Secrets Manager"},{"body":"","link":"https://kane.mx/tags/eks/","section":"tags","tags":null,"title":"EKS"},{"body":"","link":"https://kane.mx/tags/external-secrets-operator/","section":"tags","tags":null,"title":"External Secrets Operator"},{"body":"背景 密钥的管理对于使用 GitOps 方式做持续发布是一个挑战，特别是当目标部署平台是 Kubernetes 的时候。 K8S 使用声明式配置管理最终状态，而K8S中的密钥仅仅是将密钥内容做了base64格式的编码。 在基于 Flux 的 GitOps 实战介绍了使用Bitnami Sealed Secrets加密密钥内容， 可以安全的将加密后的Kubernetes Manifest文件提交到Git代码仓库，由Sealed Secrets发现这些SealedSecret的密码， 并解密后动态的创建K8S原生Secrets对象。\nSealedSecret 解决了如何在 Git 代码仓库中安全的保存密钥的痛点，但是该方式仍然需要系统管理员自行的妥善保存 SealedSecret 使用的私钥，以及如何从灾难中恢复的场景。此外，整个密钥的生命周期管理在K8S集群内部， 无法让集群外的工作负载安全有效的使用这些密钥，例如，云厂商上托管的 RDS 类型数据库。\n使用外部密钥服务管理K8S密钥 在 CNCF 基金会在2021年做的一份关于密钥管理的技术雷达报告上指出， AWS Secrets Manager, HashiCorp Vault 被列为成熟的密钥管理服务或方案。 如果可以在 Kubernetes 中使用这些成熟的密钥服务或方案来管理密钥将可以同时获得密钥服务安全功能强大和 Kubernetes 任务编排的多重收益。\n图1：CNCF End User Technology Radar, Secret Management, February 2021 External Secrets Operator(ESO) 针对以上不足之处，接下来介绍的 External Secrets Operator 将按这个思路解决这些问题。\nExternal Secrets Operator 是一个 Kubernetes Operator，它集成了外部密钥管理系统， 例如 AWS Secrets Manager、HashiCorp Vault、Google Secrets Manager、Azure Key Vault 等等。 他使用外部密钥管理服务的 API 读取信息并自动将值注入 Kubernetes Secret。\n以上是 External Secrets Operator 的简介，看了以后是不是觉得特别眼熟。他跟同时 CNCF 下另一个 DNS 解析服务External DNS非常的类似，为 Kubernetes 内的域名解析注册提供统一的实现体验， 同其他众多第三方成熟的 DNS 解析集成。\n下面将介绍如何在使用 FluxCD 管理 External Secrets Operator，以及在 EKS 中使用 AWS Secrets Manager 管理的密钥。\nFluxCD 部署 External Secrets Operator External Secrets Operator 支持使用 Helm 安装，Flux 部署 ESO 同安装其他 Helm Chart 类似。\n加入 ESO 的 Helm 仓库 1apiVersion: source.toolkit.fluxcd.io/v1beta1 2kind: HelmRepository 3metadata: 4 name: external-secrets 5spec: 6 interval: 10m 7 url: https://charts.external-secrets.io 通过 HelmRelease 部署 ESO 1apiVersion: helm.toolkit.fluxcd.io/v2beta1 2kind: HelmRelease 3metadata: 4 name: external-secrets 5spec: 6 # Override Release name to avoid the pattern Namespace-Release 7 # Ref: https://fluxcd.io/docs/components/helm/api/#helm.toolkit.fluxcd.io/v2beta1.HelmRelease 8 releaseName: external-secrets 9 targetNamespace: kube-system 10 interval: 10m 11 chart: 12 spec: 13 chart: external-secrets 14 sourceRef: 15 kind: HelmRepository 16 name: external-secrets 17 namespace: kube-system 18 values: 19 installCRDs: true 20 serviceAccountName: helm-controller 21 timeout: 5m 22 test: 23 enable: true 24 ignoreFailures: true 25 install: 26 crds: CreateReplace 27 remediation: 28 retries: 3 29 upgrade: 30 crds: CreateReplace 31 remediation: 32 remediateLastFailure: false 为 ESO 创建配置 IRSA EKS 通过IRSA将 K8S 内 RBAC 的 ServiceAccount 同 IAM role 统一在一起， 可以让K8S内的工作负载通过原生的 ServiceAccount 绑定 IAM Role，无需显示的指定 AccessKey/Secret 来访问 AWS API。\n因为 ESO 必须通过 AWS API 访问读取保存在 AWS Secrets Manager 中的密钥。所以需要为 ESO 配置 AWS 访问密钥或使用 IRSA 支持。\n根据 ESO 文档建议的 AWS Secrets Manager 权限创建 IAM Policy 1{ 2 \u0026#34;Version\u0026#34;: \u0026#34;2012-10-17\u0026#34;, 3 \u0026#34;Statement\u0026#34;: [ 4 { 5 \u0026#34;Effect\u0026#34;: \u0026#34;Allow\u0026#34;, 6 \u0026#34;Action\u0026#34;: [ 7 \u0026#34;secretsmanager:GetResourcePolicy\u0026#34;, 8 \u0026#34;secretsmanager:GetSecretValue\u0026#34;, 9 \u0026#34;secretsmanager:DescribeSecret\u0026#34;, 10 \u0026#34;secretsmanager:ListSecretVersionIds\u0026#34; 11 ], 12 \u0026#34;Resource\u0026#34;: [ 13 \u0026#34;arn:aws:secretsmanager:us-west-2:111122223333:secret:dev/*\u0026#34; # 替换 region, accountid, 密钥的名称前缀 14 ] 15 } 16 ] 17} 使用eksctl工具为 EKS 集群创建ESO需要的 Role 及绑定 ESO 需要的权限，例如， 1eksctl create iamserviceaccount --cluster=gitops-cluster --name=external-secrets \\ 2--role-only --role-name=gitops-cluster-dev-external-secrets-role --region ap-southeast-1 \\ 3--namespace=kube-system --attach-policy-arn=arn:aws:iam::123456789012:policy/gitops-dev-external-secrets-sm \\ 4--approve namespace需要跟ESO部署的命令空间保持一致\nname 需要跟部署 ESO Chart 指定的 ServiceAccount 名称一致，默认为 external-secrets\n使用 Kustomization patch 为 ESO Chart 创建的 ServiceAccount 指定 IAM role 1apiVersion: kustomize.config.k8s.io/v1beta1 2kind: Kustomization 3resources: 4 - ../../base 5 - ./secrets.yaml 6patches: 7 - patch: | 8 - op: add 9 path: /spec/patches/- 10 value: 11 patch: | 12 - op: add 13 path: /spec/values/serviceAccount/annotations/eks.amazonaws.com~1role-arn 14 value: arn:aws:iam::845861764576:role/gitops-cluster-dev-external-secrets-role 15 target: 16 kind: HelmRelease 17 name: external-secrets 18 target: 19 group: kustomize.toolkit.fluxcd.io 20 version: v1beta2 21 kind: Kustomization 22 name: external-secrets 创建 SecretStore 或 ClusterSecretStore 配置访问 AWS Secrets Manager 1apiVersion: external-secrets.io/v1beta1 2kind: ClusterSecretStore 3metadata: 4 name: secretstore 5 namespace: kube-system 6spec: 7 provider: 8 aws: 9 service: SecretsManager 10 region: ap-southeast-1 11 auth: 12 jwt: 13 serviceAccountRef: 14 name: external-secrets 15 namespace: kube-system 上面的配置使用了 ServiceAccount 的短时间有效期 JWT token 访问 AWS API，避免了在集群内管理保存 AWS 访问凭证。\n创建 ExternalSecret 对象从 Secrets Manager 获取密钥并配置到 K8S 的 Secret 对象 1apiVersion: external-secrets.io/v1beta1 2kind: ExternalSecret 3metadata: 4 name: slack-url 5 namespace: kube-system 6spec: 7 refreshInterval: 1h 8 secretStoreRef: 9 name: secretstore 10 kind: ClusterSecretStore 11 target: 12 name: slack-url 13 creationPolicy: Owner 14 deletionPolicy: Delete 15 data: 16 - secretKey: address 17 remoteRef: 18 key: dev/slackurl 如上的 ExternalSecret 对象声明了在 kube-system 命令空间创建名为 slack-url 的密钥。ESO会通过名为 secretstore 的 ClusterSecretStore 对象获取 AWS Secrets Manager 访问凭证，将名为 dev/slackurl 的 AWS Secrets Manager 密钥内容设置到 K8S Secret slack-url 的 address 键值。\n确保 FluxCD 创建 ESO 资源的顺序 如上部署通过 Helm 部署了 ESO，通过 ESO 自定义资源创建了 ClusterSecretStore 和 ExternalSecret 创建密钥。 这些资源通过不同的 Flux 控制器(Kustomization或Helm)所创建，这些资源可用的顺序没有办法保证先后顺序。但是 ESO 的自定义资源对象声明（如ClusterSecretStore）依赖 ESO 完整的部署创建自定义资源声明。这里通过嵌套的 Flux Kustomization 对象来管理不同对象间的依赖。示例实现如下，\n1apiVersion: kustomize.toolkit.fluxcd.io/v1beta2 2kind: Kustomization 3metadata: 4 name: secrets 5 namespace: flux-system 6spec: 7 interval: 10m0s 8 path: ./infrastructure/overlays/development/secrets 9 prune: true 10 dependsOn: 11 - name: sealed-secrets 12 - name: external-secrets 13 sourceRef: 14 kind: GitRepository 15 name: flux-system ESO Examples 文档也详细解释了 FluxCD 中的这个问题，并且示例了解决方法。\n小结 本文介绍了 External Secrets Operator 将成熟且经过验证的密钥管理服务（如 AWS Secrets Manager）引入到 Kubernetes 原生生态。 用户可以保留使用这些密钥服务的最佳实践和经验，同时让 K8S 编排的任务也无需改动仍然使用云原生的方式访问密钥。 整个方案兼容了安全成熟的密钥管理同 K8S 内程序访问密钥的需求。\n随后简短的示例了如何在 EKS 环境最佳实践的管理 ESO 部署，同时示例了如何使用 FluxCD GitOps 方式同时管理 ESO 部署和外部密钥。 完整的代码示例可以这个仓库获取。\n如果用户有需求通过文件访问 AWS Secrets Manager 的密钥，可以使用 AWS 开源的 AWS Secrets Manager and Config Provider for Secret Store CSI Driver, 这个项目将 Secrets Manager/Parameter Store 通过 CSI Driver 挂载到容器，提供文件系统的访问。\n","link":"https://kane.mx/posts/gitops/manage-k8s-secrets-in-external-secrets-manager/","section":"posts","tags":["External Secrets Operator","AWS Secrets Manager","Flux","GitOps","Kubernetes","Git","EKS","CD","Continuous Delivery"],"title":"使用外部Secrets Manager管理Kubernetes密钥"},{"body":"","link":"https://kane.mx/tags/crossplane/","section":"tags","tags":null,"title":"Crossplane"},{"body":"背景 在Flux 部署实战的总结展望中有一个方向是如何将云上基础设施资源同Kubernetes内资源统一管理， 而Crossplane提供了一个高度可扩展的后端，使用声明式程序同时编排应用程序和基础设施，不用关心它们在哪里运行。\n近期 AWS 官方博客宣布了 AWS Blueprints for Crossplane，为客户提供了在 Amazon EKS 上应用 Crossplane 的参考实现。\nAWS Blueprints for Crossplane AWS Blueprints for Crossplane 是一个 Github 上开源项目，它提供了如下参考架构及功能，\n✅ 使用Terraform 创建 Amazon EKS 集群并部署Crossplane ✅ 使用eksctl 创建 Amazon EKS 集群并部署Crossplane ✅ AWS Provider- Crossplane Compositions for AWS Services ✅ Terrajet AWS Provider - Another Crossplane Compositions for AWS Services ✅ AWS IRSA on EKS - AWS Provider Config with IRSA enabled ✅ 使用 AWS Provider 和 Terrajet AWS Provider 的 Composite Resources (XRs)示例部署模式 ✅ 使用Crossplane Managed Resources (MRs) 的示例部署 部署 Crossplane EKS Crossplane 参考蓝图示例了如何使用 Terraform(通过Amazon EKS Blueprints for Terraform) 和 eksctl 部署 EKS 集群及部署 Crossplane， 本文将演示如何使用 Flux 按照 GitOps 方式部署管理 Crossplane，演示将沿用 Flux 实战 所使用的示例repo。\n手动部署 Crossplane 按照 Crossplane 部署文档，Crossplane 在 EKS 上的部署分为下面三步，\n通过 Helm 部署 Crossplane chart 由于 Crossplane 大量通过 CRD 使用扩展性，需要在 Crossplane 组件部署成功后， 通过 Crossplane pkg CRD 部署及配置对应的 Provider，如在 AWS 上管理 AWS Provider 或 Terrajet AWS Provider AWS Provider 或 Terrajet AWS Provider 是通过 pkg CRD 异步部署的，需要等 Provider CRD 可用后，才可部署对应的 Provider Config 通过 Flux 实现 GitOps 部署 Crossplane 鉴于 Crossplane 部署三个步骤的强依赖性，所以使用 Flux 部署通过 Kustomization dependencies 功能实现三部分资源创建的先后依赖。\n1. 部署 Crossplane Helm chart 如下 manifest 创建 Crossplane helm release kustomization， 通过healthChecks检查确保 Crossplane 组件部署成功后才将 kustomization 设置为 reconcilation 成功。\n1apiVersion: kustomize.toolkit.fluxcd.io/v1beta2 2kind: Kustomization 3metadata: 4 name: crossplane 5 namespace: flux-system 6spec: 7 interval: 10m0s 8 path: ./infrastructure/base/crossplane/release 9 targetNamespace: crossplane-system 10 prune: true 11 sourceRef: 12 kind: GitRepository 13 name: flux-system 14 namespace: flux-system 15 timeout: 5m 16 healthChecks: 17 - apiVersion: apps/v1 18 kind: Deployment 19 name: crossplane 20 namespace: crossplane-system 通过 Flux Helm 支持部署 Crossplane helm release\n1apiVersion: helm.toolkit.fluxcd.io/v2beta1 2kind: HelmRelease 3metadata: 4 name: crossplane 5 namespace: crossplane-system 6spec: 7 releaseName: crossplane 8 targetNamespace: crossplane-system 9 chart: 10 spec: 11 chart: crossplane 12 version: \u0026#34;1.8.0\u0026#34; 13 sourceRef: 14 kind: HelmRepository 15 name: crossplane-stable 16 namespace: crossplane-system 17 serviceAccountName: helm-controller 18 timeout: 5m 19 test: 20 enable: true 21 ignoreFailures: true 22 interval: 1h0m0s 23 install: 24 crds: CreateReplace 25 remediation: 26 retries: 3 27 upgrade: 28 crds: CreateReplace 29 remediation: 30 remediateLastFailure: false 2. 创建 Crossplane AWS Provider Kustomization crossplane-provider 将依赖 kustomization crossplane， 并检查 Crossplane AWS provider 自定义资源 providerconfigs.aws.crossplane.io 创建成功与否。\n1apiVersion: kustomize.toolkit.fluxcd.io/v1beta2 2kind: Kustomization 3metadata: 4 name: crossplane-provider 5 namespace: flux-system 6spec: 7 interval: 10m0s 8 path: ./infrastructure/base/crossplane/provider 9 prune: true 10 sourceRef: 11 kind: GitRepository 12 name: flux-system 13 namespace: flux-system 14 dependsOn: 15 - name: crossplane 16 targetNamespace: crossplane-system 17 healthChecks: 18 - apiVersion: apiextensions.k8s.io/v1 19 kind: CustomResourceDefinition 20 name: providerconfigs.aws.crossplane.io 21 timeout: 5m 22 patches: 23 - patch: | 24 - op: replace 25 path: /metadata/annotations/eks.amazonaws.com~1role-arn 26 value: arn:aws:iam::845861764576:role/crossplane-provider-aws 27 target: 28 group: pkg.crossplane.io 29 version: v1alpha1 30 kind: ControllerConfig 3. 创建 Provider Config 同样方式创建部署 ProviderConfig 资源的 kustomization 对象，依赖 crossplane-provider kustomization 部署。\n1apiVersion: kustomize.toolkit.fluxcd.io/v1beta2 2kind: Kustomization 3metadata: 4 name: crossplane-provider-config 5 namespace: flux-system 6spec: 7 interval: 10m0s 8 path: ./infrastructure/base/crossplane/provider-config 9 prune: true 10 sourceRef: 11 kind: GitRepository 12 name: flux-system 13 dependsOn: 14 - name: crossplane-provider 15 timeout: 5m 使用 Crossplane 创建 AWS 基础设施 Crossplane 提供了两种方式表示外部系统资源，\n托管资源 (MR) 是 Crossplane 对外部系统中资源的表示，例如，最常见的是云提供商。如下资源申明由 AWS Provider 支持，创建 AWS 上的 RDS 数据库实例。 1apiVersion: database.aws.crossplane.io/v1beta1 2kind: RDSInstance Crossplane 复合资源 (XR)是由托管资源组成的封装的 Kubernetes 自定义资源。 复合资源旨在让用户使用自己的观点和 API 构建自己的平台，而无需从头开始编写 Kubernetes 控制器。 相反，用户定义的 XR 架构教会 Crossplane 当有使用用户定义的 XR 时它应该组成（即创建）哪些托管资源。 AWS Blueprints for Crossplane 提供了 Compositions 示例，涵盖了 VPC，S3，IAM，RDS，DynamoDB，EKS 等服务。 如前面介绍 Crossplane Compositions(XRs) 是对基础设施的模式封装和组合，并不会直接创建云原生资源。\nAWS Blueprints for Crossplane 同时提供了 Examples 示例 直接使用 AWS Provider 提供的托管资源 (MR) 和示例的复合资源 (XR)，如上Compositions中示例VPC，S3, DynamoDB等AWS资源。\n小结及展望 Crossplane 目前是 CNCF 基金会下孵化中项目，一定程度可以实现云上基础设施资源和 Kubernetes 内资源统一使用声明式方式管理。 复合资源 (Composite Resources) 支持了对业务需求的高层次抽象，理念同 Construct Hub 类似。 基础实施团队可以通过复合资源提供高阶抽象，复用经过验证且符合管理需求的抽象组合，简化下游团队管理资源的复杂度。\nCrossplane 自身利用 K8S CRD 创建管理 Composite Resources，首先需要用户熟悉 CRD 的实现。 XRs 本质是通过声明式方式管理云原生基础设施，同样 AWS CloudFormation 是由 AWS 原生提供的通过声明式方式管理 AWS 上资源。 由于云原生资源的功能复杂性，CloudFormation 面临的编写复杂声明式代码，不易于测试和复用的问题同样在 Crossplane XRs 上存在。 同时面对数量庞大的 AWS 或其他云厂商原生服务资源，需要大量的社区资源来创建管理 AWS 可复用的复合资源模式， 可以预见在相当一段时间内云厂商托管资源覆盖率及高阶的复合资源数量都是该技术被广泛采纳的一个障碍。\n对比 AWS CDK/Pulumi 编程方式管理创建的复用资源和更高阶的抽象， Crossplane 在开发和复用效率上并没有优势。 Crossplane 最大的优势是可通过统一 Kubernetes 声明式方式来管理云上资源和 Kubernetes 集群内资源。 但对用户而言采用 Crossplane 的学习成本和开发复杂度较高，Crossplane 及类似技术可列为持续评估调用中，小量谨慎用于生产环境。\n","link":"https://kane.mx/posts/gitops/crossplane-meets-gitops/","section":"posts","tags":["Crossplane","Flux","GitOps","Kubernetes","Git","EKS","CD","Continuous Delivery"],"title":"基于 Flux 的 GitOps 管理 Crossplane 部署及资源"},{"body":"AWS CDK is a great abstract to accelerate managing the cloud infrastructure as code. The journey will be enjoyful with leveraging the Construct Hub to use the high level contributions from AWS partners and commnunity.\nUse Case AWS CloudFormation is one of the underly technologies of AWS CDK to manage the cloud infrastructure. It easily to enable the IT administrators even business operators whom has no/limited developer skills to develop the end-to-end solutions with one-click user experience.\nSo it's a use case for effectively developing the Cloud Application via AWS CDK, then publishing it as CloudFormation template with better user experimental experience.\ncdk synth command CDK has a built-in capability to synthesize its application to CloudFormation templates, as known as the cdk synth command. You can upload the syntheized output templates to Amazon S3 bucket, then deploy it via AWS CloudFormation. Looks like it's quite easy to publish the CDK application as CloudFormation templates.\nWhy cdk synth does not work However above procedure is not working in most case while orchestrating a large application in cloud. Due to the CDK applications probably contains assets which need be uploaded to S3 and ECR before deploying the application. For example, a CDK application with using Node.js Function, Python Function, S3 Deployment, Docker Image Assets and so on will be synthesized to the templates that are not deployable directly. It requires to publish those assets(both S3 and ECR assets) firstly, then deploy the templates with parameters pointing to the assets. This step is difficult to be completed manually, because the assets are named with its content hash are not readable by human being in CDK V1. CDK v2 uses the modern bootstrapping template which uses deterministic name for resources to remove the parameters, but it still depends on the assets published priorly before deploying the CloudFormation template.\ncdk-assets command Hence there is another experimental tool provided with CDK project, it's cdk-assets. cdk-assets command use the outputs of cdk synth, then publish the assets of application to S3 and ECR, and update the templates to refer to the assets in S3 and ECR. Looks like the utility perfectly fits the requirement of my use case.\ncdk-assets drawbacks But it still has some drawbacks for this solution. For some AWS services, the assets are mandatorily required from same region. It means that the Lambda code packages(reside on S3) must be from same region S3 bucket, the container images(reside on ECR) must be from same region of SageMaker training job / inference endpoint. For the applications with multiple regions support, we have to replicate above procedure in multiple times and provide multiple CloudFormation links per region like below. It means the users can not switch to another region via region selector after opening one of the links.\nCloudFormation link per region the solution cdk-bootstrapless-synthesizer There is another commnuity tool cdk-bootstrapless-synthesizer to resolve above painful perfectly. It can help synthesize a single CloudFormation template entrypoint, then deploy it to any supported regions. Also it provides a pipeline example(based on AWS CodePipeline) to publish a CDK application to CloudFormation template with multiple regions supported.\n","link":"https://kane.mx/posts/2022/publish-cdk-app-via-cloudformation/","section":"posts","tags":["AWS CDK","AWS CloudFormation","AWS","Tip"],"title":"Publish your AWS CDK applications via AWS CloudFormation templates"},{"body":"在上篇介绍基于 CNCF 下的 GitOps 工具 FluxCD v2 实现了管理多账户的 Kubernetes 集群的共享组件，Secrets 使用的最佳实践， GitOps 流水线事件同 IM(Slack) 的集成，以及对 GitOps 代码的 CI 流程。\n本文将围绕如何使用 Flux 的多租户管理最佳实践，打造基于 GitOps 工作流程的共享服务平台， 实现租户(业务/应用团队)可自助的持续部署。\n一、基于 GitOps 的共享服务平台设定 Kubernetes 提供了命名空间作为一种机制将同一集群中的资源划分为相互隔离的组。 同一个集群中多租户多团队的应用管理将沿用 Kubernetes 内置的各种机制来为不同的租户、团队或应用进行隔离，包括且不限于，\n命名空间(Namespaces) 资源配额(Resource Quotas)，限制应用的资源总量 RBAC 鉴权，限制应用的权限，如可创建 Ingress，不可创建密钥可读取指定名称的密钥，不可创建持久卷等 网络策略(Network Policies) 基于 Kubernetes 以上能力，为基于 GitOps 的共享服务平台设定如下，\n平台团队通过一个 Git 仓库来管理多个跨网络跨账户跨云平台的 Kubernetes 集群，平台团队通过 GitOps 管理如下资源， GitOps Toolkit 组件，如 Flux 集群共享组件，如 CNI, CSI Driver, Ingress Class，Service Accounts, CRD, DNS 等 可观测性的共享组件，如 Log, Metrics, Trace 每个租户/团队/应用的基础资源，如 Namespaces, Resource Quotas, Open Policy，Service Accounts，密钥等 为集群中的每个租户/团队/应用使用独立的 Git 仓库来隔离其持续部署，假设有应用名为 app-a， 应用 app-a 相关的资源都将部署在命名空间 app-a 限制应用使用的总资源，如不超过 2 vCPU, 4 GiB 内存 应用团队使用独立的 Git 仓库来管理应用编排，应用团队将负责应用发布到不同 stage 环境的节奏 应用团队可以使用 Kustomization、Helm 部署应用 应用团队无法创建集群相关的组件，如持久卷、CRD 等资源 应用团队无法创建密钥、Service Account等资源，但仅可使用 infra 团队提前为应用创建的这类资源 二、Flux 多租户的安全设置 对于一个使用命名空间在隔离多租户的集群，Flux 提供了选项来禁止跨命令空间的引用， 例如，Flux 的 Kustomization 或 Helm Releases 禁止引用其他命名空间定义的 Source。 同时，启用强制模拟功能，将 Kustomization 或 Helm Releases 资源的部署默认限制到最小来显示的提升部署的安全性。\n遵循以上 Flux 的多租户安全最佳实践，进行如下 Flux Toolkits 配置（./cluster/cluster-dev/kustomization.yaml） 来禁用跨命名空间引用和强制模拟限制 Kustomization 和 Helm 部署的默认权限，\n1apiVersion: kustomize.config.k8s.io/v1beta1 2kind: Kustomization 3resources: 4 - gotk-components.yaml 5 - gotk-sync.yaml 6patches: 7 - patch: | 8 - op: add 9 path: /spec/template/spec/containers/0/args/0 10 value: --no-cross-namespace-refs=true 11 target: 12 kind: Deployment 13 name: \u0026gt;- 14 (kustomize-controller|helm-controller|notification-controller|image-reflector-controller|image-automation-controller) 15 - patch: | 16 - op: add 17 path: /spec/template/spec/containers/0/args/0 18 value: --default-service-account=default 19 target: 20 kind: Deployment 21 name: (kustomize-controller|helm-controller) 22 - patch: | 23 - op: add 24 path: /spec/serviceAccountName 25 value: kustomize-controller 26 target: 27 kind: Kustomization 28 name: flux-system 29 - patch: | 30 - op: add 31 path: /spec/serviceAccountName 32 value: helm-controller 33 target: 34 kind: HelmRelease 35 name: flux-system 同时为 infra 团队管理的共享 Kustomization/Helm 组件部署显示的指定部署权限， 例如，DEV环境 infrastructure 配置的入口./clusters/cluster-dev/infrastructure.yaml\n1apiVersion: kustomize.toolkit.fluxcd.io/v1beta2 2kind: Kustomization 3metadata: 4 name: infrastructure 5 namespace: flux-system 6spec: 7 interval: 10m0s 8 serviceAccountName: kustomize-controller 9 path: ./infrastructure/overlays/product 10 prune: true 11 sourceRef: 12 kind: GitRepository 13 name: flux-system 或通过 Kustomize 的补丁机制为所有的 Kustomization/Helm Flux 自定义资源指定部署权限， 例如DEV环境的overlay的入口./infrastructure/overlays/development/kustomization.yaml配置，\n1apiVersion: kustomize.config.k8s.io/v1beta1 2kind: Kustomization 3resources: 4 - ../../base 5 - ./secrets.yaml 6patches: 7 - path: ./aws-load-balancer-controller-patch.yaml 8 - path: ./aws-load-balancer-serviceaccount-patch.yaml 9 - path: ./dns-patch.yaml 10 - path: ./dns-sa-patch.yaml 11 - path: ./slack-patch.yaml 12 - patch: | 13 - op: add 14 path: /spec/serviceAccountName 15 value: kustomize-controller 16 target: 17 kind: Kustomization 18 namespace: (flux-system|kube-system|mariadb) 19 - patch: | 20 - op: add 21 path: /spec/serviceAccountName 22 value: helm-controller 23 target: 24 kind: HelmRelease 25 namespace: (flux-system|kube-system|mariadb) 最佳实践 通过在平台团队管理的 Kustomization 配置中，强制为应用团队 Git 仓库的 Kustomization, HelmRelease 等部署对象指定部署时使用的 Service Account。\n三、租户的集群资源管理 基于前面的管理需求假设，在 infrastructure Git 仓库中，专门为多租户/多团队/多应用创建如下目录结构， 共享 apps 通常的租户配置，例如，命名空间，RBAC(通过 Service Account)加上 Policy 实现等。\n1apps 2|-- base 3| |-- app-a 4| | |-- bitnami.yaml 5| | |-- kustomization.yaml 6| | |-- namespace.yaml 7| | |-- policies.yaml 8| | `-- rbac.yaml 9| `-- kustomization.yaml 10`-- overlays 11 `-- development 12 |-- app-a 13 | |-- kustomization.yaml 14 | `-- prestashop-sealed-secrets.yaml 15 `-- kustomization.yaml 同时创建一个 apps 的 Kustomization 入口配置同集群集成，例如 ./clusters/cluster-dev/apps.yaml 文件内容如下，\n1apiVersion: kustomize.toolkit.fluxcd.io/v1beta2 2kind: Kustomization 3metadata: 4 name: apps 5 namespace: flux-system 6spec: 7 dependsOn: 8 - name: infrastructure 9 interval: 3m0s 10 serviceAccountName: kustomize-controller 11 path: ./apps/overlays/development 12 prune: true 13 sourceRef: 14 kind: GitRepository 15 name: flux-system 最佳实践 Kubernetes 原生的 RBAC 权限控制无法细粒度的控制资源权限，如资源创建必须指定某些 Label 等。 但结合 Policy as Code，如 Gatekeeper, Kyverno 可以满足细粒度的管理需求。\n为应用 app-a 创建了如下 Policy，仅允许应用通过自助的 Git 仓库在部署时仅可创建 Helm Chart 部署必须的 Secrets。\n1apiVersion: kyverno.io/v1 2kind: Policy 3metadata: 4 name: restrict-secrets-by-type 5 namespace: app-a 6 annotations: 7 policies.kyverno.io/title: Restrict Secrets by Name 8 policies.kyverno.io/category: security 9 policies.kyverno.io/subject: Secret 10 policies.kyverno.io/description: \u0026gt;- 11 Disallow creating/deleting secrets in namespace \u0026#39;app-a\u0026#39; beside the helm 12 storage. 13spec: 14 background: false 15 validationFailureAction: enforce 16 rules: 17 - name: safe-secrets-for-helm-storage 18 match: 19 resources: 20 kinds: 21 - Secret 22 preconditions: 23 all: 24 - key: \u0026#39;{{request.operation}}\u0026#39; 25 operator: In 26 value: 27 - CREATE 28 - UPDATE 29 - DELETE 30 - key: \u0026#39;{{serviceAccountName}}\u0026#39; 31 operator: Equals 32 value: app-a-reconciler 33 validate: 34 message: Only Secrets are created by Helm v3+ 35 pattern: 36 type: helm.sh/release.v1 四、租户隔离且自服务的应用持续部署 上一步为租户/应用 app-a 配置了独立的命令空间，部署权限，策略等。同时为应用 app-a 创建了独立的 GitOps 仓库， 应用团队可以通过独立的 Git 仓库自主的发布其应用程序到不同的 STAGING 集群。 如示例中的仓库，应用团队使用 Kustomize 管理不同 STAGING 环境的部署，且通过 Helm 方式部署了电商应用 Prestashop。 应用团队的部署可以使用由 infrastructure 团队统一管理的 External DNS, Ingress Class, 应用所在命名空间的 Secrets。\n最终平台团队将应用 app-a 独立的仓库作为一个新的 GitOps 来源，通过如下配置将应用仓库的部署同集群关联上，\n1apiVersion: source.toolkit.fluxcd.io/v1beta1 2kind: GitRepository 3metadata: 4 name: app-a-tenant 5spec: 6 interval: 1m 7 url: https://github.com/zxkane/eks-gitops-app-a.git 8 ref: 9 branch: main 10--- 11apiVersion: kustomize.toolkit.fluxcd.io/v1beta2 12kind: Kustomization 13metadata: 14 name: app-a-tenant 15spec: 16 serviceAccountName: app-a-reconciler 17 interval: 5m0s 18 retryInterval: 5m0s 19 prune: true 20 sourceRef: 21 kind: GitRepository 22 name: app-a-tenant 23 namespace: app-a 24 patches: 25 - patch: |- 26 - op: replace 27 path: /spec/serviceAccountName 28 value: app-a-reconciler 29 - op: replace 30 path: /metadata/namespace 31 value: app-a 32 target: 33 group: helm.toolkit.fluxcd.io 34 version: v2beta1 35 kind: HelmRelease 36 - patch: |- 37 - op: replace 38 path: /spec/serviceAccountName 39 value: app-a-reconciler 40 - op: replace 41 path: /metadata/namespace 42 value: app-a 43 target: 44 group: kustomize.toolkit.fluxcd.io 45 version: v1beta2 46 kind: Kustomization 47 - patch: |- 48 - op: replace 49 path: /namespace 50 value: app-a 51 target: 52 group: kustomize.config.k8s.io 53 version: v1beta1 54 kind: Kustomization 应用 app-a 团队将自助的通过独立的应用 GitOps 仓库持续发布团队的应用。 如下示例 app-a 在其自助的 Git 仓库通过 HelmRelease 部署了 Web 应用。\n1apiVersion: helm.toolkit.fluxcd.io/v2beta1 2kind: HelmRelease 3metadata: 4 name: prestashop 5spec: 6 releaseName: prestashop 7 chart: 8 spec: 9 chart: prestashop 10 sourceRef: 11 kind: HelmRepository 12 name: bitnami 13 namespace: app-a 14 version: 14.0.10 15 values: 16 existingSecret: prestashop 17 service: 18 type: ClusterIP 19 ingress: 20 enabled: true 21 path: \u0026#39;/*\u0026#39; 22 annotations: 23 alb.ingress.kubernetes.io/scheme: internet-facing 24 alb.ingress.kubernetes.io/inbound-cidrs: \u0026#39;0.0.0.0/0\u0026#39; 25 alb.ingress.kubernetes.io/auth-type: none 26 alb.ingress.kubernetes.io/target-type: ip 27 kubernetes.io/ingress.class: alb 28 alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS-1-2-Ext-2018-06 29 alb.ingress.kubernetes.io/listen-ports: \u0026#39;[{\u0026#34;HTTP\u0026#34;: 80}]\u0026#39; 30 alb.ingress.kubernetes.io/backend-protocol: HTTP 31 alb.ingress.kubernetes.io/healthcheck-path: \u0026#39;/\u0026#39; 32 persistence: 33 enabled: false 34 storageClass: gp2 35 # for mariadb 36 mariadb: 37 enabled: false 38 externalDatabase: 39 host: mariadb.kube-system.svc.cluster.local 40 user: prestashop 41 database: prestashop 42 existingSecret: prestashop-db-secrets 43 allowEmptyPassword: false 44 interval: 1h0m0s 45 install: 46 remediation: 47 retries: 3 五、自动发布镜像更新 在本节实践中我们将使用 Sock Shop（一个使用 Spring Boot, Go kit, Node.js 容器化的微服务示例应用）。 同在第三，第四章节配置应用 app-a 一样，为 sock-shop 应用在 infrastructure GitOps 仓库中创建了单独的命名空间、RBAC、独立的 Git 仓库来管理应用的发布， 具体实现可参考 commit1, commit2。\n1. 部署微服务应用程序 Sock Shop 在我们分叉的 Sock Shop 通过 Kustomization 实现了多集群部署的支持， 同时将 front-end 服务通过 LoadBalancer 类型对外暴露出来，利用 Amazon EKS 同 Amazon Elastic Load Balancing 的集成来负载均衡 Sock Shop 应用的入口 front-end 服务。\n1apiVersion: kustomize.config.k8s.io/v1beta1 2kind: Kustomization 3resources: 4 - ./complete-demo.yaml 5patchesStrategicMerge: 6 - delete-ns.yaml 7patches: 8 - patch: |- 9 - op: replace 10 path: /spec/type 11 value: LoadBalancer 12 - op: replace 13 path: /metadata/annotations/service.beta.kubernetes.io~1aws-load-balancer-type 14 value: external 15 - op: replace 16 path: /metadata/annotations/service.beta.kubernetes.io~1aws-load-balancer-nlb-target-type 17 value: ip 18 - op: replace 19 path: /metadata/annotations/service.beta.kubernetes.io~1aws-load-balancer-scheme 20 value: internet-facing 21 target: 22 version: v1 23 kind: Service 24 name: front-end 通过定制化 front-end 微服务为我们的 Sock Shop 应用持续改进，最新的 front-end 通过自动化测试后打包的镜像版本通过 Github packages 容器镜像仓库对外发布。 我们在 DEV 环境将使用 Kustomization overlays 将 front-end 微服务替换为定制化更新的版本。\n1apiVersion: kustomize.config.k8s.io/v1beta1 2kind: Kustomization 3resources: 4 - ../../base 5patches: 6 - patch: |- 7 - op: replace 8 path: /metadata/annotations/external-dns.alpha.kubernetes.io~1hostname 9 value: socks-dev.test.kane.mx 10 target: 11 version: v1 12 kind: Service 13 name: front-end 14images: 15- name: weaveworksdemos/front-end 16 newName: ghcr.io/zxkane/weaveworksdemos/front-end 17 newTag: 0.3.13-rc0 在 DEV 等可持续集成的敏捷环境，在构建新服务镜像且发布后，通过人工或脚本更新 GitOps 代码仓库显得过于繁琐。 Flux 自身提供了完善且强大的 Git 仓库镜像自动升级功能。下面在我们的 GitOps 部署仓库来实现该能力。\n注意 镜像自动更新功能需要确保 Flux 在安装配置时已启用镜像自动更新组件。如未启用，可重复 bootstrap Flux 时加上 --components-extra=image-reflector-controller,image-automation-controller 参数来启用。\n2. 注册 front-end 微服务的镜像仓库 1apiVersion: image.toolkit.fluxcd.io/v1beta1 2kind: ImageRepository 3metadata: 4 name: sock-shop-front-end 5spec: 6 image: ghcr.io/zxkane/weaveworksdemos/front-end 7 interval: 1m0s 3. 设置镜像更新策略 如下规则 ^0.3.x-0 将匹配 0.3.13-rc0, 0.3.13-rc1, 0.3.13 等镜像版本。\n1apiVersion: image.toolkit.fluxcd.io/v1beta1 2kind: ImagePolicy 3metadata: 4 name: sock-shop-front-end 5spec: 6 imageRepositoryRef: 7 name: sock-shop-front-end 8 policy: 9 semver: 10 range: \u0026#39;^0.3.x-0\u0026#39; 4. 创建自动镜像更新配置 Flux 自动镜像配置会指定应用配置的 Git 仓库，包括分支、路径等信息。\n1apiVersion: image.toolkit.fluxcd.io/v1beta1 2kind: ImageUpdateAutomation 3metadata: 4 name: sock-shop-front-end 5spec: 6 git: 7 checkout: 8 ref: 9 branch: gitops 10 commit: 11 author: 12 email: fluxcdbot@users.noreply.github.com 13 name: fluxcdbot 14 messageTemplate: \u0026#39;{{range .Updated.Images}}{{println .}}{{end}}\u0026#39; 15 push: 16 branch: gitops 17 interval: 1m0s 18 sourceRef: 19 kind: GitRepository 20 name: sock-shop-tenant 21 namespace: sock-shop 22 update: 23 path: ./deploy/kubernetes/overlays/development 24 strategy: Setters 5. 为应用 GitOps 仓库配置读写凭证 由于 Flux 需要将更新后的镜像版本信息提交回应用仓库，需要为 Flux 中配置的应用 GitRepository 指定可读写的访问凭证。 下面提供参考步骤创建 Git 仓库访问凭证并配置。\n1. 创建 Sealed Secret 保存 Git 仓库读写权限的私钥 1kubectl -n sock-shop create secret generic flux-image-automation \\ 2--from-file=identity=/path/gitops-image-update-id-ecdsa \\ 3--from-file=identity.pub=/path/gitops-image-update-id-ecdsa.pub \\ # 确保此公钥已配置在 Git 仓库且具有读写权限，如 Github 仓库的 `Deploy Keys` 4--from-literal=known_hosts=\u0026#34;github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg=\u0026#34; \\ 5--dry-run=client \\ 6-o yaml \u0026gt; flux-image-automation-secrets.yaml 7 8kubeseal --format=yaml --cert=pub-sealed-secrets-dev.pem \\ 9\u0026lt; flux-image-automation-secrets.yaml \u0026gt; ./apps/overlays/development/sock-shop/sealed-git-token.yaml 2. 通过 Kustomize 为 DEV 环境的 GitRepository 配置指定访问凭证 1apiVersion: kustomize.config.k8s.io/v1beta1 2kind: Kustomization 3namespace: sock-shop 4resources: 5 - ../../../base/sock-shop 6 - ./sealed-slack-secrets.yaml 7 - ./sealed-git-token.yaml 8 - ./registry.yaml 9 - ./policy.yaml 10 - ./image-automation.yaml 11patches: 12 - patch: |- 13 - op: replace 14 path: /spec/path 15 value: ./deploy/kubernetes/overlays/development 16 target: 17 group: kustomize.toolkit.fluxcd.io 18 version: v1beta2 19 kind: Kustomization 20 name: sock-shop-tenant 21 - patch: | 22 - op: replace 23 path: /spec/channel 24 value: gitops-flux 25 target: 26 group: notification.toolkit.fluxcd.io 27 version: v1beta1 28 kind: Provider 29 name: slack 30 - patch: | 31 - op: replace 32 path: /spec/url 33 value: git@github.com:zxkane/microservices-demo.git 34 - op: replace 35 path: /spec/secretRef 36 value: {} 37 - op: replace 38 path: /spec/secretRef/name 39 value: flux-image-automation 40 target: 41 group: source.toolkit.fluxcd.io 42 version: v1beta1 43 kind: GitRepository 44 name: sock-shop-tenant 6. 验证镜像自动更新 更新微服务 front-end 代码且tag版本后，新的镜像版本被发布到镜像仓库。 通过前面配置的 ImageRepository 和 ImagePolicy 扫描到 front-end 镜像符合策略的新版本发布， 根据 ImageUpdateAutomation 配置的 Sock Shop 应用仓库，查找指定的镜像变量， Flux 的 image-automation-controller 自动将更新的镜像信息提交到应用仓库实现持续部署。\n图1：镜像自动更新消息通知 六、小结及展望 本文介绍了如何使用 GitOps 工具 FluxCD v2 构建企业内部在 Kubernetes 上持续交付共享服务平台， 将平台团队和应用/业务团队统一在同样的 Git 工作流程下，同时授权应用/业务团队用自服务的方式持续交付应用的敏捷部署。 此方案将安全和效率有效的结合在一起。前述的示例可在此仓库获取完整的 GitOps 代码。\n同时面对复杂的企业场景，还有一些方面还可以持续的优化，例如，\n面对关键的线上生产系统，如何安全增量的灰度发布？ Sealed Secrets 引入了额外的私钥管理需求，在云计算环境如何改善 GitOps 密钥的管理？ 如何将云平台的资源 IaC 同 Kubernetes 内资源 GitOps 协同管理？ 如何更加高效的开发 Kubernetes manifests(YAML)？ 将在后续的文章中逐个探讨这些问题。\n","link":"https://kane.mx/posts/gitops/flux-in-action-2/","section":"posts","tags":["GitOps","Kubernetes","Flux","Git","EKS","CD","Continuous Delivery"],"title":"基于 Flux 的 GitOps 实战（下）"},{"body":"在前文介绍了 GitOps 的概念，Kubernetes 上 GitOps 最佳实践以及对比了 CNCF 基金会下 云原生的 GitOps 工具（ArgoCD 和 Flux）。本篇将带你按照 Flux 的最佳实践在跨VPC跨账户的 Kubernetes 上实践 GitOps 的持续集成，轻松管理数十数百乃至更多的集群及部署在上面的应用。\n0. 必备条件 假设业务对稳定性的需求，使用3个 Kubernetes 集群分别对应 DEV, STAGING 和 PRODUCT 环境。这些集群环境根据企业的需求 可能会分布在不同的云账户和VPC网络中。读者可根据实际企业情况创建一个或多个集群。本文以 Amazon EKS 为例，EKS集群的创建请参阅其文档。 Git 仓库用于保存集群的声明式配置。Flux 支持 Git 在线服务（包括 Github, Gitlab, Bitbucket）和其他任意 Git 服务。本文将使用 Github 托管 Git 仓库为例。 安装 Flux CLI 1. Kubernetes 集群安装配置 Flux Github repo 为例，执行以下命令，\n1export GITHUB_TOKEN=\u0026lt;your-token\u0026gt; 2 3flux bootstrap github \\ 4 --components-extra=image-reflector-controller,image-automation-controller \\ 5 --owner=zxkane \\ 6 --repository=eks-gitops \\ 7 --path=clusters/cluster-dev \\ 8 --personal 重要 请确保 Flux CLI 执行环境可以通过 kubectl 连接到 Kubernetes 集群，且用户具备 admin 权限。\n重要 创建的 Github Personal Accesss Token 需要至少同时选中全部 repo 和 user 的权限。\n注意 如需在 DEV 环境 启用镜像自动更新功能，bootstrap Flux 时需要加上 --components-extra=image-reflector-controller,image-automation-controller 参数。\n通过类似的步骤在 STAGING 和 PRODUCT 集群安装配置 Flux 。\n1export KUBECONFIG=$HOME/.kube/config-cluster-staging 2flux bootstrap github \\ 3 --owner=zxkane \\ 4 --repository=eks-gitops \\ 5 --path=clusters/cluster-staging \\ 6 --personal 7 8export KUBECONFIG=$HOME/.kube/config-cluster-product 9flux bootstrap github \\ 10 --owner=zxkane \\ 11 --repository=eks-gitops \\ 12 --path=clusters/cluster-product \\ 13 --personal 以上步骤是手动安装及配置 Flux ，Flux 也支持同现有的 IaC 代码集成，如 eksctl, Terraform。\n最佳实践 上面示例对多环境集群的支持并没有采用多仓库/多分支的策略，而是用的使用不同路径来管理不同的集群。 这也是 Flux 推荐的策略，可以减少代码维护和合并的难度。\n1./clusters/ 2├── cluster-dev [集群名称] 3│ ├── flux-system [命名空间] 4│ ├── gotk-components.yaml [默认 Flux 配置，请勿手动修改] 5│ ├── gotk-sync.yaml [默认 Flux 配置，请勿手动修改] 6│ └── kustomize.yaml [Kustomize 配置入口文件，将通过此入口聚合了集群的全部配置] 7├── cluster-product 8│ ├── flux-system 9│ ├── gotk-components.yaml 10│ ├── gotk-sync.yaml 11│ └── kustomize.yaml 12├── cluster-staging 13│ ├── flux-system 14│ ├── gotk-components.yaml 15│ ├── gotk-sync.yaml 16│ └── kustomize.yaml 在完成初始化不同的环境集群后，将在我们的Git仓库中查看到如上目录结构。 我们可以看到 Flux 自身的配置也是通过 GitOps 的方式来管理的。\n2. 管理集群共享的组件 在企业中通常会由 Infrastructure 团队统一管理集群的共享组件，例如，Namespace, CSI Driver, Ingress Class, Persist Volume, Service Account, Secret, DaemonSet, NetworkPolicy，CustomResource 等等 Kubernetes 对象。 接下来将演示如何在多集群中创建集群内共享组件，例如，AWS Load Balancer Controller 和 External DNS， 并且逐步将这些组件部署在不同的环境中。\nFlux 自身大量依赖了 Kustomize，通过 Flux 的 Kustomize Controller 来渲染最终的 Kubernetes 声明式配置，并集成了 Hook，ServiceAccount，超时等额外配置。\n通过如下Flux Kustomize对象声明为DEV环境声明了共享 Infrastructure 配置所在的路径（该配置文件放置在cluster/cluster-dev目录下），\n1apiVersion: kustomize.toolkit.fluxcd.io/v1beta2 2kind: Kustomization 3metadata: 4 name: infrastructure 5 namespace: flux-system 6spec: 7 interval: 10m0s 8 path: ./infrastructure/overlays/development 9 prune: true 10 sourceRef: 11 kind: GitRepository 12 name: flux-system 以 External DNS(一个 CNCF 基金会项目，为 K8S Service LoadBalancer / Ingress 对象提供 DNS 域名解析注册) 为完整示例。\n使用 Flux 的 Helm Repositories 自定义对象，注册 bitnami 的 Helm Charts 仓库。\n1apiVersion: source.toolkit.fluxcd.io/v1beta1 2kind: HelmRepository 3metadata: 4 name: bitnami 5spec: 6 interval: 30m 7 url: https://charts.bitnami.com/bitnami 按照 External DNS for Amazon Route 53 的文档为 external-dns POD 创建执行 IAM 角色， 可以通过 Route 53 API 来创建修改相应的域名解析。针对 External DNS 部署的在 K8S 集群配置如下，\n为 External DNS 创建独立的 service account，同对应的 AWS IAM Role 绑定，限制该 Pod 仅拥有必需的最小权限。 关于 EKS 上如何绑定最小 AWS 权限到 pod 上请参考IAM roles for service accounts。 1apiVersion: v1 2kind: ServiceAccount 3metadata: 4 name: external-dns 5 annotations: 6 # create IAM role via following docs, 7 # https://docs.aws.amazon.com/eks/latest/userguide/specify-service-account-role.html 8 # https://github.com/kubernetes-sigs/external-dns/blob/master/docs/tutorials/aws.md#iam-permissions 9 # the role specified by kustomize 10 # eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/external-dns-role 11 eks.amazonaws.com/sts-regional-endpoints: true 定义 HelmRelease Flux 对象 从 Bitnami 的 Helm Charts 仓库安装 external-dns 。 1apiVersion: helm.toolkit.fluxcd.io/v2beta1 2kind: HelmRelease 3metadata: 4 name: external-dns 5spec: 6 releaseName: external-dns 7 targetNamespace: kube-system 8 chart: 9 spec: 10 chart: external-dns 11 version: \u0026#39;\u0026gt;=6.2.1 \u0026lt;7\u0026#39; 12 sourceRef: 13 kind: HelmRepository 14 name: bitnami 15 namespace: kube-system 16 interval: 1h0m0s 17 install: 18 remediation: 19 retries: 3 20 values: 21 provider: aws 22 aws: 23 zoneType: public 24 serviceAccount: 25 create: false 26 name: external-dns 27 podSecurityContext: 28 fsGroup: 65534 29 runAsUser: 0 使用 Kustomization 将相关的配置整合。 1apiVersion: kustomize.config.k8s.io/v1beta1 2kind: Kustomization 3namespace: kube-system 4resources: 5 - serviceaccount.yaml 6 - release.yaml 用 Kustomization 整合多个组件的配置。 1apiVersion: kustomize.config.k8s.io/v1beta1 2kind: Kustomization 3resources: 4 - sources 5 - aws-load-balancer-controller 6 - dns 以上所有配置都保存在Git仓库infrastructure/base下（详见下图）作为多套环境通用的配置，按照 Kustomize 的 Overlays 布局。\n1./infrastructure/ 2|-- base 3| |-- aws-load-balancer-controller 4| | |-- kustomization.yaml 5| | |-- release.yaml 6| | `-- serviceaccount.yaml 7| |-- dns 8| | |-- kustomization.yaml 9| | |-- release.yaml 10| | `-- serviceaccount.yaml 11| |-- kustomization.yaml 12| `-- sources 13| |-- bitnami.yaml 14| |-- eks-charts.yaml 15| `-- kustomization.yaml 16`-- overlays 17 |-- development 18 | |-- aws-load-balancer-controller-patch.yaml 19 | |-- aws-load-balancer-serviceaccount-patch.yaml 20 | |-- dns-patch.yaml 21 | |-- dns-sa-patch.yaml 22 | `-- kustomization.yaml 在DEV环境对应的 overlay 下面创建如下的补丁来覆盖跟DEV环境相关的信息声明，如集群名称、域名、 为External DNS pod所创建AWS IAM Role的ARN。\n1apiVersion: helm.toolkit.fluxcd.io/v2beta1 2kind: HelmRelease 3metadata: 4 name: external-dns 5spec: 6 values: 7 txtOwnerId: gitops-cluster 8 domainFilters[0]: test.kane.mx 9 policy: sync 10--- 11apiVersion: v1 12kind: ServiceAccount 13metadata: 14 name: external-dns 15 annotations: 16 eks.amazonaws.com/role-arn: \u0026gt;- 17 arn:aws:iam::845861764576:role/gitops-cluster-external-dns-role 最佳实践 充分利用 Kustomize 的 Overlays 机制来抽象通用的配置和覆盖每个环境所对应的特殊部分。\n最佳实践 将共享组件部署在非 Flux 命名空间(默认flux-system)，避免清理 Flux 时影响运行中的部署。\n同样在DEV环境验证External DNS组件部署成功后，将相似的配置应用到STAGING和PRODUCT环境。 通过Kustomize的Overlays分别设置STAGING和PRODUCT环境相关的配置。再将变更推送到Git仓库， Flux将会为我们部署这些声明在Git仓库中的组件！可查阅DEV, STAGING, PRODUCT这三个提交查看完整实现。\n3. 密钥的管理 最佳实践 2022/06更新：使用成熟的K8S集群外置的密钥管理服务可以很好的将成熟密钥管理最佳实践和K8S原生生态集成在一起。 详见博文使用外部Secrets Manager管理Kubernetes密钥。\nGitOps 的理念是将一切配置以声明式文本方式保存在仓库中。而对保存 Kubernetes Secrets 是个挑战， 因为 Git 仓库对所有读权限的用户公开，甚至项目的仓库是开源。Flux 通过支持 Bitnami Sealed Secrets 和 Mozilla SOPS 安全的在 Git 仓库中管理密钥。接下来将示例如何使用 Sealed Secrets 为 MariaDB 创建密码。\n首先使用 HelmRelease 部署 Bitnami Sealed Secrets。类似上面部署 External DNS，将 sealed secrets 添加到 infrastructure/base 里作为共享组件。 1apiVersion: helm.toolkit.fluxcd.io/v2beta1 2kind: HelmRelease 3metadata: 4 name: sealed-secrets 5 namespace: kube-system 6spec: 7 chart: 8 spec: 9 chart: sealed-secrets 10 sourceRef: 11 kind: HelmRepository 12 name: sealed-secrets 13 version: \u0026#34;\u0026gt;=1.15.0-0\u0026#34; 14 interval: 1h0m0s 15 releaseName: sealed-secrets-controller 16 targetNamespace: kube-system 17 install: 18 crds: Create 19 upgrade: 20 crds: CreateReplace 按照 Flux 的 Sealed Secrets 文档，安装 kubeseal。 使用 kubeseal 从集群中下载公钥。 1kubeseal --fetch-cert \\ 2--controller-name=sealed-secrets-controller \\ 3--controller-namespace=kube-system \\ 4\u0026gt; pub-sealed-secrets-dev.pem 为 Bitnami MariaDB 生成密钥。 1kubectl -n kube-system create secret generic prestashop-mariadb \\ 2--from-literal=mariadb-root-password=\u0026lt;put the ariadb root password here\u0026gt; \\ 3--from-literal=mariadb-replication-password=\u0026lt;put the replication password here\u0026gt; \\ 4--from-literal=mariadb-password=\u0026lt;put the mariadb password here\u0026gt; \\ 5--dry-run=client \\ 6-o yaml \u0026gt; /tmp/mariadb-secrets.yaml 从 K8S 内置的 Opaque Secrets 格式文件生成 sealed secret。 1kubeseal --format=yaml --cert=pub-sealed-secrets-dev.pem \\ 2\u0026lt; /tmp/mariadb-secrets.yaml \u0026gt; infrastructure/overlays/development/prestashop-mariadb-secrets.yaml 部署 Bitnami Helm Chart MariaDB，使用提前创建的密钥作为 DB 的密钥。 1apiVersion: helm.toolkit.fluxcd.io/v2beta1 2kind: HelmRelease 3metadata: 4 name: prestashop-mariadb 5spec: 6 releaseName: mariadb 7 chart: 8 spec: 9 chart: mariadb 10 sourceRef: 11 kind: HelmRepository 12 name: bitnami 13 namespace: kube-system 14 version: 10.4.0 15 interval: 1h0m0s 16 install: 17 remediation: 18 retries: 3 19 valuesFrom: 20 - kind: ConfigMap 21 name: prestashop-values 22--- 23auth: 24 existingSecret: prestashop-mariadb 25primary: 26 persistence: 27 enabled: false 28 storageClass: standard 最佳实践 切记使用 Sealed Secrets, SOPS 等工具仅将加密后的密钥提交到 Git 仓库，避免密钥的泄露！\n查阅DEV, STAGING, PRODUCT这三个提交查看完整 sealed secrets 使用。\n4. 通知集成 在运维集群的时候，不同的团队有订阅不同的 GitOps 流水线通知的需求。 例如，oncall 团队将收到有关集群中协调失败的警报， 而开发团队可能希望在部署新版本的应用程序以及部署是否健康时收到警报。\nFlux 内置了同 Slack, MS Teams, Discord 等知名 IM 工具的集成，也支持将消息发送到 webhook 接口， 由用户自行实现消息通知。\n下面以 Slack 为例，示例如何集成 GitOps 流水线消息。\n定义一个名为 slack 的 Flux 自定义资源 Provider 1apiVersion: notification.toolkit.fluxcd.io/v1beta1 2kind: Provider 3metadata: 4 name: slack 5 namespace: kube-system 6spec: 7 type: slack 8 secretRef: 9 name: slack-url 因为 Slack WebHook 并没有额外的鉴权保护，这里我们使用上一节的密钥管理机制加密保存在 Git 仓库的 slack webhook url， 同时 Provider 引用 Secrets 对象中保存的 url。 2. 创建 Flux Alert 对象订阅命名空间的各类 Flux 对象事件，并且同第一步定义的 Provider 关联。 3. 当创建多个 Alert 和不同的 Provider 关联，可以将消息发送到不同的 Slack channel 甚至是不同的 IM。\n1apiVersion: notification.toolkit.fluxcd.io/v1beta1 2kind: Alert 3metadata: 4 name: flux-alert 5 namespace: kube-system 6spec: 7 providerRef: 8 name: slack 9 eventSeverity: info 10 eventSources: 11 - kind: GitRepository 12 name: \u0026#39;*\u0026#39; 13 - kind: Kustomization 14 name: \u0026#39;*\u0026#39; 15 - kind: HelmRelease 16 name: \u0026#39;*\u0026#39; 17--- 18apiVersion: notification.toolkit.fluxcd.io/v1beta1 19kind: Alert 20metadata: 21 name: kube-system-alert 22 namespace: kube-system 23spec: 24 providerRef: 25 name: slack 26 eventSeverity: info 27 eventSources: 28 - kind: Kustomization 29 name: \u0026#39;*\u0026#39; 30 namespace: \u0026#39;kube-system\u0026#39; 31 - kind: HelmRelease 32 name: \u0026#39;*\u0026#39; 33 namespace: \u0026#39;kube-system\u0026#39; 查阅DEV, STAGING, PRODUCT这三个提交查看完整 commits 如何在不同环境集群中部署了 Slack 通知集成。\n图1：Slack channel 订阅 GitOps 流水线消息通知 最佳实践 针对订阅不同命名空间(非Alert对象定义的命令空间)的事件通知，需要显示指定命名空间属性。\n5. GitOps 代码的 CI GitOps 模式带来的又一个好处是可以使用企业成熟且惯用的代码管理工作流来自动化验证变更及代码审核审批。 针对 GitOps 代码可以引入如下 CI 步骤，\n由于 Flux 大量使用 Kustomize 来生成最终的声明式配置，可以实现在每次提交 Pull Request/Merge Request 后的验证阶段引入 kustomize CLI 验证 GitOps 配置是否可以正确的被生成。同时，使用 Flux OpenAPI 结合 kubeconform 验证 Kubernetes 内置资源和 Flux CRD 类型是否配置正确。 借助 KIND(Kubernetes in Docker) 实现完整的端到端测试。KIND 实现了 Docker 容器打包的 Kubernetes 环境， 可以每次 PR 验证阶段启动新的 KIND 环境且安装 Flux 后，执行 GitOps 代码的 reconciliation， 验证 GitOps 代码配置的资源是否可以被创建且状态为READY。 借助 Git 服务的 CI 能力，如 Github Actions, Gitlab CI/CD 等， 实现 GitOps 代码的上述两种自动化检查，以及同代码审核审批集成。 查阅此 Github Actions workflow 配置实现在 KIND 环境 End-To-End 验证 GitOps 配置，和 声明式配置 Manifests 验证。\n最佳实践 利用 Github Actions 或 Gitlab CI/CD 非常容易的将 GitOps 代码集成到 CI 环境， 通过 KIND/Kubeconform 验证代码的正确性。\n6. 小结 本文介绍使用 GitOps 工具 FluxCD v2 实现了管理多账户多 VPC 环境下的 Kubernetes 集群的共享组件，实践了 Secrets 使用的最佳实践， CD 部署事件同 IM(Slack) 的集成，最终示例了通过 GitOps 代码的 CI 流程来提高 GitOps 代码的质量，减少部署中断事故。 可在此仓库获取完整的 GitOps 代码。\n下篇将介绍基于 Flux 实现 GitOps 工作模型下的共享服务平台。\n","link":"https://kane.mx/posts/gitops/flux-in-action-1/","section":"posts","tags":["GitOps","Kubernetes","Flux","Git","EKS","CD","Continuous Delivery"],"title":"基于 Flux 的 GitOps 实战（上）"},{"body":"","link":"https://kane.mx/tags/argocd/","section":"tags","tags":null,"title":"ArgoCD"},{"body":"今天 Kuberentes 已经成为IT基础设施的重要玩家，容器编排领域的事实标准。写于3年前的文章不要自建 Kuberentes 的观点已经被绝大多数的企业所认可和接受。\n然而同众多企业接触中发现，企业有很高的意愿采用 Kuberentes 管理工作负载，并且已有大量的企业已经将 Kuberentes 用于生产环境。 但如何对多套不同阶段的 Kuberentes 集群来做持续部署，做到高安全性、权限分离、可审计、保证业务团队的敏捷等需求感到困惑。 目前客户实现方式非常多样，并没有很好的遵循业界的最佳实践。\nGitOps 是目前最佳的一种方法来实现基于 Kuberentes 集群的持续部署，且同时满足安全性、权限分离等企业级需求。\n什么是 GitOps GitOps 是一种为云原生应用程序实施持续部署的方法。 它通过使用开发人员已经熟悉的工具，包括 Git 和持续部署工具， 专注于在操作基础架构时以开发人员为中心的体验。\nGitOps 的核心思想包括，\n拥有一个 Git 存储库，该存储库始终包含对生产环境中当前所需基础设施的声明性描述，以及一个使生产环境与存储库中描述的状态相匹配的自动化过程 期望的状态以强制不变性和版本控制的方式存储，并保留完整的版本历史 如果您想部署一个新的应用程序或更新一个现有的应用程序，您只需要更新存储库，软件代理自动从源中提取所需的状态声明，自动化过程会处理其他所有事情 软件代理持续观察实际系统状态并尝试应用所需状态 同传统的持续部署系统对比如下，\n传统 CD 系统 GitOps 系统 由推送事件触发，如代码提交、定时任务、手动等 系统不断轮询变更 仅部署变更的部分 为任何部署声明了整个系统 系统可能会在部署之间漂移 系统将纠正任何漂移 CD 程序必须有权访问部署环境 部署管道在系统范围内被授权运行 GitOps 不会处理如下声明，\n持久化的应用数据，例如来自用户上传 基于 Schema 的部署，例如数据库 schema 即使采用了 GitOps 部署，对于以上数据的备份和恢复同样很重要。\n声明式配置是 Kuberentes 从 Day 1 开始提供支持的， 可以说 Kuberentes 声明式配置很好的匹配了 GitOps 原理中的声明性描述需求。 结合自定义资源定义将声明式配置扩展到了自定义资源，K8S 中的自定义资源也可以 无缝的适配 GitOps 部署方法。\nKuberentes 上 GitOps 最佳实践 GitOps 方法下，Git 成为系统所需状态的唯一事实来源，支持可重复的自动化部署、集群管理和监控。 复用企业中已经非常成熟的 Git 工作流程来完成编译、测试、扫描等持续集成步骤， 当系统最终状态的声明代码进入 Git 仓库主线分支后，依托 GitOps 工具链来完成验证部署，到观测告警， 到操作修复达到系统最终状态的闭环(见图1)。\n图1：GitOps -- 一个持续运维的模型 下面让我们来看看理想中的 Kuberentes 上的 GitOps 最佳实践。\n基于 GitOps 的持续部署 以下是基于 GitOps 的 Kuberentes 部署流程，\n开发者提交应用代码和配置到源代码仓库。 持续集成服务器编译、测试、扫描并且推送新版本的镜像到镜像仓库，同时更新 K8S manifests 到 Git 仓库。 GitOps 代理(如 ArgoCD/Flux) 检测到 Git 仓库变更并自动变动到 K8S 集群。 图2：基于 GitOps 的持续部署型 基于 GitOps 的多集群管理 当 Pull Request 被合并到平台仓库，配置在多集群上的 GitOps 代理将监控这个仓库， 并且部署这些更新到 kube-system 或者平台命名空间。 每个集群有一个集群仓库用于存储配置，例如访问控制、DNS等。同时每个集群通过平台仓库 同上游平台同步。 图3：基于 GitOps 的多集群管理 GitOps 下的应用开发体验 业务团队负责镜像的编译、测试和扫描等。声明配置被提交到业务/应用特定的配置仓库。 GitOps 代理运行在特定的租户命名空间，应用的状态从应用团队的仓库同步到特定的租户命名空间。 图4：GitOps 下的应用开发 支持多租户的集群 同上，不同的业务/应用团队有各自的配置仓库。在集群中，不同的业务/应用由不同的命名空间隔离， 被 GitOps 代理持续部署在各自不同的命名空间中。\n图5：GitOps 多租户支持 使用 Helm 管理应用程序 Helm 是 Kuberentes 生态下的包管理器，类似 Debian 下的 apt。 对 Helm 提供无缝的支持将可以利用到现有成熟的K8S应用打包生态，无论是部署三方组件还是企业已经存在的应用。\n对多环境的部署管理 在企业应用中多套环境、多云、混合云的场景是非常常见的，对应用的部署需要通过基础清单加上各个环境个性化的清单来 简化对多套环境的部署管理。\nKustomize 正是为解决这个问题而创建，并且已经成为 K8S 原生工具链中的一部分。对 Kustomize 提供支持， 将很好的在 GitOps 中满足此类需求。\n图6：GitOps 多环境支持 综上，从持续部署，多集群管理，多租户支持和现有工具链、生态集成方面，描述了对 Kubernetes 上 GitOps 的理想状态。接下来让我们来讨论下现有的 GitOps 工具， 是否可以很好的支持前面描绘的 GitOps 的理想状态。\n云原生的 GitOps 工具 由于 Kubernetes 是 CNCF 基金会的核心项目，整个生态会首先关注 CNCF 基金会下的项目，对 GitOps 来说同样如此。\nCNCF 在2020年发布了 Continuous Delivery 技术雷达，Flux 和 Helm 两个项目被归类为 采用，Kustomize 是被归类为试用，Argo CD 在雷达中为技术评估。\nCNCF 在2021年发布的 Multicluster Management 技术雷达 中的 Core Services and Add-ons 管理雷达，Flux, Kustomize, Argo 和 Helm 等项目都被评为可采用。\n基于前面对 GitOps 的核心定义，CNCF 的技术雷达象限以及社区的对比，目前整个社区中最为 普及的 GitOps 工具是 Argo CD 和 Flux。\nPush 还是 Poll? GitOps 在实践中面临是采用推(push)还是拉(pull)的部署风格选择。\n采用推部署风格会有如下好处，\n简单易理解。这种部署方式已经被众多知名的 CI/CD 工具所采用，例如 Jenkins CI，AWS Code系列。 灵活。易于同其他的脚本或工具集成。拉(pull)风格的 GitOps 代理只能运行在 Kuberentes 中。 采用拉部署风格会有如下好处，\n更加安全。因为 GitOps 代理运行在 Kubernetes 集群中，因此仅需要最小的权限用于部署。简化网络配置不需要该集群同 CD 程序建立网络连接，尤其在管理多集群时尤为简洁。 一致性。管理多集群时，确保了每个集群的管理方式都是一样的。 隔离性。每个集群的部署不依赖于集中的流水线 CD 程序。 可伸缩性。该方式可以容易的扩展到同时管理成百上千的集群。 从以上对比可见，采用拉(pull)的部署风格从安全性、可伸缩性、隔离性、一致性都更优，GitOps 部署方式应该首选拉部署风格。\n主流云原生 GitOps 对比 下表详细对比了 Argo CD / Flux v2，供参考。\nArgoCD Flux v2 安装/配置 一个命令安装，但没有原生的机制实现配置。需要通过 UI 或创建大量清单 一个命令完成安装和配置 部署风格 推(push) / 拉(pull) 拉(pull) 秘钥管理[1] Sealed secrets Sealed secrets / Mozilla SOPS Webhook 接收[2] 支持 支持 告警和通知 内置集成 slack，email，Google Chat 等。 内置集成 slack，discord，MS Teams 等。 镜像更新自动化 支持 支持 Reconciliation 可配置性 有限支持（只能全局设置 reconciliation 时间，不能为每个应用设置不同的 reconciliation 时间。） 支持设置 sync(同步) 和 reconciliation 间隔 应用交付 -- 原生 Kuberentes 清单(YAML) 是的，Argo CD 应用程序的副作用是需要在重试中重新应用资源。 支持，被视同为 kustomization yaml 应用交付 -- Kustomization 支持 支持，还提供 GitOps 组件间的依赖 应用交付 -- Helm charts 支持，但没有使用 Helm Go 语言程序库。 Helm chart 钩子被转为 Argo CD syncwaves/hooks。因此，不支持 Helm cli。 支持，原生支持 chart 钩子，可作为组件相互依赖。支持 Helm cli。 Web UI 支持，提供了完整的 UI 操作。 没有官方 UI 实现。开源平台 Weave GitOps 基于 Flux 提供 UI。 多租户权限管理 支持，实现了独立于 Kubernetes 基于访问控制列表的 RBAC，具有细粒度控制。 是的，它严格基于 Kubernetes 的 RBAC 能力，需要结合其他 CNCF 项目做粒度控制，比如 Kyverno。 多集群管理 -- 多集群管理和部署 支持，对集群做了原生抽象。 支持。理论上支持使用 KubeConfig 设置通过 Kustomziation 和 Helm 在一个 Flux 中管理多个集群上的工作负载。 多集群管理 -- 创建集群 不支持，需要通过第三方工具。例如，CAPI, Crossplane, Open Cluster Management 等。 不支持，需要通过第三方工具。例如，CAPI, Crossplane, Open Cluster Management 等。 GitOps 工具自身的可观测性 通过 Prometheus + Grafana 提供。 通过 Prometheus + Grafana 提供。 通过上表从 GitOps 核心理念来看，无论 ArgoCD 还是 Flux 都满足 GitOps 的理念。且满足了企业级需求，如多租户权限、多集群管理、 秘钥管理、告警通知、支持 Helm 和 Kustomization。从自身的实现上说，ArgoCD 提供了完整的抽象，包括且不限于 Cluster、RBAC、Application、Hook 等。 这样的做法具备了更加广泛的功能集的可能，但同时增加了自身程序的复杂度，也提高了用户的学习门槛。Flux 自身架构更加简洁，默认组件仅有 Source, Kustomize, Helm, Notification, Image automation 这5个组件，尽量复用 Kuberentes 原生能力，例如使用 ServiceAccount 实现多租户的 RBAC 控制， 降低了用户的学习门槛，同云原生社区其他项目的兼容性更强。\n[1]: Git仓库可能公开读且明文保存 Secrets 对象，需要将其加密后再提交到 Git\n[2]: 默认采用 Poll 轮询拉取 Git 仓库变更，提供 Webhook 接口可被 Git 仓库提交事件触发\n小结 本文介绍了什么是 GitOps，Kuberentes 上基于 GitOps 实现持续部署的最佳实践，以及 CNCF 下 GitOps 方向最为流行项目 Argo CD 和 Flux 的对比。后续将以 Flux v2 为实战，深入介绍如何实现 GitOps 持续部署且同时满足各类企业级需求。\n参考资料 GitOps: Cloud-native Continuous Deployment The CNCF End User Technology Radar Continuous Delivery, June 2020 The CNCF End User Technology Radar Multicluster Management, June 2021 Push vs. Pull in GitOps: Is There Really a Difference? Why is a PULL vs a PUSH pipeline important? GitOps on Kubernetes: Deciding Between Argo CD and Flux ","link":"https://kane.mx/posts/gitops/the-best-practise-of-gitops-in-kubernetes/","section":"posts","tags":["GitOps","Kubernetes","Flux","ArgoCD","Git","CD","Continuous Delivery"],"title":"Kuberentes 上 GitOps 最佳实践"},{"body":"","link":"https://kane.mx/tags/athena/","section":"tags","tags":null,"title":"Athena"},{"body":"","link":"https://kane.mx/tags/cost/","section":"tags","tags":null,"title":"Cost"},{"body":"As a builder in cloud, you might feel confused about which resources cost mostly in your account.\nIn AWS, you can quickly find out which services even functionality cost a lot via AWS Billing or AWS Cost Explorer. However sometimes it sucks on finding out which functions cost mostly if you have hundreds of Lambda functions, or which metrics/log groups cost mostly in Amazon CloudWatch.\nAWS Cost and Usage Reports should be helpful in above scenairos. The AWS Cost and Usage Reports (AWS CUR) contains the most comprehensive set of cost and usage data available, including product and product resource, and tags that you define yourself. You can use Cost and Usage Reports to publish your AWS billing reports to an Amazon S3 bucket that you own. The CUR reports are plain CSV text file, you still need analysis the report to find out the insight what you want. So Amazon Athena is one of simplest and effcientst ways to analyze your cost on demand. See the doc to how set up the Athea to analyze your AWS cost.\nAthena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services. With Amazon Athena, you pay only for the queries that you run. See my post how using Glue and Athena to analyze images in Docker repository.\nBelow are few samples to find out the mostly cost resources in your AWS account,\n1SELECT line_item_resource_id as resource, sum(line_item_unblended_cost) as total_cost FROM \u0026#34;athenacurcfn_main_account\u0026#34;.\u0026#34;main_account\u0026#34; 2WHERE year=\u0026#39;2022\u0026#39; and month=\u0026#39;1\u0026#39; and product_product_name = \u0026#39;AmazonCloudWatch\u0026#39; 3GROUP BY line_item_resource_id 4ORDER BY total_cost DESC 5LIMIT 10 Sample 1: find out the top 10 costly resources in CloudWatch, including Log Groups, Metrics, Synthetics and so on\n1SELECT line_item_resource_id, sum(line_item_usage_amount) as usage_amount, sum(line_item_blended_cost) as paid_amount FROM \u0026#34;athenacurcfn_main_account\u0026#34;.\u0026#34;main_account\u0026#34; 2 WHERE line_item_product_code=\u0026#39;AWSLambda\u0026#39; and product_group=\u0026#39;AWS-Lambda-Duration\u0026#39; 3 and year=\u0026#39;2022\u0026#39; and month=\u0026#39;1\u0026#39; 4 GROUP BY line_item_resource_id 5 ORDER BY usage_amount desc 6 LIMIT 10; Sample 2: find out the top 10 costly Lambda functions\nYou can refer to Data dictionary of CUR to understand the field definitions of report.\n","link":"https://kane.mx/posts/2022/find-out-most-costly-resource-in-your-aws-account/","section":"posts","tags":["AWS","Cost","Athena","Glue","Tip"],"title":"Find out the most costly resources in your AWS account"},{"body":"","link":"https://kane.mx/tags/glue/","section":"tags","tags":null,"title":"Glue"},{"body":"","link":"https://kane.mx/tags/aws-eks/","section":"tags","tags":null,"title":"AWS EKS"},{"body":"Though you're administrator of your AWS account, you probably see below warnings when viewing your cluster in EKS console.\nYour current user or role does not have access to Kubernetes objects on this EKS cluster.\nIt's caused by the Kuberentes has itself RBAC authorization. And AWS uses IAM to grant permissions to users. You have to map your IAM user or role to K8S RBAC authorization to grant the permissions to access K8S resources in EKS cluster.\nAbove documentation demonstrate how adding IAM roles/users to EKS cluster to grant the roles/users to access K8S resources. However the documentation is not clear to how adding federated users to EKS cluser.\nI'm facing two scenarios of federated AWS users to access K8S resources in EKS console,\nUse corp SSO to access internal system, then logging into AWS account via assuming existing role of the AWS account Use tool like AWS Vault/Alfred workflow to login AWS console via ak/sk of an IAM user Finally turn out below configuration to grant both federated users to access K8S resources in EKS console,\n1apiVersion: v1 2data: 3 mapRoles: | 4 - groups: 5 - system:bootstrappers 6 - system:nodes 7 rolearn: arn:aws:iam::123456789012:role/cluster-nodegroup-n-NodeInstanceRole-1OQT1WT84WVS8 # created by eksctl when bootrapping cluster 8 username: system:node:{{EC2PrivateDNSName}} 9 - groups: 10 - eks-console-dashboard-full-access-group 11 rolearn: arn:aws:iam::123456789012:role/Admin # granting the federated user via assuming role 12 username: Admin/kane 13 mapUsers: | 14 - userarn: arn:aws:sts::123456789012:federated-user/kane # granting the federated user via aws-vault 15 username: ops-user 16 groups: 17 - eks-console-dashboard-full-access-group","link":"https://kane.mx/posts/2022/grant-federated-users-accessing-k8s-resources-in-eks-console/","section":"posts","tags":["Kubernetes","AWS EKS","Tip","AWS"],"title":"Grant federated users accessing kubernetes resources in EKS console"},{"body":"","link":"https://kane.mx/tags/construct-hub/","section":"tags","tags":null,"title":"Construct Hub"},{"body":"","link":"https://kane.mx/tags/npm/","section":"tags","tags":null,"title":"Npm"},{"body":"","link":"https://kane.mx/tags/projen/","section":"tags","tags":null,"title":"Projen"},{"body":"Construct Hub is a web portal to collect the constructs for AWS CDK, CDK8s and CDKtf. The construct could support multiple programming languages, such as Javascript/TypeScript, Python, Java and C#. Actually the construct is developed by TypeScript, then it's compiled as across languages library by jsii! Any npm/pypi package with certain tags will be discovered by Construct Hub, the package will be automatically recognized as construct package and listed in Construct Hub.\nProjen is a project generator to create project with simplifying the project configuration to support dependencies management, building, unit testing, code style linting, CI/CD via Github actions PR and actions. So projen supports the construct project out of box, which configures construct project with jsii configuration that build the construct to across languages library, though publish the packages to kinds of package registries, such as npmjs, pypi and maven central.\nProjen provides a Publishing capability to publish construct library to supported package managers. For example, npm for JavaScript/TypeScript, it could publish the package to several npm registries, for example, npm public registry, Github packages, AWS CodeArtifact and any public accessible private npm registry.\nHowever the projen only supports publishing the package to single npm registry by default, how about you would like to publish your package to both npm public registry and Github packages?\nThere is no mature way to archive it, but projen is a flexible tool, we can hack it like below to add multiple npm registries support to publish the package to both npm public registry and Github packages,\n1const target = \u0026#39;js\u0026#39;; 2const REPO_TEMP_DIRECTORY = \u0026#39;.repo\u0026#39;; 3const options = { 4 registry: \u0026#39;npm.pkg.github.com\u0026#39;, 5 prePublishSteps: [ 6 { 7 name: \u0026#39;Prepare Repository\u0026#39;, 8 run: `mv ${project.artifactsDirectory} ${REPO_TEMP_DIRECTORY}`, 9 }, 10 { 11 name: \u0026#39;Install Dependencies\u0026#39;, 12 run: `cd ${REPO_TEMP_DIRECTORY} \u0026amp;\u0026amp; ${project.package.installCommand}`, 13 }, 14 { 15 // remove this if your package name already has scope 16 name: \u0026#39;Update package name\u0026#39;, 17 run: `cd ${REPO_TEMP_DIRECTORY} \u0026amp;\u0026amp; sed -i \u0026#34;1,5s/\\\\\u0026#34;packagename\\\\\u0026#34;/\\\\\u0026#34;@scope\\\\/packagename\\\\\u0026#34;/g\u0026#34; package.json`, 18 }, 19 { 20 name: `Create ${target} artifact`, 21 run: `cd ${REPO_TEMP_DIRECTORY} \u0026amp;\u0026amp; npx projen package:js`, 22 }, 23 { 24 name: `Collect ${target} Artifact`, 25 run: `mv ${REPO_TEMP_DIRECTORY}/${project.artifactsDirectory} ${project.artifactsDirectory}`, 26 }, 27 ], 28}; 29project.release.publisher.addPublishJob((_branch, branchOptions) =\u0026gt; { 30 return { 31 name: \u0026#39;npm_github\u0026#39;, 32 publishTools: {}, 33 prePublishSteps: options.prePublishSteps ?? [], 34 run: project.release.publisher.publibCommand(\u0026#39;publib-npm\u0026#39;), 35 registryName: \u0026#39;npm-github\u0026#39;, 36 env: { 37 NPM_DIST_TAG: branchOptions.npmDistTag ?? options.distTag ?? \u0026#39;latest\u0026#39;, 38 NPM_REGISTRY: options.registry, 39 }, 40 permissions: { 41 contents: github.workflows.JobPermission.READ, 42 packages: github.workflows.JobPermission.WRITE, 43 }, 44 workflowEnv: { 45 NPM_TOKEN: \u0026#39;${{ secrets.YOUR_GITHUB_REGISTRY_TOKEN }}\u0026#39;, 46 // if we are publishing to AWS CodeArtifact, pass AWS access keys that will be used to generate NPM_TOKEN using AWS CLI. 47 AWS_ACCESS_KEY_ID: undefined, 48 AWS_SECRET_ACCESS_KEY: undefined, 49 AWS_ROLE_TO_ASSUME: undefined, 50 }, 51 }; 52}); Above code snippet adds an additonal step in release workflow of Github action that is managed by projen, which publishes the package to Github packages.\nHAPPY Projen!\n","link":"https://kane.mx/posts/2022/publishing-npm-packages-to-multiple-registries-with-projen/","section":"posts","tags":["CDK Construct","AWS CDK","npm","projen","continuous delivery","construct hub"],"title":"Publishing npm packages to multiple registries with Projen"},{"body":"近期在一个 Webinar 分享了如何在 AWS 上服务去中心化研发团队构建共享服务平台，核心观点总结如下，\n这里的去中心化团队是同理想的完全化的 DevOps 团队(负责设计、开发、测试、运维以及运营等所有环节)相对立的。 在较大型的组织中，账户管理、网络规划、服务审计等模块会由平台，基础设施或安全团队所负责， 多个研发团队会负责各个业务系统的开发、测试、运维等。\n如今组织的健康运营对安全性合规性要求越来越高，通常基础设施团队外加安全团队负责承担安全、合规需求的整体策略规划及实施。 但是满足安全、合规需求通常是同业务交付速度期望是相悖的。一方面，平台、安全团队要为应用上线或变更进行安全性与合规性审查， 而研发团队需要投入更多的资源去满足安全、合规的需求，这必然会推迟交付。另一方面，研发团队的交付变更还需要内部流程以及 人工操作的话，跨团队的沟通、协作必然也会延缓交付速度。所以基于安全、合规需求的抽象，外加自助服务的共享服务平台， 针对这种场景而生，可以大幅改善交付速度同时满足安全、合规要求。\nAWS 上抽象的助力 在亚马逊软件开发中的抽象分为以下几类，\n框架 框架为应用程序编写代码的时候，为了编写更小更高效的代码，代码被扩展成或被建成更实质性的软件部分。 开发人员对框架应该非常熟悉了，无论是Web开发的Spring、Django、VueJS、React还是ML模型训练的TesnorFlow和PyTorch都是 框架，帮助开发者降低开发门槛，专注在核心业务上。从组织的安全、合规角度出发，基础设施或安全团队可以通过模板或者模式来 强制实施企业规范来达到安全性和合规性需求。\nAWS 目前提供的框架类工具或服务有 AWS SAM, AWS CDK以及AWS CloudFormation。 基于这些框架类工具和服务，用户可以快速构建云上应用，或者是创建云上应用的组件且同时满足企业安全合规需求。\n命令行 CLI 框架的命令行 CLI 让开发团队用他们熟练的语言根据它构建自动化。CLI 能够简化框架的使用难度，开发者可以使用最熟悉的开发语言 (例如 Shell 脚本/Python)来调用 CLI，实现业务逻辑的封装。\nAWS Copilot，AWS SAM CLI是 AWS 提供的 CLI 工具。\n部署服务 最后部署服务拿应用软件来说，由开发团队编写并定义了如何使用那个软件，并让它在真实的基础设施环境运行。开发团队需要加快 应用交付速度来满足业务需求，而基础设施和安全团队需要作为企业安全合规方面的 GateKeeper。一套合理的协作模式将会加速 开发团队的交付且满足安全需求，例如共享自服务平台。\nAWS 提供了很多开箱即用的产品开启建立一个共享服务平台，例如，AWS Proton，AWS CodeDeploy。\n平台所有权模式的思考 作为一个共享服务平台，不同的管理需求/模式或应用负载类型，会有不同的服务扮演不同的角色。下面是一些客户实践的平台模式。\nAWS 账户作为\u0026quot;平台\u0026quot; 组织将 AWS 账户作为平台，自动化创建，其特点如下，\nAWS 账户属于应用团队且由他们运营 基础设施团队为应用账户提供工具，包括且不限于， 支撑型的基础设施(例如，网络，安全，域名，公司标准等) AWS Proton 环境和应用 AWS Service Catalog 产品 此外 [AWS Control Tower][control-towner], AWS Organizations, CloudFormation StackSets, AWS Config 一致性功能， 都是一些服务或功能可以支持以账户为管理单位的共享平台。\n托管容器集群作为\u0026quot;平台\u0026quot; AWS 账户由平台/基础设施团队运营，他们将管理容器集群的生命周期，例如托管的 EKS(Kubernentes) 集群。应用团队负责将 应用部署到基础设施团队管理的多租户集群，应用团队将承担如下职责(包括且不限于)，\n应用程序持续集成 入口流量控制 访问管理 运营可见性 服务于应用程序的周边基础设施(例如，数据库、队列或缓存等)的部署可以通过持续集成的 AWS CloudFormation/AWS CDK 模板，AWS Proton 环境/应用来实现。\n可部署的应用程序模式作为\u0026quot;平台\u0026quot; 这种模式下通常会由基础设施团队拥有共享账户，负责网络、域名、审计等资源。应用团队负责应用账户。基础设施团队创建现成的 应用程序、部署机制、抽象库供应用程序团队使用或自定义，如下一些服务或功能可以实现该模式，\nAWS Proton AWS CloudFormation / AWS Service Catalog AWS CDK constructs 产品化运营共享自服务平台 共享自服务平台除了根据组织管理和技术栈选择合适的实现模式外，是否能够在一个组织内真正的优化效率，提升研发交付速度取决于 如何运营这个平台。这里我们看到的成功案例都是将该平台产品化，按经营产品的方式来运营他。通常成功的共享自服务平台 分为三个步骤来实现自身价值，形成增长飞轮。\n1. 尝试，证明价值 所谓万事开头难，构建共享服务平台需要找到一个合适的应用团队作为种子用户。该团队的应用场景应该是一个典型用例，应用团队 需要频繁发布来交付业务价值，而该应用将涉及到基础设施团队负责的模块，且要符合组织对安全性和合规性的审查。由双方团队共同 定义交互模型，例如，如何开发基础设施即代码，平台基础架构，应用部署方式。并且应用团队方需要认可该平台的价值，在种子应用 成功落地共享服务平台后，会逐渐将更多应用落地到共享平台模式上。\n同时，平台的技术选项也非常重要，要适应企业自身的组织管理结构和技术栈能力。例如，\nAWS CloudFormation / AWS CDK: 通用且最灵活的实现；同时也是双刃剑，维护大型的 CloudFormation 模板是非常困难的，采用 AWS CDK 需要学习新的技能，且有时需要深入研究才可能 Hack 某些内部实现；是如果需要完全掌握实现细节时的选择， AWS Copilot: 仅适用于 ECS 上部署的容器应用；开发应用的团队不关心或不需要管理基础设施， AWS Proton: 适用于将应用和基础设施权限分离的场景；满足应用团队需要自服务的模式。 2. 复制推广 在满足了种子应用的需求后，让共享服务平台走向成功的关键是如何运营推广他。产品化的运营需要做的以下几点，\n文档化你构建的平台，包括 自服务/自动化的采用步骤 用户手册、API 文档 支持的机制，如何为采用平台的内部用户提供技术支持 公开代码，至少是内部开源，让更多的用户参与共享 内部营销平台 利用成功案例赢得领导们的支持，达到量化平台价值，规避风险，提升业务研发效率 内部\u0026quot;路演\u0026quot;招募团队并获得反馈 证明你的共享服务平台方式可扩展 考虑支持几十上百的团队 可跨越各种场景，例如，应用程序中断，AWS 服务事件，迁移，0-day 补丁 考虑平台复制的瓶颈在哪里？例如，采用难度，学习曲线，部署，支持 / on call 等等 3. 形成飞轮 共享自服务平台最终形成产生自身增长势头的良性循环，例如，\n投资平台团队 --\u0026gt; 添加平台功能 --\u0026gt; 应用开发者更快乐 --\u0026gt; 增加平台使用 --\u0026gt; 提升组织的效率 --\u0026gt; 投资平台团队 --\u0026gt; ...\n通过以上一个飞轮闭环，达成长期成功。\n","link":"https://kane.mx/posts/2021/shared-service-platform-for-decentralized-developer-teams/","section":"posts","tags":["SSP","DevOps","GitOps","Infrastructure as Code","AWS","Proton","Service Catalog"],"title":"AWS上构建共享自服务平台服务去中心化研发团队"},{"body":"","link":"https://kane.mx/tags/proton/","section":"tags","tags":null,"title":"Proton"},{"body":"","link":"https://kane.mx/tags/service-catalog/","section":"tags","tags":null,"title":"Service Catalog"},{"body":"","link":"https://kane.mx/tags/ssp/","section":"tags","tags":null,"title":"SSP"},{"body":"","link":"https://kane.mx/tags/amazon-builders-library/","section":"tags","tags":null,"title":"Amazon Builders' Library"},{"body":"","link":"https://kane.mx/series/amazon-builders-library/","section":"series","tags":null,"title":"Amazon-Builders-Library"},{"body":"","link":"https://kane.mx/tags/resilience-engineering/","section":"tags","tags":null,"title":"Resilience Engineering"},{"body":"","link":"https://kane.mx/tags/system-design/","section":"tags","tags":null,"title":"System Design"},{"body":"AWS架构的完善(AWS Well-Architected)框架涉及了五大支柱， 其中可靠性支柱要求侧重于确保工作负载在预期的时间内正确、一致地执行其预期功能。 这要求应用程序系统具备弹性设计，可从故障中快速恢复，以便满足业务和客户需求。 然而设计、开发、且验证具备弹性设计的应用程序，对经验和实践能力都有很高的要求。 利用成熟的经验和良好的工具将加快构建符合预期的弹性应用程序。\nApplication Resilience Workshop是一套课程和动手实践学习如何进行实验来观察系统的行为， 例如，极端系统负载和网络中断情况下，使用不同的软件模式来减轻这些实验对系统稳态的影响。 整个实验也是分为假设、方法、观测和缓解等步骤，同混沌工程有异曲同工之处。\n应用程序弹性实验假设了一个微服务构建的应用程序，通过压力测试工具注入极端的系统负载， 通过应用程序各服务的可观测性来理解目标应用的延迟，吞吐、容量、RTO等指标。\n当发现最初设计的应用程序在极端压力下会有灾难性的故障，教程中给出了队列，负载卸载， 通过降低服务QoS的Client Deadline Cutoff，断路器，令牌桶等程序设计模式来缓解极端压力对系统造成的灾难性的故障。\n但是无论使用队列解耦还是负载卸载都不是绝对完美的解决方案，在Amazon Builders' Library 中的几篇文章为我们分享了Amazon从运行大规模分布式系统中学习到的宝贵且成熟的经验，\n避免无法克服的队列积压 通过卸除负载来避免过载 超时、重试和抖动回退 利用这些成熟的经验我们可以权衡系统的需求和技术实现，选择当下最合理且可行的解决方案。\n在今年的Pre-re:Invent之际，AWS发布了AWS Resilience Hub服务，将应用程序云上资源状态的扫描， 系统弹性的评估，符合RPO/RTO的配置建议，以及基于混沌工程的实验运行集成为一个整体的服务，通过一个控制面板实现了应用程序弹性的管理。\n此外，如果你的应用刚好是一个电商或在线票务系统，系统正在面对秒杀、黑五等大规模负载压力的考验，可以参考甚至直接尝试AWS解决方案 AWS Virtual Waiting Room来直接构建一个弹性系统。\n","link":"https://kane.mx/posts/2021/application-resilience/","section":"posts","tags":["AWS","resilience engineering","Amazon Builders' Library","System Design"],"title":"应用程序弹性设计"},{"body":"","link":"https://kane.mx/tags/metaverse/","section":"tags","tags":null,"title":"Metaverse"},{"body":"","link":"https://kane.mx/tags/nft/","section":"tags","tags":null,"title":"Nft"},{"body":"","link":"https://kane.mx/tags/%E5%85%83%E5%AE%87%E5%AE%99/","section":"tags","tags":null,"title":"元宇宙"},{"body":"元宇宙是近期的热点话题，近期做了些学习了解，将一些学习内容总结为一个deck。分享如下，\n","link":"https://kane.mx/posts/2021/metaverse/","section":"posts","tags":["metaverse","nft","元宇宙"],"title":"元宇宙风口下的机会"},{"body":"","link":"https://kane.mx/tags/aws-fault-injection-simulator/","section":"tags","tags":null,"title":"AWS Fault Injection Simulator"},{"body":"","link":"https://kane.mx/tags/aws-fis/","section":"tags","tags":null,"title":"AWS FIS"},{"body":"混沌工程是一种帮助系统满足弹性需求的技术，它起源于Netflix的工程实践，著名的猴子军团。\nAWS一直提倡架构的完善(AWS Well-Architected)，混沌工程正是卓越运营和可靠性支柱的实践。 因此在 re:Invent 2020 AWS发布了Fault Injection Simulator服务来简化开发者在AWS上的混动工程实践。\nAWS FIS作为AWS上原生的混沌工程服务，目前已同EC2，ECS，EKS，RDS，CloudWatch，甚至是IAM Role API集成，可以触发这些服务中资源的变更来假设故障， 例如，重启或终止EC2实例，重启RDS实例等。\nChaos Engineering on AWS是一份非常详细的混沌工程在AWS上动手实验。 该实验将指导参与者快速设置实验初始环境，通过可观测性工具了解系统状态，然后带领实验参与者通过详细的实验步骤学习如何使用FIS服务来达到对系统可靠性的验证和优化。 实验项目除了覆盖FIS支持集成的EC2，ECS，RDS等服务外，还演示了SSM集成，并且通过FIS内置的SSM文档或自定义的SSM文档来假设系统故障。 混沌工程作为一种提升系统弹性的质量手段，需要重复性的在系统中实验，动手实验也为参与者设计了CI/CD实验，通过Gitops方式将混动工程持续实验到系统环境中。 总之，对混沌工程有兴趣的开发者，Chaos Engineering on AWS非常值得一做的动手实验，可以快速的帮助您了解混沌工程，及FIS在AWS上的实践。\n最后再强调下最重要的事，混沌工程不是一个孤立的系统弹性实验，它需要系统本身的弹性、可靠性设计以及可观测性的实现，是一个系统整体的设计实践。\n","link":"https://kane.mx/posts/2021/chaos-engineering-on-aws/","section":"posts","tags":["AWS","chaos engineering","AWS Fault Injection Simulator","AWS FIS"],"title":"AWS上的混沌工程"},{"body":"","link":"https://kane.mx/tags/chaos-engineering/","section":"tags","tags":null,"title":"Chaos Engineering"},{"body":"","link":"https://kane.mx/tags/filevault/","section":"tags","tags":null,"title":"Filevault"},{"body":"","link":"https://kane.mx/tags/macos-monterey/","section":"tags","tags":null,"title":"MacOS Monterey"},{"body":"","link":"https://kane.mx/tags/macosx/","section":"tags","tags":null,"title":"MacOSX"},{"body":"I'm trying to upgrade my Macbook Pro to macOS Monterey, however the installation can not be started due to the disk is encrypted by Filevault \u0026#x1f615; I have to turn off Filevault to disable disk encrpytion before installing macOS Monterey.\nI found this support article on how turning off Filevault, but it does not work at all. There is nothing hint or error message after clicking the option Turn off Filevault.\nAfter researching it for a while, I found this post via CLI command,\n1sudo fdesetup disable But above command also does not work, it exits with error code -69594.\n1sudo fdesetup disable 2Enter the user name:kane 3Enter the password for user \u0026#39;kane\u0026#39;: 4FileVault was not disabled (-69594). I found some articles said that the Filevault only can be disabled by the user whom enables it. I found below command to show the user whom enabled the Filevault, it's enabled by an unknown user! I don't have idea how enabling it.\n1sudo fdesetup list -extended 2ESCROW UUID TYPE USER 3 2D3F7CA5-4ED4-4537-8DA2-98B1E3637954 Unknown User Finally I found below command line to disable Filevault though I don't know which user enabled it.\n1diskutil apfs disableFileVault disk1s1 -user disk Input the disk password when booting the macOS. The disabling Filevault will be processed in backgroud, you can retrieve the progress by below command,\n1diskutil apfs list Happy Monterey!\n","link":"https://kane.mx/posts/2021/turn-off-filevault-on-macosx/","section":"posts","tags":["MacOSX","macOS Monterey","filevault","Tip"],"title":"Turn off Filevault on macOS"},{"body":"","link":"https://kane.mx/tags/aws-ecr/","section":"tags","tags":null,"title":"AWS-ECR"},{"body":"","link":"https://kane.mx/tags/helm/","section":"tags","tags":null,"title":"Helm"},{"body":"I met a case to mirror existing Helm charts to another repository. It might be caused by network availability or compliance requirements.\nThere are multiple ways to host a Helm repository, for example, Nexus OSS Repository, Github Pages, AWS ECR and so on.\nAmazon Elastic Container Registry (Amazon ECR) is a fully managed container registry that makes it easy to store, manage, share, and deploy your container images and artifacts anywhere. It's built with scale and secure. In my case I'm using this existing service to mirror the Helm charts.\nI created a script to mirror existing Helm chart to AWS ECR based on the official guide in ECR doc.\nFor example, below code snippet mirrors eks-charts/aws-load-balancer-controller with version 1.2.7 to Amazon ECR,\n1helm repo add eks-charts https://aws.github.io/eks-charts 2helm pull eks-charts/aws-load-balancer-controller --version 1.2.7 3./push-helm-chart-to-all-ecr-regions.sh aws-load-balancer-controller 1.2.7","link":"https://kane.mx/posts/2021/mirror-helm-chart-to-aws-ecr/","section":"posts","tags":["AWS","AWS-ECR","Helm","Kubernetes"],"title":"Mirror Helm Charts to AWS ECR"},{"body":"","link":"https://kane.mx/tags/amazon-neptune/","section":"tags","tags":null,"title":"Amazon Neptune"},{"body":"","link":"https://kane.mx/tags/graph-database/","section":"tags","tags":null,"title":"Graph Database"},{"body":"Amazon Neptune is a managed Graph database on AWS, whose compute and storage is decoupled like Amazon Aurora. Neptune leverages popular open-source APIs such as Gremlin and SPARQL, and easily migrate existing applications.\nAfter exploring Neptune few months in solution, I have below few learnings,\nBulk loading Always meet the ConcurrentModificationExceptions when concurrently loading vertices/edges into Neptune. Using neptune-python-utils with retry backoff can improve it, however it requires the expensive large Neptune instance.\nThe best way of batch loading the large vertices/edges into Neptune is using the bulk load feature, it works fine though the instance of Neptune is small. The loading time depends on the instance size of Neptune.\nproperties of vertice In my use case, I store the embedding as properties of vertices like relation database. There are almost 400 properties for every vertices, the query performance is bad with large number of properties. Due to the embedding properties will not be queried, consolidating the 400 properties as a single one properties to improve the query performance.\nstreams Neptune Streams logs every change to the graph. It's a Lab feature in 2019, and GA in 2020. However there is no Lambda integration now! It means you can not process the Neptune streams in Lambda functions!\nTools Neptune Tools Amazon Neptune Tools is a toolkit maintained by Neptune service team.\nNeptune sigv4 The script can connect Neptune to call control plane APIs with aswauthsigv4 and proxy support.\n","link":"https://kane.mx/posts/2021/the-practise-of-amazon-neptune/","section":"posts","tags":["graph database","Amazon Neptune","AWS"],"title":"The practise of Amazon Neptune"},{"body":"","link":"https://kane.mx/tags/amazon-eks/","section":"tags","tags":null,"title":"Amazon EKS"},{"body":"","link":"https://kane.mx/tags/sonatype-nexus/","section":"tags","tags":null,"title":"Sonatype Nexus"},{"body":"Last year I shared the production-ready, cloud native solution to deploy Sonatype Nexus Repository OSS on AWS.\nThe solution has an update with below notable changes,\nsupport specifying EKS version, v1.20, v1.19, and v1.18 are supported versions support provisioning to existing VPC support provisioning to existing EKS(require EKS v1.17+) update aws-efs-csi-driver to 1.3.1 update aws-load-balancer-controller to 2.2.0 See the solution page for detail usage.\n","link":"https://kane.mx/posts/2021/nexus-oss-on-aws-v110-update/","section":"posts","tags":["Amazon EKS","Kubernetes","Helm","AWS CDK","AWS","Sonatype Nexus"],"title":"The update of Sonatype Nexus repository OSS on AWS solution"},{"body":"","link":"https://kane.mx/tags/nat/","section":"tags","tags":null,"title":"NAT"},{"body":"","link":"https://kane.mx/tags/network/","section":"tags","tags":null,"title":"Network"},{"body":"本方案的起因是，一个源代码托管在Github上的项目fix一个重要的bug后，在AWS上的持续部署流水线一直失败。分析日志后，发现流水线中的数个步骤需要克隆源代码，但是访问Github的网络非常不稳定，这数个流水线任务持续因连接超时，连接拒绝等网络错误而失败。而流水线任务大量使用了CodeBuild, Lambda等AWS托管服务，无法为执行环境配置可靠的网络连接。\n本方案思路如下，\n在 VPC public subnets 中创建 NAT instance 即 EC2 虚拟机， 配置 NAT instance，使用 tunnel 网络访问 github， 修改 private subnets 的路由表，添加 github 服务的 IP CIDRs，将对这些IP地址的请求通过 NAT instance 转发。 综上，实现了不用对现有持续部署流水线做任何修改，流水线中运行在 VPC private subnet 内的任务(包括但不限于CodeBuild, Fargate, Lambda, Glue等)，对外网的请求目标地址如在路由表的特殊规则(IP CIDRs)中，网络请求将会通过 NAT instance 来转发。\n为此，创建了一个基于 AWS CDK construct 的开源项目 SimpleNAT 来封装和复用创建配置 NAT instances，并且将指定的IP地址段更新到路由表设置路由规则。\n该项目同时提供了一个完整示例应用，演示了如何配置 NAT instance 使用 sshuttle 建立网络隧道，并且将指定的IP地址段请求通过 NAT instance 来转发。\n","link":"https://kane.mx/posts/2021/simple-nat-on-aws/","section":"posts","tags":["AWS","Tip","network","NAT","CDK Construct","AWS CDK"],"title":"在AWS上快速部署专用的NAT实例"},{"body":"Infrastructure as Code is the trend to manage the resources of application. AWS CloudFormation is the managed service offering the IaC capability on AWS since 2011. CloudFormation uses the declarative language to manage your AWS resources with the style what you get is what you declare.\nHowever there are cons of CloudFormation as a declarative language,\nthe readability and maintenance for applications involving lots of resources the reuseable of code, CloudFormation modules released in re:Invent 2020 might help mitigate it AWS CDK provides the programming way to define the infra in code by your preferred programming languages, such as Typescript, Javascript, Python, Java and C#. AWS CDK will synthesis the code to CloudFormation template, then deploying the stack via AWS CloudFormation service. It benefits the Devops engineers manage the infra on AWS as programming application, having version control, code review, unit testing, integration testing and CI/CD pipelines, the deployment still depends on the mature CloudFormation service to rolling update the resources and rollback when failing.\nFor solution development, using CDK indeed improves the productivity then publish the deployment assets as CloudFormation templates.\nThough CDK application can be synthesized to CloudFormation template, there are still some differences blocking the synthesized templates to be deployed across multiple AWS regions.\nThis post will share the tips on how effectively writing AWS CDK application then deploying the application by CloudFormation across multiple regions.\nGeneral Environment-agnostic stack Don’t specify env with account and region like below that will generate account/region hardcode in CloudFormation template.\n1new MyStack(app, \u0026#39;Stack1\u0026#39;, { 2 env: { 3 account: \u0026#39;123456789012\u0026#39;, 4 region: \u0026#39;us-east-1\u0026#39; 5 }, 6}); use CfnMapping/CfnCondition instead of if-else clause CloudFormation does not have logistic processing like programming language. Use CfnMapping or CfnCondition instead.\nNote: the CfnMapping does not support default value, you have to list all supported regions like below code snippet,\n1getAwsLoadBalancerControllerRepo() { 2 const albImageMapping = new cdk.CfnMapping(this, \u0026#39;ALBImageMapping\u0026#39;, { 3 mapping: { 4 \u0026#39;me-south-1\u0026#39;: { 5 2: \u0026#39;558608220178\u0026#39;, 6 }, 7 \u0026#39;eu-south-1\u0026#39;: { 8 2: \u0026#39;590381155156\u0026#39;, 9 }, 10 \u0026#39;ap-northeast-1\u0026#39;: { 11 2: \u0026#39;602401143452\u0026#39;, 12 }, 13 \u0026#39;ap-northeast-2\u0026#39;: { 14 2: \u0026#39;602401143452\u0026#39;, 15 }, 16 ... 17 \u0026#39;ap-east-1\u0026#39;: { 18 2: \u0026#39;800184023465\u0026#39;, 19 }, 20 \u0026#39;af-south-1\u0026#39;: { 21 2: \u0026#39;877085696533\u0026#39;, 22 }, 23 \u0026#39;cn-north-1\u0026#39;: { 24 2: \u0026#39;918309763551\u0026#39;, 25 }, 26 \u0026#39;cn-northwest-1\u0026#39;: { 27 2: \u0026#39;961992271922\u0026#39;, 28 }, 29 } 30 }); 31 return `${albImageMapping.findInMap(cdk.Aws.REGION, \u0026#39;2\u0026#39;)}.dkr.ecr.${cdk.Aws.REGION}.${cdk.Aws.URL_SUFFIX}/amazon/aws-load-balancer-controller`; 32 } never use Stack.region Don’t rely on stack.region to do the logistic for China regions. Use additional context parameter or CfnMapping like below snippet,\n1const partitionMapping = new cdk.CfnMapping(this, \u0026#39;PartitionMapping\u0026#39;, { 2 mapping: { 3 aws: { 4 nexus: \u0026#39;quay.io/travelaudience/docker-nexus\u0026#39;, 5 nexusProxy: \u0026#39;quay.io/travelaudience/docker-nexus-proxy\u0026#39;, 6 }, 7 \u0026#39;aws-cn\u0026#39;: { 8 nexus: \u0026#39;048912060910.dkr.ecr.cn-northwest-1.amazonaws.com.cn/quay/travelaudience/docker-nexus\u0026#39;, 9 nexusProxy: \u0026#39;048912060910.dkr.ecr.cn-northwest-1.amazonaws.com.cn/quay/travelaudience/docker-nexus-proxy\u0026#39;, 10 }, 11 } 12 }); 13partitionMapping.findInMap(cdk.Aws.PARTITION, \u0026#39;nexus\u0026#39;); Use core.Aws.region token referred to the region which region of the stack is deployed.\nexplicitly add dependencies on resources to control the creation/deletion order of resources For example, when deploying a solution with creating a new VPC with NAT gateway, then deploying EMR cluster in private subnets of VPC. The EMR cluster might fail on creation due to network issue. It’s caused by the NAT gateway is not ready when initializing the EMR cluster, you have to manually create the dependencies among EMR cluster and NAT gateway.\nAlways override the logical ID of CloudFormation resource when creating AWS resources with unique name UPDATED: 2023/10/24\nWhen creating an AWS resource via CDK with a friendly name, for example, you create a Glue Table named my-table in CDK. The logical ID will be generated by CDK constructs' name inheritance, however, you might refactor your constructs in the system design level. The default logical ID of the resource will be changed after your refactor or renaming the ID of construct. After the logical ID changes, the resource will be replaced when updating the CloudFormation stack to a new template. In the updating process of CloudFormation stack, a new resource will be created firstly, however, the resource creation will fail due to the conflict resource name. Currently, the workaround is that explicitly overriding the logical ID of the AWS resource created by CDK to avoid the replacement in stack updating. Explicitly override the logical ID will maintain the code readability and avoid the unintended the failure of stack update.\n1(table.node.defaultChild as CfnResource).overrideLogicalId(tableLogicId); EKS module(@aws-cdk/aws-eks) specify kubectl layer when creating EKS cluster NOTE: This tricky only applies for AWS CDK prior to 1.81.0. CDK will bundle kubectl, helm and awscli as lambda layer instead of SAR application since 1.81.0, it resolves below limitation.\nEKS uses a lambda layer to run kubectl/helm cli as custom resource, the @aws-cdk/aws-eks module depends on the Stack.region to check the region to be deployed in synthesizing phase. It violates the principle of Environment-agnostic stack! Use below workaround to create the EKS cluster,\n1const partitionMapping = new cdk.CfnMapping(this, \u0026#39;PartitionMapping\u0026#39;, { 2 mapping: { 3 aws: { 4 // see https://github.com/aws/aws-cdk/blob/60c782fe173449ebf912f509de7db6df89985915/packages/%40aws-cdk/aws-eks/lib/kubectl-layer.ts#L6 5 kubectlLayerAppid: \u0026#39;arn:aws:serverlessrepo:us-east-1:903779448426:applications/lambda-layer-kubectl\u0026#39;, 6 }, 7 \u0026#39;aws-cn\u0026#39;: { 8 kubectlLayerAppid: \u0026#39;arn:aws-cn:serverlessrepo:cn-north-1:487369736442:applications/lambda-layer-kubectl\u0026#39;, 9 }, 10 } 11}); 12 13const kubectlLayer = new eks.KubectlLayer(this, \u0026#39;KubeLayer\u0026#39;, { 14 applicationId: partitionMapping.findInMap(cdk.Aws.PARTITION, \u0026#39;kubectlLayerAppid\u0026#39;), 15}); 16const cluster = new eks.Cluster(this, \u0026#39;MyK8SCluster\u0026#39;, { 17 vpc, 18 defaultCapacity: 0, 19 kubectlEnabled: true, 20 mastersRole: clusterAdmin, 21 version: eks.KubernetesVersion.V1_16, 22 coreDnsComputeType: eks.CoreDnsComputeType.EC2, 23 kubectlLayer, 24}); If you're interested on this issue, see cdk issue for detail.\nmanage the lifecycle of helm chart deployment The k8s helm chart might create AWS resources out of CloudFormation scope. You have to manage the lifecycle of those resources by yourself.\nFor example, there is an EKS cluster with AWS load balancer controller, then you deploy a helm chart with ingress that will create ALB/NLB by the chart, you must clean those load balancers in deletion of the chart. Also the uninstall of Helm chart is asynchronous, you have to watch the deletion of resource completing before continuing to clean other resources.\nTHE END The tips will be updated when something new is found or the one is deprecated after CDK is updated.\nHAPPY CDK \u0026#x1f606;\n","link":"https://kane.mx/posts/2020/effective-aws-cdk-for-aws-cloudformation/","section":"posts","tags":["Infrastructure as Code","AWS CloudFormation","AWS CDK","AWS"],"title":"Effective AWS CDK for AWS CloudFormation"},{"body":"近期Amazon Builders Library发布了数篇文章介绍亚马逊如何实践持续部署，同时分享了亚马逊在部署方面的最佳实践。\n这里将这三篇文章核心内容做个概述，方便大家按需细读。\nGoing faster with continuous delivery 这篇文章先是分享了亚马逊持续改进和软件自动化的文化(Amazonian随时都惦记着的领导力准则)，然后介绍了亚马逊内部的持续部署工具Pipelines。从一个试点工具进化为亚马逊标准、一致且简洁的发布工具。并且将构建和发布软件的最佳实践检查也融入到Pipelines中。\n接下来是分享如何减小故障影响到客户的风险。有过软件开发经验的都知道，软件变更引入故障是不可避免的，如何将故障对客户的影响控制到最小是非常重要的。该文从下面几个方面给出了建议，\n部署卫生，如对新部署程序的健康检查 上生产系统之前的测试，自动化单元、集成、预生产测试 生产系统上的验证，分批的部署，控制故障影响半径 控制何时发布软件 最后作者介绍了亚马逊如何快速执行业务创新 -- 通过自动化一切事情。\nAutomating safe, hands-off deployments 这篇文章很好的呼应了Going faster with continuous delivery一文中如何避免新的部署导致故障影响，非常详细的介绍了亚马逊关于自动化安全部署的实践。\n对于持续部署，源码 -\u0026gt; 构建 -\u0026gt; 测试 -\u0026gt; 生产 这个流程大家都很熟悉。\n从下图看，亚马逊对于源码和构建的理解是非常深入和全面的。\n源码并不仅仅是应用程序源代码，还可以包括运维工具代码、测试代码、基础架构代码、静态资源、依赖库、配置和操作系统补丁。\n代码审核是必须的。对于全自动的流水线，代码审核是最后一道人工核验。代码审核不仅仅是审核代码的正确性，还应该检查代码是否包括足够的测试，是否有完善的工具来监测部署以及能否安全的回退。\n同时构建也不光是编译源代码，打包并存储构件。也包含单元测试，静态代码分析，代码覆盖率检查，代码审核检查。\n测试在亚马逊是一个多阶段的预生产环境，详见下图。\n集成测试是自动化的模拟客户一样使用服务，实现端到端的测试。部署到生产之前，还需要执行向后兼容性测试以及借助负载均衡实现one-box测试。\nAWS服务是部署在全球多个区域内的多个可用区，为了减少部署故障对客户的影响，生产通过波次部署来分批分阶段的安全部署。\n首先部署是在单区域的单可用区做one-box部署，如果引起负面问题，会自动回退并停止生产后续的部署。系统指标的监控是实现自动化安全部署的核心，需要通过监控的指标来自动触发部署回退。\nBake time也是实践经验总结出来的精髓。有时故障不是在部署后马上显现的，需要时间才会逐渐显现。设置合理的Bake time，能够让故障有足够时间被暴露出来，不至于照成大范围影响。\nEnsuring rollback safety during deployments 因为故障是不可避免的，部署能够被安全回退是非常必要的。这篇文章就详细介绍了如何实现可安全回退的部署 -- 通过两阶段部署的技术，以及序列化的最佳实践。\n这三篇文章分别从术和器的角度分享了亚马逊在软件部署的实践经验，开发者们可以结合自身业务情况集成适合的最佳实践。\n","link":"https://kane.mx/posts/2020/the-best-practise-of-deployment-at-amazon/","section":"posts","tags":["DevOps","Continuous Deployment","Amazon Builders' Library","System Design"],"title":"亚马逊的部署最佳实践"},{"body":"","link":"https://kane.mx/tags/aws-step-functions/","section":"tags","tags":null,"title":"AWS Step Functions"},{"body":"AWS CDK是编排部署AWS云上资源最佳的工具之一。基于AWS CDK的应用应该如何实践DevOps持续集成和部署呢？\n通常我们有下面几种方法，\n使用AWS CodePipeline来完成DevOps pipeline搭建。CodePipeline是AWS Code系列服务中的持续集成编排工具，它可以集成CodeBuild项目，在CodeBuild项目build中安装cdk，并执行cdk deploy命令来实现应用部署。 这种方法简单直接的实现了DevOps部署流水线。但缺少staging，将最新提交直接部署到生产是一种非常高风险的做法。\nCDK近期发布了体验性的新特性CDK Pipelines来封装CDK应用持续部署流水线的配置。CDK Pipelines也是基于AWS CodePipeline服务，提供快速创建可跨账号区域的持续部署流水线，同时支持部署流水线项目的自升级更新。整个流水线流程如下图所示， CDK Pipelines是非常高效且灵活的持续部署流水线创建的方式，但由于是体验性特性，在生产应用中还有一些局限性。例如，\n不支持context provider查找。也就是说，无法支持CDK应用查找账户中存在的VPC，R53 HostedZone等。 由于CDK Pipelines实际是使用CodePipeline来编排部署流水线，CodePipeline的局限性，CDK Pipelines同样存在。 CodePipeline在某些分区和区域还不可用。例如，AWS中国区暂时还没有CodePipeline服务，CDK Pipelines在AWS中国区也就无法使用。 使用AWS Step Functions来编排CDK应用部署的流水线。在Step Functions编译的部署流水线中，可用通过CodeBuild项目来完成cdk deploy执行做到完整的支持CDK的所有功能。同时Step Functions具备最大的灵活性来支持持续部署过程中的各种编排需求，例如，跨账户部署应用的不同stage，引入人工审批流程，通过Slack等chatops工具来完成审批。 Opentuna项目就实践了用Step Functions来编排持续部署流水线。整个部署流程如下图，\n如果对基于Step Functions实现的CDK应用持续部署感兴趣，可以访问OpenTUNA项目实现的源码了解更多细节。\n","link":"https://kane.mx/posts/2020/deploy-aws-cdk-applications-cross-accounts/","section":"posts","tags":["AWS","AWS CDK","DevOps","AWS Step Functions"],"title":"跨账号跨区域部署AWS CDK编排的应用"},{"body":"Sonatype Nexus repository OSS is an artifact repository that supports most software repositories such as Maven, Pypi, Npmjs, Rubygems, Yum, Apt, Docker registry and etc. In the enterprise Nexus repository is widely used for storing proprietary artifacts and caching the artifacts for speeding up the devops.\nBuilding a production ready Nexus repository always is a requirement for devops team, it should satisfy below criterias at least,\nartifacts storage management It's difficult to predicate the storage usage of artifacts, allocating large volume is not cost optimized. the durability of nexus3 data storage We need a way to make sure data storage of nexus when updating Nexus OSS to newer version or recover the service from unhealthy status. self healing capability when the service is down A reliable way recovers the Nexus repository OSS when it's unhealth. There is a well-architected solution(maintained by AWS team) to quickly(~10 minutes) deploy Nexus OSS leveraging below capabilities,\nHost on EKS cluster using managed EC2 nodes with IRSA Expose service via AWS Application load balancer managed by AWS load balancer controller(former ALB Ingress Controller) Use dedicated S3 bucket for storing Nexus OSS blobstore with ulimited and on-demand storage Use EFS, EFS CSI Driver, PV and PVC storing nexus data Use Helm to deploy Sonatype Nexus chart Optional Use External DNS to registry the domain record of Nexus repository to Route 53 Optional Use AWS Certificate Manager to create SSL certificate of domain name of Nexus repository Enjoy it\u0026#x1f60f;\n","link":"https://kane.mx/posts/2020/deploy-sonatype-nexus-oss-on-eks/","section":"posts","tags":["Amazon EKS","Kubernetes","Helm","AWS CDK","AWS","Sonatype Nexus"],"title":"Deploy Sonatype Nexus repository OSS on EKS"},{"body":"","link":"https://kane.mx/tags/aws-athena/","section":"tags","tags":null,"title":"AWS Athena"},{"body":"","link":"https://kane.mx/tags/big-data/","section":"tags","tags":null,"title":"Big Data"},{"body":"","link":"https://kane.mx/tags/cloud-native/","section":"tags","tags":null,"title":"Cloud Native"},{"body":"","link":"https://kane.mx/tags/data-lakes/","section":"tags","tags":null,"title":"Data Lakes"},{"body":"","link":"https://kane.mx/tags/docker/","section":"tags","tags":null,"title":"Docker"},{"body":"","link":"https://kane.mx/tags/%E4%BA%91%E8%AE%A1%E7%AE%97/","section":"tags","tags":null,"title":"云计算"},{"body":"近期对Docker镜像做了些数据分析，这里分享一下利用云原生技术快速且低成本的实现任意数量的数据分析。\n之前通过文章介绍了不用拉取Docker镜像就可获取镜像的大小的一种方法，通过其中的示例脚本，我们可以获取到待分析的原始数据。\n比如nginx镜像的部分原始数据(csv格式)如下，\n1 2 3 4 5 6 7 8 9 10 11 12 1.18.0-alpine,sha256:676b8117782d9e8c20af8e1b19356f64acc76c981f3a65c66e33a9874877892a,amd64,linux,null,null,\u0026#34;sha256:cbdbe7a5bc2a134ca8ec91be58565ec07d037386d1f1d8385412d224deafca08\u0026#34;,2813316 1.18.0-alpine,sha256:676b8117782d9e8c20af8e1b19356f64acc76c981f3a65c66e33a9874877892a,amd64,linux,null,null,\u0026#34;sha256:6ade829cd166df9b2331da48e3e60342aef9f95e1e45cde8d20e6b01be7e6d9a\u0026#34;,6477096 1.18.0-alpine,sha256:70feed62d5204358ed500463c0953dce6c269a0ebeef147a107422a2c78799a9,arm,linux,v6,null,\u0026#34;sha256:b9e3228833e92f0688e0f87234e75965e62e47cfbb9ca8cc5fa19c2e7cd13f80\u0026#34;,2619936 1.18.0-alpine,sha256:70feed62d5204358ed500463c0953dce6c269a0ebeef147a107422a2c78799a9,arm,linux,v6,null,\u0026#34;sha256:a03f81873d278ad248976b107883f0452d33c6f907ebcdd832a6041f1d33118a\u0026#34;,6080562 1.18.0-alpine,sha256:2ba714ccbdc4c2a7b5a5673ebbc8f28e159cf2687a664d540dcb91d325934f32,arm64,linux,v8,null,\u0026#34;sha256:29e5d40040c18c692ed73df24511071725b74956ca1a61fe6056a651d86a13bd\u0026#34;,2724424 1.18.0-alpine,sha256:2ba714ccbdc4c2a7b5a5673ebbc8f28e159cf2687a664d540dcb91d325934f32,arm64,linux,v8,null,\u0026#34;sha256:806787fcd4f9e2f814506fb53e81b6fb33f9eea04e5b537b31fa5fb601a497ee\u0026#34;,6423816 1.18.0-alpine,sha256:6d6f19360150548bbb568ecd3e1affabbdce0fcc39156e70fbae8a0aa656541a,386,linux,null,null,\u0026#34;sha256:2826c1e79865da7e0da0a993a2a38db61c3911e05b5df617439a86d4deac90fb\u0026#34;,2808418 1.18.0-alpine,sha256:6d6f19360150548bbb568ecd3e1affabbdce0fcc39156e70fbae8a0aa656541a,386,linux,null,null,\u0026#34;sha256:f2ab0e3b0ff04d1695df322540631708c42b0a68925788de2290c9497e44fef3\u0026#34;,6845295 1.18.0-alpine,sha256:c0684c6ee14c7383e4ef1d458edf3535cd62b432eeba6b03ddf0d880633207da,ppc64le,linux,null,null,\u0026#34;sha256:9a8fdc5b698322331ee7eba7dd6f66f3a4e956554db22dd1e834d519415b4f8e\u0026#34;,2821843 1.18.0-alpine,sha256:c0684c6ee14c7383e4ef1d458edf3535cd62b432eeba6b03ddf0d880633207da,ppc64le,linux,null,null,\u0026#34;sha256:30a37aac8b54a38e14e378f5122186373cf233951783587517243e342728a828\u0026#34;,6746511 1.18.0-alpine,sha256:714439fec7e1f55c29b57552213e45c96bbfeefddea2b3b30d7568591966c914,s390x,linux,null,null,\u0026#34;sha256:7184c046fdf17da4c16ca482e5ede36e1f2d41ac8cea9c036e488fd149d6e8e7\u0026#34;,2582859 1.18.0-alpine,sha256:714439fec7e1f55c29b57552213e45c96bbfeefddea2b3b30d7568591966c914,s390x,linux,null,null,\u0026#34;sha256:214dff8a034aad01facf6cf63613ed78e9d23d9a6345f1dee2ad871d6f94b689\u0026#34;,6569410 各列的含义分别是，镜像tag, 镜像Digest, 镜像对应平台的Architecture, 镜像对应平台的OS, 镜像对应平台的变种（例如，ARM的v7, v8等）, 镜像对应平台的OS版本, 镜像组成层的Digest, 镜像组成层的大小。\n上面nginx镜像的示例数据，告诉我们镜像名nginx且tag为1.18.0-alpine的镜像包含了amd64-linux, arm-linux-v6, arm64-linux-v8, 386-linux, ppc64le-linux以及s390x-linux共5种Arch合计6个版本的镜像。且每个平台的对应镜像包含了两个层以及这两个层的大小。\n当我们有了成百数千甚至海量镜像的原始数据后，如何能快速且低成本的分析这些数据呢？\n在AWS上，我们可以利用数据湖相关的系列产品来实现低成本的交互式分析。\n在Docker镜像分析这个场景下，我已经获取到了待分析镜像的平台、层等数据。我将这些数据上传到Amazon S3作为数据湖的数据源。 接下来使用AWS Glue以S3中的数据创建Table并且从中提前数据的metadata。同时做数据分区，为接下来的查询做性能和成本优化。 打开Amazon Athena，根据业务需求通过SQL语句查询分析Docker镜像数据。 就是通过以上3个简单步骤，我就得到了一个无服务器架构的Docker镜像数据分析应用！整个应用完全是按量计费的，主要成本包括S3对象存储费用，和Athena费用（根据每次查询扫描数据的大小来计算）。\n使用该分析应用，我统计了Docker Hub官方镜像中包含层最多的10个镜像(分平台统计)， 最后，得力于AWS Infra as Code的强大能力，整个应用也是通过代码管理的且开源的，有兴趣的读者也可以部署自己的分析应用。\n","link":"https://kane.mx/posts/2020/serverless-docker-images-analytics/","section":"posts","tags":["云计算","AWS","Big Data","Data Lakes","Analytics","AWS Athena","Cloud Native","docker"],"title":"无服务器架构的Docker镜像数据分析应用"},{"body":"Recently I had a requirement to stats the size of some Docker images. It would be waste if pulling them all firstly then calculating the size of each image. Also you know the docker image consists of some Docker layers that probably are shared by other images. It's hard to get the disk usage if only sum the size of each image.\nIs there any way to get the size of Docker image without pulling it?\nIt's definitely Yes. The docker images are hosted by Docker Registry, which is defined by a public specification. The latest V2 of Registry has API to fetch the manifest of an image that contains the size of every layer. Looks like it's very cool. Utilitying the manifest API of image will satisfie my requirement!\nOne more thing you should note, the v2 of Docker registry still is compatible with schema specification V1. You have to properly handle with the mixed responses of manifest when you query the manifest of an image.\nI created a simple shell script gracefully handling either v1 or v2 response of the image manifest, which can calculate the total layers size of a Docker image with specific tag, or the size of all tags of a Docker image.\nAbove script was inspired by this post. Hope you enjoy it.\n","link":"https://kane.mx/posts/2020/get-docker-image-size-without-pulling-image/","section":"posts","tags":["docker","Tip"],"title":"Get the size of Docker image without pulling image"},{"body":"","link":"https://kane.mx/tags/oh-my-zsh/","section":"tags","tags":null,"title":"Oh-My-Zsh"},{"body":"Z shell搭配oh-my-zsh自定义配置已成为众多Linux/Macosx用户的标准terminal配置。\n最近遇到在zsh中执行任意命令都变得特别慢(哪怕简单执行ls也要花费肉眼可见的1，2秒钟)，这里记录下如何排查Z shell下启用oh-my-zsh的性能问题。\n性能问题症状 突然某天起在终端中执行任意命令，都至少要花费1，2秒（肉眼计数），该命令才会完成执行并退出到终端开始接受新的输入。\n我当前主要使用的终端是iTerm2，执行命令后，在终端Tab的title bar上能显式的看到git命令也被执行了。\n尝试了其他的shell，比如bash，是没有这个问题。基本断定问题同zsh相关。更多问题描述乃至动画截屏，可以参见这个issue。\nzsh + oh-my-zsh 性能问题分析 oh-my-zsh其实就是提供zsh的定制化配置，主要包括Theme主题和各种软件的插件。\noh-my-zsh 插件 通常oh-my-zsh中内置或三方社区提供的插件是导致性能降低甚至互相冲突的主要原因。排查思路也很简单，通过逐个禁用已加载的插件来测试是否可以解决问题。\n用文本编辑器打开当前用户的~/.zshrc配置，找到plugins开头的配置行，例如，\n1plugins=( 2 git 3 osx 4## gradle 5 brew 6## command-not-found 7 github 8# gnu-utils 9## mvn 10 python 11 pip 12# screen 13 vi-mode 14 docker 15## docker-compose 16 node 17## spring 18 mosh 19# httpie 20## sudo 21 tmux 22## kubectl 23## helm 24 golang 25 history 26 history-substring-search 27 zsh-autosuggestions 28 zsh-syntax-highlighting 29) 通过行首添加#来禁用oh-my-zsh插件，启动新的终端窗口或tab来验证是否该插件是引起问题的根源。\n在我的配置中，出现过因为启用过多插件，导致新建终端需要10来秒钟。但因为创建终端不是一个高频的需求，这个性能通常来说还是可以忍受。\noh-my-zsh 主题 在我的这个问题中，即使将所有插件都禁用了，命令执行后退出速度还是没有改善，git命令仍然有被执行。这时我尝试更换不同oh-my-zsh内置主题来排查问题。但是使用了包括默认主题robbyrussell，极简主题ys在内的多个主题都无法解决该问题。\n最后直接禁止oh-my-zsh使用主题，问题没有了！\n然而oh-my-zsh主题是对zsh的极大增强，改善了默认的用户体验，没有主题扩展使用起来会非常不习惯。\n小结 最终试用了另一个社区维护的知名zsh主题Pure，性能问题得到了解决\u0026#x270c;\u0026#xfe0f; 同时也满足了主题对zsh输入输出用户体验的增强 \u0026#x1f60a;\n希望这里分享的oh-my-zsh性能的调优思路，可以帮助到有类似需要的各位。\n将来社区对这个问题如有进一步的反馈，将会做更新。\n","link":"https://kane.mx/posts/2020/zsh-performance-tuning/","section":"posts","tags":["zsh","oh-my-zsh","performance-tuning","trobule-shooting"],"title":"oh-my-zsh性能调优思路"},{"body":"","link":"https://kane.mx/tags/performance-tuning/","section":"tags","tags":null,"title":"Performance-Tuning"},{"body":"","link":"https://kane.mx/tags/trobule-shooting/","section":"tags","tags":null,"title":"Trobule-Shooting"},{"body":"","link":"https://kane.mx/tags/zsh/","section":"tags","tags":null,"title":"Zsh"},{"body":"","link":"https://kane.mx/tags/codecommit/","section":"tags","tags":null,"title":"CodeCommit"},{"body":"Github/Gitlab已经成为众多开发者非常熟悉的代码协作平台，通过他们参与开源项目或实施企业内部项目协作。\nAWS也提供了托管的、基于Git、安全且高可用的代码服务CodeCommit。CodeCommit主要针对企业用户场景，所以他并没有社交功能以及代码仓库fork功能，是否CodeCommit就无法实现Github基于Pull Request的协同工作模式呢？\n答案是，CodeCommit完全可以实现基于Pull Request的代码协作。由于Git的分布式代码管理特性，首先fork上游项目仓库，将修改后的代码提交到fork仓库，通过Pull Request申请修改请求合并。Github将这套协作流程推广开来并被开源项目广泛采用。其实还有另外的Git仓库协同方式来完成多人的协作开发，例如Gerrit Code Review。目前Android、Eclipse Foundation下面的各种项目都在使用Gerrit作为协同开发工具。Gerrit通过控制同一个代码仓库中不同角色的用户可提交代码分支的权限来实现代码贡献、Review、持续集成以及协同开发的。\nCodeCommit作为AWS托管的服务，同IAM认证和授权管理做了很好的集成。完全可以通过IAM Policy的设置，为同一个代码仓库中不同用户角色设置不同的权限。使用类似Gerrit的权限控制思路，\n任意代码仓库协作者可以提交代码到特定含义的分支，例如，features/*, bugs/*。可以允许多人协同工作在某一特定分支上。协作者同时可以创建新的Pull Request请求合并代码到主分支，例如master或者mainline。 代码仓库Master/Owner有权限合并Pull Request。 拒绝任何人直接推送代码到仓库主分支，包括仓库Owner/Admin。 监听仓库Pull Request创建和PR源分支更新事件，自动触发该PR对应分支的automation build，编译、测试等通过后，自动为PR的通过投票+1。反之若失败，则取消投票。 为代码仓库设置PR Review规则，至少需要收到PR automation build和仓库Master/Owner合计两票通过才允许合并代码。 监听代码仓库主分支，任意新提交将触发自动化发布Build。将最新变更在整个系统上做集成。 是不是很棒！完全做到了Github、Github Pull Request、Github Action/Travis CI整套devops协同开发的流程。\n协作流程如下图， 同时，以上整套基于CodeCommit代码管理的devops工作流程可以利用CloudFormation实现AWS资源编排，将Devops依赖的Infra使用代码来做管理！这样的好处是，企业内部即使有数百数千甚至更多代码仓库都可以统一管理，新仓库的申请也可以通过Infra代码的PR，在通过审批合并后自动从AWS provisioning创建出符合企业管理要求的安全代码仓库。很酷吧\u0026#x1f606;\n这里有一套完整的创建以上工作流的演示，有兴趣的读者可以在自己的账户内体验。整套方案完全使用的是AWS托管服务，仅按实际使用量(如使用CodeBuild编译了代码)计费。\n","link":"https://kane.mx/posts/2020/codecommit-devops-model/","section":"posts","tags":["云计算","AWS","Devops","CodeCommit","Git"],"title":"基于CodeCommit代码管理的无服务器架构Devops"},{"body":"","link":"https://kane.mx/tags/aws-api-gateway/","section":"tags","tags":null,"title":"AWS API Gateway"},{"body":"AWS在3月12日正式发布了新一代的API网关 -- HTTP APIs。AWS发布的第一代API Gateway服务已经快5年了，通过这些年来大规模服务客户的心得以及客户反馈，由此重新构建了更快（相比第一代网关60%的延迟减少）、更便宜（至少节省71%的费用）、更易用的第二代网关服务。\n除了性能、费用、易用性的大幅度改进之外，在HTTP APIs发布博客中着重介绍了以下新特性，\nHTTP APIs网关可同私有VPC内的负载均衡(ALB/NLB)，服务发现(Cloup Map)集成。意味着可将目前最流行且普遍应用的容器服务作为API后端 可以将自定义域名的API路径混合映射到第一代的REST APIs和最新的HTTP APIs 请求限流的改进。支持对不同stage以及请求路由分别设置不同的限流 Stage变量。可以将Stage变量传递给API网关后端的服务。同时支持路由在不同的stage动态集成不同的后端Lambda函数 Lambda集成时使用Payload version 2.0。Version 2.0格式提供了更多的灵活性及简化了数据格式 支持导入 Swagger / OpenAPI 配置文件 如果对HTTP APIs感兴趣，可以尝试在自己的账户内部署这个示例。这个示例演示了如何按需使用AWS Batch服务进行批量任务计算，同时将任务提交和查询状态通过HTTP接口提供出来。该示例支持部署时选用不同的AWS服务（ALB、REST APIs或HTTP APIs）来提供这些API接口访问。整个示例都是基于无服务器架构实现的，不进行批量计算是不产生任何费用的哦\u0026#x1f604;。\n","link":"https://kane.mx/posts/2020/new-http-apis-of-api-gateway/","section":"posts","tags":["云计算","FaaS","AWS","AWS API Gateway","Serverless Computing"],"title":"AWS发布更快、更便宜、更易用的HTTP APIs"},{"body":"","link":"https://kane.mx/tags/faas/","section":"tags","tags":null,"title":"FaaS"},{"body":"在re:Invent 2019之前，AWS Toolkit发布了Cloud Debugging beta功能。该功能支持在IntelliJ IDEs(IntelliJ, PyCharm, Webstorm, 以及 Rider)中远程调试 ECS Fargate 容器中执行的应用程序。\n\b对ECS Fargate demo启用了远程调试并调试成功后，这里记录一下该功能的使用体验并且分享体验过程中掉进去过的一些坑。\n试用体验 首先，该功能不适用于生产环境。因为对ECS Fargate类型的Service启用Cloud Debugging功能会将原始的ECS Services收缩为0个task副本，同时创建一个新的Service并启用新的Task Definition，新的Task Definition中会加入Cloud Debug Sidecar容器来辅助实现远程调试。整个过程会对生产环境造成变更。 如果ECS集群是通过CI/CD持续部署，并且是多人协同使用的环境，该功能也不适用。因为，对某些容器服务启用Cloud Debugging将导致他人的持续部署失败或不生效。 启用Cloud Debugging操作比较麻烦，且启用状态下无法更新ECS中部署的版本。需要先停用Cloud Debugging，部署新版本代码，然后再次启用Cloud Debugging才能调试新代码。尽可能的不要依赖Cloud Debugging来调试程序，花功能做好单元测试，集成测试以及E2E测试来避免调试云端环境。 试用经验 按照官方文档启用Cloud Debugging后，创建Cloud Debugging Launch Configuration并执行调试，遇到**Retrieve execution role finished exceptionally**错误。错误的原因是，文档中没有提到Cloud Debug Sidecar需要logs:CreateLogStream权限创建CloudWatch Logs Stream。解决方案是，为ECS Task Execution Role添加logs:CreateLogStream权限。 在AWS Toolkit Jetbrains当前的版本1.9-193不支持启用了AppMesh或X-Ray的Task。解决方案是，对需要启用Cloud Debugging的Task暂时禁用App Mesh和X-Ray。 Cloud Debugging是一个不错的开发工具尝试思路，帮助开发者更好的做出Cloud Native应用。但是该项目仍然是一个早期项目，有许多问题需要修复和改进。\n","link":"https://kane.mx/posts/2019/aws-cloud-debugging/","section":"posts","tags":["AWS","AWS Toolkit","AWS ECS","AWS Fargate","IntelliJ IDEs"],"title":"AWS Cloud Debugging初探"},{"body":"","link":"https://kane.mx/tags/aws-ecs/","section":"tags","tags":null,"title":"AWS ECS"},{"body":"","link":"https://kane.mx/tags/aws-fargate/","section":"tags","tags":null,"title":"AWS Fargate"},{"body":"","link":"https://kane.mx/tags/aws-toolkit/","section":"tags","tags":null,"title":"AWS Toolkit"},{"body":"","link":"https://kane.mx/tags/intellij-ides/","section":"tags","tags":null,"title":"IntelliJ IDEs"},{"body":"AWS Batch是一个全托管的批处理调度服务，它可为用户管理所有基础设施，从而避免了预置、管理、监控和扩展批处理计算作业所带来的复杂性。当然AWS Batch已与 AWS 平台原生集成，让用户能够利用 AWS 的扩展、联网和访问管理功能。让用户轻松运行能够安全地从 AWS 数据存储（如 Amazon S3 和 Amazon DynamoDB）中检索数据并向其中写入数据的作业。AWS Batch可根据所提交的批处理作业的数量和资源要求预置计算资源并优化作业分配。能够将计算资源动态扩展至运行批处理作业所需的任何数量，从而不必受固定容量集群的限制。AWS Batch还可以利用 Spot 实例，从而进一步降低运行批处理作业产生的费用。\nAWS Batch服务本身是免费的，仅收取实际使用的 EC2 实例费用。\n我创建了一个Batch App demo来演示AWS Batch相关使用方法。该示例通过一个Restful API接口来提交批处理任务，Restful API通过ALB + Lambda函数来暴露服务。Lambda函数被触发后，将新任务请求发送到SQS服务。随后另一个Lambda将消费这个SQS，并将调用AWS Batch API来提交新的批处理任务，同时将任务信息储存到DynamoDB中。同时Demo创建了Batch任务会使用到的Docker Image，并且预先提交到ECR中。同时Batch任务定义了使用的EC2实例类型(c5系列实例，且包括Spot和按需两种计费方式的实例，且优先使用Spot实例)，实例默认伸缩数量为0(没有可执行任务时将中止实例)。并且提交的任务分为计算任务和统计归并任务，统计归并任务会依赖所以计算任务执行完毕才开始执行。最后通过另一Restful接口查询计算任务的最终结果，该接口同样使用ALB + Lambda函数来实现。\nEnjoy this Batch App demo orchestrated by AWS CDK.\n","link":"https://kane.mx/posts/2019/aws-batch/","section":"posts","tags":["AWS","Batch","Infrastructure as Code"],"title":"AWS Batch简介"},{"body":"","link":"https://kane.mx/tags/batch/","section":"tags","tags":null,"title":"Batch"},{"body":"在拥有域名后，通常希望创建一些自有域名下的邮箱来收取不同用途的邮件，同时不希望为这部分功能付费\u0026#x1f603;。使用免费的企业邮箱(比如网易企业邮箱、阿里云企业邮箱)是一种选择。这时就需要配置邮件地址和邮件客户端来收取邮件，如果有多个邮箱地址，配置会特别麻烦。有时，这些企业邮箱的收件服务会莫名其妙的丢失一些邮件。\n这种场景下，邮件转发服务是一种非常好的解决方案。无需搭建邮件服务器或申请免费邮件服务，只需要配置域名的邮件MX解析到转发邮件收发件服务，同时使用DNS TXT record配置转发规则，即可将所以发送的自有域名下的邮件转发到已有的邮箱地址！\u0026#x1f192;\n特别安利Forward Email服务，一个免费而且是开源的邮件转发服务。\n如上面介绍的，只需要为域名mydomain.com创建如下3条DNS解析记录，\n名称 TTL 记录类型 优先级 记录的内容 @ 或者 空白 3600 MX 10 mx1.forwardemail.net @ 或者 空白 3600 MX 20 mx2.forwardemail.net @ 或者 空白 3600 TXT 20 forward-email=niftylettuce@gmail.com 所有发往@mydomain.com的邮件将被转发到邮箱niftylettuce@gmail.com。\u0026#x270c;\u0026#xfe0f;\n更多配置选项请查看文档。\n作为 Forward Email 数年的付费的用户后，把该配置迁移到了免费的 Cloudflare Email Routing，实现了同样的功能。因为该功能需要先将域名 DNS 解析放到 Cloudflare，所以只需要点点鼠标就配置完成了。😊✌️\n","link":"https://kane.mx/posts/2019/email-forwarding/","section":"posts","tags":["邮件转发","技巧"],"title":"免费邮件转发服务"},{"body":"","link":"https://kane.mx/tags/%E6%8A%80%E5%B7%A7/","section":"tags","tags":null,"title":"技巧"},{"body":"","link":"https://kane.mx/tags/%E9%82%AE%E4%BB%B6%E8%BD%AC%E5%8F%91/","section":"tags","tags":null,"title":"邮件转发"},{"body":"","link":"https://kane.mx/tags/edas/","section":"tags","tags":null,"title":"EDAS"},{"body":"","link":"https://kane.mx/tags/microservice/","section":"tags","tags":null,"title":"Microservice"},{"body":"","link":"https://kane.mx/tags/migration/","section":"tags","tags":null,"title":"Migration"},{"body":"近期实践了将阿里云EDAS微服务应用迁移到AWS上，在这里分享一下迁移方案。\n该方案涉及了以下三个方面，\n微服务应用集群。在AWS上采用的ECS集群部署微服务应用，通过Cloudmap实现服务注册发现，App Mesh实现服务间流量控制。更加详尽的微服务迁移要点和对应方案，详见下面的deck。 Devops pipeline。使用托管的CodePipeline，CodeBuild实现CI/CD。 Infra as Code。利用AWS强大的Infra as Code能力，将云上的基础设施和微服务应用编排通过CDK代码实现。 下面是迁移方案的deck。完整且可部署的PoC代码，点这里。\n","link":"https://kane.mx/posts/2019/aliyun-edas-migration-in-action/","section":"posts","tags":["AWS","EDAS","Migration","Microservice","Infrastructure as Code"],"title":"实战Aliyun EDAS应用迁移AWS"},{"body":"","link":"https://kane.mx/tags/analysis/","section":"tags","tags":null,"title":"Analysis"},{"body":"托管的RDS数据库已经是云计算服务中非常成熟的服务，绝大部分的云计算用户会采用RDS服务来提升数据库服务的可用性同时减少数据库的各类运维事务。\nAWS RDS服务支持开启和查询各类的数据库日志，包括常规日志、慢日志、错误日志和审计日志。但RDS服务默认提供的日志查看工具仅仅类似文本查看器，无法针对日志数据做统计和查看历史滚动的存档。\n本文将介绍如何使用AWS上云原生的服务搭建无服务架构的实时日志分析报表系统。该系统的实现思路来自于AWS中国的一篇博客，该文介绍了使用 CloudWatch Logs，Kinesis Firehose，Athena 和 Quicksight 实现实时分析 Amazon Aurora 数据库审计日志。\n这里提供了一个完整的AWS CDK应用实现了博客中介绍的服务搭建思路，RDS审计日志通过 CloudWatch Log -\u0026gt; Kinesis Firehose -\u0026gt; S3 这样一个数据管道被过滤，转换，压缩最终保存到S3上，可被无服务分析服务Athena使用。同时创建了一个Lambda函数模拟应用访问数据库，它周期性的连接上应用中创建的RDS Aurora数据库并执行查询或变更Sql。在整个应用在被部署成功后数分钟，及可通过Athena数据表查询统计Aurora审计日志。Enjoy it\u0026#x1f606;\u0026#x1f606;\n","link":"https://kane.mx/posts/2019/rds-log-analysis/","section":"posts","tags":["AWS","Serverless","Analysis"],"title":"AWS RDS数据库日志分析及展示"},{"body":"","link":"https://kane.mx/tags/aws-vpn/","section":"tags","tags":null,"title":"AWS VPN"},{"body":"","link":"https://kane.mx/tags/openswan/","section":"tags","tags":null,"title":"Openswan"},{"body":"","link":"https://kane.mx/tags/site-to-site-vpn/","section":"tags","tags":null,"title":"Site-to-Site VPN"},{"body":"业务上云之后，经常也有需求将多云、数据中心或办公室的私有网络同云端的私有网络建立连接。AWS Site-to-Site VPN正是AWS提供的托管VPN服务，我们可以在另一端的私有网络通过Openswan同AWS VPC网络建立基于IPSec协议的安全连接。\n下面是配置的详细步骤，\n如果是创建数据中心或办公室的连接，数据中心或办公室需要有公网IP。如果是在其他公有云上，需要创建带公网IP的EC2，或使用EIP。 如果使用AWS EC2配置Openswan，需要禁用 EC2 的 Source/Destination Check。 在AWS上创建Virtual Private Gateway 和 Customer Gateway(指定对端的公网IP作为静态路由)。 在AWS上创建Site-to-Site VPN连接，使用第一步和第二步创建的Virtual Private Gateway和Customer Gateway。 在对端机器上安装openswan。 1sudo yum install openswan 编辑/etc/sysctl.conf文件，确保有以下配置， 1net.ipv4.ip_forward = 1 2net.ipv4.conf.default.rp_filter = 0 3net.ipv4.conf.default.accept_source_route = 0 重新加载sysctl配置并重启network服务。 1sudo sysctl -p 2sudo service network restart 编辑/etc/ipsec.conf确保include /etc/ipsec.d/*.conf没有被注释。 创建/etc/ipsec.d/aws.conf文件，内容拷贝来自第三步创建的连接Openswan建议配置。 创建/etc/ipsec.d/aws.secrets文件，内容拷贝来自第三步创建的连接Openswan配置。 启动ipsec服务。 1# Start the ipsec service. 2sudo service ipsec start 3 4# Check the logs. 5sudo service ipsec status 6sudo ipsec auto --status 以上配置在Amazon Linux, Centos 6.9上验证通过。但是在Amazon Linux 2、Centos 7等较新的Linux发行版本上，启动ipsec服务遇到如下错误， 1Starting Internet Key Exchange (IKE) Protocol Daemon for IPsec... 2ERROR: /etc/ipsec.d/aws.conf: 12: keyword auth, invalid value: esp 解决方法是，从 AWS Site-to-Site VPN 下载的 Openswan 配置中删掉不支持的配置行auth=esp。\n","link":"https://kane.mx/posts/2019/using-openswan-connect-aws-vpn/","section":"posts","tags":["AWS","AWS VPN","Site-to-Site VPN","Openswan"],"title":"使用Openswan连接AWS VPC"},{"body":"Infrastructure as Code(架构即代码)一直是衡量公有云是否支持良好运维能力的重要指标。作为云计算领先的AWS，通过服务CloudFormation来编排云环境中的基础设施资源。不过由于CloudFormation是使用YAML/JSON编写的声明式语言，不善于处理逻辑，编写繁琐且不利于调试排错，对于新上手的Devops工程师来说也有不小的学习曲线。三方开源的工具Terraform同样没有很好解决CloudFormation存在的这些问题。\nAWS CDK的出现解决了目前CloudFormation存在的绝大部分问题，极大的提升基础设施编排代码的开发和维护效率。\nAWS CDK是一种开源软件开发框架，开发者可以用自己使用熟悉的编程语言模拟和预置云应用程序资源，目前支持Typescript/Javascript、Python、Java和.Net。AWS CDK将云中资源抽象对象化，通过极其简单语法描述资源对象或设置其各种属性(重载CDK默认属性设置)来创建或更新云中资源。\n例如，下面简单几行将创建一个新的名为Gameday的VPC网络，并且跨了两个可用区分别创建了公有子网和私有子网。\n1 this.vpc = new ec2.Vpc(this, \u0026#39;Gameday\u0026#39;, { 2 cidr: \u0026#39;10.0.0.0/16\u0026#39;, 3 maxAzs: 2, 4 subnetConfiguration: [ 5 { 6 cidrMask: 24, 7 name: \u0026#39;Public\u0026#39;, 8 subnetType: SubnetType.PUBLIC 9 }, 10 { 11 cidrMask: 24, 12 name: \u0026#39;Private\u0026#39;, 13 subnetType: SubnetType.PRIVATE 14 } 15 ] 16 }); 我创建了两个示例项目使用了AWS CDK快速创建应用环境且部署应用，\nGameday 为一个ECS上运行的Web应用编排了完整的环境，包括VPC、RDS Aurora、NAT Gateway、安全组、ECS集群、ECS Task定义、ALB负载均衡。 Serverlss Domain Redirect 基于AWS搭建了无服务器架构的域名重定向服务。基于不同的配置参数，提供了基于 S3 + CloudFront + Route 53 或是 Lambda + API Gateway + Route 53 两种解决方案。 总体的来说，AWS CDK是一个非常值得采用的云中资源编排和管理方式，高效的管理了AWS上的资源。\n由于CDK还在相对早期，成熟度还不是那么完美。我在使用中发现下面一些值得注意的问题。\nCDK程序最终还是创建了CloudFormation配置，提交到CloudFormation完成资源变更。核心的用户体验，需要依赖CloudFormation的能力。CloudFormation的创建或回退超时过长，时常影响资源部署体验。另外，清理资源的时候，遇到部分资源无法清理且缺少明确提示。比如Aurora集群。 CDK类库缺少配置校验。这类错误只能通过CloudFormation部署后，才会被资源方发现并返回错误。导致整个创建的堆栈回退，调试大型的部署栈将花费比较长的时间。建议将整个部署拆分为多个小的堆栈，减小每次部署时间，方便调试。 文档还比较简陋。缺少较为深入的示例。增加了开发人员的学习曲线。 新版本向后兼容性不够好，时常新版本有break changes。在1.0GA之后发布的版本break changes相对减少，但仍然有出现。 ","link":"https://kane.mx/posts/2019/aws-cdk/","section":"posts","tags":["AWS","AWS CDK","Infrastructure as Code"],"title":"AWS CDK简介"},{"body":"","link":"https://kane.mx/tags/aws-s3/","section":"tags","tags":null,"title":"AWS S3"},{"body":"业务时常有需求将某个域名(A)的访问重定向到其他域名(B)，即使实现这样一个很简单的需求通常也需要部署Web服务器（例如Nginx），为域名A的请求返回302响应，并提供新的Location地址重定向到域名B。现在基于云计算服务，我们可以使用一些托管服务来实现同样的事情，无需管理服务器和维护应用，同时做到最低成本实现该需求。\n接下来将介绍如何利用AWS上的服务实现该需求。\n使用AWS S3和AWS CloudFront实现域名重定向 创建一个新的S3 bucket，例如 redirect.domain.com 配置新bucket属性，开启静态网站托管，同时配置为重定向请求到期望的域名 redirected-host.domain.com 创建新的CloudFront分发，设置第一步创建的S3 bucket作为自定义源站(不可以配置源站为S3 bucket)。并且配置使用自定义域名 redirect.domain.com。注意，配置自定义CNames需要提供域名对应的SSL证书，可以使用AWS Certificate Manager创建免费的SSL/TLS证书 在域名domain.com解析服务商为域名redirect.domain.com创建新的解析记录 使用AWS Lambda和API Gateway实现域名重定向 创建一个Lambda函数来返回302请求或者HTML页面，在页面中通过Javascript实现重定向页面 为该Lambda函数创建API Gateway触发器 为该API Gateway接口创建自定义域名 在域名domain.com解析服务商为域名redirect.domain.com创建新的解析记录 我创建了一个基于AWS CDK的Github项目，利用AWS Infrastructure as Code的强大能力一键部署以上两种无服务器环境，有需要的可以作为实现参考。\n","link":"https://kane.mx/posts/effective-cloud-computing/serverless-domain-redirect/","section":"posts","tags":["云计算","AWS","AWS S3","AWS Lambda","AWS CDK"],"title":"无服务器架构的域名重定向服务"},{"body":"","link":"https://kane.mx/tags/amazon-alexa/","section":"tags","tags":null,"title":"Amazon Alexa"},{"body":"近期需要做一些Alexa上的开发，在手机上安装了Amazon Alexa，一直得到下面这样的错误提示而无法登录。\nConnection Timed Out.\n先后尝试了翻墙、更改语言等方法仍然不可登录。并且在网络上也没有找到可用的方案，决定抓包研究下为什么我的账号始终无法登录。\n通过抓取Alexa登录时发送的数据包，发现他访问了amazon.cn等数个cn域名下的一系列服务，看来这些服务中部分已无法提供正常服务，导致登录一直出现上面的错误。\nAmazon Alexa作为一个服务全球用户的app，应该是判断手机用户在大陆地区后使用了配置在大陆地区的这些服务。\n临时解决方案的思路就是设置系统或app，让他无法获取到手机真实所在的地理位置，那么Alexa app会使用默认的全球服务器来请求数据。\n以下是临时解决方案的步骤，\n从Play市场安装Alexa app。如果已安装清空app数据。 禁用app Location权限(默认就是禁用的)。 更改系统语言为英文，设置时区为任意美国时区。 拔掉SIM卡，或者禁用所有SIM卡。 打开Alexa app，使用已有或新注册Amazon账号即可登录。\n","link":"https://kane.mx/posts/2019/alexa-login-issue/","section":"posts","tags":["AWS","Amazon Alexa","Troubleshoot"],"title":"Amazon Alexa Android版本国内登录问题"},{"body":"","link":"https://kane.mx/tags/troubleshoot/","section":"tags","tags":null,"title":"Troubleshoot"},{"body":"个人电脑数据备份一直都是一个强烈的需求。使用网盘等云存储产品可以部分满足数据的备份需求，仍然无法做到使用便利性和很高的数据安全保障。\nMacOSX上系统内置了备份解决方案 -- 时间机器(Time Machine)。Time Machine支持AirPort Time Capsule，NAS存储或者外置的存储设备。然而这些备份方案都依赖于硬件设备，有容量限制或不便于移动。在云计算已经大行其道的今天，有没有使用云计算厂商对象存储作为目标存储的备份方案，为MacOSX数据备份提供无限容量、高度的安全性的云方案？经过一番搜索，既找到了开源免费的工具Restic，也有付费软件Arq。无论Restic还是Arq提供的是独立的三方工具来实现备份到云端存储或从云端恢复，有没有将Time Machine和云端储存结合在一起的方案呢？\nTimeMchine支持将外置存储作为备份设备，这里介绍的方法就是将远端云计算厂商的对象存储挂载为本地设备，设置Time Machine将它作为目标备份设备，实现将备份放到云厂商的对象储存。\n接下来我将一步步演示如何将AWS S3对象存储的bucket作为Time Machine备份的设备。\n此方法适用于将任意云厂商的对象存储作为备份存储，只要该厂商的对象存储支持被MacOSX挂载为本地磁盘。\n有很多成熟的方案将AWS S3挂载为MacOSX磁盘，例如S3fs、Goofys。本文推荐的方案是Juicefs，Juicefs为对象存储的元数据提供了缓存，能极大的优化对挂载磁盘的list，get等操作。\n首先按照Juicefs 文档安装必要的依赖和Juicefs客户端。接下来在Juicefs注册完成后，创建一个文件系统保存备份数据。注意：这里的bucket名称需要同随后创建或已有的bucket名称一致。 创建新的AWS S3 bucket(或者使用已有的bucket)，同时为该bucket专门创建用于Juicefs客户端使用的IAM用户。强烈建议不要使用云帐号的access token用于挂载，最佳实践是为不同的用途创建单独的IAM用户。更多IAM用户实践，请参考文章IAM最佳实践。下面是使用AWS CLI创建新S3 bucket及IAM用户的参考命令， 1# 创建S3 bucket 2aws s3 mb s3://my-bucket-for-mac-backup 3 4# 创建IAM用户 5aws iam create-user --user-name juicefs 6# 为juicefs用户授予读写备份S3 bucket权限 7echo \u0026#39;{ 8 \u0026#34;UserName\u0026#34;: \u0026#34;juicefs\u0026#34;, 9 \u0026#34;PolicyName\u0026#34;: \u0026#34;mac-backup-bucket-all-permissions\u0026#34;, 10 \u0026#34;PolicyDocument\u0026#34;: \u0026#34;{ \\\u0026#34;Version\\\u0026#34;: \\\u0026#34;2012-10-17\\\u0026#34;, \\\u0026#34;Statement\\\u0026#34;: [ { \\\u0026#34;Effect\\\u0026#34;: \\\u0026#34;Allow\\\u0026#34;, \\\u0026#34;Action\\\u0026#34;: \\\u0026#34;s3:*\\\u0026#34;, \\\u0026#34;Resource\\\u0026#34;: [ \\\u0026#34;arn:aws-cn:s3:::my-bucket-for-mac-backup/*\\\u0026#34;, \\\u0026#34;arn:aws-cn:s3:::my-bucket-for-mac-backup\\\u0026#34; ] } ] }\u0026#34; 11 1 { 12}\u0026#39; \u0026gt; policy.json 13aws iam put-user-policy --cli-input-json file://policy.json 14# 为juicefs用户创建access token用于juicefs客户端挂载bucket 15aws iam create-access-key --user-name juicefs 16{ 17 \u0026#34;AccessKey\u0026#34;: { 18 \u0026#34;UserName\u0026#34;: \u0026#34;juicefs\u0026#34;, 19 \u0026#34;AccessKeyId\u0026#34;: \u0026#34;\u0026lt;key id\u0026gt;\u0026#34;, 20 \u0026#34;Status\u0026#34;: \u0026#34;Active\u0026#34;, 21 \u0026#34;SecretAccessKey\u0026#34;: \u0026#34;\u0026lt;access key\u0026gt;\u0026#34;, 22 \u0026#34;CreateDate\u0026#34;: \u0026#34;2019-06-30T15:25:41Z\u0026#34; 23 } 24} 按照Juicefs文档挂载挂载S3 bucket。 进入挂载后的目录(如/jfs)，创建Sparse Image用于Time Machine写入备份。 1cd /jfs 2hdiutil create -size 600g -type SPARSEBUNDLE -fs \u0026#34;HFS+J\u0026#34; Time Machine.sparsebundle 上面命令将创建一个名为TimeMachine600 GB大小的镜像(初始仅占用数百MB，实际文件磁盘空间只有当文件写入后才会占用)。根据你的需要随意调整镜像大小，通常建议设置为Mac磁盘大小的两倍。 不熟悉命令行的用户，也可以使用磁盘工具(Disk Utility)创建。 通过Finder挂载之前创建的Sparse Image 现在是魔术步骤，告诉Time Machine使用之前创建的虚拟设备作为备份磁盘。 1sudo tmutil setdestination /Volumes/Time MachineDisk 由于S3 Bucket用于备份数据，建议开启S3 智能分层存储或者IA储存，降低花费。同时可以启用S3 KMS加密云端保存的数据，提升数据安全性。\n","link":"https://kane.mx/posts/2019/using-s3-as-device-for-mac-time-machine-backup/","section":"posts","tags":["MacOSX","AWS","AWS S3","Tip"],"title":"使用AWS S3作为MacOSX时间机器(Time Machine)的备份存储"},{"body":"","link":"https://kane.mx/tags/dingtalk/","section":"tags","tags":null,"title":"Dingtalk"},{"body":"","link":"https://kane.mx/tags/spring/","section":"tags","tags":null,"title":"Spring"},{"body":"","link":"https://kane.mx/tags/spring-cloud-function/","section":"tags","tags":null,"title":"Spring Cloud Function"},{"body":"基于serverless框架的钉钉回调函数中介绍了serverless framework，一款支持跨云厂商/Serverless平台的部署工具。但是函数代码还是需要针对不同的serverless平台作对应的适配。而Spring Clound Function就是针对这种情况专门开发的跨serverless平台的框架，实现一套代码通过不同的打包实现跨serverless平台。Spring Clound Function目前支持AWS Lambda, Microsoft Azure Function以及Apache OpenWhisk。\n这里我们继续使用无函数版本的钉钉回掉函数来演示Spring Clound Function for AWS的使用。\n首先将spring cloud function for aws adapter添加到项目依赖，\n1implementation(\u0026#34;org.springframework.cloud:spring-cloud-function-adapter-aws:${springCloudFunctionVersion}\u0026#34;) 其次创建函数Handler，实现Spring Cloud Function跨函数计算实现抽象的SpringBootRequestHandler类，或者是继承自它的trigger类，例如SpringBootApiGatewayRequestHandler\n1import org.springframework.cloud.function.adapter.aws.SpringBootApiGatewayRequestHandler 2 3class Handler : SpringBootApiGatewayRequestHandler() 接下来创建Spring Boot应用程序，并将serverless实现函数注册为Spring Bean，函数的实现部分就是serverless函数具体做的业务逻辑。\n1@SpringBootApplication 2open class DingtalkCallbackApplication { 3 4 @Bean 5 open fun dingtalkCallback(): Function\u0026lt;Message\u0026lt;EncryptedEvent\u0026gt;, Map\u0026lt;String, String\u0026gt;\u0026gt; { 6 val callback = Callback() 7 return Function { 8 callback.handleRequest(it) 9 } 10 } 11} 12fun main(args: Array\u0026lt;String\u0026gt;) { 13 SpringApplication.run(DingtalkCallbackApplication::class.java, *args) 14} 最后将函数打包为fat jar（如果将依赖打包为lambda layer，可不用打包为fat jar）作为lambda的代码。\n函数的部署同其他的lambda函数没有任何区别，这个示例中沿用了之前的SAM/CloudFormation配置或者serverless framework部署配置。\n完整的可运行、部署代码请访问这个分支。\n总体来说，Spring Clound Function的实现原理并不复杂，定义统一的函数实现入口，通过不同serverless平台的adapter对接不同平台的API接口，做到编写一次函数实现，通过打包不同的adapter做到跨serverless平台运行。\n但个人认为现实中这样的场景并不多。并且serverless函数触发方式很多，例如AWS上的APIGateway、Kinesis、CloudWatch、IoT等服务，与这些服务对接或API调用其实也产生了耦合，并不能简单的迁移到三方的serverless平台去执行。同时，开发者需要引入spring/spring boot/spring cloud相关的依赖，增加了程序的复杂度，又延长了lambda函数clod start需要的时间。另外，开发者需要学习spring cloud function相关的知识，无形中增加了复杂度。总之使用spring cloud function作为函数计算框架收益并不高，整个项目给人感觉比较鸡肋。\n","link":"https://kane.mx/posts/effective-cloud-computing/spring-cloud-function-for-aws/","section":"posts","tags":["云计算","FaaS","函数计算","AWS","AWS Lambda","钉钉","dingtalk","Serverless Computing","Spring","Spring Cloud Function"],"title":"Spring Cloud Function -- 跨Serverless平台的函数计算框架"},{"body":"","link":"https://kane.mx/tags/%E5%87%BD%E6%95%B0%E8%AE%A1%E7%AE%97/","section":"tags","tags":null,"title":"函数计算"},{"body":"","link":"https://kane.mx/tags/%E9%92%89%E9%92%89/","section":"tags","tags":null,"title":"钉钉"},{"body":"AWS是全球云计算领域的领跑者，它在计算、存储、网络等方面都做出了很多创新，同时也是其他云计算厂商学习及模仿的对象。\n阿里云是目前国内市场份额最大的云计算厂商，其份额超过了第二至五位厂商的总和，份额领先优势比AWS在全球还要显著，同时全球份额也超过IBM来到第四。\n本文将对AWS和阿里云核心服务做一个简要对比，以及这两家厂商发展方向的一些个人见解。\n云计算，其核心服务就是计算、存储及网络。这些基本能力的稳定性，功能完善性决定了云计算厂商能力的下限。\n除了上面提到的三大计算机核心组件能力，下面这些能力也是云计算中非常重要的组成部分，\n按量计费 资源编排（也就是平台作为代码） 云资源的认证及授权 API 基于上面列举的云计算核心服务和关键能力，我们来看看哪些方面是AWS的强项。\nAWS作为云计算的领军厂商，在计算、存储、网络这三大核心一直在不停的创新中，且被友商在不停的模仿。计算方面，AWS首先推出了Lambda无服务器计算引擎实现按量使用的全托管服务，生产可用的GPU实例(单虚机可配置最高64块GPU卡，而阿里云默认仅售卖2块GPU卡)，基于Nitro架构的EC2实例为客户送上了升性能降价的好事。\nS3作为AWS最早推出的云计算服务，仍然在不停的创新演化中。目前S3达到了11个9的持久性，为满足客户不同的存储需要，又推出了S3 Glacier、Glacier Deep Archive等存储方案。持续推出了Amazon Athena, Redshift, S3 select等服务及工具解决海量数据的大数据处理。\nAWS一直将PAYG(Pay-As-You-Go)的按量计费模型贯穿在各种服务中。无论是EC2(包括GPU实例)，ELB，NAT网关等等都提供小时级的按量计费。阿里云在这方面还有较多的改进空间，例如GPU实例最小售卖时长为一周，SLB首先按规格售卖，NAT网关按自然日计费。\nIAM为云上的资源提供了最细粒度的授权管理，AWS各个服务严格按最细粒度控制授权，满足企业的权限管理。在我使用过的数个阿里云服务中，多次遇到较新的服务IAM设计不周，权限粒度过大，甚至功能无法工作的情况下就上线发布了。\nAWS CloudFormation提供了云上资源编排管理，实现了资源的代码化，版本化(通常称为的Infrastrucure as Code)。将云端资源的管理运维提升到一个新的层次。\nAWS提供了三种方式管理云上资源，Web Console, CLI以及API。这三种方式，尽最大努力提供一致的功能。\nAWS同时是一个云计算的生态，各类三方云服务厂商通过Marketplace售卖各类SaaS，PaaS服务，形成一个云计算用户，三方服务Vendor，AWS三方共赢的局面。\n总得说来，AWS持续的在云计算核心服务和关键服务投入，不停的创新，保证了AWS整体服务的领先。\n接下来看看阿里云的强项。\n阿里云在提供基本的计算、存储、网络外，额外提供了很多SaaS服务，例如，Application Performance Monitor， Performance Testing Service, 日志服务，链路追踪服务，数据库管理服务等。这些服务显然同阿里云有更好的集成，对用户来说提供了开箱即用的解决方案。而这也是一把双刃剑，利用平台捆绑的优势抢占合作开发商的市场，长期来说利用平台垄断不利于基于阿里云的技术服务创业。\n总之，阿里云在云计算核心服务上同AWS比还有差距，但他在PaaS/SaaS服务上发展不错，更加容易提供全套基于阿里云的解决方案。由于阿里云在国内数据中心数量上的优势加上从万网收购的BGP资源，其服务在国内访问网络延迟会更低。\n最后，谈一个很有意思的话题，是否需要考虑云厂商的锁定。\nKubernetes事实上成为容器编排平台，首先考虑使用K8S及CNCF landscape下的项目作为应用运行环境，减少可能的迁移和学习成本。\n对不同用量的公司来说，考虑云厂商锁定的维度完全不一样。创业型公司或仍在快速发展业务中的中大型企业首先应该选择可靠性高，解决方案多，易学习的云厂商，尽可能利用云厂商的各种服务做到快速高效可靠的推进业务，将尽量多的精力、人力投入到业务相关的事情上。业务稳定的大型公司，可以使用多数据中心实现关键业务的高可用性，跨云完全不应该作为高可用的必要解决途径。另外，云厂商绝对会投入额外的人力，优先级支持他们的大客户，甚至为这类客户调整产品研发优先级或协同完成某些功能，这样绝对是个双赢的局面，Netflix和AWS的互相成就就是一个很好的例子。没有特别必要的原因，不要轻易投入精力将业务从服务已经很稳定的云厂商迁移到多云平台上，那样往往是白白耗费力气。\n下面是slide的最新完整版本，\n","link":"https://kane.mx/posts/2019/aws-vs-aliyun/","section":"posts","tags":["云计算","AWS","阿里云"],"title":"公有云对比"},{"body":"","link":"https://kane.mx/tags/%E9%98%BF%E9%87%8C%E4%BA%91/","section":"tags","tags":null,"title":"阿里云"},{"body":"Serverless Framework是一个开源命令行工具。他提供函数脚手架、流程自动化、最佳实践等帮助开发、部署跨云厂商的托管无服务器计算服务(官方已支持aws, Azure, GCP, IBM Cloud等各种厂商的无服务器计算)。同时支持使用插件来扩展各种功能，比如支持更多云厂商无服务器计算服务，例如阿里云的函数计算。\n这里使用基于函数计算的钉钉回调函数接口示例来演示如何使用Serverless Framework将一个无服务器函数部署到AWS Lambda。\n安装servereless后，可以通过serverless create命令创建函数脚手架工程，或者在已有工程的下创建serverless配置文件serverless.yml。\n接下来可以参考serverless aws reference配置你的aws lambda函数及需要的各种资源。如果已经有过使用AWS CloudFormation或者AWS SAM经验的，可以很快适应编写Serverless配置。Serverless的配置本质上是将CloudFormation/SAM相关的概念进行抽象，为各个云厂商的无服务器计算服务提供统一的工具、命令以及概念抽象。在部署aws lambda时，serverless配置会被转换为CloudFormation配置，通过AWS API进行创建或变更。\n对于Dingtalk Callback on AWS Lambda, serverless配置声明如下。其中指定了service的基本信息，全局的配置(如stage、region等)、云厂商provider(这里是aws)。函数的基本信息、权限、layer、触发器，自定义layer以及其他云厂商资源，比如Dingtalk callback这里用到的DynamoDB。完整的serverless配置查看这里。\n1service: 2 name: dingtalk-callback 3 4frameworkVersion: \u0026#34;\u0026gt;=1.0.0 \u0026lt;2.0.0\u0026#34; 5 6provider: 7 name: aws 8 runtime: java8 9 stage: ${opt:stage, \u0026#39;dev\u0026#39;} # Set the default stage used. Default is dev 10 region: ${opt:region, \u0026#39;ap-southeast-1\u0026#39;} # Overwrite the default region used. Default is ap-southeast-1 11 profile: ${opt:profile, \u0026#39;default\u0026#39;} # The default profile to use with this service 12 versionFunctions: true # Optional function versioning 13 endpointType: regional # Optional endpoint configuration for API Gateway REST API. Default is Edge. 14 15functions: 16 dingtalk-callback: 17 handler: com.github.zxkane.dingtalk.Callback::handleRequest # required, handler set in AWS Lambda 18 name: ${self:provider.stage}-dingtalk-callback # optional, Deployed Lambda name 19 memorySize: 384 # optional, in MB, default is 1024 20 timeout: 15 # optional, in seconds, default is 6 21 environment: # Function level environment variables 22 PARA_DD_TOKEN: DD_TOKEN 23 TABLE_NAME: {Ref: BPMTable} 24 package: 25 artifact: build/libs/dingtalk-callback-1.0.0-SNAPSHOT.jar 26 role: dingtalkCallbackIAMRole 27 layers: # An optional list Lambda Layers to use 28 - {Ref: DependenciesLambdaLayer} 29 events: # The Events that trigger this Function 30 - http: # This creates an API Gateway HTTP endpoint which can be used to trigger this function. Learn more in \u0026#34;events/apigateway\u0026#34; 31 path: dingtalk # Path for this endpoint 32 method: post # HTTP method for this endpoint 33 34layers: 35 dependencies: 36 path: build/deps 37 38resources: # CloudFormation template syntax 39 Resources: 40 dingtalkCallbackIAMRole: 41 Type: AWS::IAM::Role 42 Properties: 43 Policies: 44 - PolicyName: SSMPolicy 45 - PolicyName: DynamoDBPolicy 46 BPMTable: 47 Type: AWS::DynamoDB::Table 48 Properties: 49 TableName: bpm_raw_${self:service.name}_${self:provider.stage} 50 ProvisionedThroughput: 51 ReadCapacityUnits: 1 52 WriteCapacityUnits: 1 对于使用单一云厂商无服务器计算并且已经使用了类似sam cli实现持续集成、持续部署的用户，Serverless Framework并不能带来更多生产力的提升，在稳定性(封装云厂商的功能，增加复杂度很可能引入新的问题)或功能的及时性上可能还不如云厂商提供的工具。\n对于有多云厂商部署无服务器函数需求的用户，使用了Serverless Framework并不能轻松的将无服务器函数部署到不同云厂商的托管服务上，他只是帮助提供跨云厂商的统一工具链及相似的持续集成、部署等最佳实践流程。例如将一套函数从AWS迁移到Azure上，需要重新实现Azure provider下的配置，因为云厂商的托管无服务器服务和其他云资源都存在着大量差异。另外函数代码也需要面临改造，不同云厂商的触发器消息事件也都有不同的格式！这里可以考虑使用类似Spring Cloud Function这样的解决方案来实现跨云厂商的函数编写。\n总之，Serverless Framework对于跨云厂商部署场景有一定生产效率的提升，但他离完美解决跨云厂商无服务器托管服务（各厂商服务天生不兼容）还有很远的距离，也许这个思路就是走不通的\u0026#x1f615;。\n","link":"https://kane.mx/posts/2019/serverless-framework/","section":"posts","tags":["云计算","FaaS","AWS","AWS Lambda","Serverless Computing"],"title":"Serverless framework 101"},{"body":"在基于函数计算的钉钉回调函数接口中使用钉钉回调函数案例实践了AWS Lambda无服务函数。该示例中，我们将自定义的函数代码及依赖的第三方库（比如json处理库jackson, 钉钉openapi加密库, aws dynamodb client等）整体打包为一个部署包，上传到lamdba代码仓库用于函数执行。\n然而实际项目中，其实有大量的相关函数可能会共享这些基础依赖库、三方函数库(比如headless chrome(Puppeteer), pandoc, OCR library -- Tesseract等等)或者使用自定义runtime(如官方未支持的java11)的需求。AWS Lambda在去年底发布了Lambda layers功能来满足上述这些实际开发中的需求。\n接下来，让我们看看如何将前文中的函数依赖放置到一个单独的layer中，作为不同函数的共享依赖库。\n在我们的构建配置build.gradle中，将函数的共享依赖拷贝到java runtime特定的目录结构java/lib/下，\n153 154 155 156 157 tasks.register\u0026lt;Copy\u0026gt;(\u0026#34;depsLayer\u0026#34;) { into(\u0026#34;$buildDir/deps/java/lib\u0026#34;) from(configurations.compileClasspath.get()) from(configurations.runtimeClasspath.get()) } 接下来将共享的依赖创建为一个lambda layer，并且让callback函数依赖这个共享layer，不再将所有的依赖打包为一个很大的部署包减小每次变更需要发布的包大小。\n31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 DependenciesLayer: Type: AWS::Serverless::LayerVersion Properties: LayerName: DingTalkDependencies Description: DingTalk Dependencies Layer ContentUri: \u0026#39;build/deps\u0026#39; CompatibleRuntimes: - java8 LicenseInfo: \u0026#39;Available under the MIT-0 license.\u0026#39; RetentionPolicy: Retain CallbackFunction: Type: AWS::Serverless::Function # More info about Function Resource: https://github.com/awslabs/serverless-application-model/blob/master/versions/2016-10-31.md#awsserverlessfunction Properties: CodeUri: build/libs/dingtalk-callback-1.0.0-SNAPSHOT.jar Handler: com.github.zxkane.dingtalk.Callback::handleRequest Layers: - !Ref DependenciesLayer Policies: 在console查看部署后的函数，如下图，可以看到函数新增了一个layer。\n同其他的语言、技术一样，Awesome Layers项目收集了目前一些常用且维护较好的layer，自创轮子之前可以先参考下\u0026#x1f600;。\n使用layer同样有以下限制，使用前需要注意，\n依赖的layer数不能超过5个 函数以及依赖的所有layers解压后不可以超过250MB ","link":"https://kane.mx/posts/2019/aws-lambda-layers/","section":"posts","tags":["云计算","FaaS","AWS","AWS Lambda","Serverless Computing"],"title":"AWS Lambda Layer实践"},{"body":"","link":"https://kane.mx/tags/qcon/","section":"tags","tags":null,"title":"QCon"},{"body":"这周参加了QCon 2019北京站，这里记录下部分印象深刻的主题以及个人感受。\nQCon是由InfoQ主办的综合性技术盛会，主题涵盖了大前端、高可用架构、容器技术、大数据、机器学习等各种热门技术主题。其中也不乏下一代分布式应用、混沌工程等前沿有意思的主题，后面会详细介绍相关的主题演讲。\n工程效率提升 这是在QCon第一日个人感兴趣且非常有意思的一个系列主题。无论是创业公司、独角兽企业还是互联网巨头都希望不断提升工程效率，3个相关的分享分别来自BAT，可见互联网巨头们对团队效率提升的渴望和重视。\n10倍速原则对工程生产力建设的方向性影响 这个talk来自腾讯的高级顾问乔梁，这位老兄已经连续10年在QCon上分享持续集成、持续交付等工程效率相关的主题了！他的演讲始于对成功企业的一万次实验法则方法论， 而大量高效的实验基于一个双环模型的快速验证环。 最终工程生产力是由工作流程、支撑工具和工程素养三方面一起决定的。 非常认可决定工程效率的这三要素，个人认为工程素养是其他两个要素的基石，奈飞文化手册中开篇强调的只招聘成年人就是很好的诠释。\n百度工程能力提升之道 这个分享来自百度研发效能部门的产品经理，从人、技、法三方面强调工程能力提升的策略模型。其实这个模型就是对应着上面乔梁分享的工程生产力三要素。 关于对工程师的培养和技术规范，百度发布了\u0026quot;百度工程师手册\u0026quot;，据说可以从网络上下载到。大量工具的细节分享涉及的都是百度内部工具，不过工具针对的思路还是可以借鉴的。\n菜鸟集团研发效能变革实践 这个分享来自阿里系的菜鸟集团，特别强调数据化驱动的研发效能提升，里面很有意思的一点是建立成本模型来评估效能的好坏。\n作为效能部门负责人，有数据特别是成本数据，让高层管理者buy-in你的想法，这应该是个非常好的角度。\n高可用架构 声明式自愈系统——高可用分布式系统的设计之道 这个分享比较理论化的介绍声明式的、可自愈的分布式系统原理和实践，其实业界已经有个非常好的参考实现 -- 就是Kubernetes \u0026#x1f603;。\n超大规模高可用性云端系统构建之禅 这是一个非常实用的工程实践分享，列举了大量大规模云原生应用一定会面临的挑战，以及简单又实用的解决方法。每一个云原生应用开发者都应该看看这个slide，学习前人实践的经验。另外为讲演者蔡超做个推广，对Go语言有兴趣的同学，可以考虑学习蔡超的极客时间课程Go语言从入门到实战。\n运维架构 Kubernetes 日志平台建设最佳实践 这个分享介绍了Kubernetes上日志方案的解决思路，及它的实践 -- 阿里云的日志服务。对于很多有基础服务建设的团队可以作为很好的参考方案。对于已经托管在阿里云上的应用，建议就不要重复建设低端的轮子了，阿里云日志服务应该做为团队的首选。不论在性能同其他云托管服务集成上，都远远好于自建的方案。\n多云管下的自动化运维架构 多云是现在一些厂商力推的话题，个人认为是市场排名靠后的总要找些方法来提升自己产品的竞争力\u0026#x1f60f;。分享者企业做了一套ops平台来管理多云的资源，他们通过adapter方式来将不同云厂商的差异和资源进行了抽象。这其中涉及大量处理产品间差异性和被动适配的工作，个人不太认同这种方式。并且丢掉了infra as code这类重要的特性，对于有这种需求的大型企业来说不是一个完美的方案。\n混沌工程 混沌工程这个话题非常有意思，同时也是较新的一种实践工程。从最早的提出、系统实践到现在还不到10年时间。来自阿里巴巴的云原生架构下的混沌工程实践和AWS的AWS 云上混沌工程实践之对照实验设计和实施两个分享介绍了从混沌工程的起源到如何全方位的实践用于提升云原生应用的\u0026quot;韧性\u0026quot;，非常值得学习。蔡超的超大规模高可用性云端系统构建也提到了使用混沌工程来提升系统的高可用性，在云原生应用越来越普及的情况下，被动的设计高可用系统肯定不如主动(甚至持续的自动化)可控的注入混乱来逐渐提升系统的高可用性。目前chaos engineering的工具/平台支持还不太完善，这个方向看起来是技术创业很好的切入点\u0026#x1f60f;。最后切记一点，混沌工程最终一定要在生产系统上实施。 下一代分布式应用 这个主题虽说命名为下一代分布式应用，主要分享的大多是服务间流量治理问题，特别是Service Mesh下实践经验。其中来自阿里李云的分布式应用的未来——Distributionless特别值得一提。这个分享并没有实际的案例或经验分享，他重点分享的是对于Cloud Native本质和趋势的看法，这些观点我个人特别认同(好像找到知音似的:grinning:)！完整的slide这里下载。 用户增长 来自云测的陈冠诚在智能优化 \u0026amp; A/B 测试 - 实验驱动用户增长的理论与技术实践分享了A/B测试实验对用户增长的理论及实践，顺便也推广了他家云测的A/B测试SaaS服务。听圈内的朋友分享，云测的A/B测试服务确实比较简单好用，方便产品后台创建测试并分析结果，对增长有需求的小伙伴可以考虑体验下，减少不必要的重复建设轮子。\n","link":"https://kane.mx/posts/2019/2019-qconbeijing-reviews/","section":"posts","tags":["会议","QCon","DevOps","架构","混沌工程","工程效率"],"title":"QCon2019北京站回顾"},{"body":"","link":"https://kane.mx/tags/%E4%BC%9A%E8%AE%AE/","section":"tags","tags":null,"title":"会议"},{"body":"","link":"https://kane.mx/tags/%E5%B7%A5%E7%A8%8B%E6%95%88%E7%8E%87/","section":"tags","tags":null,"title":"工程效率"},{"body":"","link":"https://kane.mx/tags/%E6%9E%B6%E6%9E%84/","section":"tags","tags":null,"title":"架构"},{"body":"","link":"https://kane.mx/tags/%E6%B7%B7%E6%B2%8C%E5%B7%A5%E7%A8%8B/","section":"tags","tags":null,"title":"混沌工程"},{"body":"","link":"https://kane.mx/tags/istio/","section":"tags","tags":null,"title":"Istio"},{"body":"","link":"https://kane.mx/tags/service-mesh/","section":"tags","tags":null,"title":"Service Mesh"},{"body":"","link":"https://kane.mx/tags/spring-cloud/","section":"tags","tags":null,"title":"Spring Cloud"},{"body":"基于Java的Spring Cloud是由Java最大开源生态Spring社区推出的Out-of-Box分布式微服务解决方案，自2016年发布起就被众多开发者看好。Java作为广为流行的服务端编程语言，Spring Cloud也就越来越多的被用于微服务开发。\nSpring Cloud集成了Netflix OSS开源项目实现了很多功能(或作为实现之一)，包括服务治理、网关路由、客户端负载均衡、服务间调用、断路器等。Spring Cloud Netflix将很多生产级别微服务能力开箱即用的带到了Spring Cloud架构下的微服务中，帮助开发者快速的构建满足12要素的应用。\n在去年底发布的Spring Cloud Greenwich版本中宣布Spring Cloud Netflix中重要的组件Hystrix、Ribbon、Zuul 1等由于上游开源项目进入维护状态，对应的Spring Cloud Netflix项目也进入到维护状态。这些项目将不再适合用于长期维护的产品中！\n同时随着近年云计算的发展，特别是Kubernetes成为容器编排平台的事实标准，加上Service Mesh(服务网格)对微服务的服务治理和流量控制，为云原生应用提供了更为现代、平台无关的解决方案。\n让我们逐一看看在Kubernetes加上Serivce Mesh(例如Istio)如何实现微服务的服务发现、路由、链路追踪、断路器等功能。\n配置中心 Spring Cloud Config默认提供了多种配置管理后端，例如Git、Vault、JDBC Backend等。同时也有很多开源方案可以作为替换方案，比如Alibaba Nacos。\n作为部署在Kubernetes中的应用，最佳实践是平衡Configmap和Spring Cloud Config。将涉及程序功能的配置放置在Configmap和Secret，随同微服务的发布一起做版本管理，可以做到随着应用回退的时候同时回退到历史对应的配置版本，而不会因为历史版本的代码被最新版本的配置所中断。Spring Cloud Kuberentes项目很好的支持了Spring Cloud应用从Configmap和Secret中读取配置项。而涉及业务的配置选项，将可以考虑放到Spring Cloud Config后端实现统一管理。如果应用是部署在阿里云，使用阿里云托管的配置服务和Spring Cloud Config -- Nacos将是很好的选择。\n服务发现 Kubernetes Services提供了集群内原生的服务发现能力，是Eureka或Spring Cloud Zookeeper等服务发现服务的很好替代品。基于K8S Services的服务发现，很容易通过Service Mesh能力实现限流、A/B测试、金丝雀发布、断路器、chaos注入等服务治理能力。同时对微服务应用来说，不用在应用端添加对应三方库来实现服务注册及发现，减少了应用端开发需求。\n各种流量治理场景 应用被服务化后，一定会面临流量治理的问题。对于各种服务间如何实现限流、A/B测试、金丝雀发布、断路器、chaos注入测试、链接追踪等，这其实是一类通用的问题。\nSpring Cloud提供的是一种客户端解决思路，需要每个应用引入对应功能的libraries的支持。即使通过spring boot starter提供了近似开箱即用的能力，但是每个应用仍然需要自行添加对应的能力，版本更新、安全漏洞fix等场景都需要手动升级、测试、打包、部署。在异构编程语言实现的微服务架构下，未必每种编程框架都能提供很好的对应能力支持。除非有特别的服务治理策略，不推荐在微服务自身来实现服务流量的控制。\nService Mesh(例如Istio或Linkerd)从整个服务治理层面对上述需求提供了统一的解决方案，而不需要微服务做自身的升级或改动。在基于Kuberentes部署运行的微服务应用，Service Mesh提供了统一的服务治理方案，将用户从不同的微服务中自身维护服务治理功能中解放出来，从平台层面提供更加统一一致的解决方案。\n在去年的SpringOne Platform 2018上也有一个Topic A Tale of Two Frameworks: Spring Cloud and Istio 探讨什么场景应该使用Service Mesh，什么时候使用Spring Cloud服务治理组件，有兴趣的朋友可以看一看。\n","link":"https://kane.mx/posts/effective-cloud-computing/spring-cloud-or-cloud-native/","section":"posts","tags":["云计算","kubernetes","spring","spring cloud","service mesh","istio"],"title":"Spring Cloud or Cloud Native"},{"body":"","link":"https://kane.mx/tags/oauth2/","section":"tags","tags":null,"title":"Oauth2"},{"body":"本文是为Kubernetes中任意应用添加基于oauth2的认证保护的下篇，将图文详解如何使用基于钉钉认证的oauth2 proxy为自身本没有认证授权功能的Web站点实现认证及授权。\n示例是使用的AWS EKS服务作为K8S环境。鉴于K8S的应用运行时属性，该示例也可以部署在其他云厂商托管的K8S。\n示例模块简介 Nginx Ingress Controller为K8S集群内Web应用提供反向代理，以及支持外部认证。 简单的Web站点，基于Nginx docker容器。该站点默认没有认证及授权功能，使用外部钉钉应用作为认证及授权。 OAuth2 Proxy on Dingtalk提供基于钉钉应用的扫码认证及授权，只有认证且授权的用户才可以访问上面的Web站点。 默认设定 Web站点域名web.kane.mx 认证服务域名oauth.kane.mx 准备AWS EKS环境 创建EKS集群。由于Nginx Ingress服务是LoadBalancer类型，EKS创建NLB或ELB对应的targets时需要targets部署在public VPC subnets，所以为了简化部署EKS集群的VPC subnets都选择public subnet。新建的EKS集群允许公开访问。 本地安装配置kubectl, aws-iam-authenticator用于远程管理集群。 为集群添加worker节点。 配置Helm部署环境。 钉钉应用准备 为企业或组织开通钉钉开发平台 创建一个新的移动应用。回调域名填写\u0026lt;http or https\u0026gt;/\u0026lt;认证服务域名\u0026gt;/oauth2/callback。记录下来应用的appId和appSecret。 创建一个企业内部工作台应用。地址可以随意设置。服务器出口IP设置为EKS集群中工作节点的公网IP或者NAT EIP，取决于工作节点如何访问Internet。并记录下来应用appKey和appSecret。 部署示例应用 克隆示例部署脚本。 替换values.yaml中的dingtalk_corpid为工作台应用的appKey， dingtalk_corpsecret为工作台应用的appSecret。 由于社区维护的oauth2-proxy charts并不支持dingtalk扩展的SECRET ENV，所以将密钥配置到了configmap中。用于生产环境的话，建议按这个commit使用secret保存应用secret。 62 63 64 65 66 67 68 69 70 71 72 oauth2-proxy: config: clientID: aaa clientSecret: bbb cookieSecret: ccc configFile: |+ email_domains = [ \u0026#34;*\u0026#34; ] cookie_domain = \u0026#34;kane.mx\u0026#34; cookie_secure = false dingtalk_corpid = \u0026#34;\u0026lt;appkey of dingtalk app\u0026gt;\u0026#34; dingtalk_corpsecret = \u0026#34;\u0026lt;appsecret of dingtalk app\u0026gt;\u0026#34; 如果仅希望企业部分部门的员工可以获得授权，在上面configFile配置下添加如下配置， 1dingtalk_departments = [\u0026#34;xx公司/产品技术中心\u0026#34;,\u0026#34;xx公司/部门2/子部门3\u0026#34;] 替换部署应用的域名为你的域名。 执行以下命令安装Helm部署依赖。 1helm dep up 执行以下命令部署nginx ingress controller, web应用以及oauth2 proxy 1helm upgrade --install -f values.yaml --set oauth2-proxy.config.clientID=\u0026lt;移动应用appid\u0026gt;,oauth2-proxy.config.clientSecret=\u0026lt;移动应用appsecret\u0026gt; site-with-auth --wait ./ 如果集群中已经部署了Nginx Ingress Controller，修改values.yaml如下将忽略部署Nginx ingress， 47 48 49 50 51 52 53 affinity: {} nginx-ingress: enabled: false controller: ingressClass: nginx config: 部署成功后，获取ELB地址。 1kubectl get svc -o jsonpath=\u0026#39;{ $.status.loadBalancer.ingress[*].hostname }\u0026#39; \u0026lt;deployment name\u0026gt;-nginx-ingress-controller;echo 2a3afe672259c511e98e2a0a0d88fda3e-xx.elb.ap-southeast-1.amazonaws.com 部署成功后配置 将站点和oauth服务域名解析到上面部署创建的ELB上。\n测试 访问Web站点(如本示例中的http://web.kane.mx)，未授权的情况下，调转到钉钉应用扫码登录界面。使用组织内成员的钉钉扫码授权后，将跳转回Web站点应用，可以正常浏览该域名下的页面。\n","link":"https://kane.mx/posts/effective-cloud-computing/oauth2-proxy-on-kubernetes/part2/","section":"posts","tags":["云计算","IAM","kubernetes","oauth2","钉钉","dingtalk","AWS","AWS EKS"],"title":"为Kubernetes中任意应用添加基于oauth2的认证保护 (下)"},{"body":"由于企业内部管理的需要，用到了钉钉的业务事件回调能力，正好将这个轻量级的接口使用无服务器技术来实现部署，以应对流量无规律下的动态扩展伸缩、按需使用、按量计费等需求。\n阿里云函数计算版本 由于公司系统部署在阿里云，首先选择使用阿里云函数计算来实现及部署。该接口使用了JVM上语言Kotlin开发，虽然阿里云函数计算官方支持的开发语言有Java但没有Kotlin。其实无论Java或Kotlin最终部署文件都是Java Class字节码，加上Kotlin与Java良好的互操作性，实测函数计算可以完美支持Kotlin开发(个人认为任意JVM上的开发语言都是支持的)。\n同时该函数使用了表格存储来持久化回调事件。表格存储是个按量计费的分布式存储，有兴趣的可以自行查阅文档了解更多。\n该函数通过API网关和表格存储触发器来触发。访问日志和执行日志被存储在日志服务中。\n函数的本地测试和线上部署，使用了函数计算提供的命令行工具Fun。基于Fun定义的阿里云Serverless模型实现了对函数们使用资源的声明和编排，集成Gitlab CI实现了函数的CI/CD自动化发布流程。\n不涉及公司业务的代码已开源在Github，有兴趣的可以作为参考。\n目前函数计算和表格存储有各自的免费配额，在业务量不大的情况下，该服务完全免费。\nAWS Lambda版本 AWS Lambda是目前全球使用最为广泛的serverless服务，同时也是函数计算发展方向的引领者。\n由于一些个人原因，笔者最近接触了部分AWS服务，同时尝试将钉钉回调函数移植到了AWS Lambda上。阿里云上使用的云服务改为由AWS上对应服务来实现，例如存储使用了DynamoDB，日志使用CloudWatch收集和查询。\n本地测试和部署工具，使用的是SAM CLI，持续集成和持续部署使用的是AWS CodeBuild和AWS CodePipeline。此外AWS通过AWS CloudFormation提供一种非常强大的能力，可以将AWS上的各种资源通过配置声明的方式来管理(也就是现在非常热门的一个概念--Infrastructure as Code)。AWS CloudFormation会为每次一个或多个资源的变更生成ChangeSet，提供查看对比、版本管理、遇到变更错误整体回退等能力。所以，AWS版本也将该项目的CI/CD部署用到的AWS CodeBuild、AWS CodePipeline、Amazon DynamoDB等资源通过CloudFormation的配置管理起来。\n配置代码段如下， 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 Description: Create a CodePipeline to include Github source, CodeBuild and Lambda deployment. Parameters: AppBaseName: Type: String Description: App base name Default: dingtalk-callback ArtifactStoreS3Location: Type: String Description: Name of the S3 bucket to store CodePipeline artificat. BranchName: Description: GitHub branch name Type: String Default: master RepositoryName: Description: GitHub repository name Type: String Default: dingtalk-callback-on-aws GitHubOAuthToken: Type: String NoEcho: true Resources: BuildDingtalkProject: Type: AWS::CodeBuild::Project Properties: Name: Fn::Sub: ${AppBaseName}-build-${AWS::StackName} Description: Build, test, package dingtalk callback project ServiceRole: Fn::GetAtt: [ CodeBuildRole, Arn ] Artifacts: Type: S3 Location: Ref: ArtifactStoreS3Location Name: Fn::Sub: ${AppBaseName}-build-${AWS::StackName} NamespaceType: BUILD_ID Path: Fn::Sub: ${AppBaseName}/artifacts Packaging: NONE OverrideArtifactName: true EncryptionDisabled: true Environment: Type: LINUX_CONTAINER ComputeType: BUILD_GENERAL1_SMALL Image: aws/codebuild/java:openjdk-11 PrivilegedMode: false ImagePullCredentialsType: CODEBUILD EnvironmentVariables: - Name: s3_bucket Value: Ref: ArtifactStoreS3Location Source: DingtalkCallbackPipeline: Type: \u0026#39;AWS::CodePipeline::Pipeline\u0026#39; Properties: Name: Fn::Sub: ${AppBaseName}-pipeline-${AWS::StackName} RoleArn: Fn::GetAtt: [ CodePipelineRole, Arn ] Stages: - Name: Source Actions: - Name: SourceAction ActionTypeId: Category: Source Owner: ThirdParty Version: 1 Provider: GitHub OutputArtifacts: - Name: Fn::Sub: ${AppBaseName}-source-changed Configuration: Owner: !Ref GitHubOwner Repo: !Ref RepositoryName Branch: !Ref BranchName OAuthToken: !Ref GitHubOAuthToken PollForSourceChanges: false RunOrder: 1 - Name: Build Actions: - Name: Build_Test_Package InputArtifacts: - Name: Fn::Sub: ${AppBaseName}-source-changed ActionTypeId: Category: Build Owner: AWS Version: 1 Provider: CodeBuild OutputArtifacts: - Name: Fn::Sub: ${AppBaseName}-packaged-yml Configuration: ProjectName: Ref: BuildDingtalkProject RunOrder: 1 AWS版本完整的代码、CloudFormation配置以及部署文档可以通过这里查看。\n","link":"https://kane.mx/posts/effective-cloud-computing/serverless-dingtalk-callback/","section":"posts","tags":["云计算","FaaS","阿里云","函数计算","AWS","AWS Lambda","钉钉","dingtalk","Serverless Computing"],"title":"基于函数计算的钉钉回调函数接口"},{"body":"Serverless Computing(无服务器计算)是目前最被看好的云端计算执行模型。其最大的好处是提供分布式弹性可伸缩的计算执行环境，仅为实际使用资源付费，并且将应用维护者从常规的运维事务中解放出来，更利于专注到具体的业务上。\n在主流的应用部署方式下，无论是使用云主机还是Kubernetes作为运行环境，都会有大量运维层面的事务需要考虑和处理，并且应用程序需要按照分布式程序的设计准则来应对应用的水平伸缩。同时随着云计算服务的发展和完善，云计算厂商提供了越来越多的基础服务，例如API网关、对象存储、消息队列、日志、监控等服务，函数计算可以完美的同其他云服务集成，帮助用户快速实现出生产级别的弹性可伸缩的应用。\n那函数计算是什么呢？让我们一起来看看阿里云对于函数计算的定义。\n阿里云函数计算是事件驱动的全托管计算服务。通过函数计算，您无需管理服务器等基础设施，只需编写代码并上传。函数计算会为您准备好计算资源，以弹性、可靠的方式运行您的代码，并提供日志查询、性能监控、报警等功能。借助于函数计算，您可以快速构建任何类型的应用和服务，无需管理和运维。而且，您只需要为代码实际运行所消耗的资源付费，代码未运行则不产生费用。\n基于函数计算的特点，可以很好满足以下需求，\n业务流量不确定或有明细的周期性 构建分布式系统经验不足 无需运维 按需计算 计费灵活 由于函数计算的扩展能力，对运维的要求极少，按量计费等特性用于需要快速验证的早期项目也是非常好的场景。\n下面这个slide是近期针对阿里云函数计算做的分享。\n","link":"https://kane.mx/posts/effective-cloud-computing/serverless-computing-101/","section":"posts","tags":["云计算","FaaS","阿里云","函数计算","Serverless Computing"],"title":"无服务器计算101"},{"body":"企业随着业务的发展，必然会部署各种各样的IT系统。出于安全性的考虑，一些系统仅可企业内部使用，甚至仅开放给企业部分部门员工使用。\n这些IT系统大致可分为两类，\n系统本身不支持任何认证机制，例如资讯或文档类系统。需要增加认证保护，能够限制非企业员工访问即可。系统运维通常的做法是，为站点设置HTTP Basic认证保护。由于HTTP Basic认证是通过预设的用户、密码认证，认证信息比较容易泄露。即使定期更换密码，但需要额外的机制通知用户密码的变更，用户体验也不好。 系统自身支持认证，甚至支持多种认证机制。比如最常用的开源CI/CD工具，Jenkins内置支持本地数据库认证、通过插件支持多种第三方系统集成认证。如果大量的IT系统都有一套独立的用户管理，随着企业的员工的变更，用户的增删等操作对系统管理员来说是不小的工作量。同时，也很容易由于人为疏忽，造成资产、数据的安全隐患。 假设企业自身已经有了一套OA系统包含员工、组织结构管理，例如，国内目前最为普及流行的钉钉或企业微信。我们完全可以提供一套基于oauth 2.0协议的认证方式，让以上两类IT系统使用企业已有的OA系统(钉钉或企业微信)来实现登录认证。做到这一点后，企业无论有多少IT系统都不再需要额外管理用户的成本，并且也避免了数据安全隐患。\n钉钉通过钉钉开放平台提供的API开放了许多钉钉内部的能力，例如，身份验证、通讯录管理等等。然而钉钉的三方网站登录接口并不是标准的oauth 2.0协议实现，我们需要通过一个oauth2 proxy代理工具实现将钉钉的三方网站登录兼容oauth2协议。同理，使用这个oauth2代理工具，可以使用Google、Facebook等三方网站作为统一认证方式。\n有了基于钉钉的oauth2代理作为企业统一登录方式，对于上面两大类系统的认证需求解决方案分别如下，\n部署在Kubernetes中无内置认证机制的Web应用，通过nginx-ingress的外部OAUTH认证实现基于oauth2的安全认证。 Jenkins可以通过反向代理插件实现使用oauth2认证登录。 在下篇中，我们将图文详解如何一步步实现为一个无认证的企业文档Web应用添加基于钉钉的统一认证。\n","link":"https://kane.mx/posts/effective-cloud-computing/oauth2-proxy-on-kubernetes/part1/","section":"posts","tags":["云计算","IAM","kubernetes","oauth2","钉钉","dingtalk"],"title":"为Kubernetes中任意应用添加基于oauth2的认证保护 (上)"},{"body":"企业使用公有云服务的第一件事情就是创建云帐号，有了帐号之后如何让企业员工安全合规的使用云帐号下的各种资源是开启云之旅后的第一个考验。\n云计算厂商针对企业上云后面临的第一个需求已经推出了完善的解决方案--Identity and Access Management。IAM可以帮助云帐号安全地控制对云计算服务资源的访问。企业可以使用IAM控制对哪个用户进行身份验证 (登录) 和授权 (具有权限) 以使用资源。\n云厂商是否提供完善的IAM服务可以作为整体产品解决方案是否成熟的一个衡量指标，比如AWS的IAM和阿里云的访问控制都是较为成熟完善的产品。国内某个以AI能力为卖点的云厂商，在IAM产品方面几乎为零，很难相信对安全合规有需求的企业会完整使用他的云产品作为解决方案。\nIAM通常提供以下功能:\n对云账户的共享访问权限 允许在一个云账户下创建并管理多个用户身份，并允许给单个身份或一组身份（既可以是当前云帐号下也可以是其他云帐号下）分配不同的权限策略，从而实现不同用户拥有不同的云资源访问权限，而不必共享云帐号根用户的密码或访问密钥。\n精细权限 可以针对不同资源向不同人员授予不同权限。可以要求用户必须使用安全信道（如 SSL）、在指定时间范围、或在指定源 IP 条件下才能操作指定的云资源。\n多重验证 (MFA) 可以向云账户和各个用户添加双重身份验证以实现更高安全性。借助MFA，用户不仅必须提供使用账户所需的密码或访问密钥，还必须提供来自经过特殊配置的设备的代码。\n联合身份 可以允许已在其他位置（例如，在企业网络中或通过 Internet 身份提供商）获得密码的用户获取对云账户的用户访问权限。\n后面会有专门的文章来讲如何实践联合身份。\n统一账单 云账户接收包括所有用户的资源操作所发生费用的统一账单。\n尽管IAM提供了上面种种功能，云帐号的管理者仍可通过一些最佳实践来更好的使用IAM产品来提升安全级别和减少运维成本。\nIAM最佳实践 尽量不要使用云帐号的根用户，不要为根用户创建AK。云帐号管理员也使用各自独立的子账号。 为企业中每一个需要使用云服务的员工单独创建子账户，且默认不允许创建AK。便于员工离职的时候，通过删除帐号来完全清理用户在云计算平台的各种权限。 密码安全实践， 限制密码强度不少于8位，必须由大小写字母、数字和符号中的三种组成。 强制密码过期时间不超过90天，且过期后不可登录。 新密码至少禁止使用前3次密码。 设置密码重试约束，例如，一小时内使用错误密码最大尝试9次登录。 强制所有用户启用两步认证。 对访问网络有限制的企业，可以开启登录IP限制。 [推荐做法]已有SSO单点登录系统的企业，可以通过SAML 2.0标准实现从企业本地账号系统登录到阿里云，从而满足企业的统一用户登录认证要求。 细粒度的权限管理， 为各种云资源创建最细粒度的权限策略。例如，分别为RDS实例rds-instance-1创建只读权限策略rds-instance-1-readonly-access，RDS实例rds-instance-2创建只读权限策略rds-instance-2-readonly-access。 根据职能、部门等维度为云帐号子用户创建用户组。例如，按项目创建用户组，group-project-a，group-project-b。如果project-a用户需要访问rds-instance-1的信息，将自定义权限rds-instance-1-readonly-access授权给group-project-a。再将相关用户加到用户组group-project-a中，这样这些用户就具有只读访问RDS实例rds-instance-1的权限。而不是将所有RDS的读写权限都授予这些用户，最大限度的保证用户不获取超过实际需要的权限。 在实际场景中，通常会通过云计算服务的API来完成某些周期性任务，比如每日RDS中的慢查询统计、云帐号每日花费统计等。这些任务都需要一个云帐号的AK来完成API的身份认证。最佳的做法是，为每类相关的任务创建一个功能性子账号，禁用他们的web登录，且遵循特殊的命名规范(functional-开头)，比如functional-rds-stats、functional-cost-stats。创建最小的权限策略，然后分配给这些功能性用户。例如，functional-rds-stats仅被授予RDS只读权限，functional-cost-stats仅被授予费用的只读权限。为这些子账号创建AK，每类任务使用不同的AK来完成API认证，而不是都使用同一个AK。这样的好处是，不同类型任务的AK具有不同的权限，最大限度的保护了云帐号的安全，并且这些AK不跟实际的员工子账号关联，不会因为员工帐号的变更而受影响。如有更高的安全合规的要求下，可以定期作废已有AK，创建新AK替换。至于AK怎样安全管理，之后会有专门的文章来详解。 ","link":"https://kane.mx/posts/effective-cloud-computing/iam-best-practice/","section":"posts","tags":["云计算","阿里云","AWS","IAM"],"title":"IAM最佳实践"},{"body":"这是“如何高效使用云服务”系列文章的首篇分享。可能有朋友好奇为什么不是从云计算最基础的服务--计算资源ECS/EC2讲起呢？在Cloud Native已经被越来越接受的今天，基于Kubernetes部署、编排应用的方式已经是业界的事实标准。无论是互联网巨头，传统500强企业，还是创业团队都在使用或规划使用Kubernetes作为应用程序的自动化部署、可扩展管理平台。在云计算平台，虚拟机越来越不需要单独的管理，在绝大多数的业务场景下，它们只是作为容器集群所管理的计算资源。甚至虚拟机的创建到销毁整个生命周期管理都可以由Kubernetes根据集群的负载来自动完成。\n所有主流的云计算厂商都在解决方案中力推托管的Kubernetes，AWS的EKS，Azure上的AKS，当然少不了Google家GCP上的Kubernetes Engine。国内阿里云，腾讯云等每一个公有云玩家也都基于开源Kubernetes推出了托管服务。如果一家云计算厂商在提供托管Kubernetes这一服务上没跟上业界的步伐，将来极大可能被淘汰出这个市场。\n托管的Kubernetes类型 以国内的阿里云为例，目前提供了两大类三种不同的Kubernetes托管服务。\n经典Dedicated Kubernetes模式。这种模式下用户可以选择宿主机实例规格和操作系统，指定Kubernetes版本、自定义Kubernetes特性开关设置等。用户需要手动维护集群，例如升级Kubernetes版本，内置组件版本等。可以手动或自动伸缩集群节点数目。目前该模式下有两种类型，第一种集群主节点需要使用用户的ECS，用户可远程登录或管理这些ECS。另一种是，主节点也由云厂商托管，用户只能通过API Server管理Kubernetes。在费用方面，无论是否托管集群主节点，集群服务免费，按使用的ECS实例及计费方式收费。 Serverless 模式(目前公测中，暂时免费)。无需创建底层虚拟化资源，可以利用 Kubernetes 命令指明应用容器镜像、CPU和内存要求以及对外服务方式，直接启动应用程序。按容器使用的CPU和内存资源量计费。这种模式下应该是在一个集群内实现多租户，目前有些features不被支持。例如，部署不支持DaemonSet，Ingress不支持NodePort类型，存储不支持PV和PVC等。 用户可以根据自己的业务类型来选择适合的托管Kubernetes集群。如果部署的应用是无状态的Web服务，可以选择Serverless Kubernetes集群，进一步减少运维工作量。\n如果用户部署的应用有状态，需要挂载外部存储，例如MongDB集群，MQ集群，可以选择经典Dedicated Kubernetes模式。如果用户需要通过Kubernetes组件扩展或自定义实现某些功能，这些需求云厂商的标准版并没有提供，这时可以选择经典Dedicated Kubernetes模式，利用Kubernetes高度灵活的扩展机制来满足自定义需求。\n托管Kuberentes的优势 国内的阿里云有篇技术文档对比阿里云Kubernetes vs. 自建Kubernetes，文章看起来虽然有厂商自卖自夸的嫌疑。作为阿里云K8S的客户，在使用托管K8S近一年来，深切的体会到云厂商托管K8S带来的种种好处，文档中提到的种种优势确实是言之凿凿。\n接下来具体看看云厂商托管K8S到底有哪些优势。\n便捷 通过Web界面/API一键创建Kubernetes集群，集群升级。 Web界面/API实现集群的扩容或缩容。 集群的安装，补丁以及常规版本升级在运维工作中属于体力活。在规模不大的时候，使用人工实现需要花费不少时间准备环境测试验证，且易错。如果集群体量不够大的话，开发自动化运维脚本又浪费人力成本。云计算厂商的托管K8S集群将提供专业、稳定的技术运维服务，和几乎为零的人力成本。\n从效率和人力成本上看，托管K8S集群完胜自建Kubernetes集群。\n功能更强大 Kubernetes作为一个容器编排系统，开源版本中许多组件没有默认实现或实现有限，需要跟运行环境(如托管K8S的云平台)集成。例如，存储，Load Balancer，网络等核心组件。官方文档Internal load balancer就提供了在不同的云厂商环境中的使用示例。部署一个强大且完整的K8S集群需要同许多云计算的基础组件集成(且只能通过API完成)，这往往是云计算厂商的强项。\n云厂商托管的K8S可以在以下方面提供强大的云计算平台支持，\n网络 高性能 VPC 网络插件。 支持 network policy 和流控。 负载均衡 支持创建公网或内网负载均衡实例，或者复用已有实例。支持指定带宽大小、计费方式、4层或7层协议代理等云厂商负载均衡功能。对应用运维来说可以把负载均衡的配置通过代码实现，并且支持版本控制。对比传统的云端部署，也可以将应用部署和应用运维集成在一起统一管理，避免应用发布和运维配置的割裂，减少人为运维失误。\n阿里云托管K8S的负载均衡详细配置可以参考这个文档，AWS上见此文档。\n存储 集成了云厂商的云盘、文件存储NAS、块存储等存储方案，基于标准的FlexVolume驱动，提供了最佳的无缝集成。\n如果是在云厂商的虚拟机上自建Kubernetes集群，默认无法使用云上的存储资源。如果需要利用云厂商提供的存储方案，例如对象存储，就需要自行开发基于FlexVolume的驱动。在厂商托管K8S已经完美解决了存储集成的问题，何必自己又去费时费力的定制开发呢？\n可以看到，云厂商托管的K8S集群在网络、负载均衡和存储上有许多天然的优势。在其他几个维度，托管的K8S集群同样也优于自建的K8S，\n运维 集成厂商的日志服务，监控服务。 K8S集群cluster autoscaler自动利用云厂商的弹性伸缩扩缩容集群节点。 镜像仓库 高可用，支持大并发。 支持镜像加速。 支持 p2p 分发。 可集成云平台的用户权限。 部分厂商目前免费且不限容量。 高可用 提供多可用区支持。 支持备份和容灾。 技术支持 专门的技术团队保障容器的稳定性。 每个 Linux 版本，每个 Kubernetes 版本都会在经过严格测试之后之后才会提供给用户。 提供 Kubernetes 升级能力，新版本一键升级。 为开源软件提供兜底，无论是K8S、Docker甚至Linux自身的问题提供支持。 专业的技术团队是提供稳定K8S服务必不可少的。但绝大多数企业是无法做到有专业的技术团队来维护K8S、提供K8S或容器技术自身的各种最佳实践、发现以及修复开源软件Bug。\n在笔者的使用托管K8S的时候就遇到这样的状况。其中一个集群升级到新版本Kubernetes后，内置DNS组件从KubeDNS被替换为全新的CoreDNS，而当时的CoreDNS版本在Service ExternalName支持上有Bug，导致已有的这种Service无法提供服务。在同云厂商的技术团队沟通后，先用workaround将问题快速绕过，不影响业务的使用。同时，云厂商的技术人员（也是K8S社区committer）继续调研，发现该问题是CoreDNS的Bug。在为开源CoreDNS项目创建Issue后，同时提供Patch，又在CoreDNS committer建议下完善了测试用例，推动了该问题快速在CoreDNS中被修复。CoreDNS包含Fix的版本发布后，云厂商技术支持团队将更完美的解决方案提供给了我们。作为K8S服务的用户，这种体验是极好的。当时我们的技术团队既没有精力也没有能力快速发现并修复开源软件中的这类问题，而云厂商的服务间接帮我们实现了这种能力。\n这其实是一种非常好的共赢商业模式，云厂商有能力且有动力投入顶尖技术团队将开源技术商业化，云厂商的用户则用最小的代价获得了最优的基础服务来为核心业务赋能。\n","link":"https://kane.mx/posts/effective-cloud-computing/using-kubernetes-on-cloud/","section":"posts","tags":["云计算","阿里云","AWS","kubernetes"],"title":"不要自建Kubernetes"},{"body":"这是“如何高效使用云服务”系列文章的引子。该系列将讲述如何利用各种公有云服务来安全合规、高质量、快速、低成本的打造产品/系统，帮助企业（特别是中小微创业团队）在人少，钱缺的情况下做到最高效率。\n个人使用公有云服务的经历 初会 最早是2012年在parttime项目中开始接触使用云计算服务，当时的初创团队也是希望用最低的成本来验证idea，所有使用了云服务来做POC。目前国内市场最领先的云计算厂商阿里云那时也才提供公有云服务不到1年。由于云产品不够成熟，加上团队技能经验不足，自助互助的渠道不畅，导致最初的云计算使用体验并不好，团队没有选择完全使用云服务构建产品。\nIaaS or PaaS 云计算兴起的早期，云厂商大致分为两类，提供基于IaaS或PaaS的云服务。2013年起也有尝试不同类型的厂商平台，虽然也较好的完成一些体量不大的项目，但要在他们上面构建大规模用户产品或企业级应用，在云产品完善度上或支持开发团队协作上都有不少欠缺，还有大量的基础工作或限制留给了开发团队自身解决。\nAll-in Cloud 2015年我开始一个微电影项目创业，团队是不到10人的微型团队。从效率和成本考虑，我们将所有的服务都放到了阿里云上。我们使用了多种云产品，例如，云主机（多种OS），对象存储，图片处理，CDN，SLB，人脸识别等云服务，结合Devops集成开发，测试，部署pipeline来加速产品的迭代和更新。每名工程师承担一种以上角色，前端，后端，运维，数据，视频渲染等。合理使用云厂商的各种产品帮我们在质量，效率，成本上获得巨大的收益。\n2017年我加入了一家企业财税服务的初创公司负责技术团队。公司在2018年获得了B轮投资，研发产品运营团队近百人，属于中等规模。随着各种开源技术的巨大进步和影响逐步扩大，微服务架构的流行，基于Kubernetes的Cloud Native Computing兴起。我们利用云厂商的容器服务，DBaaS，Big Data，AI技术等用最高效的方式将数个单体应用平滑升级到高可用弹性的分布式架构，更好的满足复杂业务的多变需求，公司服务也在全国300多个城市落地，服务了数十万中小微企业客户。同时利用云厂商的VPC，访问控制，WAF等产品进行权限控制和安全保护，有效防范了因为团队扩大管理难度增加而出现安全问题。\n缘起 作为一名云计算服务6年的用户，见证了开源技术的快速发展和影响力急剧扩大，感受到整个云计算行业和厂商的长足进步。见证了国内头部云厂商从最初的使用难度颇大，现在成长为万众创业的首选服务商。\n过去的一年参加了数场技术会议，其中主题大多偏向于由知名的互联网或行业公司分享在海量数据下的技术应用。这些技术广泛涉及开发语言、应用架构、性能、大数据、机器学习和人工智能等领域，无论这些公司是否采用开源产品，在团队单兵技术能力，专业的分工，对开源项目的研发投入力量，这些经验和方法并不是中小企业可以轻易借鉴的。而云计算厂商将这些领域最基础通用的能力以产品的方式输出给用户，以按用量的方式计费，使用更简单，有专业团队维护和支持。中小团队就应该将这些事情“外包”给云厂商，集中精力到业务上，将最大的研发资源用到最核心最关键的地方。\n我同团队同事沟通中，和公司研发候选人面试交流中，发现许多从业者对云计算服务了解还不够深入。许多人理解中的云计算服务只有云服务器、云数据库等少数产品，需要自己安装维护应用服务器、负载均衡、收集日志等等看起来每个应用都绕不开的事情。他们的认知还停留在排查应用异常还需要远程登录服务器看日志，做不到合理的根据场景高效组合使用云服务，将云服务当做水电一样，作为最基础的能力加速业务的发展。业务上是采用名气大且成熟的产品，尝试新鲜看起来酷但不那么完善的产品，还是二次开发或自研开发？要做出最优的选择需要工程师能够从有高度的全局角度来考量，甚至在短时间内能用POC项目验证多个可选的方案，基于数据做出最终的选择。\n这就是这个系列的缘起，之后我将陆续分享使用那些高效的云服务产品的场景、心得、体会等等。\n封面图片Cloud Computing引用自The Blue Diamond Gallery under CC BY-SA 3.0\n","link":"https://kane.mx/posts/effective-cloud-computing/preface/","section":"posts","tags":["云计算","阿里云"],"title":"真的会用云服务吗？"},{"body":"上周参加了ArchSummit(全球架构师峰会)，在这里记录下部分参加的主题以及个人感受。\n会议回顾 今年参加了几次技术会议，微服务、容器技术、区块链、大数据、机器学习以及人工智能都是当下最热门的主题。同样这次ArchSummit绝大部分topics都跟这些主题相关。\n这次会议主要参加了两个专场主题，Kubernetes的应用和快手科技技术专题。\n基于 Kubernetes 的 DevOps是来自微软Azure的容器工程师分享如何基于 Kubernetes 的 CI/CD 落地实践。该分享中提到了CI/CD各个步骤中都有众多的工具支持，如何选择合适Kubernetes的工具将持续集成和部署串联在一起是Devops中的主要挑战。分享者也安利了AKS提供Devops完整的工具链，以及将开源工具同AKS中的服务集成实现CI/CD的最佳实践。\n我们噼里啪团队在CI/CD、Devops这块做得还不错。CI/CD pipelines持续将应用部署在运行的Kubernetes集群，过程中使用的工具链基本也是社区或CNCF推荐的主流工具。下一步可以考虑同云厂商的Devops工具链集成，进一步减少维护成本。\n基于Istio on Kubernetes云原生应用的最佳实践来自阿里云容器工程师的分享。他概要的分享了Istio技术和实现原理。当然也大力介绍了阿里云容器服务对Istio的原生支持，以及阿里云对客户使用Istio的支持，即使客户问题非常的初级他们的技术支持也很到位。\nIstio可以说是CNCF在Kubernetes上事实的服务治理实现。噼里啪技术团队也一直在关注这一块，正在尝试引入Istio提升服务的SLA。\n快手技术团队的4个分享都是围绕解决明确的业务问题而做得技术工作，非常具有实战性。其中快手万亿级实时 OLAP 平台的建设与实践介绍了快手实时OLAP平台从0到1的搭建过程。该平台从今年4月开始搭建，截止到11月，每日可以实时计算处理超过万亿的数据。而整个平台的搭建由两名大数据工程师外加一名前端工程师负责portal等UI，人效产出让人非常佩服。结合朋友间传言快手给技术人员的offer，快手应该是一家在实践类似Netflix管理文化的公司。\n最后给大家推荐一个国产的分布式New SQL数据库TiDB相关的主题。TiDB是国内技术团队开源的一个分布式数据库，已被CNCF作为Database实现推荐方案之一。他们的CTO分享了TiDB on Kubernetes 最佳实践，以及他们客户北京银行在两地多活的核心系统中采用的数据库就是TiDB。\n个人感受 会议的分享者大多来自国内一线的互联网公司，他们普遍具备流量大、数据多、技术团队能力更强等特质。并且很少使用公有云服务，使用开源产品多数也会维护私有版本。他们的业务解决方案对中小型技术团队来说可复制性不强，照搬实施的难度高，更多的是在扩展思路了解业界技术动态。中小型技术团队最紧迫的事情是满足业务快速发展和需求多变，更合理的解法是选用云厂商的服务或第三方服务快速高效的满足业务需求。极客邦旗下的会议大多缺少这类的分享，相比之下AWS的reInvent大会在这方面做得更好。\n","link":"https://kane.mx/posts/2018/2018-12-13-bj-archsummit-review/","section":"posts","tags":["会议","架构","ArchSummit"],"title":"2018北京ArchSummit回顾"},{"body":"","link":"https://kane.mx/tags/archsummit/","section":"tags","tags":null,"title":"ArchSummit"},{"body":"","link":"https://kane.mx/tags/jenkins/","section":"tags","tags":null,"title":"Jenkins"},{"body":"","link":"https://kane.mx/tags/trouble-shooting/","section":"tags","tags":null,"title":"Trouble-Shooting"},{"body":"V秘开发团队一直使用着Jenkins CI来持续集成V秘服务的新功能和各种改进。近日，Jenkins CI在重启之后，很多已有任务的配置无法被Jenkins CI完整的加载，导致很多功能无法使用。导致我们整个网站的各种服务无法被升级更新了:-(\nJenkins CI在管理控制台列出如下的错误信息，示意现有任务的部分配置由于错误无法加载。\nCannotResolveClassException: hudson.plugins.git.GitSCM, CannotResolveClassException: com.cloudbees.jenkins.plugins.BitBucketTrigger, CannotResolveClassException: hudson.plugins.emailext.ExtendedEmailPublisher, CannotResolveClassException: hudson.plugins.parameterizedtrigger.BuildTrigger 通过上面的错误信息，我们初步认为错误是由于插件无法被Jenkins CI加载。但是通过Jenkins CI的插件管理列表，我们发现Git插件已经被认为是安装的了。同时我们也可以在Jenkins CI安装目录中找到插件对应的文件git.jar，并且成功验证了类hudson.plugins.git.GitSCM也是存在在jar文件里面的。重新安装Git client插件也不能解决这个错误！\n经过进一步的分析，通过Jenkins CI的系统日志，我们发现Git插件虽然是成功安装了，但是它所依赖的某些插件没有被安装！这导致Jenkins CI无法正确加载Git插件。通过日志的提示，将缺失的插件一一安装上，重启Jenkins CI后，插件加载正常，任务执行也恢复正常。\n这个错误出现的还是相当奇怪。因为Jenkins CI会在安装插件的时候将依赖的插件一并安装上。此外该Jenkins CI已经运行很久了，这些插件也是一直安装着的。不过现在回想起之前升级Jenkins CI插件的时候，部分插件由于网络原因升级失败了，但是没有重新更新。这导致这些插件处在了一个不正确的状态。在重启Jenkins CI后，这些插件被标记为未安装，导致依赖它们的插件无法加载。\n","link":"https://kane.mx/posts/2016/how-to-fix-jenkins-fail-to-load-job-config/","section":"posts","tags":["Jenkins","trouble-shooting"],"title":"如何修复Jenkins CI无法读取存在的任务配置"},{"body":"","link":"https://kane.mx/tags/mongodb/","section":"tags","tags":null,"title":"MongoDB"},{"body":"MongoDB是目前最为流行的NoSQL数据库之一。V秘的后台数据就是保存在MongoDB中的哦;)\n尽管MongoDB的性能为业界称道，但任何数据库系统使用中都存在着慢查询的问题。慢查询的性能问题，可能是由于使用非最优的查询语句，不正确的索引或其他配置原因导致的。但开发人员或数据库维护人员首先要找出这些低效的查询，才能做出对应的查询优化。\n在MongoDB中实现慢查询的profile是非常容易，因为MongoDB内置了profile开关来记录执行时间触发了profile条件的查询。\n参照db.setProfileLevel()的文档，通过以下命令就可以记录执行时长超过300ms的查询。\n1db.setProfilingLevel(1, 300) 当慢查询被重现后，可以通过查找system.profile collection来查看执行时长超过300ms的查询。\n被profiler记录下来慢查询record看起来如下，\n1{ 2 \u0026#34;op\u0026#34; : \u0026#34;query\u0026#34;, 3 \u0026#34;ns\u0026#34; : \u0026#34;myCollection\u0026#34;, 4 \u0026#34;query\u0026#34; : { 5 \u0026#34;builds\u0026#34; : { 6 \u0026#34;$elemMatch\u0026#34; : { 7 \u0026#34;builtTime\u0026#34; : null, 8 \u0026#34;$and\u0026#34; : [ 9 { 10 \u0026#34;createdTime\u0026#34; : { 11 \u0026#34;$lt\u0026#34; : ISODate(\u0026#34;2016-09-20T20:07:00.796Z\u0026#34;) 12 } 13 } 14 ] 15 } 16 } 17 }, 18 \u0026#34;ntoreturn\u0026#34; : 0, 19 \u0026#34;ntoskip\u0026#34; : 0, 20 \u0026#34;nscanned\u0026#34; : 0, 21 \u0026#34;nscannedObjects\u0026#34; : 18231, 22 \u0026#34;keyUpdates\u0026#34; : 0, 23 \u0026#34;writeConflicts\u0026#34; : 0, 24 \u0026#34;numYield\u0026#34; : 577, 25 \u0026#34;locks\u0026#34; : { 26 \u0026#34;Global\u0026#34; : { 27 \u0026#34;acquireCount\u0026#34; : { 28 \u0026#34;r\u0026#34; : NumberLong(1156) 29 } 30 }, 31 \u0026#34;Database\u0026#34; : { 32 \u0026#34;acquireCount\u0026#34; : { 33 \u0026#34;r\u0026#34; : NumberLong(578) 34 } 35 }, 36 \u0026#34;Collection\u0026#34; : { 37 \u0026#34;acquireCount\u0026#34; : { 38 \u0026#34;r\u0026#34; : NumberLong(578) 39 } 40 } 41 }, 42 \u0026#34;nreturned\u0026#34; : 2, 43 \u0026#34;responseLength\u0026#34; : 98076, 44 \u0026#34;millis\u0026#34; : 11161, 45 \u0026#34;execStats\u0026#34; : { 46 \u0026#34;stage\u0026#34; : \u0026#34;COLLSCAN\u0026#34;, 47 \u0026#34;filter\u0026#34; : { 48 \u0026#34;builds\u0026#34; : { 49 \u0026#34;$elemMatch\u0026#34; : { 50 \u0026#34;$and\u0026#34; : [ 51 { 52 \u0026#34;$and\u0026#34; : [ 53 { 54 \u0026#34;createdTime\u0026#34; : { 55 \u0026#34;$lt\u0026#34; : ISODate(\u0026#34;2016-09-20T20:07:00.796Z\u0026#34;) 56 } 57 } 58 ] 59 }, 60 { 61 \u0026#34;builtTime\u0026#34; : { 62 \u0026#34;$eq\u0026#34; : null 63 } 64 } 65 ] 66 } 67 } 68 }, 69 \u0026#34;nReturned\u0026#34; : 2, 70 \u0026#34;executionTimeMillisEstimate\u0026#34; : 11080, 71 \u0026#34;works\u0026#34; : 18233, 72 \u0026#34;advanced\u0026#34; : 2, 73 \u0026#34;needTime\u0026#34; : 18230, 74 \u0026#34;needFetch\u0026#34; : 0, 75 \u0026#34;saveState\u0026#34; : 577, 76 \u0026#34;restoreState\u0026#34; : 577, 77 \u0026#34;isEOF\u0026#34; : 1, 78 \u0026#34;invalidates\u0026#34; : 0, 79 \u0026#34;direction\u0026#34; : \u0026#34;forward\u0026#34;, 80 \u0026#34;docsExamined\u0026#34; : 18231 81 }, 82 \u0026#34;ts\u0026#34; : ISODate(\u0026#34;2016-09-20T23:07:14.313Z\u0026#34;), 83 \u0026#34;client\u0026#34; : \u0026#34;10.171.127.66\u0026#34;, 84 \u0026#34;allUsers\u0026#34; : [ 85 { 86 \u0026#34;user\u0026#34; : \u0026#34;dbuser\u0026#34;, 87 \u0026#34;db\u0026#34; : \u0026#34;mydb\u0026#34; 88 } 89 ], 90 \u0026#34;user\u0026#34; : \u0026#34;dbuser@mydb\u0026#34; 91} 上面的数据具体解读如下，\nop: 'query'表示执行的是查询， ns是指查询的collection， query是具体的查询语句， 核心部分是execStats，给出了的查询语句具体执行统计，跟**.explain('execStats')**的内容是一致的。上面的统计是说，这个query执行了整个collection的扫描(总计扫描了18231个文档)，最终返回了2条文档，花费了11080ms，也就是11s还多的时间！这表明被记录下的慢查询跟collection的索引设置有问题，该查询没有用上索引。解决方案很简单，改善查询语句使用存在的索引或者设置合理的索引。 ts是查询开始请求的时间， allUsers和user都是MongoDB client连接所使用的用户。 ","link":"https://kane.mx/posts/2016/how-to-find-slow-queries-in-mongodb/","section":"posts","tags":["MongoDB","performance-tuning"],"title":"MongoDB中如何找出慢查询"},{"body":"Swarm mode在Docker v1.12中正式发布，Swarm mode带来了诸如Docker集群，容器编排，多主机网络等激动人心的特性。V秘团队也尝试着将各种后台服务部署到Docker Swarm Cluster获取更好的弹性计算能力。\nDocker v1.12中正式发布的Docker Swarm在我们实用中发现仍有不少不足之处，让我们一一分享给大家。\n无法将服务的published端口只绑定到特点的网卡上。比如我们的云主机（同时也是Swarm manager/node）有eth0和eth1两块网卡，分别连接内网和外网。我们计划在Docker Swarm中运行一个nginx服务，通过80/443端口提供HTTP/HTTPS服务。当我们希望将nginx中的Web服务暴露在云主机上时，我们通过以下命令创建nginx服务。然而我们无法选择将published的80端口绑定在哪个interface上。Docker Swarm会自动将服务监听到Swarm node的所有80端口上。如果我们只想将这个服务暴露在内网interface暂时无法实现。 1docker service create --name vme-nginx --network vme-network --replicas 1 \\ 2 --publish 80:80 --publish 443:443 \\ 3 nginx:1.11 无法为Docker Swarm内运行的服务设置主机名。通过docker run命令执行的容器可以设置hostname。比如， 1docker run --hostname vme-nginx nginx:1.11 但是docker service create命令缺少等价的参数为容器指定hostname。一些依赖于hostname的服务将无法部署在Docker Swarm中，比如clustered rabbitmq。 Docker compose还不能与Docker Swarm完美集成。目前有一个experimental的Docker Stacks and Distributed Application Bundles在尝试做更好的整合。 docker service update有时不能更新正在运行中的container。更多讨论见这个issue。 ","link":"https://kane.mx/posts/2016/the-limitations-docker-swarm-mode-v1.12/","section":"posts","tags":["docker","docker-swarm"],"title":"Docker Swarm mode(v1.12.x)的一些使用限制"},{"body":"","link":"https://kane.mx/tags/docker-swarm/","section":"tags","tags":null,"title":"Docker-Swarm"},{"body":"","link":"https://kane.mx/tags/ubuntu-1404/","section":"tags","tags":null,"title":"Ubuntu-1404"},{"body":"V秘团队一直致力于用技术改善产品。V秘后台的各种服务一直是通过完善的Devops流程自动部署到Docker容器集群。随着Swarm mode在Docker v1.12中正式发布，Swarm mode带来了诸如Docker集群，多主机网络等激动人心的特性。我们也在尝试将V秘服务部署到Docker Swarm Cluster获取更好的弹性计算能力。\n然而我们将V秘的服务部署到Docker Swarm Cluster时遇到服务容器无法启动的错误。错误信息类似如下，\nstarting container failed: could not add veth pair inside the network sandbox: could not find an appropriate master \u0026quot;ov-000100-1wkbc\u0026quot; for \u0026quot;vethee39f9d\u0026quot;\n经过与Docker 社区的回馈讨论，暂时通过升级Docker主机(OS: Ubuntu 14.04 LTS)的内核版本解决了这个错误。\n具体方法如下，\n1root@swarm1:~# uname -r 23.13.0-32-generic 3 4root@swarm1:~# apt-get install linux-generic-lts-vivid 5root@swarm1:~# reboot 6 7root@swarm1:~# uname -r 83.19.0-69-generic 至于这个错误的根本原因是Docker的bug还是对Linux Kernel有特殊的要求，需要Docker开发进一步确认。如果对此问题有更多兴趣，可以关注docker issue #25039。\n","link":"https://kane.mx/posts/2016/docker-swarm-mode-in-ubuntu-1404/","section":"posts","tags":["docker","docker-swarm","ubuntu-1404"],"title":"创建于Docker Swarm的服务无法在Ubuntu 14.04 LTS中运行"},{"body":"","link":"https://kane.mx/tags/angularjs/","section":"tags","tags":null,"title":"Angularjs"},{"body":"","link":"https://kane.mx/tags/nginx/","section":"tags","tags":null,"title":"Nginx"},{"body":"","link":"https://kane.mx/tags/seo/","section":"tags","tags":null,"title":"Seo"},{"body":"","link":"https://kane.mx/tags/single-page-app/","section":"tags","tags":null,"title":"Single-Page-App"},{"body":"在之前的文章我曾提到基于Angularjs的单页面应用在用户体验上的种种好处。然而任何事情都不是完美的，Angular和类似的框架通过应用内做页面路由的实现给SEO（也俗称搜索引擎优化）带来了不少麻烦。\n首先，我们来看看页面内路由是如何实现的。默认Angularjs生成的页面uri类型如下，\nhttp://mydomain.com/#/app/page1\n浏览器请求上面这个uri的时候，实际发送给服务器的请求地址是http://mydomain.com/, web服务器会将默认的页面响应给浏览器，比如index.html或index.php等。\n浏览器返回的页面里面引入了Angularjs和其他应用需要的JS库。Angularjs应用开始执行后，尝试处理路由**/app/page1**。如果应用定义了该路由，将加载必要的JS库和其他html片段来完成页面的渲染。\n理解了Angularjs页面内路由的原理后，我们知道了对浏览器或搜索引擎爬虫而言，单页面应用所有的页面对浏览器和搜索引擎都是一个网址，比如http://mydomain.com/。这样对爬虫抓取站内链接造成了困难，因为所有应用内的链接都被认做了同一个链接。\n我们理解了uri http://mydomain.com/#/app/page1给SEO造成的麻烦，接下来就是讨论如何针对SEO来作的优化。\n最理想的情况当然是搜索引擎爬虫变的更加智能，它能理解网站的框架，并且针对此种情况做出优化。但截止到目前，包括Google在内的所有爬虫都无法做到这点。那我们SEO的优化只能在应用这边来做了。\nAngularjs提供了一种HTML5 mode模式可以利用HTML5 History API来实现页面内路由。打开的方法如下，\n1$locationProvider.html5Mode(true); 同时在index.html页面加上如下标签，\n1\u0026lt;base href=\u0026#34;/\u0026#34;\u0026gt; 在打开HTML5 mode后的Angularjs应用的链接看起来就是这样了，\nhttp://mydomain.com/app/page1\n新的链接模式和站内跳转通过访问网站主页请求将没有任何问题。然而直接在浏览器请求如上链接的话，Web服务器将尝试请求/app/page1，通常会得到404的页面响应。因为服务器上并没有部署页面/app/page1。\n这时就需要在Web应用服务器或应用里面实现URL Rewrite。将/app/page1的请求转到单页面应用html文件上。\n下面是一些Web服务器或应用的参考配置，\nApache Rewrites\n1\u0026lt;VirtualHost *:80\u0026gt; 2 ServerName my-app 3 4 DocumentRoot /path/to/app 5 6 \u0026lt;Directory /path/to/app\u0026gt; 7 RewriteEngine on 8 9 # Don\u0026#39;t rewrite files or directories 10 RewriteCond %{REQUEST_FILENAME} -f [OR] 11 RewriteCond %{REQUEST_FILENAME} -d 12 RewriteRule ^ - [L] 13 14 # Rewrite everything else to index.html to allow html5 state links 15 RewriteRule ^ index.html [L] 16 \u0026lt;/Directory\u0026gt; 17\u0026lt;/VirtualHost\u0026gt; Nginx Rewrites\n1server { 2 server_name my-app; 3 4 root /path/to/app; 5 6 location / { 7 try_files $uri $uri/ /index.html; 8 } 9} Azure IIS Rewrites\n1\u0026lt;system.webServer\u0026gt; 2 \u0026lt;rewrite\u0026gt; 3 \u0026lt;rules\u0026gt; 4 \u0026lt;rule name=\u0026#34;Main Rule\u0026#34; stopProcessing=\u0026#34;true\u0026#34;\u0026gt; 5 \u0026lt;match url=\u0026#34;.*\u0026#34; /\u0026gt; 6 \u0026lt;conditions logicalGrouping=\u0026#34;MatchAll\u0026#34;\u0026gt; 7 \u0026lt;add input=\u0026#34;{REQUEST_FILENAME}\u0026#34; matchType=\u0026#34;IsFile\u0026#34; negate=\u0026#34;true\u0026#34; /\u0026gt; 8 \u0026lt;add input=\u0026#34;{REQUEST_FILENAME}\u0026#34; matchType=\u0026#34;IsDirectory\u0026#34; negate=\u0026#34;true\u0026#34; /\u0026gt; 9 \u0026lt;/conditions\u0026gt; 10 \u0026lt;action type=\u0026#34;Rewrite\u0026#34; url=\u0026#34;/\u0026#34; /\u0026gt; 11 \u0026lt;/rule\u0026gt; 12 \u0026lt;/rules\u0026gt; 13 \u0026lt;/rewrite\u0026gt; 14\u0026lt;/system.webServer\u0026gt; Express Rewrites\n1var express = require(\u0026#39;express\u0026#39;); 2var app = express(); 3 4app.use(\u0026#39;/js\u0026#39;, express.static(__dirname + \u0026#39;/js\u0026#39;)); 5app.use(\u0026#39;/dist\u0026#39;, express.static(__dirname + \u0026#39;/../dist\u0026#39;)); 6app.use(\u0026#39;/css\u0026#39;, express.static(__dirname + \u0026#39;/css\u0026#39;)); 7app.use(\u0026#39;/partials\u0026#39;, express.static(__dirname + \u0026#39;/partials\u0026#39;)); 8 9app.all(\u0026#39;/*\u0026#39;, function(req, res, next) { 10 // Just send the index.html for other files to support HTML5Mode 11 res.sendFile(\u0026#39;index.html\u0026#39;, { root: __dirname }); 12}); 13 14app.listen(3006); //the port you want to use ASP.Net C# Rewrites\n1private const string ROOT_DOCUMENT = \u0026#34;/default.aspx\u0026#34;; 2 3protected void Application_BeginRequest( Object sender, EventArgs e ) 4{ 5 string url = Request.Url.LocalPath; 6 if ( !System.IO.File.Exists( Context.Server.MapPath( url ) ) ) 7 Context.RewritePath( ROOT_DOCUMENT ); 8} ","link":"https://kane.mx/posts/2016/seo-optimization-for-angularajs-based-app/","section":"posts","tags":["angularjs","single-page-app","seo","nginx","搜索引擎优化"],"title":"基于Angularjs单页面应用的SEO优化"},{"body":"","link":"https://kane.mx/tags/%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E%E4%BC%98%E5%8C%96/","section":"tags","tags":null,"title":"搜索引擎优化"},{"body":"","link":"https://kane.mx/tags/session-management/","section":"tags","tags":null,"title":"Session-Management"},{"body":"","link":"https://kane.mx/tags/spring-boot/","section":"tags","tags":null,"title":"Spring-Boot"},{"body":"","link":"https://kane.mx/tags/spring-framework/","section":"tags","tags":null,"title":"Spring-Framework"},{"body":"","link":"https://kane.mx/tags/spring-session/","section":"tags","tags":null,"title":"Spring-Session"},{"body":"在微服务和容器等技术的帮助下，Web应用可以较为容易的进行水平扩展，来部署更多的应用实例来提升请求处理数QPS。当Web服务有状态的时候，如何在集群下管理用户session成为新的待解决问题。\nSpring Framework针对此问题衍生出了一个子项目Spring Session来实现集群下的session管理。该项目提供了以下功能：\n提供API和实现管理用户session HttpSession - 替换实现应用容器(tomcat)中的HttpSession Clustered Sessions - 实现集群的session而不依赖任何应用容器特定的解决方案 Multiple Browser Sessions - 支持多个用户session保存在同一个浏览器实例中 (例如，类似Google的多用户认证). RESTful APIs - 通过支持session ids在Http请求头来支持Restful API的认证 WebSocket - 能够保证HttpSession的存活当在接受WebSocket消息时 从上面的功能列表中，我们可以看到Spring Session能够满足集群下各种session的使用场景和需求。\nSpring Session在1.0.0 GA可以使用Redis做为session储存的backend。\n通过changelog，在最新的1.1.0 GA中支持自定义Cookie的创建，允许自定义Cookie的过期时间，作用域等。在即将发布的1.2.0 GA版本中，将添加支持JDBC的关系数据库和MongoDB作为session保存的backend。\n此外，Spring Session同Spring-boot的应用有很好的集成，只需要十多行代码及配置即可集成！\n","link":"https://kane.mx/posts/2016/clustered-session-under-spring-framework/","section":"posts","tags":["web-2.0","session-management","spring-framework","spring-session","spring-boot"],"title":"Spring框架下的分布式session管理"},{"body":"","link":"https://kane.mx/tags/web-2.0/","section":"tags","tags":null,"title":"Web-2.0"},{"body":"","link":"https://kane.mx/tags/architecture/","section":"tags","tags":null,"title":"Architecture"},{"body":"春天来了，V秘大家庭也新增了两位10后的传人。新爸爸经过一番忙乱后，希望在这里与大家分享V秘的架构，共同探讨如何快速的构建高可用，高性能的Web服务。\nV秘致力于提供最好的在线视频制作云平台。让用户随时随地零门槛的快速制作出高质量高清晰度的视频，来纪念记录生活中有意义的时刻，同时将这份快乐传递给更多的家人朋友一起分享。\n然而要可靠的可扩展的实现这样看似简单的需求，其背后确由众多知名开源技术，可靠的云服务，不间歇的监控运维来实现和保证的。\nV秘架构的基本目标就是要实现，\n服务的高扩展性。有有效可靠的方法支撑数万并发到数十万，百万及更多的并发请求。 服务的高可用性。各种服务都是多实例的集群，某些服务故障后，集群中的其他实例仍然能够提供服务。 服务的自动化构建。从代码到服务部署上线是一套自动化的流程，越少的人工介入保证了服务的可用性。 系统的实时监控。7x24小时的监控保证服务的可用性，当监控到数据异常或服务停止运行能及时告警引入人工运维团队。 更多细节请参阅下面的slides,\nHow we build Videome from Meng Xin Zhu 欢迎留言与我们探讨你的心得和建议。\n","link":"https://kane.mx/posts/2016/how-we-build-videome/","section":"posts","tags":["web-2.0","architecture","网站架构"],"title":"V秘是如何构建的"},{"body":"","link":"https://kane.mx/tags/%E7%BD%91%E7%AB%99%E6%9E%B6%E6%9E%84/","section":"tags","tags":null,"title":"网站架构"},{"body":"","link":"https://kane.mx/tags/aliyun/","section":"tags","tags":null,"title":"Aliyun"},{"body":"","link":"https://kane.mx/tags/oss/","section":"tags","tags":null,"title":"Oss"},{"body":"","link":"https://kane.mx/tags/ossfs/","section":"tags","tags":null,"title":"Ossfs"},{"body":"阿里云提供的对象或者文件存储叫OSS，为应用程序提供了海量存储，按需付费等服务。应用程序则需要通过Aliyun OSS的各语言SDK才能操作（读，写，遍历等）OSS中的文件。\n对运维人员来说，做一些数据维护工作的时候，通过SDK操作OSS中的文件就会比较麻烦。在linux/unix环境下，通常有一些工具把远程文件系统或云盘挂载为本地文件。在网络状况比较好的情况下，操作远程文件就像操作本地文件一样。例如，把Amazon S3，Dropbox云盘，可通过ssh登录的远程服务器上的磁盘挂载为本地文件系统。\n之前也有第三方公司开发的工具把OSS bucket挂载为本地磁盘。出于安全考虑一直为敢使用。\n终于，阿里云推出了官方开源版本的ossfs，并且提供技术支持（通过工单）。\n接下来，聊聊我的使用体会。\n安装，配置都还简单。 文档看起来比较详细，但实际操作起来有些就不对。感觉写文档的人，并没有在相应环境上测试过。 权限设计的一塌糊涂。ossfs基于FUSE，理当允许非root挂载或卸载OSS bucket。非root用户使用ossfs挂载的文件默认的owner都是root! 还好目前有workaround，挂载的时候指定参数，-ouid=your_uid -ogid=your_gid来指定文件的owner。 性能极其低下！！！一台ECS主机挂载了一个使用内网地址的oss bucket，bucket根下面有2k+子目录（对文件系统而言）,bucket内文件总计有28G。然而执行ls /tmp/\u0026lt;bucket mount point\u0026gt;超过10分钟都无法完成。而我们V秘之前用Java实现的AliyunOSSFS执行同样的操作只需要数秒。 阿里云相关的技术支持人员及其不专业。很多文件系统，FUSE等概念都不甚了解。跟他们沟通这些技术问题，首先要花时间进行教育。花费大量时间来沟通，进展确缓慢。 总之，阿里云ossfs这个工具远远没有达到production ready的质量。无法使用到生产环境中。 ","link":"https://kane.mx/posts/2016/aliyun-ossfs-sucks/","section":"posts","tags":["aliyun","oss","ossfs","阿里云"],"title":"说一说阿里云ossfs"},{"body":"","link":"https://kane.mx/tags/wechat/","section":"tags","tags":null,"title":"Wechat"},{"body":"","link":"https://kane.mx/tags/weixin/","section":"tags","tags":null,"title":"Weixin"},{"body":"","link":"https://kane.mx/tags/%E5%85%AC%E4%BC%97%E5%B9%B3%E5%8F%B0/","section":"tags","tags":null,"title":"公众平台"},{"body":"微信给公众平台提供了素材管理的接口，通过这一系列接口可以上传，接收以及管理图片，视频等多媒体文件。其中又分为临时和永久两种类型。永久素材有总量的限制，临时素材微信服务器只给保存3天。\n最近V秘刚好有个同微信用户互动的场景，为用户美化微信拍摄的小视频。V秘后台服务器收到用户发送过来小视频（微信将其认做临时素材），将其美化处理后，再将美化的视频上传为临时素材，最终美化后的视频作为视频类型的客服消息被推送给用户。整个流程很简洁，用户发送小视频后，就坐等观看美化后的小视频了。\n然而最终经过V秘开发团队的实践及测试，得出的结论是，\n##微信公众平台的临时素材不能用！绝对的鸡肋！\n公众平台上传素材的API以及使用已有素材发送视频消息API都很健壮。但问题出在了微信后台资源的服务上面。\n开发者把图片视频成功上传为临时素材后，会从微信的接口得到这个素材的ID。这个ID随后作为给用户发送图文消息或视频消息的资源。微信后台会把这个ID对应到素材的真实URL路径上。这个过程是没有问题的。同时微信作为一个拥有海量用户的软件，它会将这些将要推送给用户的素材都发布到它的CDN。用户收到的最终图片视频的地址就是素材文件在微信/腾讯CDN上的地址。对CDN有了解的朋友都知道，CDN服务器分散在全国或全世界各地，当用户请求这个资源的时候，请求会被路由到离用户最近的CDN服务器上。当CDN服务器上还没有缓存请求的资源时，这时候有个溯源的过程。就是原始文件从文件服务器传送到该CDN服务器的一个过程。这时，用户有一个额外的等待，等待时间取决于文件大小和CDN服务器和文件服务器间的带宽。\n微信用来给公众号放置临时素材的CDN在这一块出了问题。在我们的测试中，微信CDN可能一直无法提供这些临时素材（某些文件超过1天后仍然无法访问）。而且出现错误的几率相当高，至少20%以上。由于CDN无法为临时素材提供可靠的访问保障，所以我们得出微信给公众号临时素材这个功能基本就是不能用。\n","link":"https://kane.mx/posts/2016/weixin-temporary-materials/","section":"posts","tags":["weixin","wechat","微信","公众平台"],"title":"如何使用微信公众平台的临时素材"},{"body":"","link":"https://kane.mx/tags/%E5%BE%AE%E4%BF%A1/","section":"tags","tags":null,"title":"微信"},{"body":"","link":"https://kane.mx/tags/pay/","section":"tags","tags":null,"title":"Pay"},{"body":"随着AngularJS等前端MVC框架的流行，AJAX的异步请求数据结合H5的push state等特性，极大的改善了网站的用户体验和页面加载性能。这类网站应用通常只有一个入口页面，通过应用内路由到不同的页面，所以俗称单页面(signle page application)应用。页面URL看起来如下，\n网站首页 http://mysite.com/#/index 商品列表页 http://mysite.com/#/goods/list 商品详情页 http://mysite.com/#/goods/skuid 网站关于页 http://mysite.com/#/about 对浏览器而言，上面几个地址都是访问的网站**/目录，每个url不同的是hash部分。而AngularJS**正是依赖页面的hash来做的应用内路由，根据不同的路由来加载不同的js和html片段，实现动态内容的加载。\n世上并没有绝对完美的事情，单页面应用在用户体验和性能上获得了好处。然而，在别的地方必然付出代价。这里就分享一下单页面应用和微信支付集成的一些经验。\n这里的微信支付指的是，在微信浏览器中通过JS接口调起微信支付来完成网页应用中商品的购买。微信支付本身的开发集成并不复杂，这里就不赘述了。微信支付出于安全考虑，要求公众号必须注册支付发起页面的地址（到支付页面的上级目录为止），并且能够添加到白名单的地址不超过3个。也就是如果应用在商品详情页发起支付请求，那么地址**http://mysite.com/#/goods/**必须在白名单列表。\n目前为止，一切都很好理解，把支付页面加到微信支付白名单不就万事大吉了。可经过实测，事实确不是这么简单！\n在微信iOS版本中，微信支付JS会错误的使用landing网站页面的URL，而不是发起支付的页面URL！比如用户通过网站首页**http://mysite.com/#/index进入应用，通过站内链接浏览到了某商品详情页http://mysite.com/#/goods/skuid并发起了支付。但微信JS会把landing页面URLhttp://mysite.com/#/index**判定为支付的发起页面，从而导致支付JS调用失败！\n因为应用存在多个页面，不可能把所有的页面都加到支付白名单中(有3个数目限制，并且工作量也大到不现实)。要解决这个问题，只好另辟蹊径。我目前找到的方法是，强制刷新页面当打开商品详情页的时候。等同于直接在微信浏览器中打开了商品详情页。虽然对用户体验有些影响，但支付功能正常工作了。\n","link":"https://kane.mx/posts/2016/single-page-app-meets-weixin-pay/","section":"posts","tags":["weixin","wechat","pay","支付"],"title":"单页面应用(single page application)中使用微信支付"},{"body":"","link":"https://kane.mx/tags/%E6%94%AF%E4%BB%98/","section":"tags","tags":null,"title":"支付"},{"body":"最近在提交前端代码后，前端代码的自动发布老是失败。失败的原因多是编译Docker镜像时在执行COPY语句拷贝文件到镜像文件系统时，扔出了'No space left on device'这个错误。这个错误根据描述非常好理解，就是docker文件系统所在磁盘没有了空间。\n但是通过df -h命令，该磁盘至少还有3，4个G的剩余空间。而前端镜像的文件大小最多也不超过300M。在该磁盘通过touch,cp仍然可以创建文件。\n所以这个问题非常奇怪，为什么docker或者操作系统抱怨磁盘没有了空间？在磁盘仍然剩余数个G的情况下？\n再通过相关的查找后，docker的这个issue给了我启发。Linux文件系统的inode在耗尽后，该文件系统将不能再创建新文件。因为前端页面是基于nodejs的程序，它依赖的packages产生了大量文件，在反复制作不同的docker images时，这些依赖文件又被反复复制，导致文件数量远远超过了默认inode和磁盘大小的比例，最终inode先于磁盘空间被全部使用。\n遇到类似问题的同学，可以通过df -i查看inode的使用情况来排查问题是否由于inode耗尽导致这个错误。\n","link":"https://kane.mx/posts/2016/docker-build-no-space-left-caused-by-inode-exhausted/","section":"posts","tags":["docker","troubleshoot"],"title":"文件系统的Inode耗尽，会导致Docker编译镜像出现'No space left on device'错误"},{"body":"","link":"https://kane.mx/tags/daemon/","section":"tags","tags":null,"title":"Daemon"},{"body":"Recently I wrote a Linux like initd script to start/stop my web application.\nThe script works well when running it in shell of linux. The web application will run in background by daemon.\nHowever I found both daemon and web application(java) exited immediately if I started the script in Jenkins as a shell step of build process.\nI put below simple script in 'Execute shell' block,\n1daemon --name=test-daemon -- sleep 200sleep 60 The process 'daemon' and 'sleep 200' should exit after 200 seconds the 'sleep' exits. The jenkins job will be finished in 60 secs.\njenkins 9954 9950 0 21:48 ? 00:00:00 sleep 60 jenkins 9955 1 0 21:48 ? 00:00:00 daemon —name=test-daemon — sleep 200 jenkins 9956 9955 0 21:48 ? 00:00:00 sleep 200 Above is the process info queried via ps command. The father pid of daemon is 1, not the script generated by Jenkins.\nBut both the process 'daemon' and 'sleep 200' immediately exited when the script finished. Should be something wrong in Jenkins to cause daemon exited unexpected.\nIt's something really frustrating to use daemon to stop/start the web application in Jenkins.\nFinally I used docker container to run my web application, which easily can be stopped/started via script in Jenkins.\n","link":"https://kane.mx/posts/archive/blogspot/daemon-hell-in-jenkins/","section":"posts","tags":["docker","daemon","jenkins","jenkins-cli"],"title":"Daemon hell in Jenkins"},{"body":"","link":"https://kane.mx/tags/jenkins-cli/","section":"tags","tags":null,"title":"Jenkins-Cli"},{"body":"","link":"https://kane.mx/tags/java/","section":"tags","tags":null,"title":"Java"},{"body":"","link":"https://kane.mx/tags/mac-osx/","section":"tags","tags":null,"title":"Mac OSX"},{"body":"After uninstalling some applications from my Mac OSX, I found the applications that depends on JRE totally does not work. I noticed below symptoms,\nEclipse Mars can not be launched, even though I specified the launching vm to another one(`java -version` still work). The SWT native library failed to resolve the dependencies to '/System/Library/Frameworks/JavaVM.framework/Versions/A/JavaVM' which does not exists. I tried to reinstall Oracle 1.8.0_u45 via both brew and dmg image downloaded from Oracle website, both ways were failed as well. The Mac pkg Installer can not be started due to dylib broken. It means I can't install any pkg via GUI. The command line(such as sudo installer -verboseR -target / -pkg /Volumes/OS\\ X\\ 10.10.4\\ Update\\ Combo/OSXUpdCombo10.10.4.pkg) still works. Finally I realized the problem was caused by I uninstalled the out of date Apple Java 6. Looks like all of above failures are required the system built-in Java. It really does not make sense the Oracle 1.8 installer script to depend on the out of date Java.\nFinally I reinstalled Java for OS X 2014-001 to make everything working again. The GUI installer for pkg still does not work, you need use below command to use the pkg.\nsudo installer -verboseR -target / -pkg /Volumes/Java\\ for\\ OS\\ X\\ 2014-001/JavaForOSX.pkg\n","link":"https://kane.mx/posts/archive/blogspot/the-symptoms-of-java-broken-in-mac-osx/","section":"posts","tags":["Java","Mac OSX","troubleshoot"],"title":"The symptoms of Java broken in Mac OSX 10.10 and fix solution"},{"body":"","link":"https://kane.mx/tags/groovy/","section":"tags","tags":null,"title":"Groovy"},{"body":"Jenkins supports ssh authentication in CLI.\nBelow is a command to verify that I am authenticated:\n1 2java -jar jenkins-cli.jar -s http://myserver/jenkins who-am-i 3 4 Authenticated as: myuser 5 Authorities: 6 authenticated However you still would meet permission error when running groovy script in CLI.\n1 2java -jar jenkins-cli.jar -s http://myserver/jenkins groovysh \u0026#39;jenkins.model.Jenkins.instance.pluginManager.plugins.each { println(\u0026#34;${it.longName} - ${it.version}\u0026#34;) };\u0026#39; 3 4Exception in thread \u0026#34;main\u0026#34; java.lang.reflect.UndeclaredThrowableException 5at $Proxy2.main(Unknown Source) 6at hudson.cli.CLI.execute(CLI.java:271) 7at hudson.cli.CLI._main(CLI.java:417) 8at hudson.cli.CLI.main(CLI.java:322) It's a bug of Jenkins. The workaround is create a groovy script, then run that script via Jenkins CLI.\n1java -jar jenkins-cli.jar -s http://myserver/jenkins/ groovy test_script.gsh ","link":"https://kane.mx/posts/archive/blogspot/run-groovy-script-via-jenkins-cli/","section":"posts","tags":["groovy","jenkins","jenkins-cli"],"title":"Run groovy script via Jenkins CLI"},{"body":"","link":"https://kane.mx/tags/lucene/","section":"tags","tags":null,"title":"Lucene"},{"body":"","link":"https://kane.mx/tags/solr/","section":"tags","tags":null,"title":"Solr"},{"body":"The index has a field named 'create_time' that is the timestamp of document created time. The query string can boost the latest created document like below,\n{!boost b=recip(ms(NOW,create_time),3.16e-11,0.08,0.05)}name:keyword\nThere is another field named 'important' that indicates whether the document is important or not. The query string can boost the document is important like below,\nq={!boost b=$importfunc}name:keyword\u0026amp;importfunc=query({!v='important:true'})\nAbove query string uses a sub query in boost function.\nFinally I want to boost both above two fields, and 'important' field has higher priority. The query string looks like below,\ndefType=edismax\u0026amp;q=name:keyword\u0026amp;bf=query({!v='import:true'})^20.0 recip(ms(NOW,create_time),3.16e-11,0.08,0.05)^10.0\u0026quot;)\n","link":"https://kane.mx/posts/archive/blogspot/solr-boost-examples/","section":"posts","tags":["lucene","solr"],"title":"Solr boost examples"},{"body":"","link":"https://kane.mx/tags/django/","section":"tags","tags":null,"title":"Django"},{"body":"It's a common and ugly problem when using non-ascii characters in Django.\nThe general solution is below,\nput # -- coding: utf-8 -- at beginning of every python source files that are using utf-8 characters declare every string variable as unicode, such as str_var = u'中文字符' add a __unicode__ method in your model classes if you are running server on apache/mod_wsgi or ngnix, you need configure web server to use utf-8 encoding ","link":"https://kane.mx/posts/archive/blogspot/djangos-unicdoe-encode-error/","section":"posts","tags":["django","encoding","python"],"title":"Django's unicdoe encode error"},{"body":"","link":"https://kane.mx/tags/encoding/","section":"tags","tags":null,"title":"Encoding"},{"body":"","link":"https://kane.mx/tags/python/","section":"tags","tags":null,"title":"Python"},{"body":"","link":"https://kane.mx/tags/eclipse/","section":"tags","tags":null,"title":"Eclipse"},{"body":"","link":"https://kane.mx/tags/mountain-lion/","section":"tags","tags":null,"title":"Mountain Lion"},{"body":"","link":"https://kane.mx/tags/php/","section":"tags","tags":null,"title":"Php"},{"body":"I installed both Zend CE and zend debugger of Eclipse on my Mac. Both of them work well in Mac lion. However they don't work any more after I upgraded my Mac to mountain lion. After some investigation I found some extensions of Zend PHP can't be loaded due to shared library dependency can't be found in mountain lion. The xslt module of PHP depends on some system libraries(suc as /usr/local/libxslt-1.1.23/lib/libxslt.1.dylib) that have been removed by mountain lion.\nThe temporary solution is disabling xlst module of zend PHP if your application doesn't need them. The workaround fix of Zend CE on Mac, rename /usr/local/zend/lib/php_extensions/xsl.so to any other name\nThe workaround fix of zend debugger for Eclipse, Delete the line extension=xsl.so from file /plugins/org.zend.php.debug.debugger.macosx_5.3.18.v20110322/resources/php53/php.ini\n","link":"https://kane.mx/posts/archive/blogspot/workaround-of-making-zend-ce-mountain-lion/","section":"posts","tags":["zend ce","zend debugger","workaround","Eclipse","php","mountain lion"],"title":"The workaround of making Zend CE/Zend debugger work on mountain lion"},{"body":"","link":"https://kane.mx/tags/workaround/","section":"tags","tags":null,"title":"Workaround"},{"body":"","link":"https://kane.mx/tags/zend-ce/","section":"tags","tags":null,"title":"Zend Ce"},{"body":"","link":"https://kane.mx/tags/zend-debugger/","section":"tags","tags":null,"title":"Zend Debugger"},{"body":"","link":"https://kane.mx/tags/dual-monitor/","section":"tags","tags":null,"title":"Dual Monitor"},{"body":"I had two monitors for my workstation. One is 22' and the another is 17'. I used the small one as a extend desktop.\nToday I get a another 23' monitor to replace the small one. However the resolution of the 23' monitor can't be changed after pluging it in. It always used the resolution matching the 17' one.\nBoth 'Setting - Display' and 'AMD Catalyst control' can't adjust it as higher resolution.\nAfter some tuning, I found a workaround.\nI totally remove all config of small one from /etc/X11/xorg.conf. Then change its resolution in 'AMD Catalyst control', it works!\n","link":"https://kane.mx/posts/archive/blogspot/dual-monitors-on-ubuntu/","section":"posts","tags":["Tip","Ubuntu","Dual monitor","Trick"],"title":"Dual monitors on Ubuntu"},{"body":"","link":"https://kane.mx/tags/trick/","section":"tags","tags":null,"title":"Trick"},{"body":"","link":"https://kane.mx/tags/ubuntu/","section":"tags","tags":null,"title":"Ubuntu"},{"body":"I want to create a test server for my application. Using embedding Http server in equinox is my first option.\nI had experience using simple http service implementation of equinox, however I want to play with Jetty this time.\nFollowing the guide of Equinox server, I can't running a Jetty server with my servlet in Eclipse Indigo. Obviously the guide is out of date.\nAfter tuning it, I found below bundles are minimum collection to run Jetty inside OSGi runtime.\nYou only need create a run configuration of OSGi framework, add your bundles with servlets and above bundles.\n","link":"https://kane.mx/posts/archive/blogspot/embedding-http-server-in-equinox/","section":"posts","tags":["Equinox","Jetty","OSGi"],"title":"Embedding an HTTP server in Equinox"},{"body":"","link":"https://kane.mx/tags/equinox/","section":"tags","tags":null,"title":"Equinox"},{"body":"","link":"https://kane.mx/tags/jetty/","section":"tags","tags":null,"title":"Jetty"},{"body":"","link":"https://kane.mx/tags/osgi/","section":"tags","tags":null,"title":"OSGi"},{"body":"Sometimes I need access the Intranet of company, however I don't like to create VPN connection. The connection is slow, waste time to create the connection and have to change password due to security policy.\nMy workstation is Linux, which has a lot of utility tools to help me access Intranet at home without VPN.\nFirstly I set up a ssh server on my personal computer. It's quite easy if you are using Linux, for Windows I installed Copssh.\nThen register a free domain name and configure it in my router. And let router forward port 22(or any port you wan to use) to my personal computer.\nIn my working Linux machine, create a ssh tunnel to my personal computer. Must use the public/private key for authenticating. For example,\nIt means remote server can access my workstation's port 22 via accessing its port 1002 after the ssh tunnel is created successfully. Above command line also forwards the ports 5900 and 6500. The default VNC session will listen the port 5900.\nBut it only works when my personal computer is running. And the connection can't be reconnected after it fails once.\nThe graceful solution is installing 'autossh' in my Linux, which is an utility to retry the ssh connection with an interval if it's disconnected or failed.\nThen create a script and running it when OS is booted. The script will be executed by root user, so we need configure it ran by the normal user.\nAfter my personal computer is booted a while(the default interval of autossh is 300 seconds), I can use localhost:10002 to login my workstation, localhost:5900 to access my VNC session. Of course you can use 'froxyproxy' of Firefox via a localport to access web page of Intranet.\n","link":"https://kane.mx/posts/archive/blogspot/acess-intranet-without-vpn/","section":"posts","tags":["ssh"],"title":"Acess Intranet without VPN"},{"body":"","link":"https://kane.mx/tags/ssh/","section":"tags","tags":null,"title":"Ssh"},{"body":"","link":"https://kane.mx/tags/configuration/","section":"tags","tags":null,"title":"Configuration"},{"body":"","link":"https://kane.mx/tags/gerrit/","section":"tags","tags":null,"title":"Gerrit"},{"body":"An internal Gerrit server was moved, so the hostname of server is changed. However we are using OpenID for user control, the OpenID provider(such as Google account) will generate a new token for the new server(hostname changing will impact the identity token of Google account) when we login Gerrit with same OpenID account. Gerrit will create a new internal account by default even though my OpenID account has existed in the system and has a lot of activities.\nThe solution is updating the 'ACCOUNT_EXTERNAL_IDS' table of Gerrit via gsql. Setting the 'ACCOUNT_ID' to your existing account_id for the new record whose 'EXTERNAL_ID' is the new token gotten from Google.\nupdate ACCOUNT_EXTERNAL_IDS set ACCOUNT_ID='1000001' where EXTERNAL_ID='https://www.google.com/accounts/o8/id?id=xxxxxxxxxx';\nThen search the documentation of Gerrit, I find a configuration property looks like supporting such a migration for OpenID authentication.\nauth.allowGoogleAccountUpgrade\nAllows Google Account users to automatically update their Gerrit account when/if their Google Account OpenID identity token changes. Identity tokens can change if the server changes hostnames, or for other reasons known only to Google. The upgrade path works by matching users by email address if the identity is not present, and then changing the identity.\nThis setting also permits old Gerrit 1.x users to seamlessly upgrade from Google Accounts on Google App Engine to OpenID authentication.\nHaving this enabled incurs an extra database query when Google Account users register with the Gerrit server.\nBy default, unset/false.\n","link":"https://kane.mx/posts/archive/blogspot/how-to-reuse-existing-openid-accounts/","section":"posts","tags":["OpenID","gerrit","configuration"],"title":"How to reuse the existing OpenID accounts after the host name of Gerrit server is changed"},{"body":"","link":"https://kane.mx/tags/openid/","section":"tags","tags":null,"title":"OpenID"},{"body":"","link":"https://kane.mx/tags/certificate/","section":"tags","tags":null,"title":"Certificate"},{"body":"The problem came from I tried to set up send mail server(SMTP) for my Gerrit server. My Gerrit server is using OpenID for user authorization, so I registered a new email account to send notification from Gerrit.\nMost of email service providers require the secure authorization when using its SMTP server to send mail. However the root CA of my email provider is not added into the default certificate of JRE. So Gerrit always failed to send email due to ssl verification exception.\nMy solution is adding the certificate of SMTP server into the certificate used by JRE.\nThe detail steps are below,\nUse open_ssl utility to the certificate of SMTP server or its root CA of email provider. Below command can list the certificate of SMTP and its chain. You can paste any of them into a file.\nopenssl s_client -connect smtp.163.com:465\nThen import the certificate saved in previous step into my JRE's key store. The default password of JRE's default keystore is 'changeit'. You can find the cacerts under jre/lib/security folder.\nsudo keytool -import -keystore cacerts -alias Smtp163com -file /tmp/smtp.163.PEM\n","link":"https://kane.mx/posts/archive/blogspot/jrejdks-certificate-issue-and-solution/","section":"posts","tags":["Java","gerrit","configuration","smtp","certificate"],"title":"JRE/JDK's certificate issue and solution"},{"body":"","link":"https://kane.mx/tags/smtp/","section":"tags","tags":null,"title":"Smtp"},{"body":"","link":"https://kane.mx/tags/build/","section":"tags","tags":null,"title":"Build"},{"body":"","link":"https://kane.mx/tags/maven/","section":"tags","tags":null,"title":"Maven"},{"body":"I successfully converted our product build from PDE build to Maven/Tycho. Something is worth to be documented here.\nThere are several examples and posts to demonstrate how using Tycho building your Eclipse plug-ins, features, applications and products. The most helpful example is the demo of Tycho project.\nBelow are some traps I met when building my project by Tycho,\nproduct build\nOur product is based on plug-ins, however we added the 'featurelist' in build.properties of PDE build to include some root binary for the product. However Tycho doesn't support this type of build, we create some features as the placeholder of plug-ins. Then change the product as features based. You have to manually remove the plugins tag in .product definition file, otherwise Tycho will fail on strange error if the .produce has both features and plugins tag. Then configure the director plugin as not installing features.\norg.eclipse.tycho\ntycho-p2-director-plugin\n${tycho-version}\nmaterialize-products\nmaterialize-products\nfalse\nmyappprofile\narchive-products\narchive-products\nAnd I used below way to customize the qualifier string of our build.\norg.eclipse.tycho tycho-packaging-plugin ${tycho-version} '${qualifier-prefix}_'yyyyMMddHHmm An limitation of director plugin is that no way using different profile name for the application installed on different hosts. I contributed a patch on bug 362550 for this enhancement.\nfeature build\nWe have some features to pack some binary files as root files. But Tycho doesn't support root folder that is recognized by PDE build. The workaround is creating an additional folder, then put the root files into it.\nMeanwhile Tycho doesn't support wildcard to other native touch points, such as changing the files permission. For static file list use comma separated list as workaround.\neclipse test plug-in\nI have a plug-in whose scope is 'test', but it doesn't have test case and no dependency for any test framework, such as junit 3.8 or junit 4. And it's used for mocking test server. Configure surefire plugin to let it build as test plug-in as well.\norg.eclipse.tycho\ntycho-surefire-plugin\n${tycho-version}\njunit\njunit\n4.1\nfalse junit\njunit\n4.1\nAnd configure the surefire plugin like below to test code in Maven build.\norg.eclipse.tycho\ntycho-surefire-plugin\n${tycho-version}\nmy.group\nmy.feature\n${version}\neclipse-feature\nmy.group\nmy.testserver\n1.0.0\neclipse-plugin\n${testSuiteName}\n${testClassName} -Dcom.sun.management.jmxremote\n-consoleLog\norg.eclipse.equinox.ds\n1\ntrue\nsign jars\nAdd below signjar plugin into parent pom.xml, however I met the md5 error when materializing the repository built on .product. There is a workaround mentioned on Bug 344691.\norg.apache.maven.plugins maven-jarsigner-plugin 1.2 ${keystore} MyCompany ${storepass} ${keypass} true ${skip.jar.signing} -tsa https://timestamp.geotrust.com/tsa **/artifacts.jar **/content.jar jar eclipse-plugin eclipse-feature eclipse-test-plugin sign sign verify verify ","link":"https://kane.mx/posts/archive/blogspot/tips-of-maventycho-building/","section":"posts","tags":["Maven","Eclipse","build","Tycho"],"title":"The tips of Maven/Tycho building crossplatform RCP and repository"},{"body":"","link":"https://kane.mx/tags/tycho/","section":"tags","tags":null,"title":"Tycho"},{"body":"","link":"https://kane.mx/tags/clearcase/","section":"tags","tags":null,"title":"Clearcase"},{"body":"Several days ago I had a post to record the unsuccessful experience to migrate source code from Clearcase to Git.\nWe have a new way after doing some brain storms. This way still is not a perfect solution, but it's much better than previous trial.\nUse clearexport_ccase to export the source folder to intermittent data. See documentation of Clearcase admin. Create a temporary vob for importing the data later. See example. Import the data into temporary vob. See example. Repeat step 1 to 3 for importing all necessary data into temporary vob. Use the SVN Importer to import the temporary vob as Subversion repository. Last steps refer to a documentation of succeeded migration case of one of Eclipse project from Subversion to Git. Git definitely is greatest SCM tool now. The size of Subversion repository is around 10GB, finally the Git repository is less than 700MB, which saves more than 10 times disk space. It's awesome!\nThe flaw of this way is that the removed elements in Clearcase(said using Main/LATEST as cspec of Clearcase vob when exporting) would lose after importing into a temporary vob. So switching to a maintenance branch or tag like 1.0/2.0 in Git, the source code is incomplete. The files existed in that branch or tag, then removed in latest code base are lost. The workaround could be manually checking in GA version to have complete code.\nIf anybody have graceful and perfect solution to migrate Clearcase to Git, I think he could start a new business. :)\n","link":"https://kane.mx/posts/archive/blogspot/migration-clearcase-to-git-part-2/","section":"posts","tags":["Git","Clearcase"],"title":"Migration Clearcase to Git -- part 2"},{"body":"I tried to migrate the source code of project from Clearcase to Git repository. As far as I know there is no elegant solution for such migration. For purpose of this migration, I want to keep the history and label of files in Clearcase after migrating to Git repository.\nThere are mature tools to migrate CVS/SVN repository to Git, so I tried to use Subversion as a bridge for my migration.\nI used a free software 'SVN Importer' to import the Clearcase vobs to Subversion. The tool is great, and it keeps the history of files, labels and branches. The entire size of new Subversion repository has near 50GB which is unacceptable size of Git repository. The subversion repository contains a lot of legacy code and unwanted binaries, so removing those revisions could significantly reduce the size of subversion repository. And subversion provides some admin tools to manipulate the metadata of subversion, it's possible to remove the unnecessary revisions and re-create a subversion repository with refined content. But I don't have any experience to use the admin tool of subversion before, I failed to filter the unwanted data. It's not worthy of costing too much effects on it. Finally I failed to filter the subversion repository.\nActually the detail history of files is rarely used. If need, we still can find it in Clearcase. At last I manually checked in the released version of our project into Git repository, and tagged them.\nWrote this unsuccessful idea here for elapsed efforts.\n","link":"https://kane.mx/posts/archive/blogspot/migrate-clearcase-to-git/","section":"posts","tags":["Git","Clearcase"],"title":"Migrate Clearcase to Git"},{"body":"","link":"https://kane.mx/tags/p2/","section":"tags","tags":null,"title":"P2"},{"body":"Our p2 based on installer suffered performance issue when querying IUs from repositories. Though the repositories have a large number of IUs to be queried, but we find the performance of using QL is unacceptable in some special scenarios.\nI published several different methods to find the expected IUs. Thomas pointed out the better expression of QL and finally helped us to find out the our repository without IIndexProvider implementation.\nIIndexProvider implementation of a repository is quite important to improve the performance of QL, especially use the 'traverse' clause to query something.\nAnd Slicer API is an alternative method when querying the complete dependencies.\n","link":"https://kane.mx/posts/archive/blogspot/p2-query-performance/","section":"posts","tags":["p2","performance"],"title":"p2 query performance"},{"body":"","link":"https://kane.mx/tags/performance/","section":"tags","tags":null,"title":"Performance"},{"body":"","link":"https://kane.mx/tags/compile/","section":"tags","tags":null,"title":"Compile"},{"body":"Yesterday I modified an existing c++ application for Windows. And its default build environment is Makefile and MinGW.\nHowever I used a newly Windows API that is not included by header files of MinGW.\nFirst of all, I copied the constant definition from header file of Windows SDK, and defined the Windows API method as a extern C method. So it's no problem to compile the code in MinGW.\nSecondly I have to fix the link issue. Because the symbol of the Windows API also can't be found by gcc link.\nHere great thanks to Google. It's quite easy to get the knowledge from others.\nI found a way to create an library by using dlltool. Dlltool is a utility to create an library with specified methods from existing dll library, which can be used by gcc link later.\nBelow are links I referred to create an import library,\n[1] http://www.emmestech.com/moron_guides/moron1.html\n[2] http://www.mingw.org/wiki/CreateImportLibraries\n[3] http://lists-archives.org/mingw-users/19461-import-library-for-c.html\n","link":"https://kane.mx/posts/archive/blogspot/create-import-library-for-building/","section":"posts","tags":["compile","MinGW"],"title":"Create an import library for building application in MinGW"},{"body":"","link":"https://kane.mx/tags/mingw/","section":"tags","tags":null,"title":"MinGW"},{"body":"The documentation of PDE has a chapter for this topic. Basically it's simply. Copy the template scripts what you want from templates/headless-build folder under org.eclipse.pde.build plug-in to your build configuration directory that is the folder has build.properties file.\nHowever I found the variables listed in template 'customAssembly.xml' can't be used in the runtime. I filed bug 346370 against it.\n","link":"https://kane.mx/posts/archive/blogspot/customize-pde-build/","section":"posts","tags":["PDE","Eclipse","build"],"title":"Customize PDE build"},{"body":"","link":"https://kane.mx/tags/pde/","section":"tags","tags":null,"title":"PDE"},{"body":"","link":"https://kane.mx/tags/code-signing/","section":"tags","tags":null,"title":"Code Signing"},{"body":"I did sign the jars via reusing the existing certificate of Windows code signing several months ago. Writing it down for further reference.\nWhatever your purpose of reusing the existing Windows code certificate, I only document the way from technical perspective.\nAfter buying the certificate of Windows code signing from CA, you will get a .pvk file that stores both the certificate and private key. PVK file is the PKCS12 format[1], however java uses JKS format by default. So you need convert the pvk file to JKS keystore and certificate.\nSince 6.0 JDK supports PKCS12 directly, you can use 'jarsigner' and PVK file to sign jars directly[2].\n1jarsigner -keystore /working/mystore.pvk -storetype pkcs12 -storepass myspass -keypass j638klm -signedjar sbundle.jar bundle.jar jane Or using keytool to convert the PKCS#12 to JKS format[3] if using Eclipse PDE build to sign your jars.\n1keytool -importkeystore -srckeystore KEYSTORE.pvk -destkeystore KEYSTORE.jks -srcstoretype PKCS12 -deststoretype JKS -srcstorepass mysecret -deststorepass mysecret -srcalias myalias -destalias myalias -srckeypass mykeypass -destkeypass mykeypass -noprompt [1] http://en.wikipedia.org/wiki/PKCS\n[2] http://download.oracle.com/javase/6/docs/technotes/tools/solaris/jarsigner.html\n[3] http://shib.kuleuven.be/docs/ssl_commands.shtml#keytool\n","link":"https://kane.mx/posts/archive/blogspot/using-certificate-of-windows-code/","section":"posts","tags":["code signing","java","certificate"],"title":"Using the certificate of Windows code signing to sign jars"},{"body":"I met that firefox/thunderbird complained another its instance running even if no a running firefox/thunderbird process. Finally let them run again after removing the '.parentlock' file in their default profile.\nstrace utility helps me a lot to find the solution.\nstrace -f -e file firfox\n","link":"https://kane.mx/posts/archive/blogspot/unlock-locked-profile-if/","section":"posts","tags":null,"title":"Unlock the locked profile if firefox/thunderbird crash"},{"body":"I implemented the replication tool at the end of 2009, then published it to Eclipse Marketplace in May 2010. However it's not pervasively used due to users have to install that plug-in firstly.\nI searched a similar request on bugzilla, then I initialized my contribution in the early of this year. Finally it was accepted and will release as part of eclipse itself since Eclipse 3.7 M7! I hope it would benefit the users of Eclipse more and more.\nAnd I was nominated and elected as the committer of Equinox p2, it's a great honor for me. :)\n","link":"https://kane.mx/posts/archive/blogspot/eclipse-p2s-importexport-capability/","section":"posts","tags":["Equinox","p2","Eclipse","feature"],"title":"Eclipse P2's import/export capability"},{"body":"","link":"https://kane.mx/tags/feature/","section":"tags","tags":null,"title":"Feature"},{"body":":g!/some expression/d\n","link":"https://kane.mx/posts/archive/blogspot/vim-delete-lines-not-contain-words/","section":"posts","tags":null,"title":"[vim] delete the lines not contain words"},{"body":"Recently our installer met a strange bug, it didn't uninstall all legacy bundles after updating to new version. Finally I found it's due to a magic fragment is missing in the profile due to some causes.\ninstallBundle(bundle:${artifact})\nuninstallBundle(bundle:${artifact})\nsetStartLevel(startLevel:4);\nIt has 'hostRequirements' element that represents it's a fragment IU and match all the eclipse's plug-ins in that profile. And this fragment defines the touch point actions for its hosts that will do installBundle action during 'install' phrase and uninstallBundle action during 'uninstall' phrase. It's a very good way to remove the duplicate touch point definitions for all eclipse's plug-ins in the profile.\nBTW, p2's engine also doesn't attach this fragment to the eclipse's plug-in IU if the top level IU doesn't have the STRICT rule. I'm not sure the root cause of designing for it, but it's the fact.\n","link":"https://kane.mx/posts/archive/blogspot/inside-p2s-profile-2-fragment-matches/","section":"posts","tags":["p2","Eclipse","profile"],"title":"Inside P2's profile (2) - the fragment matches all osgi bundles"},{"body":"","link":"https://kane.mx/tags/profile/","section":"tags","tags":null,"title":"Profile"},{"body":"You would see some interesting properties at the bottom of eclipse's profile.\nFor example,\nIt attaches a property named 'org.eclipse.equinox.p2.internal.inclusion.rules' with value 'STRICT' on the IU 'org.eclipse.sdk.ide' with version 3.6.1.M20100909-0800.\nIt's a very important property for the p2 engine. It means the IU 'org.eclipse.sdk.ide' has been explicitly installed into the profile, so it's not allowed be implicitly updated or removed.\nFor example,\nWe have top feature IU 'org.eclipse.sdk.ide' that represents the Eclipse SDK, 'org.eclipse.pde.feature' that represents the Plug-in Development Tool and 'org.eclipse.jdt.feature' that represents the Java Development Tool. And both JDT and PDT are part of Eclipse SDK, so 'org.eclipse.pde.feature' and 'org.eclipse.jdt.feature' are required by 'org.eclipse.sdk.ide'.\nIf the profile only has the STRICT rule for 'org.eclipse.sdk.ide', 'org.eclipse.jdt.feature' and 'org.eclipse.pdt.feature' will implicitly be updated to 3.6.2 when updating 'org.eclipse.sdk.ide' from 3.6.1 to 3.6.2.\nHowever the profile has below STRICT rule for PDT feature,\nThe p2 engine will report errors due to 'org.eclipse.pdt.feature' has STRICT rule for updating. Hence third-party must explicitly update both 'org.eclipse.sdk.ide' and 'org.eclipse.pdt.feature' from 3.6.1 to 3.6.2.\n","link":"https://kane.mx/posts/archive/blogspot/inside-p2s-profile-1-inclusion-rules/","section":"posts","tags":["p2","Eclipse","profile"],"title":"Inside P2's profile (1) - inclusion rules"},{"body":"Latest gcc compiler enables the stack overflow protector that is since GLIBC 2.4. So the library or executable is compiled by latest gcc could be loaded or executed in RHEL4 or Solaris 9 that only have GLIBC 2.3. Hence using option '-fno-stack-protector' to compile the library or executable to make sure it could be executed in older linux release.\ng++ -fno-stack-protector -o test.o test\n","link":"https://kane.mx/posts/archive/blogspot/stack-overflow-protector/","section":"posts","tags":null,"title":"stack overflow protector"},{"body":"Recently I just know such a useful syntax usage of java.\naLoopName: for (;;) { // ... while (someCondition) // ... if (otherCondition) continue aLoopName; ","link":"https://kane.mx/posts/archive/blogspot/loop-name-for-for-clause-in-java/","section":"posts","tags":null,"title":"the loop name for 'for' clause in java"},{"body":"It's a powerful command to rename files in a batch.\nUsage:\nrename 's/(\\d+)$/$1\\.txt/' * rename add '.txt' extension name for all files that ends with number.\n","link":"https://kane.mx/posts/archive/blogspot/rename-command/","section":"posts","tags":null,"title":"rename command"},{"body":"If you have http proxy, set it to system environment,\nexport http_proxy=http://127.0.0.1:8000 Then start the application in that same terminal.\nIf the proxy is socks proxy, use 'tsocks' to wrap the application in terminal.\n","link":"https://kane.mx/posts/archive/blogspot/applying-proxy-for-softwares-without/","section":"posts","tags":null,"title":"applying proxy for the softwares without proxy support in linux"},{"body":"Honestly speaking, you have eaten the best delicious food if you're living in China. Though we have more and more concerns on the safety of food, we have to recognize that Chinese food is more delicious than others.\nThe cuisine is simple in Austria. People always use pork, beef, flour, tomato, potato and few green vegetables. So they surprised Chinese cost several hours to make the food.\nGulasch, it's good tasted after eating pizza several times\npasta\nAbout the drinking, most of them directly drink the water from water pipe. And some of them like the special water that mixes water with gas. The coffee and beer are the favorite of local citizens. You can find more than one hundred beer brand in the city, and some of them have been found for centuries. Indeed they're good tasted.\nspecial water\nDie Weissf\nWieninger, it comes from Vienna\nStiegl, local famous brand\n","link":"https://kane.mx/posts/archive/blogspot/food-and-drinking/","section":"posts","tags":["salzburg","travel","tour"],"title":"Food and Drinking"},{"body":"","link":"https://kane.mx/tags/salzburg/","section":"tags","tags":null,"title":"Salzburg"},{"body":"","link":"https://kane.mx/tags/tour/","section":"tags","tags":null,"title":"Tour"},{"body":"","link":"https://kane.mx/tags/travel/","section":"tags","tags":null,"title":"Travel"},{"body":"","link":"https://kane.mx/categories/trip/","section":"categories","tags":null,"title":"Trip"},{"body":"Joel posted a blog related to how to hire the great programmers. One of his key points is building comfortable workspaces.\nI believe every programmer loves the workspace like Google and Fog Creek. The workspace of Google has been very famous due to its French chef, gymnasium and big sofas. Why is Fog Creek? It's the company created by Joel, he also practiced his theory on his company. Ruan YiFeng posted a blog for it. I bet you would envy the guys working in that office.\nHow about the workspace of the office in Salzburg? Let me show some pictures.\nSpace\nprogrammers have two monitors\nReading\nnon-technique magazines\ntechnique books\nDrinking\ncoffee machine\nKitchen\nfreezer for fast food\nEntertainment\ntable football game\n","link":"https://kane.mx/posts/archive/blogspot/working-workspace/","section":"posts","tags":["salzburg","travel","tour"],"title":"Working Workspace"},{"body":"Salzburg is a small city and is on the banks of the Salzach River. It's easy to go through the city by bus in 30 minutes.\nSalzach River\noutline view\nRiding the bicycle is a very good way to enjoy the beautiful sight of the city. You could see many kids with parents riding bicycle in the sunny weekend.\nThe train station and a major bus transient station are the same one that is called as 'main station' by local residents. it's not far from the office of company, about 20 minutes by foot.\nThe ticket system of bus is more complicated than Beijing's. People can buy the ticket for single, 24 hours, 48 hours, a week and even a year. The children can get discount. There's no ticket seller in the bus. Usually nobody checks whether you have valid ticket. Pressing the button to open the door when getting on/off the bus.\nThe public transportation is designed well. There are different tickets for different people. For example, tourists would prefer to buy 24 hours ticket or 48 hours tickets. 24 hours ticket means the passengers can take any bus in the 24 hours after it's used first time. So it's very convenient for tourists. 24 hours ticket is 4.2€ for adults, 2.1€ for children. The price of train tickets is same. The faster train \u0026quot;ICE\u0026quot; have higher price. The regular train is much cheaper. And the ticket allows a family not exceed 5 persons to go back and forth another city in a day. It's very cheap for a family to enjoy weekend in another city or town by train. I think it's a good approach to use public transportation more to reduce environment pollution.\nround-trip ticket\nTaking the train is more convenient than China. There is no security checking, no long distance between gate and platform and even no staffs in the platform. Meanwhile there is no any limitation to travel among the European countries. I went to Munich of Germany by train, I felt it's even more convenient than taking subway in Beijing. Both the train and bus have a lot of humanization design for disability people and people with their bicycles or pets.\n","link":"https://kane.mx/posts/archive/blogspot/transportation/","section":"posts","tags":["salzburg","travel","tour"],"title":"Transportation"},{"body":"There are 20+ staffs in Salzburg office. Most of them are developers, one is administrator of office.\nGenerally the staffs in Salzburg work more flexible than the staffs working in Beijing.\nSome of them live in German. Even though it's not so far as Salzburg, they also need come to Salzburg by train. So sometimes they work at home, use internet and phone as communication tool.\nAnd they have different responsibilities for products. For example, Helmut works on installer, Matthias and Michael are responsibility for QFT testing, Martin N. is focus on license API developing. So everybody has himself schedule, he can decide when he come to office and when leave office based on his working schedule. Nobody cares when you come/leave office or how long you work every day. I believe all of them do well on their jobs.\nFurthermore you can work with your dogs together if nobody takes care of them at home.\nMax's dogs\nMost of foreign like coffee, so there is a kitchen with seats in the office. Some of them like to drink a cup of coffee or tea as a break, and it's a good chance to talk with others. It's a relaxing time for changing your mind out of work. Besides drinking some things, there is a room for playing table football game. It's a small amount of exercise, it's good for body.\nAnother thing makes me very impressive is that the team is very stable than any company I know in China. Most of them work in company more than 10 years. So I think I know that's why they know more than us. Everyone could be expert after doing the same thing more than 10 years. They love the work of coding, and they would like to do coding until retiring. That's why I can see some of them are more than 40, even 50 years old.\n","link":"https://kane.mx/posts/archive/blogspot/working-style/","section":"posts","tags":null,"title":"Working style"},{"body":"度过长途飞行的旅程不是一件容易的事。要在狭小的空间里待上近10个小时，好在是两人出行，半睡半聊的打发过了时间。\n到达维也纳之后，出了登机通道看到的居然是一个酒吧类的餐馆。感觉很稀奇，也很有味道。\n周围当然少不了免税店和商铺，但远没有首都机场那样的规模。总体感觉就像国内大型超市购物出口一样，而且人也不多。\n另外赞一下维也纳机场的Wifi，简单配置就K了，哪像国内的，又要移动号码，还要短信获取，搞半天也没弄定。\n转机之前还有段时间，就在机场里到处逛了逛。顺便在一个吧解决了晚餐，同时也尝了杯当地啤酒。\n去萨尔茨堡的飞机还是带螺旋桨的，头次坐这样的老式飞机。\n","link":"https://kane.mx/posts/archive/blogspot/day-1/","section":"posts","tags":null,"title":"day 1"},{"body":"Check out this SlideShare Presentation:\nDiscovering the p2 API\nView more presentations from Sonatype.\n","link":"https://kane.mx/posts/archive/blogspot/discovering-p2-api/","section":"posts","tags":null,"title":"Discovering the p2 API"},{"body":"Days ago I updated my p2 replication tool. It's easier to install it in your Eclipse.\nA new component named 'Eclipse marketplace' is added into Eclipse SDK since Helios, which is an application store for Eclipse. People could be easy to install third party plug-ins into their Eclipse.\nYou can launch marketplace via 'Help' - 'Eclipse Marketplace...', then search key word 'p2' or 'replication' to find the tool. Finally click next to install it.\nIt's a very graceful workflow to install some add-ons like firefox.\nAnd then p2 replication tool could help you replicate your environment. This tool supports install components from another existing Eclipse instance to save the time cost on downloading them from Internet now! Enjoy it.\n","link":"https://kane.mx/posts/archive/blogspot/p2-replication-tool-lives-on-eclipse/","section":"posts","tags":null,"title":"P2 replication tool lives on Eclipse Marketplace"},{"body":"1. tcpdump\ntcpdump -n port 80 -i eth0|lo\nmonitor all package transferred on 80 port on the network interface eth0/lo\n2. netstat\nnetstat -anp|grep java\ntrace all network traffic on the process named java\nnetstat -anp|grep 128.224.159.xxx\ntrace all network traffic on the host whose ip address is 128.224.159.xxx\n3. nslookup\nnslookup 206.191.52.46\nlook up the domain name whose ip address is 206.191.52.46\n","link":"https://kane.mx/posts/archive/blogspot/useful-network-utility-tools/","section":"posts","tags":null,"title":"useful network utility tools"},{"body":"Using -exec command like below, need add escape character for semicolon that separated two commands in shell\n1find directory/ -type d -exec chmod a+x {} \\\\; Feb 24, 2010 - update:\n1find . -maxdepth 4 -type d -name \u0026#39;g-vxworks\u0026#39; 2\u0026gt;/dev/null -print Jan. 7, 2024 - update: You might see No such file or directory when combining --exec rm to delete the found files.\nYou can add -depth option to mitigate the message.\n1find deployment/g -depth -name \u0026#39;asset.*\u0026#39; -type d -exec rm -rf {} \\; ","link":"https://kane.mx/posts/archive/blogspot/find-exec-tip/","section":"posts","tags":["Shell"],"title":"[tip]Find -exec tip"},{"body":"","link":"https://kane.mx/tags/shell/","section":"tags","tags":null,"title":"Shell"},{"body":"I suffered p2 installation failed on the configure parse. Becase I try to add vm arguments for my application.\nFor example, I added '-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8272' in the product configuration.\nP2 will fail when parsing the argument, because it contains ':' and ',' that should be escaped.\nIt works again after replacing it to '-agentlib${#58}jdwp=transport=dt_socket${#44}server=y${#44}suspend=n${#44}address=8272'.\nThe more detail note could be found in p2 touchpoint wiki.\nAnd I also opened bug to request improving it.\n","link":"https://kane.mx/posts/archive/blogspot/special-characters-in-p2-touchpoint/","section":"posts","tags":null,"title":"special characters in p2 touchpoint instruction"},{"body":"mount -t cifs -o username=xxx,password=xxx,workgroup=xx,iocharset=utf8 //share.domain/folder /localfolder\n","link":"https://kane.mx/posts/archive/blogspot/mount-windows-share-folder/","section":"posts","tags":null,"title":"mount windows share folder"},{"body":"1. open the file in vim, :%!xxd\n2. hexdump\n","link":"https://kane.mx/posts/archive/blogspot/way-to-dump-hex-file/","section":"posts","tags":null,"title":"the way to dump hex file"},{"body":"public class NameRuleTest { @Rule public TestName name = new TestName(); @Test public void testA() { assertEquals(\u0026quot;testA\u0026quot;, name.getMethodName()); } @Test public void testB() { assertEquals(\u0026quot;testB\u0026quot;, name.getMethodName()); }} ","link":"https://kane.mx/posts/archive/blogspot/how-to-get-name-of-running-test-case-in/","section":"posts","tags":null,"title":"How to get the name of running test case in JUnit4"},{"body":"P2 install wizard firstly query the repository to find out the root installable unit(as well as top installable). Then p2 recalculate the dependency and try to search the requirements in all available repositories after user submits their installation request. Go to the license agreement page if all the dependencies are satisfied.\nP2 agreement page obtains all the units to be installed from the operands of provision plan. The number always is much greater than the number submitted by user. Because the submitted IUs only are the root IUs.\nP2 UI would check the unaccepted licenses comparing to before records. The policy class of p2 UI provides the license manager to record the even accepted license. It traverses all the installable units, querying its license whether it has already been accepted if it has. If the license agreement has been accepted, it would be ignored, won't be shown in the agreement page. Otherwise, new record is created to mark it as accepted by the license manager and display it in the agreement wizard page.\nThe default implementation of license manager would persist the accepted information in the file -- /.metadata/.plugins/org.eclipse.equinox.p2.ui.sdk/license.xml.\n","link":"https://kane.mx/posts/archive/blogspot/how-p2-ui-handles-with-license/","section":"posts","tags":null,"title":"How p2 UI handles with license agreements"},{"body":"Setup SSH without password.\na) execute \u0026quot;ssh-keygen -t rsa\u0026quot; under your linux/unix login to get the RSA keys.\n(press Enter for all)\nYou will get 2 files uner ~/.ssh/, id_rsa and id_rsa.pub\nb) Put the public key id_rsa.pub to your remote host: ~/.ssh/authorized_keys If the remote host share the same nfs, just try \u0026quot; cat id_rsa.pub \u0026gt;\u0026gt; ~/.ssh/authorized_keys\u0026quot;\n* Remember to modify hostname or ip info in ~/.ssh/authorized_keys to \u0026quot;\u0026quot;, so that you can login from any host without password in your NIS domain.\nFor example:\nssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA4Ri5J0s1BL/+mR7RfAuDW6FY2P6ILc61Zvw1BdDkvHMFrTzaC/AUMw33H7biAMCXuCleakCuSoV8ZDiGHYs4wOVvet5sDmphkwdiC4xTekdl3dRNvGjMVbvFUta/Y5CiayL6YIu47Ro6Vvu4Mutsrv/13pTlifrEz+NTR/+bzMb9nTniCwiryMyYod3E46b8WvS8yE3WK+tH4BZE8bjiCwdvAzSdPyk/OFNrlBNuF1yewwnxv1roRD3UalT2+7O4kfEG9sMvvBHjuX2l7xlUe3stBftYpigBbwGmmadxjRpNIlk88t5xKcQX6nSu7V8HI3GWPHI0D+ISIlbfU5Sunw== kzhu0@\n","link":"https://kane.mx/posts/archive/blogspot/ssh-key/","section":"posts","tags":["ssh","Tip"],"title":"[tip]ssh key"},{"body":"I wrote a plug-in to simplify the process to install the same plug-ins in different platform or different workstation.\nAnyone is interested in it, pls follow below guide to freely use it.\nhttp://code.google.com/p/kane-toolkit/wiki/P2Replication\nEnjoy it.\n","link":"https://kane.mx/posts/archive/blogspot/p2-replication-plug-in/","section":"posts","tags":["p2","Eclipse"],"title":"[Eclipse][P2]P2 replication plug-in"},{"body":"How Equinox load bundles Equinox launcher is responsible to start OSGi framework. The system bundle would be created and marked as installed when initializing the framework. Equinox also tries to install the installed bundles if finding them in persistence data during the initializing period. Of course there is no extra bundles would be installed when launching Equinox first time.\nThen Equinox launcher would install the bundles specified by vm's system property 'osgi.bundles'. And start the initial bundles that are marked as early start. For example, let's have a look at the configuration/config.ini of Eclipse, you would find a line similar as below,\nosgi.bundles=reference\\:file\\:org.eclipse.equinox.simpleconfigurator_1.0.200.v20090831.jar@1\\:start\nIt means the start level of bundle 'org.eclipse.equinox.simpleconfigurator_1.0.200.v20090831.jar' is 1, and it would be started after installing it.\nHere you would ask there are only two bundles are installed(one is system bundle 'org.eclipse.osgi', the other is 'org.eclipse.equinox.simpleconfigurator') when launching Equinox, how the other bundles are installed? It's done by the activate method of 'simpleconfigurator' bundle. The available bundles are recorded in plain text file configuration/org.eclipse.equinox.simpleconfigurator/bundles.info, simpleconfigurator read the file then install those bundles.\nIt's a new bundle management introduced by p2. P2 also supports the traditional way to install extensions, such as link file, .eclipseproduct file and directly copying features/plugins.\nBelow table lists the p2 bundles to implement the compatibility installation feature,\nBundle\nUsage\norg.eclipse.equinox.p2.directorywatcher\nthe definition and implementation of directory watcher API\norg.eclipse.equinox.p2.updatesite\nthe implementation of updatesite repository\norg.eclipse.equinox.p2.extensionlocation\nthe implementation of extension repository\norg.eclipse.equinox.p2.reconciler.dropins\nscan dropin folder and link files; watch the traditional configuration file used by update manager\nP2 reconciler would scan the dropin, link folder and legacy configuration file in every Equinox launching. You can disable the capability by marking it not be early start.\norg.eclipse.equinox.p2.reconciler.dropins,1.0.100.v20091010,plugins/org.eclipse.equinox.p2.reconciler.dropins_1.0.100.v20091010.jar,4,false\nIf finding some new bundles in dorpin folder, the reconciler would add the new bundles into a local metadata repository that is stored as OSGi data of Equinox. Then synchronize the bundles into the current p2 profile, then add the new bundles into bundles.info file.\n","link":"https://kane.mx/posts/archive/blogspot/how-equinox-load-bundles/","section":"posts","tags":["Equinox","Eclipse"],"title":"[eclipse]How Equinox load bundles"},{"body":"Learn p2 step by step See this link for detail\nLearn P2 step by step\np2 concept p2 install p2 install practice p2 repository publish customized p2 touchpoint p2 repository publish practice Example Code Reference p2 concept 首先来理解p2引入的几个概念[1]\np2 / Agent\nThe provisioning infrastructure on client machines\nInstallable Unit (IU)\nMetadata that describes things that can be installed/configured\nArtifact\nThe actual content being installed/configured(e.g., bundle JARs)\nRepository\nA store of metadata or artifacts\nProfile\nThe target of install/management operations\nPlanner\nThe decision-making entity in the provisioning system\nEngine\nThe mechanism for executing provisioning requests\nTouchpoint\nThe part of the engine responsible for integrating the provisioning\nsystem to a particular runtime or management system\nIU比较好理解，就是对可安装或配置的部分一种描述，并不对应实际要安装的文件。\nArifact就是来描述实际要安装的文件，bundle类型的jar，feature，binary文件。\n这时就有了Repository（仓库）这个概念，是用来保存artifacts信息，以及artifacts的元数据。元数据包括了对artifact的唯一标识符，版本，对外暴露的接口信息，以及它依赖的接口及其版本信息，各个安装阶段需要执行的配置。在p2默认的实现里面，这两个repository用xml文件来描述，同时被压缩为artifacts.jar, content.jar来减小文件大小，缩短传输时间。\n从Eclipse 3.4起，当从远程site安装新的软件时，就会看到有个work thread在后台下载content.jar文件。p2在安装时候，首先会根据content.xml（metadata repository）来解析正在安装软件的依赖。在当前runtime里面查找metadata中指定的依赖，如果满足才继续安装。据我个人经验，如果安装的软件比较复杂，那它产生的metadata文件就会比较大（很容易上兆），下载这个文件以及解析它的内容都会比较慢，从而影响用户体验。\n比较灵活的是，用户可以实现自己的ArtifactRepository和MetadataRepository，注册到它们各自的Manager里面就可以了。所有这些服务都被实现为OSGi Service.\n下一个Profile，是用来管理安装目标里的软件信息。p2在被设计的时候，希望解决多个eclipse实例共享一份安装的某软件。比如为了某种目的我机器上有好几个Eclipse，同时它们都需要CDT，免去为重复安装的麻烦。profile就会记录每次安装的内容，让整个应用程序被管理起来。在Galileo里安装的软件都可以软件管理里面查找到。\nPlanner和Engine完全就是p2内部的东西。任何p2的操作（安装，删除，配置）都需要Planner实例来描述。有了Planner以后，还需要创建一个Engine对象，通过engine来执行对应的plan。这就是目前调用p2 API来完成安装的一个过程。\n最后一个Touchpoint。程序在安装的时候，可能会根据runtime(os, ws, arch等）或阶段(安装，卸载，配置等)执行某些配置，touchpoint就是帮助实现这些配置。具体操作是以IU为单位记录在metadata repository里的。p2默认实现了一些Eclipse touchpoint，比如拷贝，删除文件，执行外部程序等。如果用户有自己特殊的native操作需要执行，可以自己实现自定义的touchpoint。\np2 install 有了这些概念以后，我们来看看如何使用p2 API。以安装为例，\n首先需要得到当前安装的profile。如果是全新安装，通过IProfileRegistry.addProfile创建一个新profile。是更新安装的话，可以通过IProfileRegistry查询到期望更新的profile。创建profile的时候，需要注意设置profile的属性，\nMap\u0026lt;String, String\u0026gt; profileProperties = new HashMap\u0026lt;String, String\u0026gt;();\nprofileProperties.put(IProfile.PROP_INSTALL_FOLDER, installLocation.getAbsolutePath());\nprofileProperties.put(IProfile.PROP_FLAVOR, \u0026quot;tooling\u0026quot;); //$NON-NLS-1$\nprofileProperties.put(IProfile.PROP_ENVIRONMENTS, \u0026quot;osgi.os=\u0026quot; + Platform.getOS() + \u0026quot;,osgi.ws=\u0026quot; + Platform.getWS() + \u0026quot;,osgi.arch=\u0026quot; + Platform.getOSArch()); //$NON-NLS-1$;\nprofileProperties.put(IProfile.PROP_NL, \u0026quot;en_US\u0026quot;); //$NON-NLS-1$\nprofileProperties.put(IProfile.PROP_INSTALL_FEATURES, \u0026quot;true\u0026quot;);\nprofileProperties.put(IProfile.PROP_CONFIGURATION_FOLDER, new File(installLocation, \u0026quot;configuration\u0026quot;).getAbsolutePath());\nprofileProperties.put(IProfile.PROP_ROAMING, \u0026quot;true\u0026quot;);\nprofileProperties.put(IProfile.PROP_CACHE, installLocation.getAbsolutePath());\ncurrentProfile = registry.addProfile(PROFILE_ID, profileProperties);\nPROP_INSTALL_FOLDER设置安装的目录，PROP_CACHE设置保存下载来的Eclipse IU(features/plugins)的目录，如果repository是以feature为单位来发布的话，需要设置PROP_INSTALL_FEATURES为true。如果repository包括native的binary（比如launcher）也需要指定正确的PROP_ENVIROMENTS，包括OS,WS,ARCH或PROCESSOR。\n然后需要获得将要安装的IMetadataRepository集合。比如：\nArrayList ius = new ArrayList();\nIMetadataRepositoryManager repositoryManager = (IMetadataRepositoryManager) ServiceHelper.getService(Activator.getDefault().getBundle().getBundleContext(),\nIMetadataRepositoryManager.class.getName()); if (repositoryManager == null) throw new InterruptedException(\u0026quot;Failed to get IMetadataRepositoryManager.\u0026quot;);\ntry {\nfor (URI uri : uris) {\nIMetadataRepository metaRepo = repositoryManager.loadRepository(uri, progress.newChild(50/uris.length));\nCollector collector = metaRepo.query(new AccpetQuery(), new LatestNoninstalledIUCollector(currentProfile), progress.newChild(50/uris.length));\nius.addAll(collector.toCollection());\n}\n} catch (ProvisionException e) {\nthrow new InterruptedException(\u0026quot;Failed to get IMetadataRepository.\u0026quot;);\n}\n-同时这里也查找出IMetaRepository中没安装过的IUs。这就需要同当前安装的profile中已经安装过的内容来比较，\nCollector collector = metaRepo.query(new AccpetQuery(), new LatestNoninstalledIUCollector(currentProfile), progress.newChild(50/uris.length));\n这里需要指出的是，IMetadataRepository实现了IQueryable接口。IQueryable是p2引入的查找接口，返回满足特殊查询条件的集合，同时传入了一个IProgressMonitor对象，可以反应查找进度。这里的AcceptQuery，LatestNoninstalledIUCollector是自定义的Query和Collector对象。p2已经实现了许多有用的Query，经常用到的有InstallableUnitQuery，IUPropertyQuery，RangeQuery。\n-接下来生成IEngine所需的ProvisionPlan。首先创建ProfileChangeRequest对象，将先前查找出的要安装的IUs添加进去。\nrequest.addInstallableUnits(ius);\n删除的话则与之相反。更新的话也需要通过ProfileChangeRequest.removeInstallableUnits()去掉旧版本的IUs。\n调用IPlanner service的getProvisioningPlan(ProfileChangeRequest, ProvisioningContext, IProgressMonitor)得到对应于当前request的plan。\n-最后就是调用IEngine.perform(IProfile, PhaseSet, Operand[], ProvisioningContext, IProgressMonitor)来执行provisioning操作。这里的PhaseSet是用来指定Engine将要执行的几个阶段，以及每个阶段的执行时间权重。这些阶段包括了Collect, Unconfigure, Uninstall, Property, CheckTrust, Install, Configure. 如果熟悉Eclipse之前的Installer Handler，对Unconfigure/Uninstall/Install/Configure应该都很熟悉。 在p2里，更是将Collect, CheckTrust这些过程也暴露了出来。下面是p2里默认PhaseSet的实现，\npublic DefaultPhaseSet() {\nthis(new Phase[] {new Collect(100), new Unconfigure(10, forcedUninstall), new Uninstall(50, forcedUninstall), new Property(1), new CheckTrust(10), new Install(50), new Configure(10)});\n}\nOperand[]通过ProvisionPlan.getOperands()获得。\np2 install practice 先制作一个可安装的repository，这里的方法是基于Eclipse提供的模版创建一个RCP程序，比如mail template,\n然后创建一个feature包含刚才创建出来的plug-in 'com.example.mail'。\n基于存在的‘com.example.mail.product’创建product configuration，将其设置为base on features, 同时在dependencies页面添加以下feature。feature的qaulifier id依赖于用到的Eclipsse版本，从下图看到我这里使用的是Eclipse 3.5.1。如果要让RCP程序具有安装插件的能力（包含p2和p2 UI），就需要依赖更多的feature。后面的example里面会实现这部分功能。另外注意：ID不能包括空格字符。\n接下来使用Eclipse Product Export Wizard生成repository。记得要勾选上generate metadata repository。\n在成功创建了Mail Application的repository后，试用我们自己的p2 installer来安装这个应用程序。安装过程类似下面的截图。然后执行/folk/kzhu0/tmp/mailrcp/mail来运行Mail Application.\np2 repository publish 这一节将会展示如何发布/产生基于p2的repository。在p2最早的版本Eclipse 3.4中将生成repository这个程序称为generator，而3.5对此重构后命名为publisher。重构后的publish过程简单明了。首先需要创建一个IPublishInfo对象，它负责提供将要生成的repository的情况。包括了meta repository, artifact repository的信息，属性，以及提供辅助信息的advice对象。IPublisherAdvice可以看作类似创建RCP窗口时候的WorkbenchAdvice和WorkbenchWindowAdvice等辅助类。它用来提供需要记录在repository中的IU特殊信息。比如IU的属性，touchpoint的类型及各个阶段执行的action，对可执行文件或配置文件IU的处理。\n此外还需要创建IPublisherAction来处理不同类型的IU发布过程。例如BundlesAction来实现发布bundles到repository，FeaturesAction则是处理feature。此外p2已提供的IPublisherAction还包括product action, config action, launcher action和jre action等等[2]。\n有了描述repository情况的publishinfo和发布各种IUs的action后，调用Publisher.publish方法完成repository的发布。\nIPublisherInfo info = createPublisherInfo();\nIPublisherAction[] actions = createActions();\nPublisher publisher = new Publisher(info);\npublisher.publish(actions, new NullProgressMonitor());\n这里有一点需要注意，publish只是把将要用于部署的features/plugins/binary发布到repository，并不负责编译打包它们。先前我们使用过Eclipse Export功能既编译打包features/plugins同时又生成repository。Export实现的过程首先是调用PDE来编译打包features/plugins，再调用对应的publisher应用程序将编译后的features/plugins/product发布为repository。\ncustomized p2 touchpoint 前面一节已经提过IPublishInfo通过额外的IPublisherAdvise来定制发布到repository的IU信息。这里介绍为自己的IU定制新的touchpoint类型，并且要求在配置阶段在操作系统桌面创建应用程序的启动快捷方式。首先为我们的PublisherInfo添加处理touchpoint data的advice，NativeLauncherTouchPoint实现了ITouchpointAdvice接口，publisher在发布的时候当处理到touchpoint data部分，会查找实现了ITouchpointAdvice接口的advice。如果有advice可用，将会让这些advice处理现有的touchpoint data，并且得到新的touchpoint data，并把结果保存到metadata repository当中。\nPublisherInfo result = new PublisherInfo();\nresult.addAdvice(new NativeLauncherTouchPoint());\nNativeLauncherTouchPoint将指定为特定的IU在configure阶段执行createDesktop操作，以及相反的操作，unconfigure阶段执行deleteDesktop操作。\n更改touchpoint type的方法如下。当然也可以为现有的touchpoint type扩展action。内置的touchpoint类型和action的具体命令用法，请参考p2 wiki[3]。\niu.setTouchpointType(DesktopTouchpoint.TOUCHPOINT_TYPE);\ntouchpoint类型和action都是通过extension point来扩展的。通过扩展“org.eclipse.equinox.p2.engine.touchpoints”来添加新的touchpoint类型，扩展”org.eclipse.equinox.p2.engine.actions“将新的action同某个类型关联起来。\np2 repository publish practice 我们创建plug-in 'com.example.p2.touchpoint'来实现桌面快捷方式的扩展，并且创建'com.example.p2.feature'包含touchpoint实现的plug-in。具体实现请参考p2 example源码。\n然后为Mail Application添加p2相关feature的依赖，重新发布得到支持安装软件的新版本。并且用p2 example installer安装它。p.s: 个人感觉Eclipse在包含第三方plug-in时，层次有些问题。p2作为一个runtime的project（跟equinox, ECF同级），居然需要直接或间接依赖help, rcp.platform这样的上层模块。\n接下来创建plug-in 'com.example.mail.desktop' 和 feature 'com.example.mail.desktop.feature'，作为提供桌面快捷方式的IU。用Eclipse Export Feature将'com.example.mail.desktop.feature'导出，实际就是用PDE替我们编译打包:)。\n运行‘com.example.p2.generator'提供的headless publisher来生成我们定制的repository。’/folk/kzhu0/tmp/mail/desktop-deploy'是先前desktop feature导出后的路径，而'/folk/kzhu0/tmp/mail/desktop'是生成repository的路径。\n运行新版本的Mail Application，在Help菜单下面会多出Install New Software选项。将自定义publisher生成的Desktop feature repository添加为新的软件源，安装Mail Desktop Feature。安装完成后，将在桌面找到Mail Application的快捷方式。在Installation Detail里面将会出现这次安装的内容。选中Desktop Feature后选择卸载，桌面的快捷方式文件将会被删除掉。当然也可以使用p2 example installer来为Mail Application安装desktop feature。p.s: example代码里只实现了创建linux/unix桌面快捷方式。\nExample Code Example Code应该只能编译运行在Eclipse 3.5.x。Example Code使用的都是p2 internal API, 而p2 public API将会随Eclipse 3.6首次发布。这些类和方法基本都会保留，但命名，包一定会有重构。\nhttp://code.google.com/p/kane-toolkit/source/browse/#svn/trunk/p2-example\nReference [1] http://wiki.eclipse.org/Equinox/p2/Concepts\n[2] http://wiki.eclipse.org/Equinox/p2/Publisher\n[3] http://wiki.eclipse.org/Equinox/p2/Engine/Touchpoint_Instructions\n","link":"https://kane.mx/posts/archive/blogspot/learn-p2-step-by-step/","section":"posts","tags":["Equinox","p2","Eclipse"],"title":"[Eclipse][P2]Learn p2 step by step"},{"body":"ssh -qTfnN -D LocalPort remotehost\nAll the added options are for a ssh session that's used for tunneling.\n-q :- be very quite, we are acting only as a tunnel.\n-T :- Do not allocate a pseudo tty, we are only acting a tunnel.\n-f :- move the ssh process to background, as we don't want to interact with this ssh session directly.\n-N :- Do not execute remote command.\n-n :- redirect standard input to /dev/null.\nIn addition on a slow line you can gain performance by enabling compression with the -C option.\n","link":"https://kane.mx/posts/archive/blogspot/ssh-forward/","section":"posts","tags":["ssh","Tip"],"title":"[tip]ssh forward"},{"body":"-Dosgi.install.area=\u0026lt;launcher's folder\u0026gt;\n-Declipse.p2.profile=\n","link":"https://kane.mx/posts/archive/blogspot/simulate-p2-self-host-in-eclipse-run/","section":"posts","tags":["p2","Eclipse"],"title":"Simulate p2 self host in Eclipse run"},{"body":"The IPreferenceStore API of Eclipse is based on OSGi's preferences service. Equinox implements several scope context for different preferences, such DefaultScope, InstanceScope and ConfigurationScope. The IPreferenceStore is the wrapper of instance scope for back-compatibility. It stored the data in workspace(osgi.data.area).\nThe workspace folder would be created when launching RCP application if it doesn't exist. But we can use argument '-data @none' to suppress the creation of workspace. If that, the instance scope/IPreferenceStore can't store any value any more.\nThere is a workaround to resolve such issue. Use ConfigurationScope instead of InstanceScope. Both of them are implemented the same interface, so it's easy to migrate to use ConfigurationScope. The data of configuration scope would be stored in @config.dir/.setting folder.\n","link":"https://kane.mx/posts/archive/blogspot/eclipseosgi-preference/","section":"posts","tags":["Equinox","OSGi","Eclipse"],"title":"Eclipse/OSGi preference"},{"body":"Eclipse platform register an OSGi service 'IProxyService' to manage network connection, which has capability to set proxy setting. There are three types of proxy working mode,\nDirect(no proxy), Manual(specified by user), Native(using OS's proxy setting, such as gnome-proxy, IE). There are three types of proxy supported by IProxyService. They're http, https and socks.\nIt also allows to add/remove ip address from white list, which are accessed without connecting proxy.\nEnd users can manage the proxy setting of Eclipse via Preference - General - Network Connections. Eclipse would do persistence of user's setting. Other components of Eclipse also use those proxy settings to access network, such as ECF.\nBelow code snippet shows how to use proxy API to manually specify proxy server,\nproxyService.setProxiesEnabled(true);\nproxyService.setSystemProxiesEnabled(false); IProxyData[] datas = proxyService.getProxyData(); IProxyData proxyData = null; for(IProxyData data : datas) { // clean old data ((ProxyData)data).setSource(\u0026quot;Manual\u0026quot;); //$NON-NLS-1$ data.setUserid(null); //$NON-NLS-1$ data.setPassword(null); //$NON-NLS-1$ if(proxyType == SOCKSPROXY \u0026amp;\u0026amp; IProxyData.SOCKS_PROXY_TYPE.equals(data.getType())) {\nproxyData = data; continue; }else if(proxyType == WEBPROXY \u0026amp;\u0026amp; IProxyData.HTTP_PROXY_TYPE.equals(data.getType())){\nproxyData = data; continue; } data.setHost(null); //$NON-NLS-1$ data.setPort(0); } if(proxyData != null){ proxyData.setHost(proxyServer); proxyData.setPort(proxyPort); } try { proxyService.setProxyData(datas); } catch (CoreException e) { proxyService.setProxiesEnabled(false); proxyService.setSystemProxiesEnabled(false); return false; }\nOfficial API Reference\n","link":"https://kane.mx/posts/archive/blogspot/usage-of-eclipses-proxy-api/","section":"posts","tags":["Eclipse"],"title":"The usage of Eclipse's Proxy API"},{"body":"tune2fs -i 0 -c 0 /dev/sdx\n","link":"https://kane.mx/posts/archive/blogspot/turnoff-automatically-scanning-disk/","section":"posts","tags":["Tip","Linux"],"title":"[tip]Turn off automatically scanning disk when reboot"},{"body":"","link":"https://kane.mx/tags/linux/","section":"tags","tags":null,"title":"Linux"},{"body":"add below lines in ~/.gnomerc\nexport XMODIFIERS=\u0026quot;@im=fcitx\u0026quot;\nexport GTK_IM_MODULE=\u0026quot;xim\u0026quot;\n","link":"https://kane.mx/posts/archive/blogspot/how-to-set-default-input-method-for/","section":"posts","tags":null,"title":"How to set default input method for GNOME"},{"body":"create symbol link under lib/plugins of firefox to link jre/plugins/i386/ns**/libjavaplugin_oji.so\n","link":"https://kane.mx/posts/archive/blogspot/how-to-set-up-jre-environment-in/","section":"posts","tags":["JRE","Firefox","HowTo"],"title":"[HowTo]How to set up jre environment in firefox"},{"body":"","link":"https://kane.mx/tags/firefox/","section":"tags","tags":null,"title":"Firefox"},{"body":"","link":"https://kane.mx/tags/howto/","section":"tags","tags":null,"title":"HowTo"},{"body":"","link":"https://kane.mx/tags/jre/","section":"tags","tags":null,"title":"JRE"},{"body":"Such as,\n(gdb) handle SIGPIPE nostop noprint pass\n","link":"https://kane.mx/posts/archive/blogspot/how-to-ignore-specified-signal-when/","section":"posts","tags":["Tip","GDB","Debug"],"title":"[Debug][tip]How to ignore specified signal when debugging program via gdb"},{"body":"","link":"https://kane.mx/tags/debug/","section":"tags","tags":null,"title":"Debug"},{"body":"","link":"https://kane.mx/tags/gdb/","section":"tags","tags":null,"title":"GDB"},{"body":"Close Notes\nDouble click c:\\notes\\notes.ini to open it.\nAdd one new line \u0026quot;Display_font_adjustment=n\u0026quot; after the third line in notes.ini file. \u0026quot;n\u0026quot; is the number.It can be 1or 2 or 3....and the font will be larger with the number increasing.\nLaunch note\n","link":"https://kane.mx/posts/archive/blogspot/how-to-adjust-font-size-of-notes-editor/","section":"posts","tags":["Tip","IBM Notes"],"title":"[HowTo][tip]How to adjust the font size of Notes editor"},{"body":"","link":"https://kane.mx/tags/ibm-notes/","section":"tags","tags":null,"title":"IBM Notes"},{"body":"Set vm arguments 'osgi.framework.extensions' and 'osgi.frameworkClassPath' when vm starts. If those value are set, those jar or path would be added into the classloader when starting EclipseStarter.\nSee org.eclipse.equinox.launcher.Main for more details in the source code of Eclipse 3.4.\nBest Regards\nKane\n","link":"https://kane.mx/posts/archive/blogspot/add-custom-jar-or-path-into-equinox/","section":"posts","tags":["Equinox","OSGi","Eclipse"],"title":"[OSGi][Eclipse]Add custom jar or path into Equinox Framework"},{"body":"The answer is very simple, using the service 'org.eclipse.service.PackageAdmin'.\n","link":"https://kane.mx/posts/archive/blogspot/osgihow-to-acquire-fragments-of/","section":"posts","tags":["Equinox","OSGi","Eclipse"],"title":"[OSGi]How to acquire the fragments of specified bundle"},{"body":"Equinox uses the adaptor hooks to implement the class loader.\nSee http://wiki.eclipse.org/Adaptor_Hooks for more detail\nBaseClassLoadingHook would search the native code on itself. If it find the file in that jar file, it would extract the native library into its storage folder.\nEclipseClassLoadingHook defines some variables to search the native library. Belows are built-in variables:\nresult.add(\u0026quot;ws/\u0026quot; + info.getWS() + \u0026quot;/\u0026quot;); //$NON-NLS-1$ //$NON-NLS-2$\nresult.add(\u0026quot;os/\u0026quot; + info.getOS() + \u0026quot;/\u0026quot; + info.getOSArch() + \u0026quot;/\u0026quot;); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$\nresult.add(\u0026quot;os/\u0026quot; + info.getOS() + \u0026quot;/\u0026quot;); //$NON-NLS-1$ //$NON-NLS-2$\nSo the classloader can find your native library that under those path. If your bundle is jar file, equinox would extract your native library into its storage folder.\nI prefer to use OSGi header(Bundle-NativeCode) defining the path of native code, which still works on other OSGi implementations.\nEquinox defines its url schema, one of them is named as 'BundleURLConnection'. From its name, we know it's used for describing the files of bundle. You can obtain the url of file that is located on bundle by Bundle.getResource()/Bundle.getEntry()/Bundle.findEntries()/Bundle.getResources(). The return value of those functions are an object of BundleURLConnection. Once it's used as the argument of FileLocator.toFileURL(URL), the jar bundle would be unpacked into its storage folder recursively.\n","link":"https://kane.mx/posts/archive/blogspot/eclipseequinoxs-classloader-and-its-url/","section":"posts","tags":["Equinox","OSGi","Eclipse"],"title":"[Eclipse]Equinox's classloader and its URL schema"},{"body":" 1. 将要排版的文字贴到vim了\n2. set textwidth=70\n3. visual模式下选择要排版的文字,按gq, 就变成70字母1行的格式了\n","link":"https://kane.mx/posts/archive/blogspot/tipvim/","section":"posts","tags":null,"title":"[tip][vim]排版小技巧"},{"body":"In vi/vim,\n﻿set file format=unix\nor dos2unix, unix2dos\n","link":"https://kane.mx/posts/archive/blogspot/tipconvert-dos-format-to-unix/","section":"posts","tags":["Tip","Linux"],"title":"[tip]convert dos format to unix"},{"body":"Build c/c++ project always need third party library on linux, such as gtk+, glib. Writing their absolute path in Makefile is not flexible way. You can use pkg-config instead of the absolute path. Below is code snippet:\nGTK_LIB=$(shell pkg-config --libs gtk+-2.0)\nGTK_INC=$(shell pkg-config --cflags gtk+-2.0)\ngcc -o yourlibrary.so $(GTK_INC) $(GTK_LIB)\n","link":"https://kane.mx/posts/archive/blogspot/makefile-tip/","section":"posts","tags":["Tip","Linux","Makefile"],"title":"[tip]Makefile"},{"body":"","link":"https://kane.mx/tags/makefile/","section":"tags","tags":null,"title":"Makefile"},{"body":"OSGi provides a mechanism to let user contribute custom schemes automatically. It avoid some restriction with Java facilities for extending the handlers. The more detail could be found from OSGi specification R4, which has description how OSGi implements URL Handler Service.\nUse a sample to illustrate how to contribute your scheme(protocol):\n1. register your URLStreamHandlerService implementation, which must contain a property named \u0026quot;url.handler.protocol\u0026quot;. below register my scheme 'smb'\npublic void start(BundleContext context) throws Exception {\nHashtable properties = new Hashtable();\nproperties.put( URLConstants.URL_HANDLER_PROTOCOL, new String[] { \u0026quot;smb\u0026quot; } );\ncontext.registerService(URLStreamHandlerService.class.getName(), new SmbURLHandler(), properties );\n}\n2. your URL Handler extends AbstractURLStreamHandlerService, and implements abstract function 'openConnection(URL)'\npublic class SmbURLHandler extends AbstractURLStreamHandlerService {\npublic URLConnection openConnection(URL url) throws IOException {\nreturn new SmbURLConnection(url);\n}\n}\n3. your URL Connection extends java.net.URLConnection\npublic class SmbURLConnection extends URLConnection {\nprotected SmbURLConnection(URL url) {\nsuper(url);\n}\npublic void connect() throws IOException {\n}\n}\n","link":"https://kane.mx/posts/archive/blogspot/url-handlers-service/","section":"posts","tags":["Equinox","URL Handler Service","OSGi"],"title":"[OSGi][Equinox]URL Handlers Service"},{"body":"","link":"https://kane.mx/tags/url-handler-service/","section":"tags","tags":null,"title":"URL Handler Service"},{"body":"OSGi Spec defines Bundle-NativeCode header to contain a specification of native code libraries contained in that bundle. All magic things are initialized by org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.findLibrary(String) and org.eclipse.osgi.framework.internal.core.BundleLoader.findLibrary(String). Then BundleLoader uses the org.eclipse.osgi.baseadaptor.BaseData(an implementation of BundleData) to find the library path, if the bundle is NOT a jar file, it would directly get the absolute path of library. Otherwise, the BaseData would extract the library file if it could NOT find it in OSGi bundle storage(located in ${data}/org.eclipse.osgi/bundles/[bundle_id]/.cp/). Refer to org.eclipse.osgi.baseadaptor.BaseData.findLibrary(String) for more detail.\n","link":"https://kane.mx/posts/archive/blogspot/bundle-nativecode-implementation-in/","section":"posts","tags":["Equinox","NativeCode","Bundle","OSGi","Eclipse"],"title":"[OSGi][Equinox]the Bundle-NativeCode implementation in Equinox"},{"body":"","link":"https://kane.mx/tags/bundle/","section":"tags","tags":null,"title":"Bundle"},{"body":"","link":"https://kane.mx/tags/nativecode/","section":"tags","tags":null,"title":"NativeCode"},{"body":"1\u0026gt; 是输出正确数据， 2\u0026gt; 则是错误数据输出项目, 若要同时写入同一个档案需要使用 2\u0026gt;\u0026amp;1 /dev/null 是什么呢？基本上，那就有点像是一个『黑洞』的垃圾桶功能！当你输入的任何东西导向到这个虚拟的垃圾桶装置时，『他就会凭空消失不见了～～』\n","link":"https://kane.mx/posts/archive/blogspot/linux-shelllearning-note-31908/","section":"posts","tags":["Linux","Shell"],"title":"[shell]Learning Note - 3/19/08"},{"body":"You must see the qualifier string property when exporting your features and plug-ins by Eclipse pde. But specified qualifier string won't appear after you export the features successfully.\nIf you want to use the qualifier string, you must define your feature and plug-in version like below:\n1.0.0.qualifier, 2.2.2.qaulifier\n:)\n","link":"https://kane.mx/posts/archive/blogspot/how-to-use-qualifier-string-when/","section":"posts","tags":["PDE","Eclipse"],"title":"[Eclipse]How to use qualifier string when exporting features and plug-ins"},{"body":"最近几天被一个注册表相关的defect搞的焦头烂额。\n背景是这样的，产品在安装的时候需要通过修改注册表注册文件关联等信息。在先前安装程序基于InstallShield时工作正确，但在最近安装程序改用MSI后，我们写入注册表的信息没有被写到所期望的位置。\n通过各种试验，查找资料，终于搞明白原因。我们修改注册表的进程不是当前用户进程，而是系统进程，因此写入到HKEY_CURRENT_USER下的数据不能被写入到当前登陆用户下。\nWe should not use \u0026quot;HKEY_CURRENT_USER\u0026quot; to retrival current user's registry key value. Because Windows Services always startup before user login. It may happen some error or loading the wrong setting profile. If you still insist on using the current user registry key setting, please refer \u0026quot;RegOpenCurrentUser\u0026quot;.\n最后只好将这些数据写到了Local Machine键值下。\n","link":"https://kane.mx/posts/archive/blogspot/suck-windows-registry/","section":"posts","tags":["注册表"],"title":"万恶的注册表"},{"body":"","link":"https://kane.mx/tags/%E6%B3%A8%E5%86%8C%E8%A1%A8/","section":"tags","tags":null,"title":"注册表"},{"body":"When you develop a rich client application base on eclipse framework, and your application require eclipse platform feature, you would find that your application has some menu items contributed by eclipse platform. Those menu items are defined by several plug-ins' implementation of actionSet extention point. In fact Eclipse provides an activity mechanism to suppress the extension points which you don't want to use. However, you must know the identification name of extension points which you want to suppress. It's a hard work to find out all of them from dozens of plugins. so, I wrote a utility function to list all the extension points of specified name.\nIExtensionRegistry registry = Platform.getExtensionRegistry(); IExtensionPoint extensionPoint = registry.getExtensionPoint(\u0026quot;org.eclipse.ui.actionSets\u0026quot;); IExtension\\[\\] extensions = extensionPoint.getExtensions(); for(int i = 0; i \u0026lt; extensions.length; i++){ IConfigurationElement elements\\[\\] = extensions\\[i \\].getConfigurationElements(); for(int j = 0; j \u0026lt; elements.length; j++){ String pluginId = elements\\[j\\].getNamespaceIdentifier(); if(pluginId.indexOf(\u0026quot;org.eclipse\u0026quot;) \u0026gt; -1){ //$NON-NLS-1$ IConfigurationElement\\[\\] subElements = elements\\[j\\].getChildren(\u0026quot;action\u0026quot;); for(int m = 0; m \u0026lt; subElements.length; m++){ System.out.println(\u0026quot;Plugin: \u0026quot; + pluginId + \u0026quot; Id: \u0026quot; + subElements\\[m\\].getAttribute(\u0026quot;id\u0026quot;)); } } } } and the follow snippet is about the activities of menus of eclipse platform:\n","link":"https://kane.mx/posts/archive/blogspot/get-rid-of-menus-of-eclipse-platform/","section":"posts","tags":["Eclipse","RCP"],"title":"[Eclipse]get rid of the menus of eclipse platform"},{"body":"","link":"https://kane.mx/tags/rcp/","section":"tags","tags":null,"title":"RCP"},{"body":"","link":"https://kane.mx/tags/jni/","section":"tags","tags":null,"title":"JNI"},{"body":"java程序开发中经常用到JNI调用本地library, 同时又希望将library同class文件编译成一个jar文件以方便deploy.\n但是JDK的classloader不支持从jar文件中加载library, 一个变通的方法就是jar里的library以临时文件的方式写到临时目录或java.library目录.\n附上两篇文档链接 :\n**Load Library inside a jar file\n**\n使用JNI时，装载本地库的小技巧\n","link":"https://kane.mx/posts/archive/blogspot/jar/","section":"posts","tags":["Java","JNI"],"title":"加载jar文件里的本地库"},{"body":"Those days my work is focus on eclipse's update. Now I understand the general mechanism and meet some issues when using it in development work.\nThe update mechanism includes four major types: install, enable, disable and uninstall. And all of those operations can be executed by command line, such as installing a feature can use following line:\n-application org.eclipse.update.core.standaloneUpdate -command install -featureId my.feature -version 1.0.0 -from file:/v:/local_updateSite/ -to file:/v:/eclipse/.\nThe installation process would copy the feature and plugins which are included by the feature to the local site from the update site, then execute the feature's global install handler if it has one.\nSome strange issue occurs when I want to disable a feature.Then I try to disable the feature with command,\n-command disable -featureId my.feature -version 1.0.0 -to file:/v:/eclipse/\nThe output of command means that the command is executed successfully.\nBut I list the status of features with command line \u0026quot;-command listFeatures\u0026quot;, the status of my.feature is still enable.\nThen I try to uninstall my.feature with command,\n-command uninstall -featureId my.feature -version 1.0.0 -to file:/v:/eclipse/\nIt fails, and the following is the root cause found in log file.\n!MESSAGE [Cannot find unconfigured feature my.feature with version 1.0.0]\nunconfigured feature means the feature is disabled.\nI posted my question in forum, and one guy told me that it might be a bug of eclipse and advised me to fire a bug for it.\n","link":"https://kane.mx/posts/archive/blogspot/eclipse-update-support/","section":"posts","tags":["Update","Eclipse","RCP"],"title":"[Eclipse]Eclipse update support"},{"body":"","link":"https://kane.mx/tags/update/","section":"tags","tags":null,"title":"Update"},{"body":"I met a defect that dynamically created menu items disappear after creating a new viewPart. It caused me overtime last Friday. Today I find the root cause.\nThe scenario is:\nopen first document, the items are shown well\nopen another document, the items disappear\nThe requirement is that showing the menu items while current part is document, otherwise hide them.\nSo our implementation is:\nwhen current document part is deactivated, set menu items invisible\nwhen document part is activated, set menu items visible\nAfter debugging, I found that menu items was updated before the part activated listener was notified. Hence the menu is invisible while the parent menu is updated. The resolved solution is that setting menu items visible while part opened listener is notified.\n","link":"https://kane.mx/posts/archive/blogspot/call-sequence-between-partactivated-and/","section":"posts","tags":["Eclipse","RCP"],"title":"[Eclipse]The call sequence between partActivated and menu update"},{"body":"I need use remote debug in our project, however just some experience in Weblogic were found from internet. After my investigation, I got some experience about using Eclipse remote debug RCP.\nThere are two important parameters for jvm. And we must launch remote java app with those two parameters.\n-Xdebug //tells jvm starting with debug mode\n-Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044 //transport=dt_socket represents communication with socket, address=1044 represents that the port number is 1044\nThen there are 3 steps in local env:\n1.import source code into eclipse's project\n2.Debug-Remote Java Application, see attachement as a sample\n3.insert breakpoint,\nupdate:\na simpler way:\n-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8000\n","link":"https://kane.mx/posts/archive/blogspot/remote-debug-in-eclipse/","section":"posts","tags":["Java","Eclipse","Debug"],"title":"[debug][java]Remote debug in Eclipse"}]