System Architecture

Learn how BlackTide executes millions of checks per day with low latency, reliable alerting, and real-time data aggregation.

High-Level Overview

BlackTide is built on a modern microservices architecture with 12 specialized services:

Core Services

Core API Service (Port 8080): Authentication, CRUD operations, rate limiting, CSRF protection
Scheduler Service (Port 8082): Manages check scheduling and publishes to NATS queue
Check Runners (Ports 8083-8084): Execute checks (HTTP, TCP, ICMP, TLS, Web3, etc.)
Ingestion & Alerts (Port 8085): Store results and evaluate alert rules
Incident & Notifications (Port 8086): Manage incidents and send notifications
Status Pages (Port 8087): Serve public/private status pages
Aggregation Timer: NATS publisher for materialized views

P0 Differentiation Services (Web3)

Transaction Indexer (Port 8090): Whale tracking and smart money analytics
MEV Detector (Port 8089): Sandwich attack detection
Security Analyzer (Port 8088): Exploit detection and auto-pause circuit breaker

Content Services

Blog Service Backend (Port 8091): FastAPI blog API (Python 3.9)
Blog Service Frontend (Port 3001): Next.js blog (SSR + SSG)

Tech Stack

Backend: Go 1.23.4, Chi Router, GORM
Database: PostgreSQL 15+ with native partitioning
Cache: Redis 7
Message Queue: NATS JetStream
Service Discovery: HashiCorp Consul
Frontend: Next.js 16, React 19, TypeScript

Data Flow

1. Check Scheduling

Scheduling Flow

text

User creates monitor via frontend
↓
Core API validates and stores monitor config
↓
Scheduler Service picks up monitor (via polling or NATS)
↓
Scheduler publishes "check.execute" event to NATS JetStream

2. Check Execution

Execution Flow

text

Check Runners subscribe to NATS queue
↓
Runner fetches monitor config
↓
Execute check from configured location (US East, EU West, etc.)
↓
Publish "check.completed" event to NATS with result

3. Result Ingestion

Ingestion Flow

text

Ingestion Service receives "check.completed" event
↓
Store result in PostgreSQL (partitioned by time range)
↓
Update Redis cache with latest status
↓
Evaluate alert rules for this monitor
↓
If rule triggers → publish "alert.triggered" event

4. Alert & Notification

Notification Flow

text

Incident Service receives "alert.triggered" event
↓
Create or update incident
↓
Notification Service sends alerts to configured channels
↓
Email, Slack, Discord, PagerDuty, etc.

Database Architecture

PostgreSQL Partitioning

We use PostgreSQL 15+ native partitioning for time-series data to maintain high performance as data grows:

check_results - Partitioned by week (PARTITION BY RANGE on created_at)
incidents_timeline - Partitioned by month
notifications_log - Partitioned by month

Migration from TimescaleDB

We migrated from TimescaleDB to PostgreSQL native partitioning in October 2025 to simplify our stack while maintaining the same performance. See backend/ARCHITECTURE_MIGRATION.md for details.

Retention Policies

Data Type	Retention	Aggregation
Raw check results	90 days	1-minute granularity
Hourly aggregates	1 year	Avg/min/max/p95 latency
Daily aggregates	Forever	Uptime percentage
Incidents	Forever	Full timeline

Scalability

Horizontal Scaling

All services are stateless and can scale horizontally:

Check Runners - Scale to 50+ instances for high check volume
Ingestion Service - Multiple instances consume from NATS queue
API Service - Load balanced across multiple instances

Load Distribution

Checks are distributed across 6 global locations:

US East (Virginia)
US West (California)
EU West (Ireland)
EU Central (Frankfurt)
Asia Pacific (Singapore)
South America (São Paulo)

High Availability

Redundancy

Database - PostgreSQL with streaming replication (primary + 2 replicas)
Redis - Sentinel mode with automatic failover
NATS - JetStream cluster (3 nodes)
Services - Multiple instances behind load balancer

Failure Handling

NATS retries - Failed checks are retried with exponential backoff
Circuit breaker - Services auto-pause when downstream dependencies fail
Health checks - Consul monitors all services and removes unhealthy instances

Security

Multi-Layer Protection

Authentication - httpOnly cookies (XSS-immune) + JWT
CSRF Protection - Double Submit Cookie pattern
Security Headers - CSP, X-Frame-Options, HSTS, etc.
CORS - Configured with credentials for specific origins only
Rate Limiting - Redis token bucket per user/IP

Security Best Practice

We never use localStorage or sessionStorage for JWT tokens. All authentication tokens are stored in httpOnly cookies to prevent XSS attacks.

Observability

Monitoring & Metrics

Prometheus - Metrics collection from all services
Grafana - Real-time dashboards and alerts
Consul - Service health and discovery
Logs - Structured JSON logging with correlation IDs

Key Metrics

Metric	Target	Purpose
API Response Time (p95)	<200ms	User experience
Check Execution Time	<30s	Timely alerts
NATS Queue Depth	<1000	Processing backlog
Database Query Time (p95)	<50ms	Data access speed

Deployment

Infrastructure

Container Orchestration - Docker + Docker Compose (production)
CI/CD - GitHub Actions for automated builds and deployments
Image Registry - GitHub Container Registry (GHCR)
Reverse Proxy - Nginx with SSL/TLS termination

Deployment Process

Deployment Commands

bash

# Build all services (multi-platform: amd64 + arm64)
make build-and-push

# Deploy via Ansible
ansible-playbook deploy.yml

# Health check all services
make health-check

Performance Optimizations

Caching Strategy

Monitor configs - Cached in Redis for 5 minutes
Latest check results - Cached for real-time dashboard
Uptime aggregates - Cached for 1 hour
Status page data - Cached for 30 seconds (public)

Database Optimizations

Indexes - B-tree on monitor_id, created_at, status
Partial indexes - Index only failed checks for faster incident queries
Connection pooling - Max 100 connections per service
Query optimization - Use EXPLAIN ANALYZE for slow queries

Future Roadmap

Planned Improvements

Kubernetes - Migrate from Docker Compose to K8s for better orchestration
Multi-region - Deploy services in multiple AWS regions for global HA
GraphQL API - Add GraphQL endpoint alongside REST API
Real-time WebSocket - Live dashboard updates without polling
Machine Learning - Anomaly detection for unusual patterns

Technical Documentation

For more technical details, see:

backend/ARCHITECTURE_MIGRATION.md - TimescaleDB → PostgreSQL migration
backend/docker-compose.yml - Service configuration
backend/README.md - Backend development guide
API Reference - Complete API documentation