System Architecture
Learn how BlackTide executes millions of checks per day with low latency, reliable alerting, and real-time data aggregation.
High-Level Overview
BlackTide is built on a modern microservices architecture with 12 specialized services:
Core Services
- Core API Service (Port 8080): Authentication, CRUD operations, rate limiting, CSRF protection
- Scheduler Service (Port 8082): Manages check scheduling and publishes to NATS queue
- Check Runners (Ports 8083-8084): Execute checks (HTTP, TCP, ICMP, TLS, Web3, etc.)
- Ingestion & Alerts (Port 8085): Store results and evaluate alert rules
- Incident & Notifications (Port 8086): Manage incidents and send notifications
- Status Pages (Port 8087): Serve public/private status pages
- Aggregation Timer: NATS publisher for materialized views
P0 Differentiation Services (Web3)
- Transaction Indexer (Port 8090): Whale tracking and smart money analytics
- MEV Detector (Port 8089): Sandwich attack detection
- Security Analyzer (Port 8088): Exploit detection and auto-pause circuit breaker
Content Services
- Blog Service Backend (Port 8091): FastAPI blog API (Python 3.9)
- Blog Service Frontend (Port 3001): Next.js blog (SSR + SSG)
Tech Stack
- Backend: Go 1.23.4, Chi Router, GORM
- Database: PostgreSQL 15+ with native partitioning
- Cache: Redis 7
- Message Queue: NATS JetStream
- Service Discovery: HashiCorp Consul
- Frontend: Next.js 16, React 19, TypeScript
Data Flow
1. Check Scheduling
Scheduling Flow
text
User creates monitor via frontend
↓
Core API validates and stores monitor config
↓
Scheduler Service picks up monitor (via polling or NATS)
↓
Scheduler publishes "check.execute" event to NATS JetStream2. Check Execution
Execution Flow
text
Check Runners subscribe to NATS queue
↓
Runner fetches monitor config
↓
Execute check from configured location (US East, EU West, etc.)
↓
Publish "check.completed" event to NATS with result3. Result Ingestion
Ingestion Flow
text
Ingestion Service receives "check.completed" event
↓
Store result in PostgreSQL (partitioned by time range)
↓
Update Redis cache with latest status
↓
Evaluate alert rules for this monitor
↓
If rule triggers → publish "alert.triggered" event4. Alert & Notification
Notification Flow
text
Incident Service receives "alert.triggered" event
↓
Create or update incident
↓
Notification Service sends alerts to configured channels
↓
Email, Slack, Discord, PagerDuty, etc.Database Architecture
PostgreSQL Partitioning
We use PostgreSQL 15+ native partitioning for time-series data to maintain high performance as data grows:
- check_results - Partitioned by week (PARTITION BY RANGE on created_at)
- incidents_timeline - Partitioned by month
- notifications_log - Partitioned by month
Migration from TimescaleDB
We migrated from TimescaleDB to PostgreSQL native partitioning in October 2025 to simplify our stack while maintaining the same performance. See
backend/ARCHITECTURE_MIGRATION.md for details.Retention Policies
| Data Type | Retention | Aggregation |
|---|---|---|
| Raw check results | 90 days | 1-minute granularity |
| Hourly aggregates | 1 year | Avg/min/max/p95 latency |
| Daily aggregates | Forever | Uptime percentage |
| Incidents | Forever | Full timeline |
Scalability
Horizontal Scaling
All services are stateless and can scale horizontally:
- Check Runners - Scale to 50+ instances for high check volume
- Ingestion Service - Multiple instances consume from NATS queue
- API Service - Load balanced across multiple instances
Load Distribution
Checks are distributed across 6 global locations:
- US East (Virginia)
- US West (California)
- EU West (Ireland)
- EU Central (Frankfurt)
- Asia Pacific (Singapore)
- South America (São Paulo)
High Availability
Redundancy
- Database - PostgreSQL with streaming replication (primary + 2 replicas)
- Redis - Sentinel mode with automatic failover
- NATS - JetStream cluster (3 nodes)
- Services - Multiple instances behind load balancer
Failure Handling
- NATS retries - Failed checks are retried with exponential backoff
- Circuit breaker - Services auto-pause when downstream dependencies fail
- Health checks - Consul monitors all services and removes unhealthy instances
Security
Multi-Layer Protection
- Authentication - httpOnly cookies (XSS-immune) + JWT
- CSRF Protection - Double Submit Cookie pattern
- Security Headers - CSP, X-Frame-Options, HSTS, etc.
- CORS - Configured with credentials for specific origins only
- Rate Limiting - Redis token bucket per user/IP
Security Best Practice
We never use localStorage or sessionStorage for JWT tokens. All authentication tokens are stored in httpOnly cookies to prevent XSS attacks.
Observability
Monitoring & Metrics
- Prometheus - Metrics collection from all services
- Grafana - Real-time dashboards and alerts
- Consul - Service health and discovery
- Logs - Structured JSON logging with correlation IDs
Key Metrics
| Metric | Target | Purpose |
|---|---|---|
| API Response Time (p95) | <200ms | User experience |
| Check Execution Time | <30s | Timely alerts |
| NATS Queue Depth | <1000 | Processing backlog |
| Database Query Time (p95) | <50ms | Data access speed |
Deployment
Infrastructure
- Container Orchestration - Docker + Docker Compose (production)
- CI/CD - GitHub Actions for automated builds and deployments
- Image Registry - GitHub Container Registry (GHCR)
- Reverse Proxy - Nginx with SSL/TLS termination
Deployment Process
Deployment Commands
bash
# Build all services (multi-platform: amd64 + arm64)
make build-and-push
# Deploy via Ansible
ansible-playbook deploy.yml
# Health check all services
make health-checkPerformance Optimizations
Caching Strategy
- Monitor configs - Cached in Redis for 5 minutes
- Latest check results - Cached for real-time dashboard
- Uptime aggregates - Cached for 1 hour
- Status page data - Cached for 30 seconds (public)
Database Optimizations
- Indexes - B-tree on monitor_id, created_at, status
- Partial indexes - Index only failed checks for faster incident queries
- Connection pooling - Max 100 connections per service
- Query optimization - Use EXPLAIN ANALYZE for slow queries
Future Roadmap
Planned Improvements
- Kubernetes - Migrate from Docker Compose to K8s for better orchestration
- Multi-region - Deploy services in multiple AWS regions for global HA
- GraphQL API - Add GraphQL endpoint alongside REST API
- Real-time WebSocket - Live dashboard updates without polling
- Machine Learning - Anomaly detection for unusual patterns
Technical Documentation
For more technical details, see:
backend/ARCHITECTURE_MIGRATION.md- TimescaleDB → PostgreSQL migrationbackend/docker-compose.yml- Service configurationbackend/README.md- Backend development guide- API Reference - Complete API documentation