# AI 服务故障排查指南 ## 概述 本指南提供 AI 服务常见问题的诊断和解决方案。 ## 问题分类 ### 1. API 请求问题 #### 1.1 401 Unauthorized **症状**: ```json { "success": false, "message": "未授权", "error_code": "UNAUTHORIZED" } ``` **可能原因**: - Token 过期 - Token 无效 - 未提供 Authorization header **解决方案**: ```bash # 重新登录获取新 token curl -X POST http://localhost:8000/api/v1/auth/login \ -H "Content-Type: application/json" \ -d '{ "phone": "13800138000", "password": "your_password" }' # 使用新 token 重试 export TOKEN="new_access_token" ``` #### 1.2 404 Not Found **症状**: ```json { "success": false, "message": "资源不存在", "error_code": "NOT_FOUND" } ``` **可能原因**: - conversation_id 不存在 - job_id 不存在 - screenplay_id 不存在 - 用户无权访问该资源 **解决方案**: ```bash # 检查资源是否存在 docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c " SELECT conversation_id, title, created_at FROM ai_conversations WHERE conversation_id = 'your_conversation_id'; " # 检查用户权限 docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c " SELECT c.conversation_id, c.user_id, u.nickname FROM ai_conversations c JOIN users u ON c.user_id = u.user_id WHERE c.conversation_id = 'your_conversation_id'; " ``` #### 1.3 400 Bad Request **症状**: ```json { "success": false, "message": "参数验证失败", "error_code": "VALIDATION_ERROR" } ``` **可能原因**: - 缺少必填参数 - 参数类型错误 - 参数值超出范围 **解决方案**: ```bash # 检查 API 文档 curl -X GET http://localhost:8000/docs # 查看详细错误信息 docker logs jointo-server-app | grep "ValidationError" ``` #### 1.4 402 Payment Required **症状**: ```json { "success": false, "message": "积分不足", "error_code": "INSUFFICIENT_CREDITS" } ``` **可能原因**: - 用户积分余额不足 - 任务消耗积分超过余额 **解决方案**: ```bash # 检查用户积分 docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c " SELECT user_id, balance, total_earned, total_spent FROM user_credits WHERE user_id = 'your_user_id'; " # 充值积分(测试环境) docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c " UPDATE user_credits SET balance = balance + 10000 WHERE user_id = 'your_user_id'; " ``` ### 2. Celery 任务问题 #### 2.1 任务一直处于 pending 状态 **症状**: - 任务创建成功 - 任务状态一直是 `pending` - 长时间没有进展 **可能原因**: - Celery Worker 未启动 - RabbitMQ 连接失败 - 任务队列阻塞 **解决方案**: ```bash # 1. 检查 Celery Worker 状态 docker ps | grep celery-ai # 2. 检查 Celery Worker 日志 docker logs jointo-server-celery-ai -f # 3. 检查 RabbitMQ 状态 docker exec jointo-server-rabbitmq rabbitmqctl status # 4. 检查任务队列 docker exec -it jointo-server-redis redis-cli > KEYS celery* > LLEN celery > LRANGE celery 0 -1 # 5. 重启 Celery Worker docker restart jointo-server-celery-ai # 6. 清空任务队列(谨慎使用) docker exec -it jointo-server-redis redis-cli > DEL celery ``` #### 2.2 任务失败 **症状**: - 任务状态变为 `failed` - error_message 包含错误信息 **可能原因**: - AI Provider API 调用失败 - 网络连接问题 - 参数错误 - 超时 **解决方案**: ```bash # 1. 查看任务详情 curl -X GET http://localhost:8000/api/v1/ai/jobs/$JOB_ID \ -H "Authorization: Bearer $TOKEN" # 2. 查看 Celery Worker 日志 docker logs jointo-server-celery-ai | grep "ERROR" # 3. 查看应用日志 docker logs jointo-server-app | grep "ai_tasks" # 4. 检查 AI Provider 配置 docker exec jointo-server-app env | grep AIHUBMIX # 5. 测试 AI Provider 连接 docker exec jointo-server-app python -c " from app.services.ai_providers.aihubmix_provider import AIHubMixProvider provider = AIHubMixProvider() print(provider.test_connection()) " ``` #### 2.3 任务超时 **症状**: - 任务运行时间超过预期 - 任务状态变为 `timeout` **可能原因**: - AI 模型响应慢 - 网络延迟 - 超时配置过短 **解决方案**: ```bash # 1. 检查超时配置 docker exec jointo-server-app python -c " from app.core.config import settings print(f'AI Task Timeout: {settings.AI_TASK_TIMEOUT}') " # 2. 调整超时时间(修改 .env) docker exec jointo-server-app bash -c " echo 'AI_TASK_TIMEOUT=600' >> .env " # 3. 重启应用 docker restart jointo-server-app docker restart jointo-server-celery-ai # 4. 查看任务执行时间 docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c " SELECT job_id, job_type, status, created_at, started_at, completed_at, EXTRACT(EPOCH FROM (completed_at - started_at)) as duration_seconds FROM ai_jobs WHERE job_id = 'your_job_id'; " ``` ### 3. AI Provider 问题 #### 3.1 API Key 无效 **症状**: ``` AIHubMix API Error: Invalid API Key ``` **解决方案**: ```bash # 1. 检查环境变量 docker exec jointo-server-app env | grep AIHUBMIX_API_KEY # 2. 更新 API Key docker exec jointo-server-app bash -c " echo 'AIHUBMIX_API_KEY=your_new_api_key' >> .env " # 3. 重启应用 docker restart jointo-server-app docker restart jointo-server-celery-ai ``` #### 3.2 模型不可用 **症状**: ``` AIHubMix API Error: Model not found or not available ``` **解决方案**: ```bash # 1. 查看可用模型列表 curl -X GET http://localhost:8000/api/v1/ai/models \ -H "Authorization: Bearer $TOKEN" # 2. 检查模型配置 docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c " SELECT model_id, model_name, model_type, provider, is_active FROM ai_models WHERE model_name = 'your_model_name'; " # 3. 同步最新模型列表 docker exec jointo-server-app python scripts/sync_models_from_api.py ``` #### 3.3 配额超限 **症状**: ``` AIHubMix API Error: Rate limit exceeded ``` **解决方案**: ```bash # 1. 检查 API 配额 # 登录 AIHubMix 控制台查看配额使用情况 # 2. 实施速率限制 # 修改 app/services/ai_providers/aihubmix_provider.py # 添加 rate limiting 逻辑 # 3. 使用多个 API Key 轮询 # 修改 .env 添加多个 API Key docker exec jointo-server-app bash -c " echo 'AIHUBMIX_API_KEYS=key1,key2,key3' >> .env " ``` ### 4. 数据库问题 #### 4.1 连接池耗尽 **症状**: ``` sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached ``` **解决方案**: ```bash # 1. 检查连接池配置 docker exec jointo-server-app python -c " from app.core.config import settings print(f'Pool Size: {settings.DB_POOL_SIZE}') print(f'Max Overflow: {settings.DB_MAX_OVERFLOW}') " # 2. 增加连接池大小 docker exec jointo-server-app bash -c " echo 'DB_POOL_SIZE=20' >> .env echo 'DB_MAX_OVERFLOW=40' >> .env " # 3. 重启应用 docker restart jointo-server-app # 4. 检查未关闭的连接 docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c " SELECT count(*), state FROM pg_stat_activity WHERE datname = 'jointo' GROUP BY state; " ``` #### 4.2 死锁 **症状**: ``` sqlalchemy.exc.OperationalError: deadlock detected ``` **解决方案**: ```bash # 1. 查看死锁日志 docker logs jointo-server-postgres | grep "deadlock" # 2. 查看当前锁 docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c " SELECT l.locktype, l.relation::regclass, l.mode, l.granted, a.query FROM pg_locks l JOIN pg_stat_activity a ON l.pid = a.pid WHERE NOT l.granted; " # 3. 终止阻塞的查询 docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c " SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND query_start < now() - interval '5 minutes'; " ``` #### 4.3 查询慢 **症状**: - API 响应时间长 - 数据库 CPU 使用率高 **解决方案**: ```bash # 1. 启用慢查询日志 docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c " ALTER SYSTEM SET log_min_duration_statement = 1000; SELECT pg_reload_conf(); " # 2. 查看慢查询 docker logs jointo-server-postgres | grep "duration:" # 3. 分析查询计划 docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c " EXPLAIN ANALYZE SELECT * FROM ai_jobs WHERE user_id = 'your_user_id'; " # 4. 添加索引 docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c " CREATE INDEX CONCURRENTLY idx_ai_jobs_user_id_created_at ON ai_jobs(user_id, created_at DESC); " ``` ### 5. 内存问题 #### 5.1 内存泄漏 **症状**: - 容器内存使用持续增长 - 最终导致 OOM (Out of Memory) **解决方案**: ```bash # 1. 监控内存使用 docker stats jointo-server-app # 2. 查看内存占用最高的进程 docker exec jointo-server-app ps aux --sort=-%mem | head -10 # 3. 使用 memory_profiler 分析 docker exec jointo-server-app pip install memory_profiler docker exec jointo-server-app python -m memory_profiler app/tasks/ai_tasks.py # 4. 限制容器内存 # 修改 docker-compose.yml services: app: mem_limit: 2g mem_reservation: 1g # 5. 重启容器 docker-compose restart app ``` #### 5.2 Celery Worker 内存泄漏 **症状**: - Celery Worker 内存持续增长 - 任务执行变慢 **解决方案**: ```bash # 1. 配置 Worker 自动重启 # 修改 docker-compose.yml services: celery-ai: command: celery -A app.tasks.celery_app worker --max-tasks-per-child=100 # 2. 重启 Worker docker restart jointo-server-celery-ai # 3. 监控 Worker 内存 docker stats jointo-server-celery-ai ``` ## 日志分析 ### 查看应用日志 ```bash # 实时查看日志 docker logs jointo-server-app -f # 查看最近 100 行 docker logs jointo-server-app --tail 100 # 查看特定时间段 docker logs jointo-server-app --since 2026-02-03T10:00:00 # 过滤错误日志 docker logs jointo-server-app 2>&1 | grep "ERROR" # 过滤 AI 相关日志 docker logs jointo-server-app 2>&1 | grep "ai_" ``` ### 查看 Celery 日志 ```bash # AI Worker 日志 docker logs jointo-server-celery-ai -f # Beat 日志(定时任务) docker logs jointo-server-celery-beat -f # 过滤任务执行日志 docker logs jointo-server-celery-ai 2>&1 | grep "Task" ``` ### 查看数据库日志 ```bash # PostgreSQL 日志 docker logs jointo-server-postgres -f # 过滤错误 docker logs jointo-server-postgres 2>&1 | grep "ERROR" # 过滤慢查询 docker logs jointo-server-postgres 2>&1 | grep "duration:" ``` ## 性能优化建议 ### 1. 数据库优化 ```sql -- 添加索引 CREATE INDEX CONCURRENTLY idx_ai_jobs_user_id_status ON ai_jobs(user_id, status); CREATE INDEX CONCURRENTLY idx_ai_jobs_created_at ON ai_jobs(created_at DESC); CREATE INDEX CONCURRENTLY idx_ai_conversations_user_id ON ai_conversations(user_id); -- 定期清理旧数据 DELETE FROM ai_jobs WHERE status IN ('completed', 'failed', 'cancelled') AND created_at < NOW() - INTERVAL '30 days'; -- 分析表 ANALYZE ai_jobs; ANALYZE ai_conversations; ANALYZE ai_messages; ``` ### 2. 缓存优化 ```python # 使用 Redis 缓存 AI 模型列表 from app.core.cache import cache @cache.cached(timeout=3600, key_prefix='ai_models') async def get_ai_models(): # ... pass ``` ### 3. 异步优化 ```python # 使用 asyncio.gather 并发执行 import asyncio results = await asyncio.gather( get_conversation(conversation_id), get_messages(conversation_id), get_mentions(conversation_id) ) ``` ## 监控和告警 ### 1. 设置监控指标 ```bash # 使用 Prometheus + Grafana # 监控指标: # - API 响应时间 # - 任务成功率 # - 任务平均执行时间 # - 数据库连接数 # - 内存使用率 # - CPU 使用率 ``` ### 2. 设置告警规则 ```yaml # Prometheus 告警规则 groups: - name: ai_service rules: - alert: HighTaskFailureRate expr: rate(ai_task_failures_total[5m]) > 0.1 annotations: summary: "AI 任务失败率过高" - alert: SlowAPIResponse expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 annotations: summary: "API 响应时间过长" ``` ## 相关文档 - [AI 服务测试指南](./ai-service-testing-guide.md) - [AI 服务实施总结](../changelogs/2026-02-03-ai-services-implementation-summary.md) - [Jointo 技术栈规范](../../architecture/tech-stack.md)