You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
12 KiB
12 KiB
AI 服务故障排查指南
概述
本指南提供 AI 服务常见问题的诊断和解决方案。
问题分类
1. API 请求问题
1.1 401 Unauthorized
症状:
{
"success": false,
"message": "未授权",
"error_code": "UNAUTHORIZED"
}
可能原因:
- Token 过期
- Token 无效
- 未提供 Authorization header
解决方案:
# 重新登录获取新 token
curl -X POST http://localhost:8000/api/v1/auth/login \
-H "Content-Type: application/json" \
-d '{
"phone": "13800138000",
"password": "your_password"
}'
# 使用新 token 重试
export TOKEN="new_access_token"
1.2 404 Not Found
症状:
{
"success": false,
"message": "资源不存在",
"error_code": "NOT_FOUND"
}
可能原因:
- conversation_id 不存在
- job_id 不存在
- screenplay_id 不存在
- 用户无权访问该资源
解决方案:
# 检查资源是否存在
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT conversation_id, title, created_at
FROM ai_conversations
WHERE conversation_id = 'your_conversation_id';
"
# 检查用户权限
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT c.conversation_id, c.user_id, u.nickname
FROM ai_conversations c
JOIN users u ON c.user_id = u.user_id
WHERE c.conversation_id = 'your_conversation_id';
"
1.3 400 Bad Request
症状:
{
"success": false,
"message": "参数验证失败",
"error_code": "VALIDATION_ERROR"
}
可能原因:
- 缺少必填参数
- 参数类型错误
- 参数值超出范围
解决方案:
# 检查 API 文档
curl -X GET http://localhost:8000/docs
# 查看详细错误信息
docker logs jointo-server-app | grep "ValidationError"
1.4 402 Payment Required
症状:
{
"success": false,
"message": "积分不足",
"error_code": "INSUFFICIENT_CREDITS"
}
可能原因:
- 用户积分余额不足
- 任务消耗积分超过余额
解决方案:
# 检查用户积分
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT user_id, balance, total_earned, total_spent
FROM user_credits
WHERE user_id = 'your_user_id';
"
# 充值积分(测试环境)
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
UPDATE user_credits
SET balance = balance + 10000
WHERE user_id = 'your_user_id';
"
2. Celery 任务问题
2.1 任务一直处于 pending 状态
症状:
- 任务创建成功
- 任务状态一直是
pending - 长时间没有进展
可能原因:
- Celery Worker 未启动
- RabbitMQ 连接失败
- 任务队列阻塞
解决方案:
# 1. 检查 Celery Worker 状态
docker ps | grep celery-ai
# 2. 检查 Celery Worker 日志
docker logs jointo-server-celery-ai -f
# 3. 检查 RabbitMQ 状态
docker exec jointo-server-rabbitmq rabbitmqctl status
# 4. 检查任务队列
docker exec -it jointo-server-redis redis-cli
> KEYS celery*
> LLEN celery
> LRANGE celery 0 -1
# 5. 重启 Celery Worker
docker restart jointo-server-celery-ai
# 6. 清空任务队列(谨慎使用)
docker exec -it jointo-server-redis redis-cli
> DEL celery
2.2 任务失败
症状:
- 任务状态变为
failed - error_message 包含错误信息
可能原因:
- AI Provider API 调用失败
- 网络连接问题
- 参数错误
- 超时
解决方案:
# 1. 查看任务详情
curl -X GET http://localhost:8000/api/v1/ai/jobs/$JOB_ID \
-H "Authorization: Bearer $TOKEN"
# 2. 查看 Celery Worker 日志
docker logs jointo-server-celery-ai | grep "ERROR"
# 3. 查看应用日志
docker logs jointo-server-app | grep "ai_tasks"
# 4. 检查 AI Provider 配置
docker exec jointo-server-app env | grep AIHUBMIX
# 5. 测试 AI Provider 连接
docker exec jointo-server-app python -c "
from app.services.ai_providers.aihubmix_provider import AIHubMixProvider
provider = AIHubMixProvider()
print(provider.test_connection())
"
2.3 任务超时
症状:
- 任务运行时间超过预期
- 任务状态变为
timeout
可能原因:
- AI 模型响应慢
- 网络延迟
- 超时配置过短
解决方案:
# 1. 检查超时配置
docker exec jointo-server-app python -c "
from app.core.config import settings
print(f'AI Task Timeout: {settings.AI_TASK_TIMEOUT}')
"
# 2. 调整超时时间(修改 .env)
docker exec jointo-server-app bash -c "
echo 'AI_TASK_TIMEOUT=600' >> .env
"
# 3. 重启应用
docker restart jointo-server-app
docker restart jointo-server-celery-ai
# 4. 查看任务执行时间
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT job_id, job_type, status,
created_at, started_at, completed_at,
EXTRACT(EPOCH FROM (completed_at - started_at)) as duration_seconds
FROM ai_jobs
WHERE job_id = 'your_job_id';
"
3. AI Provider 问题
3.1 API Key 无效
症状:
AIHubMix API Error: Invalid API Key
解决方案:
# 1. 检查环境变量
docker exec jointo-server-app env | grep AIHUBMIX_API_KEY
# 2. 更新 API Key
docker exec jointo-server-app bash -c "
echo 'AIHUBMIX_API_KEY=your_new_api_key' >> .env
"
# 3. 重启应用
docker restart jointo-server-app
docker restart jointo-server-celery-ai
3.2 模型不可用
症状:
AIHubMix API Error: Model not found or not available
解决方案:
# 1. 查看可用模型列表
curl -X GET http://localhost:8000/api/v1/ai/models \
-H "Authorization: Bearer $TOKEN"
# 2. 检查模型配置
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT model_id, model_name, model_type, provider, is_active
FROM ai_models
WHERE model_name = 'your_model_name';
"
# 3. 同步最新模型列表
docker exec jointo-server-app python scripts/sync_models_from_api.py
3.3 配额超限
症状:
AIHubMix API Error: Rate limit exceeded
解决方案:
# 1. 检查 API 配额
# 登录 AIHubMix 控制台查看配额使用情况
# 2. 实施速率限制
# 修改 app/services/ai_providers/aihubmix_provider.py
# 添加 rate limiting 逻辑
# 3. 使用多个 API Key 轮询
# 修改 .env 添加多个 API Key
docker exec jointo-server-app bash -c "
echo 'AIHUBMIX_API_KEYS=key1,key2,key3' >> .env
"
4. 数据库问题
4.1 连接池耗尽
症状:
sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached
解决方案:
# 1. 检查连接池配置
docker exec jointo-server-app python -c "
from app.core.config import settings
print(f'Pool Size: {settings.DB_POOL_SIZE}')
print(f'Max Overflow: {settings.DB_MAX_OVERFLOW}')
"
# 2. 增加连接池大小
docker exec jointo-server-app bash -c "
echo 'DB_POOL_SIZE=20' >> .env
echo 'DB_MAX_OVERFLOW=40' >> .env
"
# 3. 重启应用
docker restart jointo-server-app
# 4. 检查未关闭的连接
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT count(*), state
FROM pg_stat_activity
WHERE datname = 'jointo'
GROUP BY state;
"
4.2 死锁
症状:
sqlalchemy.exc.OperationalError: deadlock detected
解决方案:
# 1. 查看死锁日志
docker logs jointo-server-postgres | grep "deadlock"
# 2. 查看当前锁
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT
l.locktype,
l.relation::regclass,
l.mode,
l.granted,
a.query
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT l.granted;
"
# 3. 终止阻塞的查询
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND query_start < now() - interval '5 minutes';
"
4.3 查询慢
症状:
- API 响应时间长
- 数据库 CPU 使用率高
解决方案:
# 1. 启用慢查询日志
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();
"
# 2. 查看慢查询
docker logs jointo-server-postgres | grep "duration:"
# 3. 分析查询计划
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
EXPLAIN ANALYZE
SELECT * FROM ai_jobs WHERE user_id = 'your_user_id';
"
# 4. 添加索引
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
CREATE INDEX CONCURRENTLY idx_ai_jobs_user_id_created_at
ON ai_jobs(user_id, created_at DESC);
"
5. 内存问题
5.1 内存泄漏
症状:
- 容器内存使用持续增长
- 最终导致 OOM (Out of Memory)
解决方案:
# 1. 监控内存使用
docker stats jointo-server-app
# 2. 查看内存占用最高的进程
docker exec jointo-server-app ps aux --sort=-%mem | head -10
# 3. 使用 memory_profiler 分析
docker exec jointo-server-app pip install memory_profiler
docker exec jointo-server-app python -m memory_profiler app/tasks/ai_tasks.py
# 4. 限制容器内存
# 修改 docker-compose.yml
services:
app:
mem_limit: 2g
mem_reservation: 1g
# 5. 重启容器
docker-compose restart app
5.2 Celery Worker 内存泄漏
症状:
- Celery Worker 内存持续增长
- 任务执行变慢
解决方案:
# 1. 配置 Worker 自动重启
# 修改 docker-compose.yml
services:
celery-ai:
command: celery -A app.tasks.celery_app worker --max-tasks-per-child=100
# 2. 重启 Worker
docker restart jointo-server-celery-ai
# 3. 监控 Worker 内存
docker stats jointo-server-celery-ai
日志分析
查看应用日志
# 实时查看日志
docker logs jointo-server-app -f
# 查看最近 100 行
docker logs jointo-server-app --tail 100
# 查看特定时间段
docker logs jointo-server-app --since 2026-02-03T10:00:00
# 过滤错误日志
docker logs jointo-server-app 2>&1 | grep "ERROR"
# 过滤 AI 相关日志
docker logs jointo-server-app 2>&1 | grep "ai_"
查看 Celery 日志
# AI Worker 日志
docker logs jointo-server-celery-ai -f
# Beat 日志(定时任务)
docker logs jointo-server-celery-beat -f
# 过滤任务执行日志
docker logs jointo-server-celery-ai 2>&1 | grep "Task"
查看数据库日志
# PostgreSQL 日志
docker logs jointo-server-postgres -f
# 过滤错误
docker logs jointo-server-postgres 2>&1 | grep "ERROR"
# 过滤慢查询
docker logs jointo-server-postgres 2>&1 | grep "duration:"
性能优化建议
1. 数据库优化
-- 添加索引
CREATE INDEX CONCURRENTLY idx_ai_jobs_user_id_status
ON ai_jobs(user_id, status);
CREATE INDEX CONCURRENTLY idx_ai_jobs_created_at
ON ai_jobs(created_at DESC);
CREATE INDEX CONCURRENTLY idx_ai_conversations_user_id
ON ai_conversations(user_id);
-- 定期清理旧数据
DELETE FROM ai_jobs
WHERE status IN ('completed', 'failed', 'cancelled')
AND created_at < NOW() - INTERVAL '30 days';
-- 分析表
ANALYZE ai_jobs;
ANALYZE ai_conversations;
ANALYZE ai_messages;
2. 缓存优化
# 使用 Redis 缓存 AI 模型列表
from app.core.cache import cache
@cache.cached(timeout=3600, key_prefix='ai_models')
async def get_ai_models():
# ...
pass
3. 异步优化
# 使用 asyncio.gather 并发执行
import asyncio
results = await asyncio.gather(
get_conversation(conversation_id),
get_messages(conversation_id),
get_mentions(conversation_id)
)
监控和告警
1. 设置监控指标
# 使用 Prometheus + Grafana
# 监控指标:
# - API 响应时间
# - 任务成功率
# - 任务平均执行时间
# - 数据库连接数
# - 内存使用率
# - CPU 使用率
2. 设置告警规则
# Prometheus 告警规则
groups:
- name: ai_service
rules:
- alert: HighTaskFailureRate
expr: rate(ai_task_failures_total[5m]) > 0.1
annotations:
summary: "AI 任务失败率过高"
- alert: SlowAPIResponse
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
annotations:
summary: "API 响应时间过长"