You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 

12 KiB

AI 服务故障排查指南

概述

本指南提供 AI 服务常见问题的诊断和解决方案。

问题分类

1. API 请求问题

1.1 401 Unauthorized

症状

{
  "success": false,
  "message": "未授权",
  "error_code": "UNAUTHORIZED"
}

可能原因

  • Token 过期
  • Token 无效
  • 未提供 Authorization header

解决方案

# 重新登录获取新 token
curl -X POST http://localhost:8000/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{
    "phone": "13800138000",
    "password": "your_password"
  }'

# 使用新 token 重试
export TOKEN="new_access_token"

1.2 404 Not Found

症状

{
  "success": false,
  "message": "资源不存在",
  "error_code": "NOT_FOUND"
}

可能原因

  • conversation_id 不存在
  • job_id 不存在
  • screenplay_id 不存在
  • 用户无权访问该资源

解决方案

# 检查资源是否存在
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT conversation_id, title, created_at 
FROM ai_conversations 
WHERE conversation_id = 'your_conversation_id';
"

# 检查用户权限
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT c.conversation_id, c.user_id, u.nickname
FROM ai_conversations c
JOIN users u ON c.user_id = u.user_id
WHERE c.conversation_id = 'your_conversation_id';
"

1.3 400 Bad Request

症状

{
  "success": false,
  "message": "参数验证失败",
  "error_code": "VALIDATION_ERROR"
}

可能原因

  • 缺少必填参数
  • 参数类型错误
  • 参数值超出范围

解决方案

# 检查 API 文档
curl -X GET http://localhost:8000/docs

# 查看详细错误信息
docker logs jointo-server-app | grep "ValidationError"

1.4 402 Payment Required

症状

{
  "success": false,
  "message": "积分不足",
  "error_code": "INSUFFICIENT_CREDITS"
}

可能原因

  • 用户积分余额不足
  • 任务消耗积分超过余额

解决方案

# 检查用户积分
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT user_id, balance, total_earned, total_spent
FROM user_credits
WHERE user_id = 'your_user_id';
"

# 充值积分(测试环境)
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
UPDATE user_credits
SET balance = balance + 10000
WHERE user_id = 'your_user_id';
"

2. Celery 任务问题

2.1 任务一直处于 pending 状态

症状

  • 任务创建成功
  • 任务状态一直是 pending
  • 长时间没有进展

可能原因

  • Celery Worker 未启动
  • RabbitMQ 连接失败
  • 任务队列阻塞

解决方案

# 1. 检查 Celery Worker 状态
docker ps | grep celery-ai

# 2. 检查 Celery Worker 日志
docker logs jointo-server-celery-ai -f

# 3. 检查 RabbitMQ 状态
docker exec jointo-server-rabbitmq rabbitmqctl status

# 4. 检查任务队列
docker exec -it jointo-server-redis redis-cli
> KEYS celery*
> LLEN celery
> LRANGE celery 0 -1

# 5. 重启 Celery Worker
docker restart jointo-server-celery-ai

# 6. 清空任务队列(谨慎使用)
docker exec -it jointo-server-redis redis-cli
> DEL celery

2.2 任务失败

症状

  • 任务状态变为 failed
  • error_message 包含错误信息

可能原因

  • AI Provider API 调用失败
  • 网络连接问题
  • 参数错误
  • 超时

解决方案

# 1. 查看任务详情
curl -X GET http://localhost:8000/api/v1/ai/jobs/$JOB_ID \
  -H "Authorization: Bearer $TOKEN"

# 2. 查看 Celery Worker 日志
docker logs jointo-server-celery-ai | grep "ERROR"

# 3. 查看应用日志
docker logs jointo-server-app | grep "ai_tasks"

# 4. 检查 AI Provider 配置
docker exec jointo-server-app env | grep AIHUBMIX

# 5. 测试 AI Provider 连接
docker exec jointo-server-app python -c "
from app.services.ai_providers.aihubmix_provider import AIHubMixProvider
provider = AIHubMixProvider()
print(provider.test_connection())
"

2.3 任务超时

症状

  • 任务运行时间超过预期
  • 任务状态变为 timeout

可能原因

  • AI 模型响应慢
  • 网络延迟
  • 超时配置过短

解决方案

# 1. 检查超时配置
docker exec jointo-server-app python -c "
from app.core.config import settings
print(f'AI Task Timeout: {settings.AI_TASK_TIMEOUT}')
"

# 2. 调整超时时间(修改 .env)
docker exec jointo-server-app bash -c "
echo 'AI_TASK_TIMEOUT=600' >> .env
"

# 3. 重启应用
docker restart jointo-server-app
docker restart jointo-server-celery-ai

# 4. 查看任务执行时间
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT job_id, job_type, status, 
       created_at, started_at, completed_at,
       EXTRACT(EPOCH FROM (completed_at - started_at)) as duration_seconds
FROM ai_jobs
WHERE job_id = 'your_job_id';
"

3. AI Provider 问题

3.1 API Key 无效

症状

AIHubMix API Error: Invalid API Key

解决方案

# 1. 检查环境变量
docker exec jointo-server-app env | grep AIHUBMIX_API_KEY

# 2. 更新 API Key
docker exec jointo-server-app bash -c "
echo 'AIHUBMIX_API_KEY=your_new_api_key' >> .env
"

# 3. 重启应用
docker restart jointo-server-app
docker restart jointo-server-celery-ai

3.2 模型不可用

症状

AIHubMix API Error: Model not found or not available

解决方案

# 1. 查看可用模型列表
curl -X GET http://localhost:8000/api/v1/ai/models \
  -H "Authorization: Bearer $TOKEN"

# 2. 检查模型配置
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT model_id, model_name, model_type, provider, is_active
FROM ai_models
WHERE model_name = 'your_model_name';
"

# 3. 同步最新模型列表
docker exec jointo-server-app python scripts/sync_models_from_api.py

3.3 配额超限

症状

AIHubMix API Error: Rate limit exceeded

解决方案

# 1. 检查 API 配额
# 登录 AIHubMix 控制台查看配额使用情况

# 2. 实施速率限制
# 修改 app/services/ai_providers/aihubmix_provider.py
# 添加 rate limiting 逻辑

# 3. 使用多个 API Key 轮询
# 修改 .env 添加多个 API Key
docker exec jointo-server-app bash -c "
echo 'AIHUBMIX_API_KEYS=key1,key2,key3' >> .env
"

4. 数据库问题

4.1 连接池耗尽

症状

sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached

解决方案

# 1. 检查连接池配置
docker exec jointo-server-app python -c "
from app.core.config import settings
print(f'Pool Size: {settings.DB_POOL_SIZE}')
print(f'Max Overflow: {settings.DB_MAX_OVERFLOW}')
"

# 2. 增加连接池大小
docker exec jointo-server-app bash -c "
echo 'DB_POOL_SIZE=20' >> .env
echo 'DB_MAX_OVERFLOW=40' >> .env
"

# 3. 重启应用
docker restart jointo-server-app

# 4. 检查未关闭的连接
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT count(*), state
FROM pg_stat_activity
WHERE datname = 'jointo'
GROUP BY state;
"

4.2 死锁

症状

sqlalchemy.exc.OperationalError: deadlock detected

解决方案

# 1. 查看死锁日志
docker logs jointo-server-postgres | grep "deadlock"

# 2. 查看当前锁
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT 
    l.locktype,
    l.relation::regclass,
    l.mode,
    l.granted,
    a.query
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT l.granted;
"

# 3. 终止阻塞的查询
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND query_start < now() - interval '5 minutes';
"

4.3 查询慢

症状

  • API 响应时间长
  • 数据库 CPU 使用率高

解决方案

# 1. 启用慢查询日志
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();
"

# 2. 查看慢查询
docker logs jointo-server-postgres | grep "duration:"

# 3. 分析查询计划
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
EXPLAIN ANALYZE
SELECT * FROM ai_jobs WHERE user_id = 'your_user_id';
"

# 4. 添加索引
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
CREATE INDEX CONCURRENTLY idx_ai_jobs_user_id_created_at 
ON ai_jobs(user_id, created_at DESC);
"

5. 内存问题

5.1 内存泄漏

症状

  • 容器内存使用持续增长
  • 最终导致 OOM (Out of Memory)

解决方案

# 1. 监控内存使用
docker stats jointo-server-app

# 2. 查看内存占用最高的进程
docker exec jointo-server-app ps aux --sort=-%mem | head -10

# 3. 使用 memory_profiler 分析
docker exec jointo-server-app pip install memory_profiler
docker exec jointo-server-app python -m memory_profiler app/tasks/ai_tasks.py

# 4. 限制容器内存
# 修改 docker-compose.yml
services:
  app:
    mem_limit: 2g
    mem_reservation: 1g

# 5. 重启容器
docker-compose restart app

5.2 Celery Worker 内存泄漏

症状

  • Celery Worker 内存持续增长
  • 任务执行变慢

解决方案

# 1. 配置 Worker 自动重启
# 修改 docker-compose.yml
services:
  celery-ai:
    command: celery -A app.tasks.celery_app worker --max-tasks-per-child=100

# 2. 重启 Worker
docker restart jointo-server-celery-ai

# 3. 监控 Worker 内存
docker stats jointo-server-celery-ai

日志分析

查看应用日志

# 实时查看日志
docker logs jointo-server-app -f

# 查看最近 100 行
docker logs jointo-server-app --tail 100

# 查看特定时间段
docker logs jointo-server-app --since 2026-02-03T10:00:00

# 过滤错误日志
docker logs jointo-server-app 2>&1 | grep "ERROR"

# 过滤 AI 相关日志
docker logs jointo-server-app 2>&1 | grep "ai_"

查看 Celery 日志

# AI Worker 日志
docker logs jointo-server-celery-ai -f

# Beat 日志(定时任务)
docker logs jointo-server-celery-beat -f

# 过滤任务执行日志
docker logs jointo-server-celery-ai 2>&1 | grep "Task"

查看数据库日志

# PostgreSQL 日志
docker logs jointo-server-postgres -f

# 过滤错误
docker logs jointo-server-postgres 2>&1 | grep "ERROR"

# 过滤慢查询
docker logs jointo-server-postgres 2>&1 | grep "duration:"

性能优化建议

1. 数据库优化

-- 添加索引
CREATE INDEX CONCURRENTLY idx_ai_jobs_user_id_status 
ON ai_jobs(user_id, status);

CREATE INDEX CONCURRENTLY idx_ai_jobs_created_at 
ON ai_jobs(created_at DESC);

CREATE INDEX CONCURRENTLY idx_ai_conversations_user_id 
ON ai_conversations(user_id);

-- 定期清理旧数据
DELETE FROM ai_jobs 
WHERE status IN ('completed', 'failed', 'cancelled')
AND created_at < NOW() - INTERVAL '30 days';

-- 分析表
ANALYZE ai_jobs;
ANALYZE ai_conversations;
ANALYZE ai_messages;

2. 缓存优化

# 使用 Redis 缓存 AI 模型列表
from app.core.cache import cache

@cache.cached(timeout=3600, key_prefix='ai_models')
async def get_ai_models():
    # ...
    pass

3. 异步优化

# 使用 asyncio.gather 并发执行
import asyncio

results = await asyncio.gather(
    get_conversation(conversation_id),
    get_messages(conversation_id),
    get_mentions(conversation_id)
)

监控和告警

1. 设置监控指标

# 使用 Prometheus + Grafana
# 监控指标:
# - API 响应时间
# - 任务成功率
# - 任务平均执行时间
# - 数据库连接数
# - 内存使用率
# - CPU 使用率

2. 设置告警规则

# Prometheus 告警规则
groups:
  - name: ai_service
    rules:
      - alert: HighTaskFailureRate
        expr: rate(ai_task_failures_total[5m]) > 0.1
        annotations:
          summary: "AI 任务失败率过高"
      
      - alert: SlowAPIResponse
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        annotations:
          summary: "API 响应时间过长"

相关文档