12 KiB

Raw Blame History

AI 服务故障排查指南

概述

本指南提供 AI 服务常见问题的诊断和解决方案。

问题分类

1. API 请求问题

1.1 401 Unauthorized

症状：

{
  "success": false,
  "message": "未授权",
  "error_code": "UNAUTHORIZED"
}

可能原因：

Token 过期
Token 无效
未提供 Authorization header

解决方案：

# 重新登录获取新 token
curl -X POST http://localhost:8000/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{
    "phone": "13800138000",
    "password": "your_password"
  }'

# 使用新 token 重试
export TOKEN="new_access_token"

1.2 404 Not Found

症状：

{
  "success": false,
  "message": "资源不存在",
  "error_code": "NOT_FOUND"
}

可能原因：

conversation_id 不存在
job_id 不存在
screenplay_id 不存在
用户无权访问该资源

解决方案：

# 检查资源是否存在
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT conversation_id, title, created_at 
FROM ai_conversations 
WHERE conversation_id = 'your_conversation_id';
"

# 检查用户权限
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT c.conversation_id, c.user_id, u.nickname
FROM ai_conversations c
JOIN users u ON c.user_id = u.user_id
WHERE c.conversation_id = 'your_conversation_id';
"

1.3 400 Bad Request

症状：

{
  "success": false,
  "message": "参数验证失败",
  "error_code": "VALIDATION_ERROR"
}

可能原因：

缺少必填参数
参数类型错误
参数值超出范围

解决方案：

# 检查 API 文档
curl -X GET http://localhost:8000/docs

# 查看详细错误信息
docker logs jointo-server-app | grep "ValidationError"

1.4 402 Payment Required

症状：

{
  "success": false,
  "message": "积分不足",
  "error_code": "INSUFFICIENT_CREDITS"
}

可能原因：

用户积分余额不足
任务消耗积分超过余额

解决方案：

# 检查用户积分
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT user_id, balance, total_earned, total_spent
FROM user_credits
WHERE user_id = 'your_user_id';
"

# 充值积分（测试环境）
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
UPDATE user_credits
SET balance = balance + 10000
WHERE user_id = 'your_user_id';
"

2. Celery 任务问题

2.1 任务一直处于 pending 状态

症状：

任务创建成功
任务状态一直是 pending
长时间没有进展

可能原因：

Celery Worker 未启动
RabbitMQ 连接失败
任务队列阻塞

解决方案：

# 1. 检查 Celery Worker 状态
docker ps | grep celery-ai

# 2. 检查 Celery Worker 日志
docker logs jointo-server-celery-ai -f

# 3. 检查 RabbitMQ 状态
docker exec jointo-server-rabbitmq rabbitmqctl status

# 4. 检查任务队列
docker exec -it jointo-server-redis redis-cli
> KEYS celery*
> LLEN celery
> LRANGE celery 0 -1

# 5. 重启 Celery Worker
docker restart jointo-server-celery-ai

# 6. 清空任务队列（谨慎使用）
docker exec -it jointo-server-redis redis-cli
> DEL celery

2.2 任务失败

症状：

任务状态变为 failed
error_message 包含错误信息

可能原因：

AI Provider API 调用失败
网络连接问题
参数错误
超时

解决方案：

# 1. 查看任务详情
curl -X GET http://localhost:8000/api/v1/ai/jobs/$JOB_ID \
  -H "Authorization: Bearer $TOKEN"

# 2. 查看 Celery Worker 日志
docker logs jointo-server-celery-ai | grep "ERROR"

# 3. 查看应用日志
docker logs jointo-server-app | grep "ai_tasks"

# 4. 检查 AI Provider 配置
docker exec jointo-server-app env | grep AIHUBMIX

# 5. 测试 AI Provider 连接
docker exec jointo-server-app python -c "
from app.services.ai_providers.aihubmix_provider import AIHubMixProvider
provider = AIHubMixProvider()
print(provider.test_connection())
"

2.3 任务超时

症状：

任务运行时间超过预期
任务状态变为 timeout

可能原因：

AI 模型响应慢
网络延迟
超时配置过短

解决方案：

# 1. 检查超时配置
docker exec jointo-server-app python -c "
from app.core.config import settings
print(f'AI Task Timeout: {settings.AI_TASK_TIMEOUT}')
"

# 2. 调整超时时间（修改 .env）
docker exec jointo-server-app bash -c "
echo 'AI_TASK_TIMEOUT=600' >> .env
"

# 3. 重启应用
docker restart jointo-server-app
docker restart jointo-server-celery-ai

# 4. 查看任务执行时间
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT job_id, job_type, status, 
       created_at, started_at, completed_at,
       EXTRACT(EPOCH FROM (completed_at - started_at)) as duration_seconds
FROM ai_jobs
WHERE job_id = 'your_job_id';
"

3. AI Provider 问题

3.1 API Key 无效

症状：

AIHubMix API Error: Invalid API Key

解决方案：

# 1. 检查环境变量
docker exec jointo-server-app env | grep AIHUBMIX_API_KEY

# 2. 更新 API Key
docker exec jointo-server-app bash -c "
echo 'AIHUBMIX_API_KEY=your_new_api_key' >> .env
"

# 3. 重启应用
docker restart jointo-server-app
docker restart jointo-server-celery-ai

3.2 模型不可用

症状：

AIHubMix API Error: Model not found or not available

解决方案：

# 1. 查看可用模型列表
curl -X GET http://localhost:8000/api/v1/ai/models \
  -H "Authorization: Bearer $TOKEN"

# 2. 检查模型配置
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT model_id, model_name, model_type, provider, is_active
FROM ai_models
WHERE model_name = 'your_model_name';
"

# 3. 同步最新模型列表
docker exec jointo-server-app python scripts/sync_models_from_api.py

3.3 配额超限

症状：

AIHubMix API Error: Rate limit exceeded

解决方案：

# 1. 检查 API 配额
# 登录 AIHubMix 控制台查看配额使用情况

# 2. 实施速率限制
# 修改 app/services/ai_providers/aihubmix_provider.py
# 添加 rate limiting 逻辑

# 3. 使用多个 API Key 轮询
# 修改 .env 添加多个 API Key
docker exec jointo-server-app bash -c "
echo 'AIHUBMIX_API_KEYS=key1,key2,key3' >> .env
"

4. 数据库问题

4.1 连接池耗尽

症状：

sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached

解决方案：

# 1. 检查连接池配置
docker exec jointo-server-app python -c "
from app.core.config import settings
print(f'Pool Size: {settings.DB_POOL_SIZE}')
print(f'Max Overflow: {settings.DB_MAX_OVERFLOW}')
"

# 2. 增加连接池大小
docker exec jointo-server-app bash -c "
echo 'DB_POOL_SIZE=20' >> .env
echo 'DB_MAX_OVERFLOW=40' >> .env
"

# 3. 重启应用
docker restart jointo-server-app

# 4. 检查未关闭的连接
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT count(*), state
FROM pg_stat_activity
WHERE datname = 'jointo'
GROUP BY state;
"

4.2 死锁

症状：

sqlalchemy.exc.OperationalError: deadlock detected

解决方案：

# 1. 查看死锁日志
docker logs jointo-server-postgres | grep "deadlock"

# 2. 查看当前锁
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT 
    l.locktype,
    l.relation::regclass,
    l.mode,
    l.granted,
    a.query
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT l.granted;
"

# 3. 终止阻塞的查询
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND query_start < now() - interval '5 minutes';
"

4.3 查询慢

症状：

API 响应时间长
数据库 CPU 使用率高

解决方案：

# 1. 启用慢查询日志
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();
"

# 2. 查看慢查询
docker logs jointo-server-postgres | grep "duration:"

# 3. 分析查询计划
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
EXPLAIN ANALYZE
SELECT * FROM ai_jobs WHERE user_id = 'your_user_id';
"

# 4. 添加索引
docker exec -it jointo-server-postgres psql -U jointoAI -d jointo -c "
CREATE INDEX CONCURRENTLY idx_ai_jobs_user_id_created_at 
ON ai_jobs(user_id, created_at DESC);
"

5. 内存问题

5.1 内存泄漏

症状：

容器内存使用持续增长
最终导致 OOM (Out of Memory)

解决方案：

# 1. 监控内存使用
docker stats jointo-server-app

# 2. 查看内存占用最高的进程
docker exec jointo-server-app ps aux --sort=-%mem | head -10

# 3. 使用 memory_profiler 分析
docker exec jointo-server-app pip install memory_profiler
docker exec jointo-server-app python -m memory_profiler app/tasks/ai_tasks.py

# 4. 限制容器内存
# 修改 docker-compose.yml
services:
  app:
    mem_limit: 2g
    mem_reservation: 1g

# 5. 重启容器
docker-compose restart app

5.2 Celery Worker 内存泄漏

症状：

Celery Worker 内存持续增长
任务执行变慢

解决方案：

# 1. 配置 Worker 自动重启
# 修改 docker-compose.yml
services:
  celery-ai:
    command: celery -A app.tasks.celery_app worker --max-tasks-per-child=100

# 2. 重启 Worker
docker restart jointo-server-celery-ai

# 3. 监控 Worker 内存
docker stats jointo-server-celery-ai

日志分析

查看应用日志

# 实时查看日志
docker logs jointo-server-app -f

# 查看最近 100 行
docker logs jointo-server-app --tail 100

# 查看特定时间段
docker logs jointo-server-app --since 2026-02-03T10:00:00

# 过滤错误日志
docker logs jointo-server-app 2>&1 | grep "ERROR"

# 过滤 AI 相关日志
docker logs jointo-server-app 2>&1 | grep "ai_"

查看 Celery 日志

# AI Worker 日志
docker logs jointo-server-celery-ai -f

# Beat 日志（定时任务）
docker logs jointo-server-celery-beat -f

# 过滤任务执行日志
docker logs jointo-server-celery-ai 2>&1 | grep "Task"

查看数据库日志

# PostgreSQL 日志
docker logs jointo-server-postgres -f

# 过滤错误
docker logs jointo-server-postgres 2>&1 | grep "ERROR"

# 过滤慢查询
docker logs jointo-server-postgres 2>&1 | grep "duration:"

性能优化建议

1. 数据库优化

-- 添加索引
CREATE INDEX CONCURRENTLY idx_ai_jobs_user_id_status 
ON ai_jobs(user_id, status);

CREATE INDEX CONCURRENTLY idx_ai_jobs_created_at 
ON ai_jobs(created_at DESC);

CREATE INDEX CONCURRENTLY idx_ai_conversations_user_id 
ON ai_conversations(user_id);

-- 定期清理旧数据
DELETE FROM ai_jobs 
WHERE status IN ('completed', 'failed', 'cancelled')
AND created_at < NOW() - INTERVAL '30 days';

-- 分析表
ANALYZE ai_jobs;
ANALYZE ai_conversations;
ANALYZE ai_messages;

2. 缓存优化

# 使用 Redis 缓存 AI 模型列表
from app.core.cache import cache

@cache.cached(timeout=3600, key_prefix='ai_models')
async def get_ai_models():
    # ...
    pass

3. 异步优化

# 使用 asyncio.gather 并发执行
import asyncio

results = await asyncio.gather(
    get_conversation(conversation_id),
    get_messages(conversation_id),
    get_mentions(conversation_id)
)

监控和告警

1. 设置监控指标

# 使用 Prometheus + Grafana
# 监控指标：
# - API 响应时间
# - 任务成功率
# - 任务平均执行时间
# - 数据库连接数
# - 内存使用率
# - CPU 使用率

2. 设置告警规则

# Prometheus 告警规则
groups:
  - name: ai_service
    rules:
      - alert: HighTaskFailureRate
        expr: rate(ai_task_failures_total[5m]) > 0.1
        annotations:
          summary: "AI 任务失败率过高"
      
      - alert: SlowAPIResponse
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        annotations:
          summary: "API 响应时间过长"

12 KiB Raw Blame History

AI 服务故障排查指南

概述

问题分类

1. API 请求问题

1.1 401 Unauthorized

1.2 404 Not Found

1.3 400 Bad Request

1.4 402 Payment Required

2. Celery 任务问题

2.1 任务一直处于 pending 状态

2.2 任务失败

2.3 任务超时

3. AI Provider 问题

3.1 API Key 无效

3.2 模型不可用

3.3 配额超限

4. 数据库问题

4.1 连接池耗尽

4.2 死锁

4.3 查询慢

5. 内存问题

5.1 内存泄漏

5.2 Celery Worker 内存泄漏

日志分析

查看应用日志

查看 Celery 日志

查看数据库日志

性能优化建议

1. 数据库优化

2. 缓存优化

3. 异步优化

监控和告警

1. 设置监控指标

2. 设置告警规则

相关文档

12 KiB

Raw Blame History