You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 

14 KiB

Screenplay 文件存储架构重构

日期: 2026-02-06
版本: v1.0
类型: 重大架构优化
RFC: RFC 140 - Screenplay 文件存储重构


📋 变更摘要

重构 Screenplay 文件存储架构,实现原始文件与解析文件的清晰分离,解决文件存储混乱问题。

核心改进

改进前(混乱)

screenplays.file_url → 原始 DOCX/PDF ❌(含义不明确)
screenplays.content → 解析后文本

改进后(清晰)

attachments 表
└── file_url → 原始 DOCX/PDF ✅(专用表存储)

screenplays 表
├── source_attachment_id → 关联原始文件(通过多态关联)
├── file_url → 解析后的 Markdown 文件 ✅(职责明确)
└── content → 解析后文本

🔄 主要变更

1. 数据模型层 (Models)

attachment.py

新增枚举值

  • RelatedType.SCREENPLAY = 8 - 支持剧本类型关联
  • AttachmentPurpose.SOURCE = 7 - 标识原始文件用途
class RelatedType(IntEnum):
    """关联实体类型"""
    # ... 现有类型 ...
    SCREENPLAY = 8  # RFC 140: 2026-02-06

class AttachmentPurpose(IntEnum):
    """附件用途"""
    # ... 现有用途 ...
    SOURCE = 7  # RFC 140: 2026-02-06 (剧本原始文件)

影响范围: 所有使用多态关联的模块


2. 仓储层 (Repositories)

attachment_repository.py

新增功能

  1. 引用完整性检查exists_related_entity

    elif related_type == RelatedType.SCREENPLAY:
        # RFC 140: 支持剧本类型
        from app.models.screenplay import Screenplay
        result = await self.session.execute(
            select(Screenplay.screenplay_id).where(
                Screenplay.screenplay_id == related_id,
                Screenplay.deleted_at.is_(None)
            ).limit(1)
        )
    
  2. 权限检查check_related_permission

    elif related_type == RelatedType.SCREENPLAY:
        # RFC 140: 通过剧本所属项目检查权限
        from app.models.screenplay import Screenplay
        result = await self.session.execute(
            select(Screenplay.project_id).where(
                Screenplay.screenplay_id == related_id,
                Screenplay.deleted_at.is_(None)
            )
        )
        project_id = result.scalar_one_or_none()
        if not project_id:
            logger.warning("剧本不存在或已删除: %s", related_id)
            return False
    
        from app.repositories.project_repository import ProjectRepository
        project_repo = ProjectRepository(self.session)
        return await project_repo.check_user_permission(
            user_id, project_id, required_role
        )
    

影响范围: AttachmentService 中的附件验证流程


3. 业务逻辑层 (Services)

screenplay_service.py

修改方法: create_screenplay_from_file()

变更前:

# 直接将原始文件 URL 存入 screenplay.file_url
screenplay.file_url = uploaded_file.file_url  # ❌ 混乱

变更后:

# 1. 创建剧本记录(file_url 暂时为空)
screenplay = Screenplay(
    project_id=project_id,
    name=name,
    type=ScreenplayType.FILE,
    file_url=None,  # ✅ 待解析后生成 Markdown URL
    parsing_status='pending',
    status=ScreenplayStatus.DRAFT,
    created_by=user_id,
    updated_by=user_id
)

# 2. 上传原始文件并创建 Attachment 记录(多态关联)
file_metadata = await file_storage.upload_file(...)
attachment = Attachment(
    file_url=file_metadata.file_url,
    related_id=created_screenplay.screenplay_id,
    related_type=RelatedType.SCREENPLAY,
    attachment_purpose=AttachmentPurpose.SOURCE,  # ✅ 标识为原始文件
    uploaded_by=user_id
)
self.db.add(attachment)
await self.db.commit()

影响范围: 剧本上传接口 POST /api/v1/screenplays/upload


screenplay_file_parser_service.py

新增功能:Markdown 文件生成与上传

新增方法:

  1. _format_as_markdown(content: str) -> str
    将解析后的文本转换为 Markdown 格式

  2. _upload_markdown_file(screenplay_id, markdown_content, user_id) -> str
    上传 Markdown 文件到 OSS 并返回 URL

修改方法:

  • parse_file() - 异步解析
  • parse_file_sync() - 同步解析

核心变更:

# 1. 获取剧本创建者(用于 Markdown 文件上传)
screenplay = await self.repository.get_by_id(screenplay_id)
if not screenplay:
    raise ValueError(f"剧本不存在: {screenplay_id}")

# 2. 生成 Markdown 文件并上传到 OSS (RFC 140)
markdown_content = self._format_as_markdown(content)
markdown_url = await self._upload_markdown_file(
    screenplay_id, 
    markdown_content,
    user_id=screenplay.created_by
)

# 3. 更新剧本记录
await self.repository.update(screenplay_id, {
    'content': content,
    'file_url': markdown_url,  # ✅ 存储 Markdown 文件 URL (RFC 140)
    'word_count': word_count,
    'parsing_status': 3,  # 3=completed
    'parsed_at': datetime.now(timezone.utc)
})

影响范围: 所有剧本文件解析流程


4. 异步任务层 (Tasks)

screenplay_tasks.py

修改方法: _parse_file_async()

核心变更:

async def _parse_file_async(screenplay_id: str, file_path: str, mime_type: str):
    async with async_session_maker() as db:
        from sqlmodel import select
        from app.models.attachment import Attachment, RelatedType, AttachmentPurpose
        
        # 1. 查询原始文件附件(通过多态关联)
        result = await db.exec(
            select(Attachment)
            .where(Attachment.related_type == RelatedType.SCREENPLAY)
            .where(Attachment.related_id == screenplay_id)
            .where(Attachment.attachment_purpose == AttachmentPurpose.SOURCE)
        )
        source_attachment = result.first()
        
        if not source_attachment:
            raise ValueError(f"剧本未关联原始文件: {screenplay_id}")
        
        # 2. 从 attachment 获取原始文件 URL
        actual_file_path = source_attachment.file_url
        actual_mime_type = source_attachment.mime_type
        
        # 3. 解析文件
        parser_service = ScreenplayFileParserService(db)
        result = await parser_service.parse_file(
            screenplay_id=screenplay_id,
            file_path=actual_file_path,
            mime_type=actual_mime_type
        )

影响范围: Celery 异步解析任务


5. Schema 层 (API 响应)

screenplay.py

新增 Schema:

class SourceFileInfo(BaseModel):
    """原始文件信息 (RFC 140)"""
    attachment_id: UUID = Field(..., alias="attachmentId")
    file_name: str = Field(..., alias="fileName")
    file_size: int = Field(..., alias="fileSize")
    mime_type: str = Field(..., alias="mimeType")
    file_url: str = Field(..., alias="fileUrl")

修改 Schema: ScreenplayResponse

新增字段:

source_file: Optional[SourceFileInfo] = Field(
    default=None,
    alias="sourceFile",
    description="原始文件信息(RFC 140)"
)

字段语义变更:

# 现有字段说明更新
file_url: Optional[str] = Field(
    default=None,
    alias="fileUrl",
    description="剧本文件URL(RFC 140: 现为解析后的 Markdown 文件)"
)

影响范围:

  • GET /api/v1/screenplays/{id} - 剧本详情
  • GET /api/v1/screenplays - 剧本列表

6. URL 存储策略(渐进式迁移)

core/storage.py

新增功能: 智能 URL 构建

def build_file_url(path_or_url: str, bucket_name: Optional[str] = None) -> str:
    """
    构建完整文件 URL(智能判断,支持渐进式迁移)
    
    - 如果已经是完整 URL,直接返回(向后兼容)
    - 如果是相对路径,拼接完整 URL
    """
    # 向后兼容:完整 URL 直接返回
    if path_or_url and (path_or_url.startswith('http://') or path_or_url.startswith('https://')):
        return path_or_url
    
    # 拼接完整 URL
    if settings.S3_PUBLIC_URL:
        return f"{settings.S3_PUBLIC_URL}/{path_or_url}"
    else:
        return f"https://{bucket}.s3.{settings.S3_REGION}.amazonaws.com/{path_or_url}"

修改功能: StorageService.upload_bytes() 返回相对路径

async def upload_bytes(self, data: bytes, object_name: str, ...) -> str:
    """
    Returns:
        str: 对象存储路径(相对路径,不含域名)
    """
    self.client.put_object(...)
    
    # ✅ 返回相对路径(重构后)
    return object_name

Schema 计算字段

新增计算字段:

# app/schemas/screenplay.py
class ScreenplayResponse(BaseModel):
    file_url: Optional[str]  # 数据库存储相对路径
    
    @computed_field(alias="parsedFileUrl")
    @property
    def parsed_file_url(self) -> Optional[str]:
        """动态生成完整访问 URL"""
        return build_file_url(self.file_url) if self.file_url else None

# app/schemas/attachment.py
class AttachmentResponse(BaseModel):
    file_url: str
    
    @computed_field(alias="fullUrl")
    @property
    def full_url(self) -> str:
        return build_file_url(self.file_url)

# app/schemas/file_checksum.py
class FileMetadata(BaseModel):
    @computed_field
    @property
    def full_url(self) -> str:
        return build_file_url(self.file_url)

优势:

  • 渐进式迁移:旧数据(完整 URL)仍然可用,新数据使用相对路径
  • 零停机:无需数据库迁移,部署即生效
  • 域名无关:便于多环境部署(dev/staging/prod)
  • 未来可扩展:支持 CDN 切换、多区域存储

影响范围:

  • 所有文件上传操作
  • 所有 API 响应中的文件 URL 字段

📊 数据流变更

原始流程(改进前)

用户上传 DOCX
    ↓
直接存储到 screenplay.file_url  ❌(混乱)
    ↓
异步解析 → 更新 content 字段

改进流程(改进后)

用户上传 DOCX
    ↓
1. 创建 Screenplay (file_url=None, status=pending)
    ↓
2. 上传原始文件到 OSS → 创建 Attachment 记录
   - related_type=SCREENPLAY
   - attachment_purpose=SOURCE
    ↓
3. 异步解析原始文件
    ↓
4. 生成 Markdown 文件 → 上传到 OSS
    ↓
5. 更新 Screenplay.file_url = markdown_url  ✅(清晰)

⚠️ 破坏性变更

API 响应格式变更

影响接口:

  • GET /api/v1/screenplays/{id}
  • GET /api/v1/screenplays

变更内容:

改进前

{
  "fileUrl": "https://oss.example.com/original.docx",  //  含义不明
  "content": "解析后文本"
}

改进后

{
  "fileUrl": "https://oss.example.com/screenplay_xxx.md",  //  解析后的 Markdown
  "content": "解析后文本",
  "sourceFile": {  //  新增:原始文件信息
    "attachmentId": "xxx",
    "fileName": "original.docx",
    "fileSize": 102400,
    "mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "fileUrl": "https://oss.example.com/original.docx"
  }
}

前端兼容性:

  • 向后兼容:fileUrl 字段保留(但语义变更)
  • ⚠️ 需前端适配:若需访问原始文件,应使用 sourceFile.fileUrl

🧪 测试影响

需要更新的测试

  1. 单元测试:

    • test_screenplay_service.py::test_create_screenplay_from_file - 验证 Attachment 创建
    • test_screenplay_file_parser_service.py::test_parse_file - 验证 Markdown 生成
    • test_attachment_repository.py::test_exists_related_entity - 验证 SCREENPLAY 类型支持
  2. 集成测试:

    • test_screenplay_upload_api.py - 验证完整上传流程
    • test_screenplay_parse_task.py - 验证异步解析流程

测试命令

# 运行相关单元测试
pytest tests/unit/services/test_screenplay_service.py -v
pytest tests/unit/services/test_screenplay_file_parser_service.py -v
pytest tests/unit/repositories/test_attachment_repository.py -v

# 运行集成测试
pytest tests/integration/test_screenplay_upload_api.py -v

🚀 部署建议

部署前

  1. 无需数据迁移(开发阶段)
  2. 清空现有测试数据(建议)
    -- 如需清空测试数据
    DELETE FROM attachments WHERE related_type = 8;
    DELETE FROM screenplays WHERE parsing_status IN ('pending', 'processing');
    

部署步骤

  1. 停止后端服务

    docker-compose stop server
    
  2. 更新代码

    git pull origin main
    
  3. 重启服务

    docker-compose up -d server
    
  4. 验证部署

    # 1. 上传测试文件
    curl -X POST http://localhost:6170/api/v1/screenplays/upload \
      -H "Authorization: Bearer $TOKEN" \
      -F "file=@test.docx"
    
    # 2. 检查 Attachment 是否创建
    # 3. 等待解析完成后检查 file_url 是否为 .md 文件
    

📝 后续优化

  1. Schema 完善: 在 ScreenplayResponse 中暴露 source_file 字段
  2. 前端适配: 更新前端下载逻辑,区分原始文件和 Markdown 文件
  3. 文档更新: 更新 API 文档说明新的响应格式
  4. 监控告警: 添加 Markdown 上传失败的监控

📚 相关文档


👥 贡献者

  • @panta - 架构设计与实现

变更类型: Feature | 🔄 Refactor | ⚠️ Breaking Change
影响范围: 🗄️ Database | 🔌 API | 🎨 Frontend
优先级: 🔴 High