# Screenplay 文件存储架构重构 **日期**: 2026-02-06 **版本**: v1.0 **类型**: 重大架构优化 **RFC**: [RFC 140 - Screenplay 文件存储重构](../rfcs/140-screenplay-file-storage-refactor.md) --- ## 📋 变更摘要 重构 Screenplay 文件存储架构,实现原始文件与解析文件的清晰分离,解决文件存储混乱问题。 ### 核心改进 **改进前(混乱)**: ``` screenplays.file_url → 原始 DOCX/PDF ❌(含义不明确) screenplays.content → 解析后文本 ``` **改进后(清晰)**: ``` attachments 表 └── file_url → 原始 DOCX/PDF ✅(专用表存储) screenplays 表 ├── source_attachment_id → 关联原始文件(通过多态关联) ├── file_url → 解析后的 Markdown 文件 ✅(职责明确) └── content → 解析后文本 ``` --- ## 🔄 主要变更 ### 1. 数据模型层 (Models) #### `attachment.py` **新增枚举值**: - `RelatedType.SCREENPLAY = 8` - 支持剧本类型关联 - `AttachmentPurpose.SOURCE = 7` - 标识原始文件用途 ```python class RelatedType(IntEnum): """关联实体类型""" # ... 现有类型 ... SCREENPLAY = 8 # RFC 140: 2026-02-06 class AttachmentPurpose(IntEnum): """附件用途""" # ... 现有用途 ... SOURCE = 7 # RFC 140: 2026-02-06 (剧本原始文件) ``` **影响范围**: 所有使用多态关联的模块 --- ### 2. 仓储层 (Repositories) #### `attachment_repository.py` **新增功能**: 1. **引用完整性检查**(`exists_related_entity`) ```python elif related_type == RelatedType.SCREENPLAY: # RFC 140: 支持剧本类型 from app.models.screenplay import Screenplay result = await self.session.execute( select(Screenplay.screenplay_id).where( Screenplay.screenplay_id == related_id, Screenplay.deleted_at.is_(None) ).limit(1) ) ``` 2. **权限检查**(`check_related_permission`) ```python elif related_type == RelatedType.SCREENPLAY: # RFC 140: 通过剧本所属项目检查权限 from app.models.screenplay import Screenplay result = await self.session.execute( select(Screenplay.project_id).where( Screenplay.screenplay_id == related_id, Screenplay.deleted_at.is_(None) ) ) project_id = result.scalar_one_or_none() if not project_id: logger.warning("剧本不存在或已删除: %s", related_id) return False from app.repositories.project_repository import ProjectRepository project_repo = ProjectRepository(self.session) return await project_repo.check_user_permission( user_id, project_id, required_role ) ``` **影响范围**: AttachmentService 中的附件验证流程 --- ### 3. 业务逻辑层 (Services) #### `screenplay_service.py` **修改方法**: `create_screenplay_from_file()` **变更前**: ```python # 直接将原始文件 URL 存入 screenplay.file_url screenplay.file_url = uploaded_file.file_url # ❌ 混乱 ``` **变更后**: ```python # 1. 创建剧本记录(file_url 暂时为空) screenplay = Screenplay( project_id=project_id, name=name, type=ScreenplayType.FILE, file_url=None, # ✅ 待解析后生成 Markdown URL parsing_status='pending', status=ScreenplayStatus.DRAFT, created_by=user_id, updated_by=user_id ) # 2. 上传原始文件并创建 Attachment 记录(多态关联) file_metadata = await file_storage.upload_file(...) attachment = Attachment( file_url=file_metadata.file_url, related_id=created_screenplay.screenplay_id, related_type=RelatedType.SCREENPLAY, attachment_purpose=AttachmentPurpose.SOURCE, # ✅ 标识为原始文件 uploaded_by=user_id ) self.db.add(attachment) await self.db.commit() ``` **影响范围**: 剧本上传接口 `POST /api/v1/screenplays/upload` --- #### `screenplay_file_parser_service.py` **新增功能**:Markdown 文件生成与上传 **新增方法**: 1. `_format_as_markdown(content: str) -> str` 将解析后的文本转换为 Markdown 格式 2. `_upload_markdown_file(screenplay_id, markdown_content, user_id) -> str` 上传 Markdown 文件到 OSS 并返回 URL **修改方法**: - `parse_file()` - 异步解析 - `parse_file_sync()` - 同步解析 **核心变更**: ```python # 1. 获取剧本创建者(用于 Markdown 文件上传) screenplay = await self.repository.get_by_id(screenplay_id) if not screenplay: raise ValueError(f"剧本不存在: {screenplay_id}") # 2. 生成 Markdown 文件并上传到 OSS (RFC 140) markdown_content = self._format_as_markdown(content) markdown_url = await self._upload_markdown_file( screenplay_id, markdown_content, user_id=screenplay.created_by ) # 3. 更新剧本记录 await self.repository.update(screenplay_id, { 'content': content, 'file_url': markdown_url, # ✅ 存储 Markdown 文件 URL (RFC 140) 'word_count': word_count, 'parsing_status': 3, # 3=completed 'parsed_at': datetime.now(timezone.utc) }) ``` **影响范围**: 所有剧本文件解析流程 --- ### 4. 异步任务层 (Tasks) #### `screenplay_tasks.py` **修改方法**: `_parse_file_async()` **核心变更**: ```python async def _parse_file_async(screenplay_id: str, file_path: str, mime_type: str): async with async_session_maker() as db: from sqlmodel import select from app.models.attachment import Attachment, RelatedType, AttachmentPurpose # 1. 查询原始文件附件(通过多态关联) result = await db.exec( select(Attachment) .where(Attachment.related_type == RelatedType.SCREENPLAY) .where(Attachment.related_id == screenplay_id) .where(Attachment.attachment_purpose == AttachmentPurpose.SOURCE) ) source_attachment = result.first() if not source_attachment: raise ValueError(f"剧本未关联原始文件: {screenplay_id}") # 2. 从 attachment 获取原始文件 URL actual_file_path = source_attachment.file_url actual_mime_type = source_attachment.mime_type # 3. 解析文件 parser_service = ScreenplayFileParserService(db) result = await parser_service.parse_file( screenplay_id=screenplay_id, file_path=actual_file_path, mime_type=actual_mime_type ) ``` **影响范围**: Celery 异步解析任务 --- ### 5. Schema 层 (API 响应) #### `screenplay.py` **新增 Schema**: ```python class SourceFileInfo(BaseModel): """原始文件信息 (RFC 140)""" attachment_id: UUID = Field(..., alias="attachmentId") file_name: str = Field(..., alias="fileName") file_size: int = Field(..., alias="fileSize") mime_type: str = Field(..., alias="mimeType") file_url: str = Field(..., alias="fileUrl") ``` **修改 Schema**: `ScreenplayResponse` **新增字段**: ```python source_file: Optional[SourceFileInfo] = Field( default=None, alias="sourceFile", description="原始文件信息(RFC 140)" ) ``` **字段语义变更**: ```python # 现有字段说明更新 file_url: Optional[str] = Field( default=None, alias="fileUrl", description="剧本文件URL(RFC 140: 现为解析后的 Markdown 文件)" ) ``` **影响范围**: - `GET /api/v1/screenplays/{id}` - 剧本详情 - `GET /api/v1/screenplays` - 剧本列表 --- ### 6. URL 存储策略(渐进式迁移) #### `core/storage.py` **新增功能**: 智能 URL 构建 ```python def build_file_url(path_or_url: str, bucket_name: Optional[str] = None) -> str: """ 构建完整文件 URL(智能判断,支持渐进式迁移) - 如果已经是完整 URL,直接返回(向后兼容) - 如果是相对路径,拼接完整 URL """ # 向后兼容:完整 URL 直接返回 if path_or_url and (path_or_url.startswith('http://') or path_or_url.startswith('https://')): return path_or_url # 拼接完整 URL if settings.S3_PUBLIC_URL: return f"{settings.S3_PUBLIC_URL}/{path_or_url}" else: return f"https://{bucket}.s3.{settings.S3_REGION}.amazonaws.com/{path_or_url}" ``` **修改功能**: `StorageService.upload_bytes()` 返回相对路径 ```python async def upload_bytes(self, data: bytes, object_name: str, ...) -> str: """ Returns: str: 对象存储路径(相对路径,不含域名) """ self.client.put_object(...) # ✅ 返回相对路径(重构后) return object_name ``` #### Schema 计算字段 **新增计算字段**: ```python # app/schemas/screenplay.py class ScreenplayResponse(BaseModel): file_url: Optional[str] # 数据库存储相对路径 @computed_field(alias="parsedFileUrl") @property def parsed_file_url(self) -> Optional[str]: """动态生成完整访问 URL""" return build_file_url(self.file_url) if self.file_url else None # app/schemas/attachment.py class AttachmentResponse(BaseModel): file_url: str @computed_field(alias="fullUrl") @property def full_url(self) -> str: return build_file_url(self.file_url) # app/schemas/file_checksum.py class FileMetadata(BaseModel): @computed_field @property def full_url(self) -> str: return build_file_url(self.file_url) ``` **优势**: - ✅ **渐进式迁移**:旧数据(完整 URL)仍然可用,新数据使用相对路径 - ✅ **零停机**:无需数据库迁移,部署即生效 - ✅ **域名无关**:便于多环境部署(dev/staging/prod) - ✅ **未来可扩展**:支持 CDN 切换、多区域存储 **影响范围**: - 所有文件上传操作 - 所有 API 响应中的文件 URL 字段 --- ## 📊 数据流变更 ### 原始流程(改进前) ``` 用户上传 DOCX ↓ 直接存储到 screenplay.file_url ❌(混乱) ↓ 异步解析 → 更新 content 字段 ``` ### 改进流程(改进后) ``` 用户上传 DOCX ↓ 1. 创建 Screenplay (file_url=None, status=pending) ↓ 2. 上传原始文件到 OSS → 创建 Attachment 记录 - related_type=SCREENPLAY - attachment_purpose=SOURCE ↓ 3. 异步解析原始文件 ↓ 4. 生成 Markdown 文件 → 上传到 OSS ↓ 5. 更新 Screenplay.file_url = markdown_url ✅(清晰) ``` --- ## ⚠️ 破坏性变更 ### API 响应格式变更 **影响接口**: - `GET /api/v1/screenplays/{id}` - `GET /api/v1/screenplays` **变更内容**: #### 改进前 ```json { "fileUrl": "https://oss.example.com/original.docx", // ❌ 含义不明 "content": "解析后文本" } ``` #### 改进后 ```json { "fileUrl": "https://oss.example.com/screenplay_xxx.md", // ✅ 解析后的 Markdown "content": "解析后文本", "sourceFile": { // ✅ 新增:原始文件信息 "attachmentId": "xxx", "fileName": "original.docx", "fileSize": 102400, "mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "fileUrl": "https://oss.example.com/original.docx" } } ``` **前端兼容性**: - ✅ 向后兼容:`fileUrl` 字段保留(但语义变更) - ⚠️ 需前端适配:若需访问原始文件,应使用 `sourceFile.fileUrl` --- ## 🧪 测试影响 ### 需要更新的测试 1. **单元测试**: - `test_screenplay_service.py::test_create_screenplay_from_file` - 验证 Attachment 创建 - `test_screenplay_file_parser_service.py::test_parse_file` - 验证 Markdown 生成 - `test_attachment_repository.py::test_exists_related_entity` - 验证 SCREENPLAY 类型支持 2. **集成测试**: - `test_screenplay_upload_api.py` - 验证完整上传流程 - `test_screenplay_parse_task.py` - 验证异步解析流程 ### 测试命令 ```bash # 运行相关单元测试 pytest tests/unit/services/test_screenplay_service.py -v pytest tests/unit/services/test_screenplay_file_parser_service.py -v pytest tests/unit/repositories/test_attachment_repository.py -v # 运行集成测试 pytest tests/integration/test_screenplay_upload_api.py -v ``` --- ## 🚀 部署建议 ### 部署前 1. ✅ **无需数据迁移**(开发阶段) 2. ✅ **清空现有测试数据**(建议) ```sql -- 如需清空测试数据 DELETE FROM attachments WHERE related_type = 8; DELETE FROM screenplays WHERE parsing_status IN ('pending', 'processing'); ``` ### 部署步骤 1. **停止后端服务** ```bash docker-compose stop server ``` 2. **更新代码** ```bash git pull origin main ``` 3. **重启服务** ```bash docker-compose up -d server ``` 4. **验证部署** ```bash # 1. 上传测试文件 curl -X POST http://localhost:6170/api/v1/screenplays/upload \ -H "Authorization: Bearer $TOKEN" \ -F "file=@test.docx" # 2. 检查 Attachment 是否创建 # 3. 等待解析完成后检查 file_url 是否为 .md 文件 ``` --- ## 📝 后续优化 1. **Schema 完善**: 在 `ScreenplayResponse` 中暴露 `source_file` 字段 2. **前端适配**: 更新前端下载逻辑,区分原始文件和 Markdown 文件 3. **文档更新**: 更新 API 文档说明新的响应格式 4. **监控告警**: 添加 Markdown 上传失败的监控 --- ## 📚 相关文档 - [RFC 140 - Screenplay 文件存储重构](../rfcs/140-screenplay-file-storage-refactor.md) - [ADR 005 - 多态关联设计](../../architecture/adrs/005-polymorphic-association.md) - [数据库设计规范](../../architecture/database-design.md) --- ## 👥 贡献者 - @panta - 架构设计与实现 --- **变更类型**: ✨ Feature | 🔄 Refactor | ⚠️ Breaking Change **影响范围**: 🗄️ Database | 🔌 API | 🎨 Frontend **优先级**: 🔴 High