14 KiB
Screenplay 文件存储架构重构
日期: 2026-02-06
版本: v1.0
类型: 重大架构优化
RFC: RFC 140 - Screenplay 文件存储重构
📋 变更摘要
重构 Screenplay 文件存储架构,实现原始文件与解析文件的清晰分离,解决文件存储混乱问题。
核心改进
改进前(混乱):
screenplays.file_url → 原始 DOCX/PDF ❌(含义不明确)
screenplays.content → 解析后文本
改进后(清晰):
attachments 表
└── file_url → 原始 DOCX/PDF ✅(专用表存储)
screenplays 表
├── source_attachment_id → 关联原始文件(通过多态关联)
├── file_url → 解析后的 Markdown 文件 ✅(职责明确)
└── content → 解析后文本
🔄 主要变更
1. 数据模型层 (Models)
attachment.py
新增枚举值:
RelatedType.SCREENPLAY = 8- 支持剧本类型关联AttachmentPurpose.SOURCE = 7- 标识原始文件用途
class RelatedType(IntEnum):
"""关联实体类型"""
# ... 现有类型 ...
SCREENPLAY = 8 # RFC 140: 2026-02-06
class AttachmentPurpose(IntEnum):
"""附件用途"""
# ... 现有用途 ...
SOURCE = 7 # RFC 140: 2026-02-06 (剧本原始文件)
影响范围: 所有使用多态关联的模块
2. 仓储层 (Repositories)
attachment_repository.py
新增功能:
-
引用完整性检查(
exists_related_entity)elif related_type == RelatedType.SCREENPLAY: # RFC 140: 支持剧本类型 from app.models.screenplay import Screenplay result = await self.session.execute( select(Screenplay.screenplay_id).where( Screenplay.screenplay_id == related_id, Screenplay.deleted_at.is_(None) ).limit(1) ) -
权限检查(
check_related_permission)elif related_type == RelatedType.SCREENPLAY: # RFC 140: 通过剧本所属项目检查权限 from app.models.screenplay import Screenplay result = await self.session.execute( select(Screenplay.project_id).where( Screenplay.screenplay_id == related_id, Screenplay.deleted_at.is_(None) ) ) project_id = result.scalar_one_or_none() if not project_id: logger.warning("剧本不存在或已删除: %s", related_id) return False from app.repositories.project_repository import ProjectRepository project_repo = ProjectRepository(self.session) return await project_repo.check_user_permission( user_id, project_id, required_role )
影响范围: AttachmentService 中的附件验证流程
3. 业务逻辑层 (Services)
screenplay_service.py
修改方法: create_screenplay_from_file()
变更前:
# 直接将原始文件 URL 存入 screenplay.file_url
screenplay.file_url = uploaded_file.file_url # ❌ 混乱
变更后:
# 1. 创建剧本记录(file_url 暂时为空)
screenplay = Screenplay(
project_id=project_id,
name=name,
type=ScreenplayType.FILE,
file_url=None, # ✅ 待解析后生成 Markdown URL
parsing_status='pending',
status=ScreenplayStatus.DRAFT,
created_by=user_id,
updated_by=user_id
)
# 2. 上传原始文件并创建 Attachment 记录(多态关联)
file_metadata = await file_storage.upload_file(...)
attachment = Attachment(
file_url=file_metadata.file_url,
related_id=created_screenplay.screenplay_id,
related_type=RelatedType.SCREENPLAY,
attachment_purpose=AttachmentPurpose.SOURCE, # ✅ 标识为原始文件
uploaded_by=user_id
)
self.db.add(attachment)
await self.db.commit()
影响范围: 剧本上传接口 POST /api/v1/screenplays/upload
screenplay_file_parser_service.py
新增功能:Markdown 文件生成与上传
新增方法:
-
_format_as_markdown(content: str) -> str
将解析后的文本转换为 Markdown 格式 -
_upload_markdown_file(screenplay_id, markdown_content, user_id) -> str
上传 Markdown 文件到 OSS 并返回 URL
修改方法:
parse_file()- 异步解析parse_file_sync()- 同步解析
核心变更:
# 1. 获取剧本创建者(用于 Markdown 文件上传)
screenplay = await self.repository.get_by_id(screenplay_id)
if not screenplay:
raise ValueError(f"剧本不存在: {screenplay_id}")
# 2. 生成 Markdown 文件并上传到 OSS (RFC 140)
markdown_content = self._format_as_markdown(content)
markdown_url = await self._upload_markdown_file(
screenplay_id,
markdown_content,
user_id=screenplay.created_by
)
# 3. 更新剧本记录
await self.repository.update(screenplay_id, {
'content': content,
'file_url': markdown_url, # ✅ 存储 Markdown 文件 URL (RFC 140)
'word_count': word_count,
'parsing_status': 3, # 3=completed
'parsed_at': datetime.now(timezone.utc)
})
影响范围: 所有剧本文件解析流程
4. 异步任务层 (Tasks)
screenplay_tasks.py
修改方法: _parse_file_async()
核心变更:
async def _parse_file_async(screenplay_id: str, file_path: str, mime_type: str):
async with async_session_maker() as db:
from sqlmodel import select
from app.models.attachment import Attachment, RelatedType, AttachmentPurpose
# 1. 查询原始文件附件(通过多态关联)
result = await db.exec(
select(Attachment)
.where(Attachment.related_type == RelatedType.SCREENPLAY)
.where(Attachment.related_id == screenplay_id)
.where(Attachment.attachment_purpose == AttachmentPurpose.SOURCE)
)
source_attachment = result.first()
if not source_attachment:
raise ValueError(f"剧本未关联原始文件: {screenplay_id}")
# 2. 从 attachment 获取原始文件 URL
actual_file_path = source_attachment.file_url
actual_mime_type = source_attachment.mime_type
# 3. 解析文件
parser_service = ScreenplayFileParserService(db)
result = await parser_service.parse_file(
screenplay_id=screenplay_id,
file_path=actual_file_path,
mime_type=actual_mime_type
)
影响范围: Celery 异步解析任务
5. Schema 层 (API 响应)
screenplay.py
新增 Schema:
class SourceFileInfo(BaseModel):
"""原始文件信息 (RFC 140)"""
attachment_id: UUID = Field(..., alias="attachmentId")
file_name: str = Field(..., alias="fileName")
file_size: int = Field(..., alias="fileSize")
mime_type: str = Field(..., alias="mimeType")
file_url: str = Field(..., alias="fileUrl")
修改 Schema: ScreenplayResponse
新增字段:
source_file: Optional[SourceFileInfo] = Field(
default=None,
alias="sourceFile",
description="原始文件信息(RFC 140)"
)
字段语义变更:
# 现有字段说明更新
file_url: Optional[str] = Field(
default=None,
alias="fileUrl",
description="剧本文件URL(RFC 140: 现为解析后的 Markdown 文件)"
)
影响范围:
GET /api/v1/screenplays/{id}- 剧本详情GET /api/v1/screenplays- 剧本列表
6. URL 存储策略(渐进式迁移)
core/storage.py
新增功能: 智能 URL 构建
def build_file_url(path_or_url: str, bucket_name: Optional[str] = None) -> str:
"""
构建完整文件 URL(智能判断,支持渐进式迁移)
- 如果已经是完整 URL,直接返回(向后兼容)
- 如果是相对路径,拼接完整 URL
"""
# 向后兼容:完整 URL 直接返回
if path_or_url and (path_or_url.startswith('http://') or path_or_url.startswith('https://')):
return path_or_url
# 拼接完整 URL
if settings.S3_PUBLIC_URL:
return f"{settings.S3_PUBLIC_URL}/{path_or_url}"
else:
return f"https://{bucket}.s3.{settings.S3_REGION}.amazonaws.com/{path_or_url}"
修改功能: StorageService.upload_bytes() 返回相对路径
async def upload_bytes(self, data: bytes, object_name: str, ...) -> str:
"""
Returns:
str: 对象存储路径(相对路径,不含域名)
"""
self.client.put_object(...)
# ✅ 返回相对路径(重构后)
return object_name
Schema 计算字段
新增计算字段:
# app/schemas/screenplay.py
class ScreenplayResponse(BaseModel):
file_url: Optional[str] # 数据库存储相对路径
@computed_field(alias="parsedFileUrl")
@property
def parsed_file_url(self) -> Optional[str]:
"""动态生成完整访问 URL"""
return build_file_url(self.file_url) if self.file_url else None
# app/schemas/attachment.py
class AttachmentResponse(BaseModel):
file_url: str
@computed_field(alias="fullUrl")
@property
def full_url(self) -> str:
return build_file_url(self.file_url)
# app/schemas/file_checksum.py
class FileMetadata(BaseModel):
@computed_field
@property
def full_url(self) -> str:
return build_file_url(self.file_url)
优势:
- ✅ 渐进式迁移:旧数据(完整 URL)仍然可用,新数据使用相对路径
- ✅ 零停机:无需数据库迁移,部署即生效
- ✅ 域名无关:便于多环境部署(dev/staging/prod)
- ✅ 未来可扩展:支持 CDN 切换、多区域存储
影响范围:
- 所有文件上传操作
- 所有 API 响应中的文件 URL 字段
📊 数据流变更
原始流程(改进前)
用户上传 DOCX
↓
直接存储到 screenplay.file_url ❌(混乱)
↓
异步解析 → 更新 content 字段
改进流程(改进后)
用户上传 DOCX
↓
1. 创建 Screenplay (file_url=None, status=pending)
↓
2. 上传原始文件到 OSS → 创建 Attachment 记录
- related_type=SCREENPLAY
- attachment_purpose=SOURCE
↓
3. 异步解析原始文件
↓
4. 生成 Markdown 文件 → 上传到 OSS
↓
5. 更新 Screenplay.file_url = markdown_url ✅(清晰)
⚠️ 破坏性变更
API 响应格式变更
影响接口:
GET /api/v1/screenplays/{id}GET /api/v1/screenplays
变更内容:
改进前
{
"fileUrl": "https://oss.example.com/original.docx", // ❌ 含义不明
"content": "解析后文本"
}
改进后
{
"fileUrl": "https://oss.example.com/screenplay_xxx.md", // ✅ 解析后的 Markdown
"content": "解析后文本",
"sourceFile": { // ✅ 新增:原始文件信息
"attachmentId": "xxx",
"fileName": "original.docx",
"fileSize": 102400,
"mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"fileUrl": "https://oss.example.com/original.docx"
}
}
前端兼容性:
- ✅ 向后兼容:
fileUrl字段保留(但语义变更) - ⚠️ 需前端适配:若需访问原始文件,应使用
sourceFile.fileUrl
🧪 测试影响
需要更新的测试
-
单元测试:
test_screenplay_service.py::test_create_screenplay_from_file- 验证 Attachment 创建test_screenplay_file_parser_service.py::test_parse_file- 验证 Markdown 生成test_attachment_repository.py::test_exists_related_entity- 验证 SCREENPLAY 类型支持
-
集成测试:
test_screenplay_upload_api.py- 验证完整上传流程test_screenplay_parse_task.py- 验证异步解析流程
测试命令
# 运行相关单元测试
pytest tests/unit/services/test_screenplay_service.py -v
pytest tests/unit/services/test_screenplay_file_parser_service.py -v
pytest tests/unit/repositories/test_attachment_repository.py -v
# 运行集成测试
pytest tests/integration/test_screenplay_upload_api.py -v
🚀 部署建议
部署前
- ✅ 无需数据迁移(开发阶段)
- ✅ 清空现有测试数据(建议)
-- 如需清空测试数据 DELETE FROM attachments WHERE related_type = 8; DELETE FROM screenplays WHERE parsing_status IN ('pending', 'processing');
部署步骤
-
停止后端服务
docker-compose stop server -
更新代码
git pull origin main -
重启服务
docker-compose up -d server -
验证部署
# 1. 上传测试文件 curl -X POST http://localhost:6170/api/v1/screenplays/upload \ -H "Authorization: Bearer $TOKEN" \ -F "file=@test.docx" # 2. 检查 Attachment 是否创建 # 3. 等待解析完成后检查 file_url 是否为 .md 文件
📝 后续优化
- Schema 完善: 在
ScreenplayResponse中暴露source_file字段 - 前端适配: 更新前端下载逻辑,区分原始文件和 Markdown 文件
- 文档更新: 更新 API 文档说明新的响应格式
- 监控告警: 添加 Markdown 上传失败的监控
📚 相关文档
👥 贡献者
- @panta - 架构设计与实现
变更类型: ✨ Feature | 🔄 Refactor | ⚠️ Breaking Change
影响范围: 🗄️ Database | 🔌 API | 🎨 Frontend
优先级: 🔴 High