各个Group表的来源
- group:从timeline列表里面提取出来的问答(有提问、有回答)
- group2:从问答区的首页列表里面爬取的结果
- group_full:通过完整遍历1032140前的每个提问得到的完整列表,并基于这个爬取完整回答列表
- 由于爬取的时间为6.15,所以实际上后续有1200多个新group没有爬取完整列表
这里存储的都是group(提问)本身而不是回答,所有的group的回答实际上都是item(type=ganswer),都会被自动归类到item表中存储
合并Group
在db.group上执行:
[ { $project: { _id: 0, gid: {$toString: '$gid'}, create_time: 1, detail_: '$detail', } }, { $addFields: { detail: { group: '$detail_' } } }, { $project: { _id: 0, gid: 1, create_time: 1, detail: 1, } }, { $addFields: { XX_imported_from_timelinegroup: true } }, { $merge: { into: 'group_full', on: 'gid', whenMatched: 'keepExisting', whenNotMatched: 'insert' } } ]
db.group2
[ { $project: { _id: 0, gid: { $toString: "$gid", }, create_time: 1, detail_: "$detail", }, }, { $unset: "detail_.since", }, { $unset: "detail_.action_list", }, { $addFields: { detail: { group: "$detail_", }, }, }, { $project: { _id: 0, gid: 1, create_time: 1, detail: 1, }, }, { $addFields: { XX_imported_from_group2: true, }, }, { $merge: { into: "group_full", on: "gid", whenMatched: "keepExisting", whenNotMatched: "insert", }, }, ]