feat: 知乎支持创作者主页数据爬取(回答、文章、视频)

This commit is contained in:
Relakkes 2024-10-16 21:02:27 +08:00
parent af9d2d8e84
commit da8f1c62b8
8 changed files with 511 additions and 66 deletions

102
README.md
View File

@ -13,17 +13,8 @@
原理:利用[playwright](https://playwright.dev/)搭桥保留登录成功后的上下文浏览器环境通过执行JS表达式获取一些加密参数 原理:利用[playwright](https://playwright.dev/)搭桥保留登录成功后的上下文浏览器环境通过执行JS表达式获取一些加密参数
通过使用此方式免去了复现核心加密JS代码逆向难度大大降低 通过使用此方式免去了复现核心加密JS代码逆向难度大大降低
[MediaCrawlerPro](https://github.com/MediaCrawlerPro) 版本已经迭代出来了,相较于开源版本的优势:
- 多账号+IP代理支持重点
- 去除Playwright依赖使用更加简单
- 支持linux部署Docker docker-compose
- 代码重构优化更加易读易维护解耦JS签名逻辑
- 完美的架构设计,更加易扩展,源码学习的价值更大
MediaCrawler仓库白金赞助商: MediaCrawler仓库白金赞助商:
<a href="https://dashboard.ipcola.com/register?referral_code=atxtupzfjhpbdbl">【IPCola全球独家海外IP代理】⚡新鲜的原生住宅代理超高性价比超多稀缺国家</a> <a href="https://dashboard.ipcola.com/register?referral_code=atxtupzfjhpbdbl">【IPCola全球独家海外IP代理】⚡新鲜的原生住宅代理超高性价比超多稀缺国家</a>
> 【IPCola全球独家海外IP代理】使用此处阿江专属推荐码注册atxtupzfjhpbdbl 获得10%金额补贴。
## 功能列表 ## 功能列表
| 平台 | 关键词搜索 | 指定帖子ID爬取 | 二级评论 | 指定创作者主页 | 登录态缓存 | IP代理池 | 生成评论词云图 | | 平台 | 关键词搜索 | 指定帖子ID爬取 | 二级评论 | 指定创作者主页 | 登录态缓存 | IP代理池 | 生成评论词云图 |
@ -36,8 +27,80 @@ MediaCrawler仓库白金赞助商:
| 贴吧 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | 贴吧 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| 知乎 | ✅ | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ | | 知乎 | ✅ | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ |
## 创建并激活 python 虚拟环境
> 如果是爬取抖音和知乎需要提前安装nodejs环境版本大于等于`16`即可 <br>
```shell
# 进入项目根目录
cd MediaCrawler
# 创建虚拟环境
# 我的python版本是3.9.6requirements.txt中的库是基于这个版本的如果是其他python版本可能requirements.txt中的库不兼容自行解决一下。
python -m venv venv
# macos & linux 激活虚拟环境
source venv/bin/activate
# windows 激活虚拟环境
venv\Scripts\activate
```
## 安装依赖库
```shell
pip install -r requirements.txt
```
## 安装 playwright浏览器驱动
```shell
playwright install
```
## 运行爬虫程序
```shell
### 项目默认是没有开启评论爬取模式如需评论请在config/base_config.py中的 ENABLE_GET_COMMENTS 变量修改
### 一些其他支持项也可以在config/base_config.py查看功能写的有中文注释
# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论
python main.py --platform xhs --lt qrcode --type search
# 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息
python main.py --platform xhs --lt qrcode --type detail
# 打开对应APP扫二维码登录
# 其他平台爬虫使用示例,执行下面的命令查看
python main.py --help
```
## 数据保存
- 支持关系型数据库Mysql中保存需要提前创建数据库
- 执行 `python db.py` 初始化数据库数据库表结构(只在首次执行)
- 支持保存到csv中data/目录下)
- 支持保存到json中data/目录下)
## MediaCrawlerPro
[MediaCrawlerPro](https://github.com/MediaCrawlerPro) 版本已经重构出来了,相较于开源版本的优势:
- 多账号+IP代理支持重点
- 去除Playwright依赖使用更加简单
- 支持linux部署Docker docker-compose
- 代码重构优化更加易读易维护解耦JS签名逻辑
- 代码质量更高,对于构建更大型的爬虫项目更加友好
- 完美的架构设计,更加易扩展,源码学习的价值更大
## 其他常见问题可以查看在线文档
>
> 在线文档包含使用方法、常见问题、加入项目交流群等。
> [MediaCrawler在线文档](https://nanmicoder.github.io/MediaCrawler/)
>
## 开发者服务 ## 开发者服务
> 开源不易希望大家可以Star一下MediaCrawler仓库、支持下我的课程、星球十分感谢 <br> > 开源不易希望大家可以Star一下MediaCrawler仓库十分感谢 <br>
> 如果你对知识付费认可,可以看下下面我提供的付费服务,如果你是学生,请一定提前告知,会有优惠💰<br>
- MediaCrawler源码剖析课程 - MediaCrawler源码剖析课程
如果你想很快入门这个项目,或者想了具体实现原理,我推荐你看看这个我录制的视频课程,从设计出发一步步带你如何使用,门槛大大降低 如果你想很快入门这个项目,或者想了具体实现原理,我推荐你看看这个我录制的视频课程,从设计出发一步步带你如何使用,门槛大大降低
@ -65,12 +128,6 @@ MediaCrawler仓库白金赞助商:
- [Python协程在并发场景下的幂等性问题](https://articles.zsxq.com/id_wocdwsfmfcmp.html) - [Python协程在并发场景下的幂等性问题](https://articles.zsxq.com/id_wocdwsfmfcmp.html)
- [错误使用 Python 可变类型带来的隐藏 Bug](https://articles.zsxq.com/id_f7vn89l1d303.html) - [错误使用 Python 可变类型带来的隐藏 Bug](https://articles.zsxq.com/id_f7vn89l1d303.html)
## 使用教程文档
> MediaCrawler文档使用vitepress构建包含使用方法、常见问题、加入项目交流群等。
>
[MediaCrawler在线文档](https://nanmicoder.github.io/MediaCrawler/)
## 感谢下列Sponsors对本仓库赞助 ## 感谢下列Sponsors对本仓库赞助
> 【IPCola全球独家海外IP代理】使用此处阿江专属推荐码注册atxtupzfjhpbdbl 获得10%金额补贴。 > 【IPCola全球独家海外IP代理】使用此处阿江专属推荐码注册atxtupzfjhpbdbl 获得10%金额补贴。
@ -80,6 +137,19 @@ MediaCrawler仓库白金赞助商:
成为赞助者可以将您产品展示在这里每天获得大量曝光联系作者微信yzglan 或 emailrelakkes@gmail.com 成为赞助者可以将您产品展示在这里每天获得大量曝光联系作者微信yzglan 或 emailrelakkes@gmail.com
## MediaCrawler项目微信交流群
👏👏👏 汇聚爬虫技术爱好者,共同学习,共同进步。
群内禁止广告禁止发各类违规和MediaCrawler不相关的问题
### 加群方式
> 备注github会有拉群小助手自动拉你进群。
>
> 如果图片展示不出来或过期可以直接添加我的微信号yzglan并备注github会有拉群小助手自动拉你进群
![relakkes_wechat](docs/static/images/relakkes_weichat.jpg)
## 打赏 ## 打赏
如果觉得项目不错的话可以打赏哦。您的支持就是我最大的动力! 如果觉得项目不错的话可以打赏哦。您的支持就是我最大的动力!

View File

@ -1,6 +1,6 @@
# 基础配置 # 基础配置
PLATFORM = "xhs" PLATFORM = "xhs"
KEYWORDS = "编程副业,编程兼职" KEYWORDS = "编程副业,编程兼职" # 关键词搜索配置,以英文逗号分隔
LOGIN_TYPE = "qrcode" # qrcode or phone or cookie LOGIN_TYPE = "qrcode" # qrcode or phone or cookie
COOKIES = "" COOKIES = ""
# 具体值参见media_platform.xxx.field下的枚举值暂时只支持小红书 # 具体值参见media_platform.xxx.field下的枚举值暂时只支持小红书
@ -45,8 +45,8 @@ MAX_CONCURRENCY_NUM = 1
# 是否开启爬图片模式, 默认不开启爬图片 # 是否开启爬图片模式, 默认不开启爬图片
ENABLE_GET_IMAGES = False ENABLE_GET_IMAGES = False
# 是否开启爬评论模式, 默认开启爬评论 # 是否开启爬评论模式, 默认开启爬评论
ENABLE_GET_COMMENTS = False ENABLE_GET_COMMENTS = True
# 是否开启爬二级评论模式, 默认不开启爬二级评论 # 是否开启爬二级评论模式, 默认不开启爬二级评论
# 老版本项目使用了 db, 则需参考 schema/tables.sql line 287 增加表字段 # 老版本项目使用了 db, 则需参考 schema/tables.sql line 287 增加表字段
@ -130,6 +130,13 @@ KS_CREATOR_ID_LIST = [
# ........................ # ........................
] ]
# 指定知乎创作者主页url列表
ZHIHU_CREATOR_URL_LIST = [
"https://www.zhihu.com/people/yd1234567",
# ........................
]
# 词云相关 # 词云相关
# 是否开启生成评论词云图 # 是否开启生成评论词云图
ENABLE_GET_WORDCLOUD = False ENABLE_GET_WORDCLOUD = False

View File

@ -5,18 +5,19 @@ from typing import Any, Callable, Dict, List, Optional, Union
from urllib.parse import urlencode from urllib.parse import urlencode
import httpx import httpx
from httpx import Response
from playwright.async_api import BrowserContext, Page from playwright.async_api import BrowserContext, Page
from tenacity import retry, stop_after_attempt, wait_fixed from tenacity import retry, stop_after_attempt, wait_fixed
import config import config
from base.base_crawler import AbstractApiClient from base.base_crawler import AbstractApiClient
from constant import zhihu as zhihu_constant from constant import zhihu as zhihu_constant
from model.m_zhihu import ZhihuComment, ZhihuContent from model.m_zhihu import ZhihuComment, ZhihuContent, ZhihuCreator
from tools import utils from tools import utils
from .exception import DataFetchError, ForbiddenError from .exception import DataFetchError, ForbiddenError
from .field import SearchSort, SearchTime, SearchType from .field import SearchSort, SearchTime, SearchType
from .help import ZhiHuJsonExtractor, sign from .help import ZhihuExtractor, sign
class ZhiHuClient(AbstractApiClient): class ZhiHuClient(AbstractApiClient):
@ -33,7 +34,7 @@ class ZhiHuClient(AbstractApiClient):
self.timeout = timeout self.timeout = timeout
self.default_headers = headers self.default_headers = headers
self.cookie_dict = cookie_dict self.cookie_dict = cookie_dict
self._extractor = ZhiHuJsonExtractor() self._extractor = ZhihuExtractor()
async def _pre_headers(self, url: str) -> Dict: async def _pre_headers(self, url: str) -> Dict:
""" """
@ -95,7 +96,7 @@ class ZhiHuClient(AbstractApiClient):
raise DataFetchError(response.text) raise DataFetchError(response.text)
async def get(self, uri: str, params=None) -> Dict: async def get(self, uri: str, params=None, **kwargs) -> Union[Response, Dict, str]:
""" """
GET请求对请求头签名 GET请求对请求头签名
Args: Args:
@ -109,7 +110,7 @@ class ZhiHuClient(AbstractApiClient):
if isinstance(params, dict): if isinstance(params, dict):
final_uri += '?' + urlencode(params) final_uri += '?' + urlencode(params)
headers = await self._pre_headers(final_uri) headers = await self._pre_headers(final_uri)
return await self.request(method="GET", url=zhihu_constant.ZHIHU_URL + final_uri, headers=headers) return await self.request(method="GET", url=zhihu_constant.ZHIHU_URL + final_uri, headers=headers, **kwargs)
async def pong(self) -> bool: async def pong(self) -> bool:
""" """
@ -194,7 +195,7 @@ class ZhiHuClient(AbstractApiClient):
} }
search_res = await self.get(uri, params) search_res = await self.get(uri, params)
utils.logger.info(f"[ZhiHuClient.get_note_by_keyword] Search result: {search_res}") utils.logger.info(f"[ZhiHuClient.get_note_by_keyword] Search result: {search_res}")
return self._extractor.extract_contents(search_res) return self._extractor.extract_contents_from_search(search_res)
async def get_root_comments(self, content_id: str, content_type: str, offset: str = "", limit: int = 10, async def get_root_comments(self, content_id: str, content_type: str, offset: str = "", limit: int = 10,
order_by: str = "sort") -> Dict: order_by: str = "sort") -> Dict:
@ -317,3 +318,170 @@ class ZhiHuClient(AbstractApiClient):
all_sub_comments.extend(sub_comments) all_sub_comments.extend(sub_comments)
await asyncio.sleep(crawl_interval) await asyncio.sleep(crawl_interval)
return all_sub_comments return all_sub_comments
async def get_creator_info(self, url_token: str) -> Optional[ZhihuCreator]:
"""
获取创作者信息
Args:
url_token:
Returns:
"""
uri = f"/people/{url_token}"
html_content: str = await self.get(uri, return_response=True)
return self._extractor.extract_creator(url_token, html_content)
async def get_creator_answers(self, url_token: str, offset: int = 0, limit: int = 20) -> Dict:
"""
获取创作者的回答
Args:
url_token:
offset:
limit:
Returns:
"""
uri = f"/api/v4/members/{url_token}/answers"
params = {
"include":"data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,attachment,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,excerpt,paid_info,reaction_instruction,is_labeled,label_info,relationship.is_authorized,voting,is_author,is_thanked,is_nothelp;data[*].vessay_info;data[*].author.badge[?(type=best_answerer)].topics;data[*].author.vip_info;data[*].question.has_publishing_draft,relationship",
"offset": offset,
"limit": limit,
"order_by": "created"
}
return await self.get(uri, params)
async def get_creator_articles(self, url_token: str, offset: int = 0, limit: int = 20) -> Dict:
"""
获取创作者的文章
Args:
url_token:
offset:
limit:
Returns:
"""
uri = f"/api/v4/members/{url_token}/articles"
params = {
"include":"data[*].comment_count,suggest_edit,is_normal,thumbnail_extra_info,thumbnail,can_comment,comment_permission,admin_closed_comment,content,voteup_count,created,updated,upvoted_followees,voting,review_info,reaction_instruction,is_labeled,label_info;data[*].vessay_info;data[*].author.badge[?(type=best_answerer)].topics;data[*].author.vip_info;",
"offset": offset,
"limit": limit,
"order_by": "created"
}
return await self.get(uri, params)
async def get_creator_videos(self, url_token: str, offset: int = 0, limit: int = 20) -> Dict:
"""
获取创作者的视频
Args:
url_token:
offset:
limit:
Returns:
"""
uri = f"/api/v4/members/{url_token}/zvideos"
params = {
"include":"similar_zvideo,creation_relationship,reaction_instruction",
"offset": offset,
"limit": limit,
"similar_aggregation": "true"
}
return await self.get(uri, params)
async def get_all_anwser_by_creator(self, creator: ZhihuCreator, crawl_interval: float = 1.0,
callback: Optional[Callable] = None) -> List[ZhihuContent]:
"""
获取创作者的所有回答
Args:
creator: 创作者信息
crawl_interval: 爬取一次笔记的延迟单位
callback: 一次笔记爬取结束后
Returns:
"""
all_contents: List[ZhihuContent] = []
is_end: bool = False
offset: int = 0
limit: int = 20
while not is_end:
res = await self.get_creator_answers(creator.url_token, offset, limit)
if not res:
break
utils.logger.info(f"[ZhiHuClient.get_all_anwser_by_creator] Get creator {creator.url_token} answers: {res}")
paging_info = res.get("paging", {})
is_end = paging_info.get("is_end")
contents = self._extractor.extract_content_list_from_creator(res.get("data"))
if callback:
await callback(contents)
all_contents.extend(contents)
offset += limit
await asyncio.sleep(crawl_interval)
return all_contents
async def get_all_articles_by_creator(self, creator: ZhihuCreator, crawl_interval: float = 1.0,
callback: Optional[Callable] = None) -> List[ZhihuContent]:
"""
获取创作者的所有文章
Args:
creator:
crawl_interval:
callback:
Returns:
"""
all_contents: List[ZhihuContent] = []
is_end: bool = False
offset: int = 0
limit: int = 20
while not is_end:
res = await self.get_creator_articles(creator.url_token, offset, limit)
if not res:
break
paging_info = res.get("paging", {})
is_end = paging_info.get("is_end")
contents = self._extractor.extract_content_list_from_creator(res.get("data"))
if callback:
await callback(contents)
all_contents.extend(contents)
offset += limit
await asyncio.sleep(crawl_interval)
return all_contents
async def get_all_videos_by_creator(self, creator: ZhihuCreator, crawl_interval: float = 1.0,
callback: Optional[Callable] = None) -> List[ZhihuContent]:
"""
获取创作者的所有视频
Args:
creator:
crawl_interval:
callback:
Returns:
"""
all_contents: List[ZhihuContent] = []
is_end: bool = False
offset: int = 0
limit: int = 20
while not is_end:
res = await self.get_creator_videos(creator.url_token, offset, limit)
if not res:
break
paging_info = res.get("paging", {})
is_end = paging_info.get("is_end")
contents = self._extractor.extract_content_list_from_creator(res.get("data"))
if callback:
await callback(contents)
all_contents.extend(contents)
offset += limit
await asyncio.sleep(crawl_interval)
return all_contents

View File

@ -10,7 +10,7 @@ from playwright.async_api import (BrowserContext, BrowserType, Page,
import config import config
from base.base_crawler import AbstractCrawler from base.base_crawler import AbstractCrawler
from model.m_zhihu import ZhihuContent from model.m_zhihu import ZhihuContent, ZhihuCreator
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import zhihu as zhihu_store from store import zhihu as zhihu_store
from tools import utils from tools import utils
@ -18,7 +18,7 @@ from var import crawler_type_var, source_keyword_var
from .client import ZhiHuClient from .client import ZhiHuClient
from .exception import DataFetchError from .exception import DataFetchError
from .help import ZhiHuJsonExtractor from .help import ZhihuExtractor
from .login import ZhiHuLogin from .login import ZhiHuLogin
@ -31,7 +31,7 @@ class ZhihuCrawler(AbstractCrawler):
self.index_url = "https://www.zhihu.com" self.index_url = "https://www.zhihu.com"
# self.user_agent = utils.get_user_agent() # self.user_agent = utils.get_user_agent()
self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
self._extractor = ZhiHuJsonExtractor() self._extractor = ZhihuExtractor()
async def start(self) -> None: async def start(self) -> None:
""" """
@ -74,7 +74,7 @@ class ZhihuCrawler(AbstractCrawler):
await self.zhihu_client.update_cookies(browser_context=self.browser_context) await self.zhihu_client.update_cookies(browser_context=self.browser_context)
# 知乎的搜索接口需要打开搜索页面之后cookies才能访问API单独的首页不行 # 知乎的搜索接口需要打开搜索页面之后cookies才能访问API单独的首页不行
utils.logger.info("[ZhihuCrawler.start] Zhihu跳转到搜索页面获取搜索页面的Cookies过程需要5秒左右") utils.logger.info("[ZhihuCrawler.start] Zhihu跳转到搜索页面获取搜索页面的Cookies过程需要5秒左右")
await self.context_page.goto(f"{self.index_url}/search?q=python&search_source=Guess&utm_content=search_hot&type=content") await self.context_page.goto(f"{self.index_url}/search?q=python&search_source=Guess&utm_content=search_hot&type=content")
await asyncio.sleep(5) await asyncio.sleep(5)
await self.zhihu_client.update_cookies(browser_context=self.browser_context) await self.zhihu_client.update_cookies(browser_context=self.browser_context)
@ -88,7 +88,7 @@ class ZhihuCrawler(AbstractCrawler):
raise NotImplementedError raise NotImplementedError
elif config.CRAWLER_TYPE == "creator": elif config.CRAWLER_TYPE == "creator":
# Get creator's information and their notes and comments # Get creator's information and their notes and comments
raise NotImplementedError await self.get_creators_and_notes()
else: else:
pass pass
@ -169,6 +169,53 @@ class ZhihuCrawler(AbstractCrawler):
callback=zhihu_store.batch_update_zhihu_note_comments callback=zhihu_store.batch_update_zhihu_note_comments
) )
async def get_creators_and_notes(self) -> None:
"""
Get creator's information and their notes and comments
Returns:
"""
utils.logger.info("[ZhihuCrawler.get_creators_and_notes] Begin get xiaohongshu creators")
for user_link in config.ZHIHU_CREATOR_URL_LIST:
utils.logger.info(f"[ZhihuCrawler.get_creators_and_notes] Begin get creator {user_link}")
user_url_token = user_link.split("/")[-1]
# get creator detail info from web html content
createor_info: ZhihuCreator = await self.zhihu_client.get_creator_info(url_token=user_url_token)
if not createor_info:
utils.logger.info(f"[ZhihuCrawler.get_creators_and_notes] Creator {user_url_token} not found")
continue
utils.logger.info(f"[ZhihuCrawler.get_creators_and_notes] Creator info: {createor_info}")
await zhihu_store.save_creator(creator=createor_info)
# 默认只提取回答信息,如果需要文章和视频,把下面的注释打开即可
# Get all anwser information of the creator
all_content_list = await self.zhihu_client.get_all_anwser_by_creator(
creator=createor_info,
crawl_interval=random.random(),
callback=zhihu_store.batch_update_zhihu_contents
)
# Get all articles of the creator's contents
# all_content_list = await self.zhihu_client.get_all_articles_by_creator(
# creator=createor_info,
# crawl_interval=random.random(),
# callback=zhihu_store.batch_update_zhihu_contents
# )
# Get all videos of the creator's contents
# all_content_list = await self.zhihu_client.get_all_videos_by_creator(
# creator=createor_info,
# crawl_interval=random.random(),
# callback=zhihu_store.batch_update_zhihu_contents
# )
# Get all comments of the creator's contents
await self.batch_get_content_comments(all_content_list)
@staticmethod @staticmethod
def format_proxy_info(ip_proxy_info: IpInfoModel) -> Tuple[Optional[Dict], Optional[Dict]]: def format_proxy_info(ip_proxy_info: IpInfoModel) -> Tuple[Optional[Dict], Optional[Dict]]:
"""format proxy info for playwright and httpx""" """format proxy info for playwright and httpx"""

View File

@ -1,8 +1,10 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
from typing import Dict, List import json
from typing import Dict, List, Optional
from urllib.parse import parse_qs, urlparse from urllib.parse import parse_qs, urlparse
import execjs import execjs
from parsel import Selector
from constant import zhihu as zhihu_constant from constant import zhihu as zhihu_constant
from model.m_zhihu import ZhihuComment, ZhihuContent, ZhihuCreator from model.m_zhihu import ZhihuComment, ZhihuContent, ZhihuCreator
@ -29,11 +31,11 @@ def sign(url: str, cookies: str) -> Dict:
return ZHIHU_SGIN_JS.call("get_sign", url, cookies) return ZHIHU_SGIN_JS.call("get_sign", url, cookies)
class ZhiHuJsonExtractor: class ZhihuExtractor:
def __init__(self): def __init__(self):
pass pass
def extract_contents(self, json_data: Dict) -> List[ZhihuContent]: def extract_contents_from_search(self, json_data: Dict) -> List[ZhihuContent]:
""" """
extract zhihu contents extract zhihu contents
Args: Args:
@ -45,21 +47,34 @@ class ZhiHuJsonExtractor:
if not json_data: if not json_data:
return [] return []
result: List[ZhihuContent] = []
search_result: List[Dict] = json_data.get("data", []) search_result: List[Dict] = json_data.get("data", [])
search_result = [s_item for s_item in search_result if s_item.get("type") in ['search_result', 'zvideo']] search_result = [s_item for s_item in search_result if s_item.get("type") in ['search_result', 'zvideo']]
for sr_item in search_result: return self._extract_content_list([sr_item.get("object") for sr_item in search_result if sr_item.get("object")])
sr_object: Dict = sr_item.get("object", {})
if sr_object.get("type") == zhihu_constant.ANSWER_NAME:
result.append(self._extract_answer_content(sr_object)) def _extract_content_list(self, content_list: List[Dict]) -> List[ZhihuContent]:
elif sr_object.get("type") == zhihu_constant.ARTICLE_NAME: """
result.append(self._extract_article_content(sr_object)) extract zhihu content list
elif sr_object.get("type") == zhihu_constant.VIDEO_NAME: Args:
result.append(self._extract_zvideo_content(sr_object)) content_list:
Returns:
"""
if not content_list:
return []
res: List[ZhihuContent] = []
for content in content_list:
if content.get("type") == zhihu_constant.ANSWER_NAME:
res.append(self._extract_answer_content(content))
elif content.get("type") == zhihu_constant.ARTICLE_NAME:
res.append(self._extract_article_content(content))
elif content.get("type") == zhihu_constant.VIDEO_NAME:
res.append(self._extract_zvideo_content(content))
else: else:
continue continue
return res
return result
def _extract_answer_content(self, answer: Dict) -> ZhihuContent: def _extract_answer_content(self, answer: Dict) -> ZhihuContent:
""" """
@ -72,22 +87,23 @@ class ZhiHuJsonExtractor:
res = ZhihuContent() res = ZhihuContent()
res.content_id = answer.get("id") res.content_id = answer.get("id")
res.content_type = answer.get("type") res.content_type = answer.get("type")
res.content_text = extract_text_from_html(answer.get("content")) res.content_text = extract_text_from_html(answer.get("content", ""))
res.question_id = answer.get("question").get("id") res.question_id = answer.get("question").get("id")
res.content_url = f"{zhihu_constant.ZHIHU_URL}/question/{res.question_id}/answer/{res.content_id}" res.content_url = f"{zhihu_constant.ZHIHU_URL}/question/{res.question_id}/answer/{res.content_id}"
res.title = extract_text_from_html(answer.get("title")) res.title = extract_text_from_html(answer.get("title", ""))
res.desc = extract_text_from_html(answer.get("description")) res.desc = extract_text_from_html(answer.get("description", "") or answer.get("excerpt", ""))
res.created_time = answer.get("created_time") res.created_time = answer.get("created_time")
res.updated_time = answer.get("updated_time") res.updated_time = answer.get("updated_time")
res.voteup_count = answer.get("voteup_count") res.voteup_count = answer.get("voteup_count", 0)
res.comment_count = answer.get("comment_count") res.comment_count = answer.get("comment_count", 0)
# extract author info # extract author info
author_info = self._extract_author(answer.get("author")) author_info = self._extract_content_or_comment_author(answer.get("author"))
res.user_id = author_info.user_id res.user_id = author_info.user_id
res.user_link = author_info.user_link res.user_link = author_info.user_link
res.user_nickname = author_info.user_nickname res.user_nickname = author_info.user_nickname
res.user_avatar = author_info.user_avatar res.user_avatar = author_info.user_avatar
res.user_url_token = author_info.url_token
return res return res
def _extract_article_content(self, article: Dict) -> ZhihuContent: def _extract_article_content(self, article: Dict) -> ZhihuContent:
@ -106,17 +122,18 @@ class ZhiHuJsonExtractor:
res.content_url = f"{zhihu_constant.ZHIHU_URL}/p/{res.content_id}" res.content_url = f"{zhihu_constant.ZHIHU_URL}/p/{res.content_id}"
res.title = extract_text_from_html(article.get("title")) res.title = extract_text_from_html(article.get("title"))
res.desc = extract_text_from_html(article.get("excerpt")) res.desc = extract_text_from_html(article.get("excerpt"))
res.created_time = article.get("created_time") res.created_time = article.get("created_time", 0) or article.get("created", 0)
res.updated_time = article.get("updated_time") res.updated_time = article.get("updated_time", 0) or article.get("updated", 0)
res.voteup_count = article.get("voteup_count") res.voteup_count = article.get("voteup_count", 0)
res.comment_count = article.get("comment_count") res.comment_count = article.get("comment_count", 0)
# extract author info # extract author info
author_info = self._extract_author(article.get("author")) author_info = self._extract_content_or_comment_author(article.get("author"))
res.user_id = author_info.user_id res.user_id = author_info.user_id
res.user_link = author_info.user_link res.user_link = author_info.user_link
res.user_nickname = author_info.user_nickname res.user_nickname = author_info.user_nickname
res.user_avatar = author_info.user_avatar res.user_avatar = author_info.user_avatar
res.user_url_token = author_info.url_token
return res return res
def _extract_zvideo_content(self, zvideo: Dict) -> ZhihuContent: def _extract_zvideo_content(self, zvideo: Dict) -> ZhihuContent:
@ -129,25 +146,34 @@ class ZhiHuJsonExtractor:
""" """
res = ZhihuContent() res = ZhihuContent()
res.content_id = zvideo.get("zvideo_id")
if "video" in zvideo and isinstance(zvideo.get("video"), dict): # 说明是从创作者主页的视频列表接口来的
res.content_id = zvideo.get("video").get("video_id")
res.content_url = f"{zhihu_constant.ZHIHU_URL}/zvideo/{res.content_id}"
res.created_time = zvideo.get("published_at")
res.updated_time = zvideo.get("updated_at")
else:
res.content_id = zvideo.get("zvideo_id")
res.content_url = zvideo.get("video_url")
res.created_time = zvideo.get("created_at")
res.content_type = zvideo.get("type") res.content_type = zvideo.get("type")
res.content_url = zvideo.get("video_url")
res.title = extract_text_from_html(zvideo.get("title")) res.title = extract_text_from_html(zvideo.get("title"))
res.desc = extract_text_from_html(zvideo.get("description")) res.desc = extract_text_from_html(zvideo.get("description"))
res.created_time = zvideo.get("created_at")
res.voteup_count = zvideo.get("voteup_count") res.voteup_count = zvideo.get("voteup_count")
res.comment_count = zvideo.get("comment_count") res.comment_count = zvideo.get("comment_count")
# extract author info # extract author info
author_info = self._extract_author(zvideo.get("author")) author_info = self._extract_content_or_comment_author(zvideo.get("author"))
res.user_id = author_info.user_id res.user_id = author_info.user_id
res.user_link = author_info.user_link res.user_link = author_info.user_link
res.user_nickname = author_info.user_nickname res.user_nickname = author_info.user_nickname
res.user_avatar = author_info.user_avatar res.user_avatar = author_info.user_avatar
res.user_url_token = author_info.url_token
return res return res
@staticmethod @staticmethod
def _extract_author(author: Dict) -> ZhihuCreator: def _extract_content_or_comment_author(author: Dict) -> ZhihuCreator:
""" """
extract zhihu author extract zhihu author
Args: Args:
@ -165,6 +191,7 @@ class ZhiHuJsonExtractor:
res.user_link = f"{zhihu_constant.ZHIHU_URL}/people/{author.get('url_token')}" res.user_link = f"{zhihu_constant.ZHIHU_URL}/people/{author.get('url_token')}"
res.user_nickname = author.get("name") res.user_nickname = author.get("name")
res.user_avatar = author.get("avatar_url") res.user_avatar = author.get("avatar_url")
res.url_token = author.get("url_token")
return res return res
def extract_comments(self, page_content: ZhihuContent, comments: List[Dict]) -> List[ZhihuComment]: def extract_comments(self, page_content: ZhihuContent, comments: List[Dict]) -> List[ZhihuComment]:
@ -209,7 +236,7 @@ class ZhiHuJsonExtractor:
res.content_type = page_content.content_type res.content_type = page_content.content_type
# extract author info # extract author info
author_info = self._extract_author(comment.get("author")) author_info = self._extract_content_or_comment_author(comment.get("author"))
res.user_id = author_info.user_id res.user_id = author_info.user_id
res.user_link = author_info.user_link res.user_link = author_info.user_link
res.user_nickname = author_info.user_nickname res.user_nickname = author_info.user_nickname
@ -254,3 +281,80 @@ class ZhiHuJsonExtractor:
query_params = parse_qs(parsed_url.query) query_params = parse_qs(parsed_url.query)
offset = query_params.get('offset', [""])[0] offset = query_params.get('offset', [""])[0]
return offset return offset
@staticmethod
def _foramt_gender_text(gender: int) -> str:
"""
format gender text
Args:
gender:
Returns:
"""
if gender == 1:
return ""
elif gender == 0:
return ""
else:
return "未知"
def extract_creator(self, user_url_token: str, html_content: str) -> Optional[ZhihuCreator]:
"""
extract zhihu creator
Args:
user_url_token : zhihu creator url token
html_content: zhihu creator html content
Returns:
"""
if not html_content:
return None
js_init_data = Selector(text=html_content).xpath("//script[@id='js-initialData']/text()").get(default="").strip()
if not js_init_data:
return None
js_init_data_dict: Dict = json.loads(js_init_data)
users_info: Dict = js_init_data_dict.get("initialState", {}).get("entities", {}).get("users", {})
if not users_info:
return None
creator_info: Dict = users_info.get(user_url_token)
if not creator_info:
return None
res = ZhihuCreator()
res.user_id = creator_info.get("id")
res.user_link = f"{zhihu_constant.ZHIHU_URL}/people/{user_url_token}"
res.user_nickname = creator_info.get("name")
res.user_avatar = creator_info.get("avatarUrl")
res.url_token = creator_info.get("urlToken") or user_url_token
res.gender = self._foramt_gender_text(creator_info.get("gender"))
res.ip_location = creator_info.get("ipInfo")
res.follows = creator_info.get("followingCount")
res.fans = creator_info.get("followerCount")
res.anwser_count = creator_info.get("answerCount")
res.video_count = creator_info.get("zvideoCount")
res.question_count = creator_info.get("questionCount")
res.article_count = creator_info.get("articlesCount")
res.column_count = creator_info.get("columnsCount")
res.get_voteup_count = creator_info.get("voteupCount")
return res
def extract_content_list_from_creator(self, anwser_list: List[Dict]) -> List[ZhihuContent]:
"""
extract content list from creator
Args:
anwser_list:
Returns:
"""
if not anwser_list:
return []
return self._extract_content_list(anwser_list)

View File

@ -15,8 +15,8 @@ class ZhihuContent(BaseModel):
question_id: str = Field(default="", description="问题ID, type为answer时有值") question_id: str = Field(default="", description="问题ID, type为answer时有值")
title: str = Field(default="", description="内容标题") title: str = Field(default="", description="内容标题")
desc: str = Field(default="", description="内容描述") desc: str = Field(default="", description="内容描述")
created_time: int = Field(default="", description="创建时间") created_time: int = Field(default=0, description="创建时间")
updated_time: int = Field(default="", description="更新时间") updated_time: int = Field(default=0, description="更新时间")
voteup_count: int = Field(default=0, description="赞同人数") voteup_count: int = Field(default=0, description="赞同人数")
comment_count: int = Field(default=0, description="评论数量") comment_count: int = Field(default=0, description="评论数量")
source_keyword: str = Field(default="", description="来源关键词") source_keyword: str = Field(default="", description="来源关键词")
@ -25,6 +25,7 @@ class ZhihuContent(BaseModel):
user_link: str = Field(default="", description="用户主页链接") user_link: str = Field(default="", description="用户主页链接")
user_nickname: str = Field(default="", description="用户昵称") user_nickname: str = Field(default="", description="用户昵称")
user_avatar: str = Field(default="", description="用户头像地址") user_avatar: str = Field(default="", description="用户头像地址")
user_url_token: str = Field(default="", description="用户url_token")
class ZhihuComment(BaseModel): class ZhihuComment(BaseModel):
@ -57,7 +58,15 @@ class ZhihuCreator(BaseModel):
user_link: str = Field(default="", description="用户主页链接") user_link: str = Field(default="", description="用户主页链接")
user_nickname: str = Field(default="", description="用户昵称") user_nickname: str = Field(default="", description="用户昵称")
user_avatar: str = Field(default="", description="用户头像地址") user_avatar: str = Field(default="", description="用户头像地址")
url_token: str = Field(default="", description="用户url_token")
gender: str = Field(default="", description="用户性别") gender: str = Field(default="", description="用户性别")
ip_location: Optional[str] = Field(default="", description="IP地理位置") ip_location: Optional[str] = Field(default="", description="IP地理位置")
follows: int = Field(default=0, description="关注数") follows: int = Field(default=0, description="关注数")
fans: int = Field(default=0, description="粉丝数") fans: int = Field(default=0, description="粉丝数")
anwser_count: int = Field(default=0, description="回答数")
video_count: int = Field(default=0, description="视频数")
question_count: int = Field(default=0, description="提问数")
article_count: int = Field(default=0, description="文章数")
column_count: int = Field(default=0, description="专栏数")
get_voteup_count: int = Field(default=0, description="获得的赞同数")

View File

@ -474,6 +474,7 @@ CREATE TABLE `zhihu_content` (
`user_link` varchar(255) NOT NULL COMMENT '用户主页链接', `user_link` varchar(255) NOT NULL COMMENT '用户主页链接',
`user_nickname` varchar(64) NOT NULL COMMENT '用户昵称', `user_nickname` varchar(64) NOT NULL COMMENT '用户昵称',
`user_avatar` varchar(255) NOT NULL COMMENT '用户头像地址', `user_avatar` varchar(255) NOT NULL COMMENT '用户头像地址',
`user_url_token` varchar(255) NOT NULL COMMENT '用户url_token',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳', `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳', `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
PRIMARY KEY (`id`), PRIMARY KEY (`id`),
@ -482,6 +483,7 @@ CREATE TABLE `zhihu_content` (
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='知乎内容(回答、文章、视频)'; ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='知乎内容(回答、文章、视频)';
CREATE TABLE `zhihu_comment` ( CREATE TABLE `zhihu_comment` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID', `id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID', `comment_id` varchar(64) NOT NULL COMMENT '评论ID',
@ -513,10 +515,17 @@ CREATE TABLE `zhihu_creator` (
`user_link` varchar(255) NOT NULL COMMENT '用户主页链接', `user_link` varchar(255) NOT NULL COMMENT '用户主页链接',
`user_nickname` varchar(64) NOT NULL COMMENT '用户昵称', `user_nickname` varchar(64) NOT NULL COMMENT '用户昵称',
`user_avatar` varchar(255) NOT NULL COMMENT '用户头像地址', `user_avatar` varchar(255) NOT NULL COMMENT '用户头像地址',
`url_token` varchar(64) NOT NULL COMMENT '用户URL Token',
`gender` varchar(16) DEFAULT NULL COMMENT '用户性别', `gender` varchar(16) DEFAULT NULL COMMENT '用户性别',
`ip_location` varchar(64) DEFAULT NULL COMMENT 'IP地理位置', `ip_location` varchar(64) DEFAULT NULL COMMENT 'IP地理位置',
`follows` int NOT NULL DEFAULT '0' COMMENT '关注数', `follows` int NOT NULL DEFAULT 0 COMMENT '关注数',
`fans` int NOT NULL DEFAULT '0' COMMENT '粉丝数', `fans` int NOT NULL DEFAULT 0 COMMENT '粉丝数',
`anwser_count` int NOT NULL DEFAULT 0 COMMENT '回答数',
`video_count` int NOT NULL DEFAULT 0 COMMENT '视频数',
`question_count` int NOT NULL DEFAULT 0 COMMENT '问题数',
`article_count` int NOT NULL DEFAULT 0 COMMENT '文章数',
`column_count` int NOT NULL DEFAULT 0 COMMENT '专栏数',
`get_voteup_count` int NOT NULL DEFAULT 0 COMMENT '获得的赞同数',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳', `add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳', `last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
PRIMARY KEY (`id`), PRIMARY KEY (`id`),

View File

@ -3,7 +3,7 @@ from typing import List
import config import config
from base.base_crawler import AbstractStore from base.base_crawler import AbstractStore
from model.m_zhihu import ZhihuComment, ZhihuContent from model.m_zhihu import ZhihuComment, ZhihuContent, ZhihuCreator
from store.zhihu.zhihu_store_impl import (ZhihuCsvStoreImplement, from store.zhihu.zhihu_store_impl import (ZhihuCsvStoreImplement,
ZhihuDbStoreImplement, ZhihuDbStoreImplement,
ZhihuJsonStoreImplement) ZhihuJsonStoreImplement)
@ -25,6 +25,21 @@ class ZhihuStoreFactory:
raise ValueError("[ZhihuStoreFactory.create_store] Invalid save option only supported csv or db or json ...") raise ValueError("[ZhihuStoreFactory.create_store] Invalid save option only supported csv or db or json ...")
return store_class() return store_class()
async def batch_update_zhihu_contents(contents: List[ZhihuContent]):
"""
批量更新知乎内容
Args:
contents:
Returns:
"""
if not contents:
return
for content_item in contents:
await update_zhihu_content(content_item)
async def update_zhihu_content(content_item: ZhihuContent): async def update_zhihu_content(content_item: ZhihuContent):
""" """
更新知乎内容 更新知乎内容
@ -71,3 +86,19 @@ async def update_zhihu_content_comment(comment_item: ZhihuComment):
local_db_item.update({"last_modify_ts": utils.get_current_timestamp()}) local_db_item.update({"last_modify_ts": utils.get_current_timestamp()})
utils.logger.info(f"[store.zhihu.update_zhihu_note_comment] zhihu content comment:{local_db_item}") utils.logger.info(f"[store.zhihu.update_zhihu_note_comment] zhihu content comment:{local_db_item}")
await ZhihuStoreFactory.create_store().store_comment(local_db_item) await ZhihuStoreFactory.create_store().store_comment(local_db_item)
async def save_creator(creator: ZhihuCreator):
"""
保存知乎创作者信息
Args:
creator:
Returns:
"""
if not creator:
return
local_db_item = creator.model_dump()
local_db_item.update({"last_modify_ts": utils.get_current_timestamp()})
await ZhihuStoreFactory.create_store().store_creator(local_db_item)