完成词云图生成函数并添加至存储逻辑中

2024-06-12 15:33:39 +08:00 · 2024-06-12 15:33:39 +08:00 · 7048f040c9
parent 3c7c678d7a
commit 7048f040c9
12 changed files with 959 additions and 40 deletions
--- a/README.md
+++ b/README.md
@ -17,13 +17,13 @@
 ## 功能列表
 > 下面不支持的项目，相关的代码架构已经搭建好，只需要实现对应的方法即可，欢迎大家提交PR

-| 平台  | 关键词搜索 | 指定帖子ID爬取 | 二级评论 | 指定创作者主页 | 登录态缓存 | IP代理池 |
-|-----|-------|----------|-----|--------|-------|-------|
-| 小红书 | ✅     | ✅        | ✅   | ✅      | ✅     | ✅     |
-| 抖音  | ✅     | ✅        | ✅    | ✅       | ✅     | ✅     |
-| 快手  | ✅     | ✅        | ❌   | ❌      | ✅     | ✅     |
-| B 站 | ✅     | ✅        | ✅   | ❌      | ✅     | ✅     |
-| 微博  | ✅     | ✅        | ❌   | ❌      | ✅     | ✅     |
+| 平台  | 关键词搜索 | 指定帖子ID爬取 | 二级评论 | 指定创作者主页 | 登录态缓存 | IP代理池 | 生成评论词云图 |
+|-----|-------|----------|-----|--------|-------|-------|-------|
+| 小红书 | ✅     | ✅        | ✅   | ✅      | ✅     | ✅     | ✅    |
+| 抖音  | ✅     | ✅        | ✅    | ✅       | ✅     | ✅     | ✅    |
+| 快手  | ✅     | ✅        | ❌   | ❌      | ✅     | ✅     | ✅    |
+| B 站 | ✅     | ✅        | ✅   | ❌      | ✅     | ✅     | ✅    |
+| 微博  | ✅     | ✅        | ❌   | ❌      | ✅     | ✅     | ✅    |


 ## 使用方法
@ -186,4 +186,3 @@



-
--- a/config/base_config.py
+++ b/config/base_config.py
@ -95,3 +95,19 @@ DY_CREATOR_ID_LIST = [
    "MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE",
    # ........................
 ]
+
+#词云相关
+#是否开启生成评论词云图
+ENABLE_GET_WORDCLOUD = False
+# 自定义词语及其分组
+#添加规则：xx:yy 其中xx为自定义添加的词组，yy为将xx该词组分到的组名。
+CUSTOM_WORDS = {
+    '零几': '年份',  # 将“零几”识别为一个整体
+    '高频词': '专业术语'  # 示例自定义词
+}
+
+#停用(禁用)词文件路径
+STOP_WORDS_FILE = "./docs/hit_stopwords.txt"
+
+#中文字体文件路径
+FONT_PATH= "./docs/STZHONGS.TTF"
--- a/docs/STZHONGS.TTF
+++ b/docs/STZHONGS.TTF
--- a/docs/hit_stopwords.txt
+++ b/docs/hit_stopwords.txt
@ -0,0 +1,768 @@
+\n
+———
+》），
+）÷（１－
+”，
+）、
+＝（
+:
+→
+℃ 
+&
+*
+一一
+~~~~
+’
+. 
+『
+.一
+./
+-- 
+』
+＝″
+【
+［＊］
+｝＞
+［⑤］］
+［①Ｄ］
+ｃ］
+ｎｇ昉
+＊
+//
+［
+］
+［②ｅ］
+［②ｇ］
+＝｛
+}
+，也 
+‘
+Ａ
+［①⑥］
+［②Ｂ］ 
+［①ａ］
+［④ａ］
+［①③］
+［③ｈ］
+③］
+１． 
+－－ 
+［②ｂ］
+’‘ 
+××× 
+［①⑧］
+０：２ 
+＝［
+［⑤ｂ］
+［②ｃ］ 
+［④ｂ］
+［②③］
+［③ａ］
+［④ｃ］
+［①⑤］
+［①⑦］
+［①ｇ］
+∈［ 
+［①⑨］
+［①④］
+［①ｃ］
+［②ｆ］
+［②⑧］
+［②①］
+［①Ｃ］
+［③ｃ］
+［③ｇ］
+［②⑤］
+［②②］
+一.
+［①ｈ］
+.数
+［］
+［①Ｂ］
+数/
+［①ｉ］
+［③ｅ］
+［①①］
+［④ｄ］
+［④ｅ］
+［③ｂ］
+［⑤ａ］
+［①Ａ］
+［②⑧］
+［②⑦］
+［①ｄ］
+［②ｊ］
+〕〔
+］［
+://
+′∈
+［②④
+［⑤ｅ］
+１２％
+ｂ］
+...
+...................
+…………………………………………………③
+ＺＸＦＩＴＬ
+［③Ｆ］
+」
+［①ｏ］
+］∧′＝［ 
+∪φ∈
+′｜
+｛－
+②ｃ
+｝
+［③①］
+Ｒ．Ｌ．
+［①Ｅ］
+Ψ
+－［＊］－
+↑
+.日 
+［②ｄ］
+［②
+［②⑦］
+［②②］
+［③ｅ］
+［①ｉ］
+［①Ｂ］
+［①ｈ］
+［①ｄ］
+［①ｇ］
+［①②］
+［②ａ］
+ｆ］
+［⑩］
+ａ］
+［①ｅ］
+［②ｈ］
+［②⑥］
+［③ｄ］
+［②⑩］
+ｅ］
+〉
+】
+元／吨
+［②⑩］
+２．３％
+５：０  
+［①］
+::
+［②］
+［③］
+［④］
+［⑤］
+［⑥］
+［⑦］
+［⑧］
+［⑨］ 
+……
+——
+?
+、
+。
+“
+”
+《
+》
+！
+，
+：
+；
+？
+．
+,
+．
+'
+? 
+·
+———
+──
+? 
+—
+<
+>
+（
+）
+〔
+〕
+[
+]
+(
+)
+-
+
+～
+×
+／
+/
+①
+②
+③
+④
+⑤
+⑥
+⑦
+⑧
+⑨
+⑩
+Ⅲ
+В
+"
+;
+#
+@
+γ
+μ
+φ
+φ．
+× 
+Δ
+■
+▲
+sub
+exp 
+sup
+sub
+Lex 
+＃
+％
+＆
+＇
+＋
+＋ξ
+＋＋
+－
+－β
+＜
+＜±
+＜Δ
+＜λ
+＜φ
+＜＜
+=
+＝
+＝☆
+＝－
+＞
+＞λ
+＿
+～±
+～＋
+［⑤ｆ］
+［⑤ｄ］
+［②ｉ］
+≈ 
+［②Ｇ］
+［①ｆ］
+ＬＩ
+㈧ 
+［－
+......
+〉
+［③⑩］
+第二
+一番
+一直
+一个
+一些
+许多
+种
+有的是
+也就是说
+末##末
+啊
+阿
+哎
+哎呀
+哎哟
+唉
+俺
+俺们
+按
+按照
+吧
+吧哒
+把
+罢了
+被
+本
+本着
+比
+比方
+比如
+鄙人
+彼
+彼此
+边
+别
+别的
+别说
+并
+并且
+不比
+不成
+不单
+不但
+不独
+不管
+不光
+不过
+不仅
+不拘
+不论
+不怕
+不然
+不如
+不特
+不惟
+不问
+不只
+朝
+朝着
+趁
+趁着
+乘
+冲
+除
+除此之外
+除非
+除了
+此
+此间
+此外
+从
+从而
+打
+待
+但
+但是
+当
+当着
+到
+得
+的
+的话
+等
+等等
+地
+第
+叮咚
+对
+对于
+多
+多少
+而
+而况
+而且
+而是
+而外
+而言
+而已
+尔后
+反过来
+反过来说
+反之
+非但
+非徒
+否则
+嘎
+嘎登
+该
+赶
+个
+各
+各个
+各位
+各种
+各自
+给
+根据
+跟
+故
+故此
+固然
+关于
+管
+归
+果然
+果真
+过
+哈
+哈哈
+呵
+和
+何
+何处
+何况
+何时
+嘿
+哼
+哼唷
+呼哧
+乎
+哗
+还是
+还有
+换句话说
+换言之
+或
+或是
+或者
+极了
+及
+及其
+及至
+即
+即便
+即或
+即令
+即若
+即使
+几
+几时
+己
+既
+既然
+既是
+继而
+加之
+假如
+假若
+假使
+鉴于
+将
+较
+较之
+叫
+接着
+结果
+借
+紧接着
+进而
+尽
+尽管
+经
+经过
+就
+就是
+就是说
+据
+具体地说
+具体说来
+开始
+开外
+靠
+咳
+可
+可见
+可是
+可以
+况且
+啦
+来
+来着
+离
+例如
+哩
+连
+连同
+两者
+了
+临
+另
+另外
+另一方面
+论
+嘛
+吗
+慢说
+漫说
+冒
+么
+每
+每当
+们
+莫若
+某
+某个
+某些
+拿
+哪
+哪边
+哪儿
+哪个
+哪里
+哪年
+哪怕
+哪天
+哪些
+哪样
+那
+那边
+那儿
+那个
+那会儿
+那里
+那么
+那么些
+那么样
+那时
+那些
+那样
+乃
+乃至
+呢
+能
+你
+你们
+您
+宁
+宁可
+宁肯
+宁愿
+哦
+呕
+啪达
+旁人
+呸
+凭
+凭借
+其
+其次
+其二
+其他
+其它
+其一
+其余
+其中
+起
+起见
+起见
+岂但
+恰恰相反
+前后
+前者
+且
+然而
+然后
+然则
+让
+人家
+任
+任何
+任凭
+如
+如此
+如果
+如何
+如其
+如若
+如上所述
+若
+若非
+若是
+啥
+上下
+尚且
+设若
+设使
+甚而
+甚么
+甚至
+省得
+时候
+什么
+什么样
+使得
+是
+是的
+首先
+谁
+谁知
+顺
+顺着
+似的
+虽
+虽然
+虽说
+虽则
+随
+随着
+所
+所以
+他
+他们
+他人
+它
+它们
+她
+她们
+倘
+倘或
+倘然
+倘若
+倘使
+腾
+替
+通过
+同
+同时
+哇
+万一
+往
+望
+为
+为何
+为了
+为什么
+为着
+喂
+嗡嗡
+我
+我们
+呜
+呜呼
+乌乎
+无论
+无宁
+毋宁
+嘻
+吓
+相对而言
+像
+向
+向着
+嘘
+呀
+焉
+沿
+沿着
+要
+要不
+要不然
+要不是
+要么
+要是
+也
+也罢
+也好
+一
+一般
+一旦
+一方面
+一来
+一切
+一样
+一则
+依
+依照
+矣
+以
+以便
+以及
+以免
+以至
+以至于
+以致
+抑或
+因
+因此
+因而
+因为
+哟
+用
+由
+由此可见
+由于
+有
+有的
+有关
+有些
+又
+于
+于是
+于是乎
+与
+与此同时
+与否
+与其
+越是
+云云
+哉
+再说
+再者
+在
+在下
+咱
+咱们
+则
+怎
+怎么
+怎么办
+怎么样
+怎样
+咋
+照
+照着
+者
+这
+这边
+这儿
+这个
+这会儿
+这就是说
+这里
+这么
+这么点儿
+这么些
+这么样
+这时
+这些
+这样
+正如
+吱
+之
+之类
+之所以
+之一
+只是
+只限
+只要
+只有
+至
+至于
+诸位
+着
+着呢
+自
+自从
+自个儿
+自各儿
+自己
+自家
+自身
+综上所述
+总的来看
+总的来说
+总的说来
+总而言之
+总之
+纵
+纵令
+纵然
+纵使
+遵照
+作为
+兮
+呃
+呗
+咚
+咦
+喏
+啐
+喔唷
+嗬
+嗯
+嗳
--- a/docs/常见问题.md
+++ b/docs/常见问题.md
@ -22,4 +22,10 @@ Q: 报错 `playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.`<
 A: 出现这种情况检查下开梯子没有<br>

 Q: 小红书扫码登录成功后如何手动验证?
-A: 打开 config/base_config.py 文件, 找到 HEADLESS 配置项, 将其设置为 False, 此时重启项目, 在浏览器中手动通过验证码
+A: 打开 config/base_config.py 文件, 找到 HEADLESS 配置项, 将其设置为 False, 此时重启项目, 在浏览器中手动通过验证码<br>
+
+Q: 如何配置词云图的生成?
+A: 打开 config/base_config.py 文件, 找到`ENABLE_GET_WORDCLOUD` 以及`ENABLE_GET_COMMENTS` 两个配置项，将其都设为True即可使用该功能。<br>
+
+Q: 如何给词云图添加禁用词和自定义词组？
+A: 打开 `docs/hit_stopwords.txt` 输入禁用词(注意一个词语一行)。打开 config/base_config.py 文件找到 `CUSTOM_WORDS `按格式添加自定义词组即可。<br>
--- a/docs/项目代码结构.md
+++ b/docs/项目代码结构.md
@ -29,7 +29,8 @@ MediaCrawler
 │   ├── crawler_util.py         # 爬虫相关的工具函数
 │   ├── slider_util.py          # 滑块相关的工具函数
 │   ├── time_util.py            # 时间相关的工具函数
-│   └── easing.py               # 模拟滑动轨迹相关的函数
+│   ├── easing.py               # 模拟滑动轨迹相关的函数
+|   └── words.py				# 生成词云图相关的函数
 ├── db.py                       # DB ORM
 ├── main.py                     # 程序入口
 ├── var.py                      # 上下文变量定义
--- a/store/bilibili/bilibili_store_impl.py
+++ b/store/bilibili/bilibili_store_impl.py
@ -11,10 +11,11 @@ from typing import Dict

 import aiofiles

+import config
 from base.base_crawler import AbstractStore
 from tools import utils
 from var import crawler_type_var
-
+from tools import words

 def calculate_number_of_files(file_store_path: str) -> int:
    """计算数据保存文件的前部分排序数字，支持每次运行代码不写到同一个文件中
@ -130,12 +131,14 @@ class BiliDbStoreImplement(AbstractStore):


 class BiliJsonStoreImplement(AbstractStore):
-    json_store_path: str = "data/bilibili"
+    json_store_path: str = "data/bilibili/json"
+    words_store_path: str = "data/bilibili/words"
    lock = asyncio.Lock()
    file_count:int=calculate_number_of_files(json_store_path)
+    WordCloud = words.AsyncWordCloudGenerator()


-    def make_save_file_name(self, store_type: str) -> str:
+    def make_save_file_name(self, store_type: str) -> (str,str):
        """
        make save file name by store type
        Args:
@ -145,7 +148,10 @@ class BiliJsonStoreImplement(AbstractStore):

        """

-        return f"{self.json_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json"
+        return (
+            f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
+            f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
+        )

    async def save_data_to_json(self, save_item: Dict, store_type: str):
        """
@ -158,7 +164,8 @@ class BiliJsonStoreImplement(AbstractStore):

        """
        pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
-        save_file_name = self.make_save_file_name(store_type=store_type)
+        pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
+        save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
        save_data = []

        async with self.lock:
@ -170,6 +177,12 @@ class BiliJsonStoreImplement(AbstractStore):
            async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
                await file.write(json.dumps(save_data, ensure_ascii=False))

+            if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
+                try:
+                    await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
+                except:
+                    pass
+
    async def store_content(self, content_item: Dict):
        """
        content JSON storage implementation
--- a/store/douyin/douyin_store_impl.py
+++ b/store/douyin/douyin_store_impl.py
@ -12,8 +12,9 @@ from typing import Dict
 import aiofiles

 from base.base_crawler import AbstractStore
-from tools import utils
+from tools import utils,words
 from var import crawler_type_var
+import config


 def calculate_number_of_files(file_store_path: str) -> int:
@ -162,11 +163,14 @@ class DouyinDbStoreImplement(AbstractStore):
            await update_creator_by_user_id(user_id, creator)

 class DouyinJsonStoreImplement(AbstractStore):
-    json_store_path: str = "data/douyin"
+    json_store_path: str = "data/douyin/json"
+    words_store_path: str = "data/douyin/words"
+
    lock = asyncio.Lock()
    file_count: int = calculate_number_of_files(json_store_path)
+    WordCloud = words.AsyncWordCloudGenerator()

-    def make_save_file_name(self, store_type: str) -> str:
+    def make_save_file_name(self, store_type: str) -> (str,str):
        """
        make save file name by store type
        Args:
@ -176,8 +180,10 @@ class DouyinJsonStoreImplement(AbstractStore):

        """

-        return f"{self.json_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json"
-
+        return (
+            f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
+            f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
+        )
    async def save_data_to_json(self, save_item: Dict, store_type: str):
        """
        Below is a simple way to save it in json format.
@ -189,7 +195,8 @@ class DouyinJsonStoreImplement(AbstractStore):

        """
        pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
-        save_file_name = self.make_save_file_name(store_type=store_type)
+        pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
+        save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
        save_data = []

        async with self.lock:
@ -201,6 +208,12 @@ class DouyinJsonStoreImplement(AbstractStore):
            async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
                await file.write(json.dumps(save_data, ensure_ascii=False))

+            if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
+                try:
+                    await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
+                except:
+                    pass
+
    async def store_content(self, content_item: Dict):
        """
        content JSON storage implementation
--- a/store/kuaishou/kuaishou_store_impl.py
+++ b/store/kuaishou/kuaishou_store_impl.py
@ -12,9 +12,9 @@ from typing import Dict
 import aiofiles

 from base.base_crawler import AbstractStore
-from tools import utils
+from tools import utils,words
 from var import crawler_type_var
-
+import config

 def calculate_number_of_files(file_store_path: str) -> int:
    """计算数据保存文件的前部分排序数字，支持每次运行代码不写到同一个文件中
@ -131,12 +131,15 @@ class KuaishouDbStoreImplement(AbstractStore):


 class KuaishouJsonStoreImplement(AbstractStore):
-    json_store_path: str = "data/kuaishou"
+    json_store_path: str = "data/kuaishou/json"
+    words_store_path: str = "data/kuaishou/words"
    lock = asyncio.Lock()
    file_count:int=calculate_number_of_files(json_store_path)
+    WordCloud = words.AsyncWordCloudGenerator()


-    def make_save_file_name(self, store_type: str) -> str:
+
+    def make_save_file_name(self, store_type: str) -> (str,str):
        """
        make save file name by store type
        Args:
@ -146,8 +149,10 @@ class KuaishouJsonStoreImplement(AbstractStore):

        """

-
-        return f"{self.json_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json"
+        return (
+            f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
+            f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
+        )

    async def save_data_to_json(self, save_item: Dict, store_type: str):
        """
@ -160,7 +165,8 @@ class KuaishouJsonStoreImplement(AbstractStore):

        """
        pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
-        save_file_name = self.make_save_file_name(store_type=store_type)
+        pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
+        save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
        save_data = []

        async with self.lock:
@ -172,6 +178,12 @@ class KuaishouJsonStoreImplement(AbstractStore):
            async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
                await file.write(json.dumps(save_data, ensure_ascii=False))

+            if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
+                try:
+                    await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
+                except:
+                    pass
+
    async def store_content(self, content_item: Dict):
        """
        content JSON storage implementation
--- a/store/weibo/weibo_store_impl.py
+++ b/store/weibo/weibo_store_impl.py
@ -12,9 +12,9 @@ from typing import Dict
 import aiofiles

 from base.base_crawler import AbstractStore
-from tools import utils
+from tools import utils,words
 from var import crawler_type_var
-
+import config

 def calculate_number_of_files(file_store_path: str) -> int:
    """计算数据保存文件的前部分排序数字，支持每次运行代码不写到同一个文件中
@ -132,12 +132,14 @@ class WeiboDbStoreImplement(AbstractStore):


 class WeiboJsonStoreImplement(AbstractStore):
-    json_store_path: str = "data/weibo"
+    json_store_path: str = "data/weibo/json"
+    words_store_path: str = "data/weibo/words"
    lock = asyncio.Lock()
    file_count:int=calculate_number_of_files(json_store_path)
+    WordCloud = words.AsyncWordCloudGenerator()


-    def make_save_file_name(self, store_type: str) -> str:
+    def make_save_file_name(self, store_type: str) -> (str,str):
        """
        make save file name by store type
        Args:
@ -147,7 +149,10 @@ class WeiboJsonStoreImplement(AbstractStore):

        """

-        return f"{self.json_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json"
+        return (
+            f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
+            f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
+        )

    async def save_data_to_json(self, save_item: Dict, store_type: str):
        """
@ -160,7 +165,8 @@ class WeiboJsonStoreImplement(AbstractStore):

        """
        pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
-        save_file_name = self.make_save_file_name(store_type=store_type)
+        pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
+        save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
        save_data = []

        async with self.lock:
@ -172,6 +178,12 @@ class WeiboJsonStoreImplement(AbstractStore):
            async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
                await file.write(json.dumps(save_data, ensure_ascii=False))

+            if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
+                try:
+                    await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
+                except:
+                    pass
+
    async def store_content(self, content_item: Dict):
        """
        content JSON storage implementation
--- a/store/xhs/xhs_store_impl.py
+++ b/store/xhs/xhs_store_impl.py
@ -12,9 +12,9 @@ from typing import Dict
 import aiofiles

 from base.base_crawler import AbstractStore
-from tools import utils
+from tools import utils,words
 from var import crawler_type_var
-
+import config

 def calculate_number_of_files(file_store_path: str) -> int:
    """计算数据保存文件的前部分排序数字，支持每次运行代码不写到同一个文件中
@ -161,11 +161,13 @@ class XhsDbStoreImplement(AbstractStore):


 class XhsJsonStoreImplement(AbstractStore):
-    json_store_path: str = "data/xhs"
+    json_store_path: str = "data/xhs/json"
+    words_store_path: str = "data/xhs/words"
    lock = asyncio.Lock()
    file_count:int=calculate_number_of_files(json_store_path)
+    WordCloud = words.AsyncWordCloudGenerator()

-    def make_save_file_name(self, store_type: str) -> str:
+    def make_save_file_name(self, store_type: str) -> (str,str):
        """
        make save file name by store type
        Args:
@ -175,7 +177,10 @@ class XhsJsonStoreImplement(AbstractStore):

        """

-        return f"{self.json_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json"
+        return (
+            f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
+            f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
+        )

    async def save_data_to_json(self, save_item: Dict, store_type: str):
        """
@ -188,7 +193,8 @@ class XhsJsonStoreImplement(AbstractStore):

        """
        pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
-        save_file_name = self.make_save_file_name(store_type=store_type)
+        pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
+        save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
        save_data = []

        async with self.lock:
@ -200,6 +206,11 @@ class XhsJsonStoreImplement(AbstractStore):
            async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
                await file.write(json.dumps(save_data, ensure_ascii=False))

+            if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
+                try:
+                    await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
+                except:
+                    pass
    async def store_content(self, content_item: Dict):
        """
        content JSON storage implementation
--- a/tools/words.py
+++ b/tools/words.py
@ -0,0 +1,68 @@
+import aiofiles
+import asyncio
+import jieba
+from collections import Counter
+from wordcloud import WordCloud
+import json
+import matplotlib.pyplot as plt
+import config
+from tools import utils
+
+plot_lock = asyncio.Lock()
+
+class AsyncWordCloudGenerator:
+    def __init__(self):
+        self.stop_words_file = config.STOP_WORDS_FILE
+        self.lock = asyncio.Lock()
+        self.stop_words = self.load_stop_words()
+        self.custom_words = config.CUSTOM_WORDS
+        for word, group in self.custom_words.items():
+            jieba.add_word(word)
+
+    def load_stop_words(self):
+        with open(self.stop_words_file, 'r', encoding='utf-8') as f:
+            return set(f.read().strip().split('\n'))
+
+    async def generate_word_frequency_and_cloud(self, data, save_words_prefix):
+        all_text = ' '.join(item['content'] for item in data)
+        words = [word for word in jieba.lcut(all_text) if word not in self.stop_words]
+        word_freq = Counter(words)
+
+        # Save word frequency to file
+        freq_file = f"{save_words_prefix}_word_freq.json"
+        async with aiofiles.open(freq_file, 'w', encoding='utf-8') as file:
+            await file.write(json.dumps(word_freq, ensure_ascii=False, indent=4))
+
+        # Try to acquire the plot lock without waiting
+        if plot_lock.locked():
+            utils.logger.info("Skipping word cloud generation as the lock is held.")
+            return
+
+        await self.generate_word_cloud(word_freq, save_words_prefix)
+
+    async def generate_word_cloud(self, word_freq, save_words_prefix):
+        await plot_lock.acquire()
+        top_20_word_freq = {word: freq for word, freq in
+                            sorted(word_freq.items(), key=lambda item: item[1], reverse=True)[:20]}
+        wordcloud = WordCloud(
+            font_path=config.FONT_PATH,
+            width=800,
+            height=400,
+            background_color='white',
+            max_words=200,
+            stopwords=self.stop_words,
+            colormap='viridis',
+            contour_color='steelblue',
+            contour_width=1
+        ).generate_from_frequencies(top_20_word_freq)
+
+        # Save word cloud image
+        plt.figure(figsize=(10, 5), facecolor='white')
+        plt.imshow(wordcloud, interpolation='bilinear')
+
+        plt.axis('off')
+        plt.tight_layout(pad=0)
+        plt.savefig(f"{save_words_prefix}_word_cloud.png", format='png', dpi=300)
+        plt.close()
+
+        plot_lock.release()