拒绝低效！Python教你爬虫公众号文章和链接("高效爬取！Python助你轻松获取公众号文章与链接")

原创

ithorizon 7个月前 (10-21) 阅读数 41 #后端开发

高效爬取！Python助你轻松获取公众号文章与链接

一、引言

在信息爆炸的时代，公众号文章已经成为我们获取信息的重要途径。然而，手动收集公众号文章和链接是一项耗时且低效的工作。本文将介绍怎样使用Python进行高效爬取公众号文章和链接，帮助您轻松获取所需内容。

二、准备工作

在进行爬取之前，我们需要做一些准备工作：

安装Python环境

安装requests库

安装BeautifulSoup库

三、获取公众号文章列表

首先，我们需要获取公众号文章的列表。这通常涉及到访问公众号的API接口或者网页。

3.1 使用requests库发送请求


import requests
url = 'https://mp.weixin.qq.com/cgi-bin/appmsg?token=YOUR_TOKEN&lang=zh_CN'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(url, headers=headers)

3.2 解析文章列表

获取到响应后，我们需要解析文章列表。这里我们使用BeautifulSoup库进行解析。


from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
article_list = soup.find_all('div', class_='appmsg_item')
for article in article_list:
    title = article.find('h4').text
    link = article.find('a')['href']
    print(title, link)

四、获取文章内容

获取到文章列表后，我们可以进一步获取每篇文章的具体内容。

4.1 获取文章链接

从文章列表中，我们已经获取到了文章的链接。接下来，我们将使用requests库发送请求获取文章内容。


def get_article_content(link):
    response = requests.get(link, headers=headers)
    return response.text

4.2 解析文章内容

获取到文章内容后，我们可以使用BeautifulSoup库解析文章内容。


def parse_article_content(content):
    soup = BeautifulSoup(content, 'html.parser')
    article_title = soup.find('h2').text
    article_content = soup.find('div', class_='rich_media_content').text
    return article_title, article_content

五、保存文章内容

获取到文章内容后，我们可以将其保存到本地文件中，以便后续查看。


def save_article(title, content):
    with open(title + '.txt', 'w', encoding='utf-8') as file:
        file.write(content)

六、完整代码示例

以下是整个爬取过程的完整代码示例：


import requests
from bs4 import BeautifulSoup
def get_article_list(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    article_list = soup.find_all('div', class_='appmsg_item')
    return article_list
def get_article_content(link):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(link, headers=headers)
    return response.text
def parse_article_content(content):
    soup = BeautifulSoup(content, 'html.parser')
    article_title = soup.find('h2').text
    article_content = soup.find('div', class_='rich_media_content').text
    return article_title, article_content
def save_article(title, content):
    with open(title + '.txt', 'w', encoding='utf-8') as file:
        file.write(content)
def main():
    url = 'https://mp.weixin.qq.com/cgi-bin/appmsg?token=YOUR_TOKEN&lang=zh_CN'
    article_list = get_article_list(url)
    for article in article_list:
        title = article.find('h4').text
        link = article.find('a')['href']
        content = get_article_content(link)
        article_title, article_content = parse_article_content(content)
        save_article(article_title, article_content)
if __name__ == '__main__':
    main()