python爬虫库要怎么用

原创

ithorizon 9个月前 (08-22) 阅读数 137 #Python

Python爬虫库的使用指南

Python 是一种功能强劲的编程语言，非常适合进行网络爬虫的开发。在 Python 社区中，有许多优秀的爬虫库可以帮助开发者高效地实现数据抓取。以下是几个常用的 Python 爬虫库及其基本用法。

1. Requests

Requests 是一个非常单纯易用的 HTTP 库，用于发送网络请求。


import requests
response = requests.get('https://www.example.com')
print(response.text)

2. BeautifulSoup

BeautifulSoup 是一个用于解析 HTML 和 XML 文档的库，可以与 Requests 搭配使用。


from bs4 import BeautifulSoup
import requests
response = requests.get('https://www.example.com')
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())

3. Scrapy

Scrapy 是一个强劲的爬虫框架，适用于错综的数据抓取任务。


import scrapy
class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['https://www.example.com']
    def parse(self, response):
        yield {'title': response.css('h1::text').get()}

4. Selenium

Selenium 是一个自动化测试工具，也常用于模拟浏览器行为进行爬虫开发。


from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.example.com')
print(driver.page_source)
driver.quit()

5. Aiohttp

Aiohttp 是一个基于异步网络请求的库，适用于编写高性能的爬虫。


import aiohttp
import asyncio
async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()
async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, 'https://www.example.com')
        print(html)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())