网页爬虫实战
学习从网页获取数据
环境准备
pip install requests beautifulsoup4
基础爬虫
import requests
from bs4 import BeautifulSoup
# 发送请求
url = "https://example.com"
response = requests.get(url)
response.encoding = "utf-8"
# 解析HTML
soup = BeautifulSoup(response.text, "html.parser")
# 提取数据
title = soup.find("title").text
links = soup.find_all("a")
for link in links:
print(link.get("href"), link.text)
实战:爬取新闻标题
import requests
from bs4 import BeautifulSoup
def get_news(url):
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
news_list = []
for item in soup.find_all("h2", class_="title"):
news_list.append({
"title": item.text.strip(),
"link": item.find("a").get("href") if item.find("a") else None
})
return news_list
# 使用
# news = get_news("https://news.example.com")
# for n in news:
# print(n["title"])
注意事项
- 遵守robots.txt
- 控制请求频率
- 设置User-Agent
- 处理异常情况
本章小结
- requests:发送HTTP请求
- BeautifulSoup:解析HTML
- 爬虫礼仪:尊重网站规则
→ 继续阅读:31-Web开发入门