网页爬虫实战

学习从网页获取数据

环境准备

pip install requests beautifulsoup4

基础爬虫

import requests
from bs4 import BeautifulSoup

# 发送请求
url = "https://example.com"
response = requests.get(url)
response.encoding = "utf-8"

# 解析HTML
soup = BeautifulSoup(response.text, "html.parser")

# 提取数据
title = soup.find("title").text
links = soup.find_all("a")

for link in links:
    print(link.get("href"), link.text)

实战：爬取新闻标题

import requests
from bs4 import BeautifulSoup

def get_news(url):
    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    news_list = []
    for item in soup.find_all("h2", class_="title"):
        news_list.append({
            "title": item.text.strip(),
            "link": item.find("a").get("href") if item.find("a") else None
        })
    return news_list

# 使用
# news = get_news("https://news.example.com")
# for n in news:
#     print(n["title"])

注意事项

遵守robots.txt
控制请求频率
设置User-Agent
处理异常情况

本章小结

requests：发送HTTP请求
BeautifulSoup：解析HTML
爬虫礼仪：尊重网站规则

→ 继续阅读：31-Web开发入门

环境准备​

基础爬虫​

实战：爬取新闻标题​

注意事项​

本章小结​

环境准备

基础爬虫

实战：爬取新闻标题

注意事项

本章小结