python爬虫图解

Posted Aug 13, 2024 Updated Sep 11, 2024

4 min read

发送http请求

以csdn为例，鼠标右键点击检查（或F12进入开发者模式），然后点击Network，刷新网页，继续点击Name列表中的第一个。我们发现此网站的请求方式为GET，请求头Headers反映用户电脑系统、浏览器版本等信息。

在展开框中选择请求标头，一般拿到Cookie和User-Agent就可以（前者用来加载登陆状态，后者用来伪装浏览器访问）。

构造一个请求

url = 'https://blog.csdn.net/xxx/.html'
header = {
    "User-Agent": "Mozilla/5.0 (xxx)",
    "Cookie": "uuid_tt_dd=xxxxx"
}
response = requests.get(url, headers=header, verify=False, timeout=10)

verify设为False可以规避认证问题，同样也存在风险（有可能被第三方攻击）。真正的勇士从不在意这些。

作为进阶，你还可以随机设一些user-agent列表来模拟不同浏览器访问。

解析网页内容

有很多方式，这里使用 BeautifulSoup（TODO: 不同解析库的异同、哪个是如今最流行的）

soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())  # 输出页面内容，帮助你理解页面结构
# 提取文章链接
article_list = soup.find('ul', class_='column_article_list')

基本上，soup.find能解决百分之90的问题，核心在于你要先找到自己想爬取的内容在哪里。一种方式是，在你想要爬取的内容上右击，选择检查，会自动引到对应的源代码。

图片区域

以上是通过右击找到了csdn首页的推荐文章列表，如果想要爬取首页这一列所有的推荐文章链接，可以看到，它们都属于<div class="Community">这个大类。那么你可以：

  
article_list = soup.find('div', class_='Community') # 获取文章列表（小类列表）

实际内容往往层层嵌套，拿到所有小类并不是终极：

图片区域

上图红框里的a中的href字段才是你想要的链接。那么你需要找到所有小类列表中的<div class="content">，再找到a，再去获取a中的href：

if not article_list:
    print("No article list found.")

else:
    article_links = []
    article_items = article_list.find_all('div', class_='content')
    for item in article_items:
        a_tag = item.find('a')
        if a_tag and 'href' in a_tag.attrs:
            link = a_tag['href']
            article_links.append(link)

这样就拿到了。后面可以再读取文章内容、标题，然后保存为md什么的，原理是一样的：

  
from markdownify import markdownify as md

# 遍历文章链接，获取文章内容并转换为Markdown格式
for link in article_links:
    article_response = requests.get(link, headers=header, verify=False, timeout=10)
    article_soup = BeautifulSoup(article_response.content, 'html.parser')
    
    # 提取文章内容
    article_content = article_soup.find('div', id='content_views')
    if article_content:
        # 转换为Markdown格式
        markdown_content = md(str(article_content))
    
        # 获取文章标题
        title = article_soup.find('h1', class_='title-article').text.strip().replace(" ", "").replace("/", "-")
    
        # 保存为Markdown文件
        filename = f"{output_dir}/{title}.md"
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(markdown_content)
        print(f"Saved {filename}")
    else:
        print(f"No content found for {link}")
    wait_time = random.uniform(1, 10) # 模拟人类操作
    time.sleep(wait_time)
    
print("All articles have been processed.")

一些需要注意的小tips：

random.uniform: 随机生成一个在[x,y]范围内的实数

coding

黑科技爬虫

This post is licensed under CC BY 4.0 by the author.

发送http请求

解析网页内容

Trending Tags