Selenium | 笔记

引言 selenium 保存网页为 图片 selenium 保存网页为 pdf 更多 准备 chromedriver 下载
- 官方: https://chromedriver.storage.googleapis.com/index.html
- 淘宝镜像: https://npm.taobao.org/mirrors/chromedriver/ Chrome 下载
- https://www.slimjet.com/chrome/google-chrome-old-version.php
- selenium / webdriver 基础

导入包

pip 安装 pythhon selenium 包
1
pip install selenium
ubuntu 下载安装 Chrome 注意: 建议固定 Chrome 版本, Chrome 版本必须与 chromedriver 版本对应一致
1
2
3
# 安装
sudo dpkg -i google-chrome*.deb
sudo apt-get install -f
下载 对应版本的 chromedriver
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 下载 chromedriver
sudo wget http://chromedriver.storage.googleapis.com/88.0.4324.96/chromedriver_linux64.zip

sudo apt-get install unzip

# 解压
sudo unzip chromedriver_linux64.zip

# 为所有用户添加可执行权限 (对 chromedriver 文件)
sudo chmod a+x chromedriver

# 解决中文网页截图时, 中文乱码: 安装中文字体
# 下面两行安装中文字体
sudo apt install -y --force-yes --no-install-recommends fonts-wqy-microhei
sudo apt install -y --force-yes --no-install-recommends ttf-wqy-zenhei
在代码中导入
1
2
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
补充
1
2
3
4
5
from selenium import webdriver
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.edge.options import Options as EdgeOptions
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.ie.options import Options as IEOptions

driver 实例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

import time

chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')

# 1. Chrome 2. 指定 chromedriver 路径: './chromedriver'
driver = webdriver.Chrome('./chromedriver', options=chrome_options)

driver.get("https://github.com/yiyungent/WebScreenshot")

width = driver.execute_script("return document.documentElement.scrollWidth")
height = driver.execute_script("return document.documentElement.scrollHeight")
driver.set_window_size(width, height)

# 保存截图
driver.save_screenshot('./screenshots/' + time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) +'.png')

driver.quit()
selenium 保存网页为 图片 参考: Selenium C#中完整元素的屏幕截图 | 码农家园 TakeScreenshot | Working with windows and tabs | Selenium
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

import time

chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')

# 1. Chrome 2. 指定 chromedriver 路径: './chromedriver'
driver = webdriver.Chrome('./chromedriver', options=chrome_options)

driver.get("https://github.com/yiyungent/WebScreenshot")

width = driver.execute_script("return document.documentElement.scrollWidth")
height = driver.execute_script("return document.documentElement.scrollHeight")
driver.set_window_size(width, height)

# 保存截图
driver.save_screenshot('./screenshots/' + time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) +'.png')
selenium 保存网页为 pdf 参考: 使用selenium把网页保存为PDF | 郑小凯的个人博客 利用selenium将edge浏览器里面的网页保存为pdf - 队长 - 博客园

思路

主要有如下几种: 利用第三方包:pdfkit,可参考:https://www.cnblogs.com/silence-cc/p/9463227.html 使用chrome的—print-to-pdf模式,将请求到html导出为pdf,可参考:http://osask.cn/front/ask/view/1029784 使用js命令'window.print();来调用浏览器打印,可参考:https://gitee.com/shinemic/codes/09y87ph6vf2c5zamwls3q48 这里我们选用第三种,相对来说适应性比较好,也方便查看进展,如果想隐藏页面,只需要加入—headlss选项即可。

实现

配置 chromedriver 的 options
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
appState = {
"recentDestinations": [
{
"id": "Save as PDF",
"origin": "local"
}
],
"selectedDestinationId": "Save as PDF",
"version": 2
}
profile = {
'printing.print_preview_sticky_settings.appState': json.dumps(appState),
'savefile.default_directory': './articles'
}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', profile)
chrome_options.add_argument('--kiosk-printing')
这里 savefile.default_directory 用来指定保存文件的路径,需自行配置。 保存pdf
1
2
3
4
5
driver.get(url)
time.sleep(5)
# 保存 PDF
temp_title = driver.title
driver.execute_script('window.print();')
这里 chrome 打印网页时默认文件名为网页的title,所以这里先保存一下 temp_title=driver.title 改名
1
os.rename('./articles/' + temp_title + '.pdf', './articles/' + title + '.pdf')
由于如果打开同一个网站的多个页面并保存pdf,那么很可能就会出现由于网站title相同而覆盖的情况,所以每次保存完毕后,改一下pdf的文件名。 注意:当网页异常等情况可能出现title为空的情况,那么这里改名的时候就会报异常错误,需要进行异常处理。 Cookies 参考: Working with cookies | Selenium 等待 参考: Waits | Selenium

显式等待

1
2
3
4
5
6
7
8
from selenium.webdriver.support.ui import WebDriverWait
def document_initialised(driver):
return driver.execute_script("return initialised")

driver.navigate("file:///race_condition.html")
WebDriverWait(driver).until(document_initialised)
el = driver.find_element(By.TAG_NAME, "p")
assert el.text == "Hello from JavaScript!"
上方可以简化为下方
1
2
3
4
5
from selenium.webdriver.support.ui import WebDriverWait

driver.navigate("file:///race_condition.html")
el = WebDriverWait(driver).until(lambda d: d.find_element_by_tag_name("p"))
assert el.text == "Hello from JavaScript!"
Q&A

其它类似

Puppeteer

puppeteer/puppeteer: Headless Chrome Node.js API Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

PhantomJS

ariya/phantomjs: Scriptable Headless Browser PhantomJS (phantomjs.org) is a headless WebKit scriptable with JavaScript. The latest stable release is version 2.1. Important: PhantomJS development is suspended until further notice (see #15344 for more details). 补充

Selenium driver.Url vs. driver.Navigate().GoToUrl()

参考: c# - Selenium driver.Url vs. driver.Navigate().GoToUrl() - Stack Overflow Selenium is an open source framework, so please have a look at the source code here. GoToUrl() is defined in RemoteNavigator.cs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/// <summary>
/// Navigate to a url for your test
/// </summary>
/// <param name="url">String of where you want the browser to go to</param>
public void GoToUrl(string url)
{
this.driver.Url = url;
}

/// <summary>
/// Navigate to a url for your test
/// </summary>
/// <param name="url">Uri object of where you want the browser to go to</param>
public void GoToUrl(Uri url)
{
if (url == null)
{
throw new ArgumentNullException("url", "URL cannot be null.");
}

this.driver.Url = url.ToString();
}
driver.Navigate().GoToUrl() 实际上内部就是 driver.Url = url

ubuntu 安装/卸载 *.deb

参考: 手动安装 | Ubuntu 如何在Ubuntu上安装Deb文件软件包 | myfreax 在 Ubuntu Linux 上安装 Deb 文件的 3 种方法 | Linux 中国 - 知乎 如果你想在命令行中安装 deb 软件包,你可以使用 apt 命令或者 dpkg 命令。
实际上,apt 命令在底层上使用 dpkg 命令,但是 apt 却更流行和易于使用。 如果你在安装 deb 软件包的过程中得到一个依赖项的错误,你可以使用下面的命令来修复依赖项的问题:
1
sudo apt install -f
方法1
1
2
3
4
5
6
7
8
9
10
11
# 安装.deb文件
sudo dpkg -i 软件包名.deb

# 卸载
sudo dpkg -r program_name

# 查询
# 这将给予我全部的名称中含有 "grid" 的软件包,从这里,我可以得到准确的程序名称。
apt list --installed | grep grid
#WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
#appgrid/now 0.298 all [installed,local]
方法2
1
2
3
4
5
6
7
8
9
10
11
# 安装
# sudo apt install ./teamviewer_amd64.deb
sudo apt install path_to_deb_file

# 卸载
# 或 sudo apt remove program_name
sudo apt-get remove 软件包名称

# 查询
dpkg -l | grep grid
#ii appgrid 0.298 all Discover and install apps for Ubuntu

Selenium 反反爬

参考: 网站如何识别 你是 selenium爬虫?那我们怎么解决(反反爬) - 程序员宅基地

chromedriver: error while loading shared libraries: libglib-2.0.so.0:

参考: chromedriver: error while loading shared libraries: libgconf-2.so.4: cannot open shared object file_UESTC Like-CSDN博客 下方, 成功解决
1
apt-get install libglib2.0 -y
但没有解决下方:
1
2
3
Network is unreachable Network is unreachable
OpenQA.Selenium.WebDriverException: Cannot start the driver service on http://localhost:39255/
at OpenQA.Selenium.DriverService.Start()

chromedriver: error while loading shared libraries: libnss3.so:

参考: Linux上error while loading shared libraries问题解决方法 - 锅边糊 - 博客园 Linux/ubuntu:Chrome报错解决: error while loading shared libraries: libnss3.so libXss.so.1 libasound.so._个人博客-CSDN博客
1
apt-get install libnss3-dev -y

chromedriver: error while loading shared libraries: libxcb.so.1:

1
apt-get install libxcb1 -y

OpenQA.Selenium.WebDriverException: unknown error: cannot find Chrome binary

参考: org.openqa.selenium.WebDriverException: unknown error: cannot find Chrome binary 异常解决_往復不息的博客-CSDN博客 解决: 未正确安装 Chrome, 如果还是保存, 则手动指定
1
2
3
4
5
6
7
8
var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
options.AddArgument("--disable-dev-shm-usage");
options.AddArgument("--headless");

// Chrome 的启动文件路径
// 只要正确安装的就不需要指定
//options.BinaryLocation = "";

OpenQA.Selenium.WebDriverArgumentException: invalid argument

1
2
// url 应为合法完整url, 如: http://moeci.com
OpenQA.Selenium.Navigator.GoToUrl(String url)

OpenQA.Selenium.WebDriverException: The HTTP request to the remote WebDriver server for URL http://localhost:40811/session timed out after 60 seconds.

参考: c#-硒错误-60秒钟后,到远程WebDriver的HTTP请求超时 - ITranslater c# - Selenium Error - The HTTP request to the remote WebDriver timed out after 60 seconds - Stack Overflow
1
2
3
4
5
6
7
8
var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
options.AddArgument("--headless");

// 注意: TimeSpan.FromMinutes(5) 设置 5分钟 超时
var driver = new ChromeDriver(chromeDriverDirectory: "/app/tools/selenium/", options, commandTimeout: TimeSpan.FromMinutes(5));

driver.Navigate().GoToUrl(url);

OpenQA.Selenium.WebDriverException: unknown error: session deleted because of page crash

1
2
3
4
5
6
7
8
9
10
OpenQA.Selenium.WebDriverException: unknown error: session deleted because of page crash
from unknown error: cannot determine loading status
from tab crashed
(Session info: headless chrome=88.0.4324.182)
at OpenQA.Selenium.WebDriver.UnpackAndThrowOnError(Response errorResponse)
at OpenQA.Selenium.WebDriver.Execute(String driverCommandToExecute, Dictionary`2 parameters)
at OpenQA.Selenium.WebDriver.ExecuteScriptCommand(String script, String commandName, Object[] args)
at OpenQA.Selenium.WebDriver.ExecuteScript(String script, Object[] args)
at WebScreenshot.Controllers.HomeController.SaveScreenshot(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 109
at WebScreenshot.Controllers.HomeController.Get(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 78
参考: selenium selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash from unknown error: cannot determine loading status - 胖头鹅 - 博客园 selenium.WebDriverException: unknown error: session deleted because of page crash from tab crashed - Stack Overflow python - unknown error: session deleted because of page crash from unknown error: cannot determine loading status from tab crashed with ChromeDriver Selenium - Stack Overflow SeleniumHQ/docker-selenium: Docker images for the Selenium Grid Server Docker run reference | Docker Documentation 这是在 docker 容器中运行才会出现的错误, 由于 shm_size 不够用了, 默认 64MB
1
docker run -d -p 4444:4444 --shm-size="2g" selenium/standalone-chrome:4.1.2-20220217
docker-compose.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
version: "3"
services:
hub:
image: selenium/hub
ports:
- "4444:4444"
chrome:
image: selenium/node-chrome
shm_size: '1gb'
depends_on:
- hub
environment:
- HUB_HOST=hub
firefox:
image: selenium/node-firefox
shm_size: '1gb'
depends_on:
- hub
environment:
- HUB_HOST=hub

System.InvalidOperationException: session not created

1
2
3
4
5
6
7
8
9
10
11
12
System.InvalidOperationException: session not created
from tab crashed
(Session info: headless chrome=88.0.4324.182) (SessionNotCreated)
at OpenQA.Selenium.WebDriver.UnpackAndThrowOnError(Response errorResponse)
at OpenQA.Selenium.WebDriver.Execute(String driverCommandToExecute, Dictionary`2 parameters)
at OpenQA.Selenium.WebDriver.StartSession(ICapabilities desiredCapabilities)
at OpenQA.Selenium.WebDriver..ctor(ICommandExecutor executor, ICapabilities capabilities)
at OpenQA.Selenium.Chromium.ChromiumDriver..ctor(ChromiumDriverService service, ChromiumOptions options, TimeSpan commandTimeout)
at OpenQA.Selenium.Chrome.ChromeDriver..ctor(ChromeDriverService service, ChromeOptions options, TimeSpan commandTimeout)
at OpenQA.Selenium.Chrome.ChromeDriver..ctor(String chromeDriverDirectory, ChromeOptions options, TimeSpan commandTimeout)
at WebScreenshot.Controllers.HomeController.SaveScreenshot(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 109
at WebScreenshot.Controllers.HomeController.Get(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 78
参考: selenium.common.exceptions.SessionNotCreatedException: Message: session not created from tab crashed using ChromeDriver Chrome Selenium Python - Stack Overflow 解决
1
2
3
4
5
6
7
var options = new ChromeOptions();
// https://stackoverflow.com/questions/59186984/selenium-common-exceptions-sessionnotcreatedexception-message-session-not-crea
options.AddArgument("--no-sandbox");
options.AddArgument("--disable-dev-shm-usage");
options.AddArgument("--headless");
// 重要: 测试添加了这行后,才成功
options.AddArgument("--ignore-certificate-errors");

Timed out receiving message from renderer: 10.000

1
2
3
4
5
6
7
8
9
10
11
12
ChromeDriver was started successfully.
[1646482757.506][SEVERE]: Timed out receiving message from renderer: 10.000
[1646482757.506][WARNING]: screenshot failed, retrying timeout: Timed out receiving message from renderer: 10.000
[1646482767.506][SEVERE]: Timed out receiving message from renderer: 10.000
OpenQA.Selenium.WebDriverTimeoutException: timeout: Timed out receiving message from renderer: 10.000
(Session info: headless chrome=88.0.4324.182)
at OpenQA.Selenium.WebDriver.UnpackAndThrowOnError(Response errorResponse)
at OpenQA.Selenium.WebDriver.Execute(String driverCommandToExecute, Dictionary`2 parameters)
at OpenQA.Selenium.WebDriver.GetScreenshot()
at WebScreenshot.Controllers.HomeController.SaveScreenshot(String url, String jsurl, String jsStr) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 288
at WebScreenshot.Controllers.HomeController.FileCache(Byte[]& cacheEntry, String url, String jsurl, String jsStr) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 167
at WebScreenshot.Controllers.HomeController.Get(String url, String jsurl, Int32 windowWidth, Int32 windowHeight) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 130
参考: google chrome - Selenium Timed out receiving message from renderer - Stack Overflow selenium - How to handle "Time out receiving message from the renderer" in chrome driver? - Software Quality Assurance & Testing Stack Exchange 解决
1
2
3
4
5
6
7
var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
options.AddArgument("--disable-dev-shm-usage");
options.AddArgument("--headless");
options.AddArgument("--ignore-certificate-errors");
// 重要: 下方此行
options.AddArgument("--disable-gpu");

screenshot failed, retrying timeout: Timed out receiving message from renderer: 10.

参考: automated testing - Selenium Java :- [1553593587.996][SEVERE]: Timed out receiving message from renderer: 10.000 [1553593587.997][WARNING]: screenshot failed, retrying - Software Quality Assurance & Testing Stack Exchange Timed out receiving message from renderer: 0.100 · Issue #442 · bonigarcia/webdrivermanager selenium - Timed out receiving message from renderer: 10.000 while capturing screenshot using chromedriver and chrome through Jenkins on Windows - Stack Overflow

Dockerfile: /bin/sh: 1: source: not found

参考: /bin/sh: 1: source: not found (docker) - 哔哩哔哩 【docker】——报错:/bin/sh: 1: source: not found,docker加环境变量_怡宝2号-CSDN博客 source ~/.bash_profile是什么意思 - fen斗 - 博客园 添加 chromedriver 所在目录到 PATH
Dockerfile
1
2
3
4
5
6
7
8
# TODO: 以下添加 PATH 失败: 无效
RUN echo 'export PATH=$PATH:/app' >> ~/.bash_profile
RUN /bin/bash -c "source ~/.bash_profile"
# 使用 Dockerfile 方式 添加 PATH
ENV PATH=/app:$PATH
# 效验版本
RUN google-chrome --version
RUN chromedriver --version
PS:
~ 这个符号表示你的家目录,
.bash_profile 是一个隐藏的配置文件,主要是用来配置bash shell的,
source ~/.bash_profile 就是让这个配置文件在修改后立即生效。 参考: 利用cookie免帐号密码登陆b站 - JavaShuo 利用python+selenium带上cookies自动登录bilibili-python黑洞网 执行 JavaScript
1
document.cookie ="SESSDATA=49d4147c%256557247677%2Cf295e641;domain=.bilibili.com;path=/";
参考 感谢帮助! wkhtmltopdf wkhtmltopdfhtml php生成pdf快照,网页截图,网页快照完整版 (原) - 戈丫汝 - 博客园 在Ubuntu上安装Chrome浏览器和ChromeDriver - 想54256 - 博客园 .NET Core(C#) 操作selenium(Chrome)对网页截完整页面长图的方法及示例代码-CJavaPy chromedriver.storage.googleapis.com/index.html .NET(C#) Selenium操作调用浏览器判断页面元素(ElementIsVisible)可见的方法-CJavaPy .NET Selenium WebDriver操作调用浏览器后台执行Js(JavaScript)代码-CJavaPy The Selenium Browser Automation Project | Selenium - 重要 - 官方文档 - 含各语言实现 Working with cookies | Selenium Github | 使用 Action 操作 Selenium 方案 | ZkeqのCoding日志