Python is frequently used for web scraping. Often times, the ‘requests’ library is sufficient. However, it is typically only used for basic requests. We can send a GET request to a website, but what if the actual page is loaded via javascript? Using a real browser/web driver allows us to load the page completely. Instead of simply sending a request to a url, we can automatically execute scripts and download resources.
In my case, I’m scraping a website which requires me to be logged in. However, the login page has a variety of security implementations that can’t easily be circumvented with simple HTTP requests. After realizing this, I decided to use a selenium
The selenium webdriver objects have a get_cookies function, which returns a list of dicts. Here is a list of the keys in each dictionary, alongside their type and a brief description:
name (string): The name of the cookie. value (string): The value of the cookie. domain (string): The domain of the server the cookie is sent to. path (string): Document location in which cookie is sent. secure (bool): Cookie is only sent to the server in encrypted requests. httpOnly (bool): Prevents the cookie from being accessed through client side scripts. expiry (int): Unix timestamp indicating when the cookie expires.
Knowing this, here is an example of the result of using json.dumps(list) to serialize the resulting list from get_cookies:
[
{
"name": "cookie1",
"value": "whatever",
"path": "/",
"domain": ".justin.ooo",
"secure": true,
"httpOnly": false,
"expiry": 1590978394
},
{
"name": "cookie2",
"value": "doesn't matter",
"path": "/",
"domain": "justin.ooo",
"secure": true,
"httpOnly": false,
"expiry": 1559528794
}
]
That’s a very simple example, containing 2 meaningless cookies. We need to somehow derive
After some brief searching on the CPython git repo, we can find the http.cookiejar.Cookie class source code & constructor. To effectively copy these cookies, we’ll need to instantiate an instance of the Cookie object from each of our cookie
def generate_cookie(cookie_raw):
"""
Creates a http.cookiejar.Cookie object, given raw cookie information as dict.
This dict must contain the following keys: name, value, domain, path, secure
Parameters:
cookie_raw (dict): The cookie information dictionary.
Returns:
http.cookiejar.Cookie: The generated cookie object.
"""
# expiry is optional, so default it to false if not set
if not 'expiry' in cookie_raw:
cookie_raw['expiry'] = False
# initialize Cookie object
cookie = http.cookiejar.Cookie(
0, # version
cookie_raw['name'], # name
cookie_raw['value'], # value
None, # port
False, # port_specified
cookie_raw['domain'], # domain
True, # domain_specified
"", # domain_initial_dot
cookie_raw['path'], # path
True, # path_specified,
cookie_raw['secure'], # secure
cookie_raw['expiry'], # expires
False, # discard
"", # comment
"", # comment_url
None, # rest
)
return cookie
This block of code generates a single http.cookiejar.Cookie object from a dict. Logically, we can complete our goal by simply
- Getting the list of raw cookies (
dict ) from our driver. - Calling generate_cookie(
dict ) on each of those items to generate a http.cookielib.Cookie object. - Setting
these http.cookielib .Cookie objects as cookies ina requests .Session instance using the set_cookie(cookie) method.
Here is a working implementation of this logic, alongside a usage example:
def session_from_driver(browser):
"""
Creates a 'requests.Session' object to make requests from.
Automatically copies cookies from selenium driver into new session.
Parameters:
browser (selenium.webdriver): An instance of a selenium webdriver.
Returns:
requests.Session: A session containing cookies from the provided selenium.webdriver object.
"""
cookies = browser.get_cookies()
session = requests.session()
for cookie_raw in cookies:
cookie = generate_cookie(cookie_raw)
session.cookies.set_cookie(cookie)
from selenium import webdriver
import utils
# initialize our webdriver
browser = webdriver.Firefox() # note: geckodriver is needed for this
# load the websites login page
browser.get("https://example.com/login.php")
# fill out username/password, click the 'submit' button
browser.find_element_by_id("username").send_keys("justin")
browser.find_element_by_id("password").send_keys("testing123")
browser.find_element_by_xpath("//input[@type='submit']").click()
# use our session_from_driver function to create a requests.Session
session = utils.session_from_driver(browser)
# access data that should only be available if we're logged in!
response = session.get("https://example.com/profile.php")
print(response)
This solution allows us to use of a real browser when needed, and seamlessly switch to the requests library to send standard HTTP requests while retaining session information.
Excellent post! We will be linking to this particularly great content on our website. Keep up the great writing.
Way cool! Some extremely valid points! I appreciate you writing this post and the rest of the website is also really good.
Ꮃow, this piecе of writing is fastidіous, my younger sister iѕ ɑnalyzing such thingѕ,
therefore I am going to inform her.