Python is frequently used for web scraping. Often times, the ‘requests’ library is sufficient. However, it is typically only used for basic requests. We can send a GET request to a website, but what if the actual page is loaded via javascript? Using a real browser/web driver allows us to load the page completely. Instead of simply sending a request to a url, we can automatically execute scripts and download resources.
In my case, I’m scraping a website which requires me to be logged in. However, the login page has a variety of security implementations that can’t easily be circumvented with simple HTTP requests. After realizing this, I decided to use a selenium
The selenium webdriver objects have a get_cookies function, which returns a list of dicts. Here is a list of the keys in each dictionary, alongside their type and a brief description:
name (string): The name of the cookie. value (string): The value of the cookie. domain (string): The domain of the server the cookie is sent to. path (string): Document location in which cookie is sent. secure (bool): Cookie is only sent to the server in encrypted requests. httpOnly (bool): Prevents the cookie from being accessed through client side scripts. expiry (int): Unix timestamp indicating when the cookie expires.
Knowing this, here is an example of the result of using json.dumps(list) to serialize the resulting list from get_cookies:
[ { "name": "cookie1", "value": "whatever", "path": "/", "domain": ".justin.ooo", "secure": true, "httpOnly": false, "expiry": 1590978394 }, { "name": "cookie2", "value": "doesn't matter", "path": "/", "domain": "justin.ooo", "secure": true, "httpOnly": false, "expiry": 1559528794 } ]
That’s a very simple example, containing 2 meaningless cookies. We need to somehow derive
After some brief searching on the CPython git repo, we can find the http.cookiejar.Cookie class source code & constructor. To effectively copy these cookies, we’ll need to instantiate an instance of the Cookie object from each of our cookie
def generate_cookie(cookie_raw): """ Creates a http.cookiejar.Cookie object, given raw cookie information as dict. This dict must contain the following keys: name, value, domain, path, secure Parameters: cookie_raw (dict): The cookie information dictionary. Returns: http.cookiejar.Cookie: The generated cookie object. """ # expiry is optional, so default it to false if not set if not 'expiry' in cookie_raw: cookie_raw['expiry'] = False # initialize Cookie object cookie = http.cookiejar.Cookie( 0, # version cookie_raw['name'], # name cookie_raw['value'], # value None, # port False, # port_specified cookie_raw['domain'], # domain True, # domain_specified "", # domain_initial_dot cookie_raw['path'], # path True, # path_specified, cookie_raw['secure'], # secure cookie_raw['expiry'], # expires False, # discard "", # comment "", # comment_url None, # rest ) return cookie
This block of code generates a single http.cookiejar.Cookie object from a dict. Logically, we can complete our goal by simply
- Getting the list of raw cookies (
dict ) from our driver. - Calling generate_cookie(
dict ) on each of those items to generate a http.cookielib.Cookie object. - Setting
these http.cookielib .Cookie objects as cookies ina requests .Session instance using the set_cookie(cookie) method.
Here is a working implementation of this logic, alongside a usage example:
def session_from_driver(browser): """ Creates a 'requests.Session' object to make requests from. Automatically copies cookies from selenium driver into new session. Parameters: browser (selenium.webdriver): An instance of a selenium webdriver. Returns: requests.Session: A session containing cookies from the provided selenium.webdriver object. """ cookies = browser.get_cookies() session = requests.session() for cookie_raw in cookies: cookie = generate_cookie(cookie_raw) session.cookies.set_cookie(cookie)
from selenium import webdriver import utils # initialize our webdriver browser = webdriver.Firefox() # note: geckodriver is needed for this # load the websites login page browser.get("https://example.com/login.php") # fill out username/password, click the 'submit' button browser.find_element_by_id("username").send_keys("justin") browser.find_element_by_id("password").send_keys("testing123") browser.find_element_by_xpath("//input[@type='submit']").click() # use our session_from_driver function to create a requests.Session session = utils.session_from_driver(browser) # access data that should only be available if we're logged in! response = session.get("https://example.com/profile.php") print(response)
This solution allows us to use of a real browser when needed, and seamlessly switch to the requests library to send standard HTTP requests while retaining session information.