{"id":43,"date":"2019-06-04T03:08:42","date_gmt":"2019-06-04T03:08:42","guid":{"rendered":"http:\/\/justin.ooo\/?p=43"},"modified":"2023-06-20T03:12:16","modified_gmt":"2023-06-20T03:12:16","slug":"creating-a-requests-session-from-a-selenium-webdriver","status":"publish","type":"post","link":"https:\/\/justin.ooo\/index.php\/2019\/06\/04\/creating-a-requests-session-from-a-selenium-webdriver\/","title":{"rendered":"Creating a &#8216;requests&#8217; session from a selenium web driver"},"content":{"rendered":"\n<p>Python is frequently used for web scraping. Often times, the &#8216;requests&#8217; library is sufficient. However, it is typically only used for basic requests. We can send a GET request to a website, but what if the actual page is loaded via javascript? Using a real browser\/web driver allows us to load the page completely. Instead of simply sending a request to a url, we can automatically execute scripts and download resources.<\/p>\n\n\n\n<p>In my case, I&#8217;m scraping a website which requires me to be logged in. However, the login page has a variety of security implementations that can&#8217;t easily be circumvented with simple HTTP requests. After realizing this, I decided to use a selenium <g class=\"gr_ gr_6 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling\" id=\"6\" data-gr-id=\"6\"><g class=\"gr_ gr_6 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling\" id=\"6\" data-gr-id=\"6\">webdriver<\/g><\/g> to complete the login. After logging in, I simply needed the session information (cookies) established by the login request to scrape the rest of the site.<\/p>\n\n\n\n<p>The selenium webdriver objects have a <a href=\"https:\/\/selenium-python.readthedocs.io\/api.html#selenium.webdriver.remote.webdriver.WebDriver.get_cookies\"><strong>get_cookies<\/strong><\/a> function, which returns a <strong>list<\/strong> of <strong>dict<\/strong>s. Here is a list of the <strong>key<\/strong>s in each dictionary, alongside their <strong>type<\/strong> and a brief <strong>description<\/strong>:<br><\/p>\n\n\n\n<pre class=\"wp-block-preformatted has-white-color has-black-background-color has-text-color has-background\">name (string): The name of the cookie.\nvalue (string): The value of the cookie.\ndomain (string): The domain of the server the cookie is sent to.\npath (string): Document location in which cookie is sent. \nsecure (bool): Cookie is only sent to the server in encrypted requests.\nhttpOnly (bool): Prevents the cookie from being accessed through client side scripts.\nexpiry (int): Unix timestamp indicating when the cookie expires.<\/pre>\n\n\n\n<p>Knowing this, here is an example of the result of using <strong>json.dumps(list)<\/strong> to serialize the resulting list from <strong>get_cookies<\/strong>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"json\" class=\"language-json line-numbers\">[\n  {\n    \"name\": \"cookie1\",\n    \"value\": \"whatever\",\n    \"path\": \"\/\",\n    \"domain\": \".justin.ooo\",\n    \"secure\": true,\n    \"httpOnly\": false,\n    \"expiry\": 1590978394\n  },\n  {\n    \"name\": \"cookie2\",\n    \"value\": \"doesn't matter\",\n    \"path\": \"\/\",\n    \"domain\": \"justin.ooo\",\n    \"secure\": true,\n    \"httpOnly\": false,\n    \"expiry\": 1559528794\n  }\n]<\/code><\/pre>\n\n\n\n<p>That&#8217;s a very simple example, containing 2 meaningless cookies. We need to somehow derive <g class=\"gr_ gr_10 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace\" id=\"10\" data-gr-id=\"10\">a requests<\/g><g class=\"gr_ gr_13 gr-alert gr_gramm gr_inline_cards gr_run_anim Style replaceWithoutSep\" id=\"13\" data-gr-id=\"13\">.Session<\/g> object from this list of dictionaries. Unfortunately, the requests library does not store cookies in simple dicts. Instead, it uses <strong><a href=\"https:\/\/docs.python.org\/2\/library\/cookielib.html#cookielib.Cookie\">http.cookiejar.Cookie<\/a><\/strong> objects.<\/p>\n\n\n\n<p>After some brief searching on the <strong><a href=\"https:\/\/github.com\/python\/cpython\/\">CPython git repo<\/a><\/strong>, we can find the <strong>http.cookiejar.Cookie<\/strong> class <a href=\"https:\/\/github.com\/python\/cpython\/blob\/c76add7afd68387aa2481d672e1c0d7e7b4c9afc\/Lib\/http\/cookiejar.py#L729\">source code<\/a> &amp; <strong><a href=\"https:\/\/github.com\/python\/cpython\/blob\/c76add7afd68387aa2481d672e1c0d7e7b4c9afc\/Lib\/http\/cookiejar.py#L747\">constructor<\/a><\/strong>. To effectively copy these cookies, we&#8217;ll need to instantiate an instance of the Cookie object from each of our cookie <g class=\"gr_ gr_6 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace\" id=\"6\" data-gr-id=\"6\">dicts<\/g>, and then set each of those cookies in the new session. This can be achieved similarly using the <a href=\"https:\/\/github.com\/kennethreitz\/requests\/blob\/75bdc998e2d430a35d869b2abf1779bd0d34890e\/requests\/cookies.py#L441\">requests.cookies.create_cookie<\/a> function, however I chose to use the standard <g class=\"gr_ gr_216 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling\" id=\"216\" data-gr-id=\"216\">construc<\/g>tor. My solution is written as follows:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">def generate_cookie(cookie_raw):\n    \"\"\"\n    Creates a http.cookiejar.Cookie object, given raw cookie information as dict.\n    This dict must contain the following keys: name, value, domain, path, secure\n    Parameters:\n        cookie_raw (dict): The cookie information dictionary.\n    Returns:\n        http.cookiejar.Cookie: The generated cookie object.\n    \"\"\"\n    # expiry is optional, so default it to false if not set\n    if not 'expiry' in cookie_raw:\n        cookie_raw['expiry'] = False\n    # initialize Cookie object\n    cookie = http.cookiejar.Cookie(\n        0,                      # version\n        cookie_raw['name'],     # name\n        cookie_raw['value'],    # value\n        None,                   # port\n        False,                  # port_specified\n        cookie_raw['domain'],   # domain\n        True,                   # domain_specified\n        \"\",                     # domain_initial_dot\n        cookie_raw['path'],     # path\n        True,                   # path_specified,\n        cookie_raw['secure'],   # secure\n        cookie_raw['expiry'],   # expires\n        False,                  # discard\n        \"\",                     # comment\n        \"\",                     # comment_url\n        None,                   # rest\n        )\n    return cookie<\/code><\/pre>\n\n\n\n<p>This block of code generates a single <strong>http.cookiejar.Cookie<\/strong> object from a dict. Logically, we can complete our goal by simply<\/p>\n\n\n\n<ul>\n<li>Getting the list of raw cookies (<g class=\"gr_ gr_34 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace\" id=\"34\" data-gr-id=\"34\"><strong class=\"\">dict<\/strong><\/g>) from our driver.<\/li>\n\n\n\n<li>Calling <strong>generate_cookie(<\/strong><g class=\"gr_ gr_104 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace\" id=\"104\" data-gr-id=\"104\"><strong>dict<\/strong><\/g><strong>)<\/strong> on each of those items to generate a <strong>http.cookielib.Cookie<\/strong> object.<\/li>\n\n\n\n<li>Setting <g class=\"gr_ gr_144 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace\" id=\"144\" data-gr-id=\"144\">these <strong>http.cookielib<\/strong><\/g><strong>.Cookie<\/strong> objects as cookies in <g class=\"gr_ gr_176 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace\" id=\"176\" data-gr-id=\"176\">a <strong class=\"\">requests<\/strong><\/g><strong>.Session<\/strong> instance using the <strong><a href=\"https:\/\/2.python-requests.org\/en\/master\/api\/#requests.cookies.RequestsCookieJar.set_cookie\">set_cookie(cookie)<\/a><\/strong> method.<\/li>\n<\/ul>\n\n\n\n<p>Here is a working implementation of this logic, alongside a usage example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">def session_from_driver(browser):\n    \"\"\"\n    Creates a 'requests.Session' object to make requests from.\n    Automatically copies cookies from selenium driver into new session.\n    Parameters:\n        browser (selenium.webdriver): An instance of a selenium webdriver.\n    Returns:\n        requests.Session: A session containing cookies from the provided selenium.webdriver object.\n    \"\"\"\n    cookies = browser.get_cookies()\n    session = requests.session()\n    for cookie_raw in cookies:\n        cookie = generate_cookie(cookie_raw)\n        session.cookies.set_cookie(cookie)<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">from selenium import webdriver\nimport utils\n# initialize our webdriver\nbrowser = webdriver.Firefox() # note: geckodriver is needed for this\n# load the websites login page\nbrowser.get(\"https:\/\/example.com\/login.php\")\n# fill out username\/password, click the 'submit' button\nbrowser.find_element_by_id(\"username\").send_keys(\"justin\")\nbrowser.find_element_by_id(\"password\").send_keys(\"testing123\")\nbrowser.find_element_by_xpath(\"\/\/input[@type='submit']\").click()\n# use our session_from_driver function to create a requests.Session\nsession = utils.session_from_driver(browser)\n# access data that should only be available if we're logged in!\nresponse = session.get(\"https:\/\/example.com\/profile.php\")\nprint(response)<\/code><\/pre>\n\n\n\n<p>This solution allows us to use of a real browser when needed, and seamlessly switch to the requests library to send standard HTTP requests while retaining session information.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Python is frequently used for web scraping. Often times, the &#8216;requests&#8217; library is sufficient. However, it is typically only used for basic requests. We can send a GET request to a website, but what if the actual page is loaded via javascript? Using a real browser\/web driver allows us to load the page completely. Instead of simply sending a request to a url, we can automatically execute scripts and download [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[2,3],"tags":[],"_links":{"self":[{"href":"https:\/\/justin.ooo\/index.php\/wp-json\/wp\/v2\/posts\/43"}],"collection":[{"href":"https:\/\/justin.ooo\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/justin.ooo\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/justin.ooo\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/justin.ooo\/index.php\/wp-json\/wp\/v2\/comments?post=43"}],"version-history":[{"count":35,"href":"https:\/\/justin.ooo\/index.php\/wp-json\/wp\/v2\/posts\/43\/revisions"}],"predecessor-version":[{"id":264,"href":"https:\/\/justin.ooo\/index.php\/wp-json\/wp\/v2\/posts\/43\/revisions\/264"}],"wp:attachment":[{"href":"https:\/\/justin.ooo\/index.php\/wp-json\/wp\/v2\/media?parent=43"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/justin.ooo\/index.php\/wp-json\/wp\/v2\/categories?post=43"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/justin.ooo\/index.php\/wp-json\/wp\/v2\/tags?post=43"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}