Python Web Scraping

Começar. É Gratuito
ou inscrever-se com seu endereço de e-mail
Python Web Scraping por Mind Map: Python Web Scraping

1. What is it?

1.1. Web Crawling involves programs that automatically scan web page content, normally for purpose of classification

1.1.1. such programs are often referred to as spiders

1.1.2. spiders are used by search engines like Google

1.1.3. web searches are based on classification data produced by spiders

1.1.4. spiders are automated programs, often categorised as bots, which automatically run and follow links from page to page across millions of webs sites

1.2. Web Scraping can be used to describe part of the activity of Web Crawling, but often Web Scraping can describe a program that targets a specific website

1.2.1. Web Scraping involves an automated program known as a scraper

1.2.2. Purpose of scraper is to extract relevant data from targeted web page(s) and aggregated the data into a usable structured format

2. Web Scraping Ethics

2.1. Legally ambiguous

2.1.1. Intellectual Property (IP) Rights includes Copyright, which is an automatically granted right assigned to creator of data applies not to acquiring data but usage of it, so the scraping itself is legal

2.2. Must consult the robots.txt document that should be stored in the root directory of the website

2.2.1. robots.txt specifies what is and what is not allowed by the site in terms of bots

2.2.2. you should respect the information in the robots.txt file

2.3. You should generally respect the website you are scraping and avoid putting excessive loads on the server via high volumes of automated requests that might overload the server and degrade its performance

3. Jupyter Notebook

3.1. Server-client application that allows editing and execution of notebook documents via web browser

3.2. notebook documents are rich text, human readable documents that combine executable code, text, pictures, graphs, etc.

3.3. application can be installed on a local machine and run without Internet connection or can be installed on remote server run via Internet connection

3.4. Notebook documents attach to a code kernel, which enables the embedded code to be executed when notebook is run

3.4.1. Jupyter notebooks attached to Python code kernel are known as IPython notebooks Notebook files in IPython format take the *.ipynb file extension

3.5. Notebook editing tips

3.5.1. Ctrl + Enter executes cell and keeps focus on that cell

3.5.2. Shift + Enter executes cell and creates new cell below, shifting focus to the new cell

3.5.3. The print function can be used, but it is also invoked implicitly example cell input: x = [1,2,3,4,5] x

3.5.4. Press A to insert new cell above active one, press B to insert new cell below

3.5.5. Press D twice (i.e. D + D) to delete active cell

3.5.6. Press M to convert cell to Markdown Markdown cell formatting

3.5.7. Press Y to convert cell to Code

4. Anaconda

4.1. A distribution that bundles Python language, Jupyter Notebook application and numerous packages for data science

4.2. Installing packages

4.2.1. Use Anaconda Prompt or Anaconda Powershell Prompt Windows provides older cmd command prompt and newer Powershell prompt - you can run the same commands for tasks like package installation via either of these

4.2.2. pip install <package_name> example pip install requests-html

5. APIs

5.1. Application Programming Interface

5.2. Not web scraping, but should always be used in preference to web scraping where available

5.3. APIs can be free or paid

5.4. For getting data from websites, we use web APIs, based on HTTP

5.5. API documentation should specify how to use the API and the format of the response

5.5.1. common response format is JSON


6.1. HyperText Transfer Protocol

6.2. Websites consist of a collection of HTML code files, image files, video files and various other files, such as style sheets, etc.

6.3. Client makes request to download file and server responds with requested file

6.3.1. Web browser (client) interprets downloaded files, displaying website content within the browser window

6.4. HTTP Request Methods

6.4.1. GET Most popular request method fetches data from server can be bookmarked parameters are added to URL, but in plain text not to be used for exchanging sensitive information

6.4.2. POST 2nd most popular HTTP method invoked Alters state of some object held server side example is a shopping basket Used to send confidential information Login credentials always passed via POST request Parameters added in separate body

6.5. HTTP Response Codes

6.5.1. 200 Request was processed successfully

6.5.2. 404 Error: page not found

6.6. HTTP Response format

6.6.1. HTML for web pages

6.6.2. For web APIs, most common response format is JSON


7.1. Fundamentally you can think of JSON as Python dictionaries, where the keys are always strings and the values can any of the following:

7.1.1. string

7.1.2. number

7.1.3. object i.e. a nested dictionary { }

7.1.4. array i.e. a list [ ]

7.1.5. null

7.1.6. Boolean

7.2. Many web APIs give their response payloads for both GET and POST in JSON format

8. Web API with JSON Response

8.1. import requests

8.1.1. requests module give us ability to send HTTP GET requests to web server and capture the response

8.2. base_url = "<url>"

8.2.1. capture the URL for the GET request in a variable

8.3. response = requests.get(base_url)

8.3.1. invoke get function from imported requests module, passing in the URL via the variable

8.3.2. requests.get() returns requests.Response object

8.4. print(response.ok) print(response.status_code)

8.4.1. check the response status (looking for True and 200)

8.5. response.json()

8.5.1. returns response payload as Python dictionary note: print response.text first to verify that format is JSON

8.5.2. note: when status_code is 400 (Bad Request) you can normally expect to see details of the error returned in the JSON response

8.5.3. note: response is a variable that references an object instance of class Response, which is returned by requests.get()

8.6. import json results = json.dumps(response.json(),indent=4) print(results)

8.6.1. use dumps() function from json module to print a pretty string representation of dictionary

8.6.2. json.dumps()

8.7. Adding parameters

8.7.1. after base URL, add "?" followed by <param>=<val>

8.7.2. multiple <val> may be comma separated

8.7.3. to specify more than one parameter, separate <param>=<val> sequences by &

8.7.4. example,EUR,CAD

8.7.5. requests.get() with params better way to pass parameters in GET request because it auto handles symbols/white space that is not legal in a URL params takes a dictionary for its value, where every key is valid API parameter name and value is the paired parameter value example: import requests base_url = "" r = requests.get(base_url, params={"term":"kraftwerk", "country":"GB"})

8.8. Pagination

8.8.1. Search APIs sometimes deliver results in pages - e.g. Google search sends results one page at a time to the client browser

8.8.2. API documentation should describe if pagination is used and how to retrieve particular pages

8.8.3. example import requests get_url = "" r = requests.get(get_url,params = {"page":1,"description":"sql"}) this API has a page parameter to specify which page you want to loop through multiple pages, we can use a for loop and append results to a list

8.9. APIs with authentication

8.9.1. Requires registration and sometimes a paid subscription

8.9.2. Once registered, you get an application ID and Key, which is equivalent of username + password but dedicated to API use

8.9.3. HTTP POST method used instead of GET With POST requests, the data is sent in the message body and not as part of the request URL itself (as with GET) more secure amount of data that can be sent as part of request is unlimited, unlike GET which is limited by URL restrictions

8.9.4. url is first and required parameter like requests.get() params is optional parameter this is where you will typically put the application ID and key that you get as a registered user of the API the value will be a dictionary and the keys/values driven by the API documentation headers is optional parameter dictionary argument json is optional parameter json argument, but for Python this is synonymous with dictionary

8.9.5. example (see my "Python Web API with POST (Authentication).ipynb" notebook for more info import requests import json app_id = "8548bf9b" app_key = "df38bd7b9b3a6283ea6b1f5dca7ed85f" api_endpoint = "" header = {"Content-Type": "application/json"} recipe = {"title":"Cappuccino","ingr":["18g ground espresso","150ml milk"]} r =, params = {"app_id":app_id,"app_key":app_key}, headers = header, json = recipe)

9. Structure data with pandas

9.1. pandas library has data frame object

9.1.1. data frame is a structured data type, a lot like a table

9.2. import pandas as pd

9.2.1. pd alias is a commonly used convention but not required

9.3. passing a dictionary of shallow dictionaries into a DataFrame constructor is very quick and easy method for creating a new data frame

9.3.1. a shallow dictionary is one that does not consist of any complex values such as nested lists or nested dictionaries

9.3.2. example searchResDF = pd.DataFrame(r.json()["results"]) note that r represents a response object that was returned from a Web API GET request and the json() method is returning a dictionary object from the json formatted response payload. "results" is the key that references the dictionary of dictionaries

9.4. data frame objects can be easily exported to csv format

9.4.1. example searchResDF.to_csv("itunes_search_results.csv")

9.5. data frame objects can be easily exported to Excel format

9.5.1. example searchResDF.to_excel("itunes_search_results.xlsx")

10. HTTP File Downloads

10.1. Any response can be captured and then written to a file, but it can be inefficient to download the content of some request wholly into RAM and then open up a file stream to write that content to file

10.2. Leveraging the stream parameter of the requests.get() function is the key to implementing a smarter, more efficient process for HTTP downloads, and use the Python with statement for more elegant coding

10.2.1. Python with statement Used commonly with file streams because it guarantees to close the file in the event of some exception arising, and removes the need to explicitly call the .close() method on the file stream Also used for other processes with dependencies on things external to the Python environment, including locks, sockets, telnets, etc. (with the recurring theme being automated assurance of closing down the connection when the work is done, including if the work is interrupted via exception) syntax: with <open connection to create object> as <connection object>: <statement(s) involving connection object> example: with open('file_path', 'w') as file: file.write('hello world !')

10.2.2. when we create a GET request using requests.get() and specify stream = True, this allows us to iterate the response (for the purpose of writing) in chunks, the size of which we can specify in bytes in order to iterate the response content we need to invoke the .iter_content() method we pass the chunk_size keyword argument to iter_content() example

11. HTML

11.1. HyperText Markup Language is code that describes structure and content of web pages, and is used by web browsers to render web pages

11.1.1. In addition to HTML, web pages are often supplemented by CSS and JavaScript

11.2. HTML document is composed of nested elements

11.2.1. every document has a Head and a Body element Head contains metadata describing such things as Title, Language and Style (all of which are nested elements in the Head element) No data in Head is used to display content in the browser - it is primarily used by search engines to understand what kind of page they are looking at Body contains all the content that will become visible in the browser Includes elements such as Link, Image, Paragraph, Table

11.2.2. To be valid HTML every element must be wholly nested inside another element

11.2.3. The syntax for every element is: <tag_name>content</tag_name> tag names identify the element and they can be any of the predefined names for HTML, or in some cases they can be custom content can be text, other elements, or a combination of both note: some elements are content-less and consist solely of attributes - these have a syntax of:

11.2.4. First tag in an HTML document should be: <!DOCTYPE html> Below the opening <!DOCTYPE html> tag comes the root <html>...</html> element and all content is placed between the opening and closing html tags

11.2.5. Tag attributes are specified as name="value" pairs and they are always placed inside of opening tags different elements support different tag attributes tag attributes are separated from the tag name by a space, and multiple attributes (name="value") can be specified by using a space separator The element for a link is <a>..</a>, which stems from an old term "anchor" The content of the <a> link element is text that serves as a hot-link label to re-direct the browser (on click) to an external URL, we add the href attribute to make the browser open the external link in a new browser tab, we use the target attribute with the "_blank" value example:

11.3. Most important tag attributes we need to understand for web scraping are Class and ID

11.3.1. HTML class attribute used to group elements that are of the same category so that all of them can be manipulated at once example class="menu" elements can have more than one class as they can belong to more than one category multiple class assignments are specified inside a single class="value" attribute, where multiple values are separated by a space Web developers often use very descriptive names for class attribute values because this helps improve their search engine rankings

11.3.2. HTML id attribute value must be unique across the HTML document (web page) every element can only have one value for its id attribute

11.4. Popular tags

11.4.1. HTML head tag HTML title tag mandatory for all HTML docs can only be one such element used by search engines like Google to categorise HTML meta tag content-less tags featuring attributes only charset tag attribute used to specify the character encoding - e.g. UTF-8 name and content attributes pair together for various metadata, including author, description, keywords, etc. HTML style tag used with CSS content to define style of HTML document works in combination with class and id attributes in the body elements HTML script tag used with Javascript content

11.4.2. HTML body tag HTML div tag defines division or section in HTML doc just a container for other elements, a way to group elements used almost exclusively with class and id attributes HTML span tag embedded within content, typically to apply some styling to part of the content whilst keeping the content together unlike div tags, span tag content never starts on a new line HTML iframe tag used to embed another HTML document src attribute specifies link to embedded document, often a URL external to site HTML img tag specifies image to display src and alt attributes both required

11.5. HTML Lists

11.5.1. ordered lists HTML ol tag numbered by default, but alternatives can be specified

11.5.2. unordered lists HTML ul tag bullet point list

11.5.3. HTML li tag list item, carries content for both ordered and unordered lists

11.6. HTML table tag

11.6.1. defines HTML table

11.6.2. consists of nested table row elements HTML tr tag consists of nested table data or table header elements

11.6.3. example <table> <tr> <th>Month</th> <th>Savings</th> </tr> <tr> <td>January</td> <td>$100</td> </tr> </table>

11.7. Handling reserved characters or symbols not on our keyboard

11.7.1. Reserved symbols include < and >

11.7.2. 3 methods specify name of symbol with & prefix and ; suffix e.g. note: not every symbol can be represented by a name and not every browser will recognise it specify decimal code of Unicode codepoint for the symbol, and topped and tailed with & and ; e.g. specify hex code of Unicode codepoint for the symbol, and topped and tailed with & and ;, but with x in between the & and hex number e.g.

11.8. Watch out for the non breaking space!

11.8.1. the non breaking space looks like a regular space on the screen but it has a different Unicode codepoint value of 160 (vs 32 for a regular space) in hex, the nbsp is A0

11.8.2. referred to as nbsp character

11.8.3. an nbsp is used in HTML to ensure that two words are kept together and not allowed to break apart over two lines

11.9. XHTML

11.9.1. HTML rules are specified as guidelines, which means that poorly written HTML code that ignores certain rules is allowed web browsers automatically handle things like opening tags with no closing tags, attribute values specified without double quotes, etc. XHTML is a strict standard that insists on valid HTML there are websites out there written in XHTML but it never took off in a big way, which means most websites are based on HTML

12. CSS

12.1. Cascading Style Sheets

12.2. language used to describe presentation and style of HTML documents

12.3. 3 ways that style can be applied to HTML element

12.3.1. inline style attribute example note syntax for the style attribute values

12.3.2. internal style element embedded inside the head element style element content is composed of CSS selectors followed by CSS properties and values wrapped in curly braces { } example note that you can specify multiple CSS selectors (e.g. table, th, td) with a common set of CSS properties, separating each by comma

12.3.3. external separate file that uses same syntax as the internal style element browser downloads the CSS file and applies styles to every page in the site based on that file this approach allows the entire look and feel of a website to be changed by altering this single CSS file it's also faster for browsers to apply styling this way

12.4. CSS Ref for Properties

13. Beautiful Soup

13.1. Python package for parsing HTML and XML documents - ideal for web scraping

13.2. Web scaping workflow

13.2.1. 1. Inspect the page use browser developer tool to inspect, and get a feel for the page structure be aware that the developer's inspect tool often invokes Javascript itself, which modifies the HTML you inspect

13.2.2. 2. Obtain HTML requests.get()

13.2.3. 3. Choose Parser Parsing is process of decomposing HTML page and reconstructing into a parse tree (think element hierarchy) Beautiful Soup does not have its own parser, and currently supports 3 external parsers html.parser lxml html5lib

13.2.4. 4. Create a Beautiful Soup object Input parameter for the Beautiful Soup constructor is a parse tree (produced by the chosen parser)

13.2.5. 5. Export the HTML to a file (optional) Recommended because different parsers can produce different parse trees for same source HTML document, and it's useful to store the parsed HTML for reference

13.3. Basics of the web scraping workflow in Python

13.3.1. import requests from bs4 import BeautifulSoup

13.3.2. get the HTML using requests.get() e.g. url = "" r = requests.get(url)

13.3.3. Peek at the content to verify that response looks like an HTML document html = r.content html[:100]

13.3.4. Make the "soup" by invoking the BeautifulSoup constructor, passing in the html response as 1st arg and the HTML parser name as 2nd arg e.g. soup = BeautifulSoup(html, "html.parser")

13.3.5. Write the parsed HTML to file, which involves opening up a binary file stream in write mode and for the file write() method, passing in the soup object with its prettify() method invoked, which produces a nicely formatting representation of the the HTML for writing to the file e.g.(noting that "soup" is instance of BeautifulSoup) with open("Wiki_response.html","wb") as file: file.write(soup.prettify("utf-8"))

13.3.6. Use the BeautifulSoup find() method to find first instance of a given element, where the tag name is passed as string argument e.g. (noting that "soup" is instance of BeautifulSoup) soup.find('head') the result of find() is bs4.element.Tag, which is an object you can also invoke the find_all() method on but if no such element is found then result is None example of finding a tbody (table body) tag and then invoking find_all() to get all td (table data) tags contained within it

13.3.7. Use the BeautifulSoup find_all() method to find all instances of a given element, where the tag name is passed as string argument e.g. (noting that "soup" is instance of BeautifulSoup) links = soup.find_all('a') the result of find_all() is bs4.element.ResultSet, which is a subclass of list if no elements are found, the result is still bs4.element.ResultSet but akin to an empty list

13.3.8. Every element in the parse tree can have multiple children but only one parent navigate to children by invoking the content property of soup element object e.g. (noting "table" is instance of bs4.element.Tag) navigate to parent by invoking the parent property of soup element object e.g. (noting "table" is instance of bs4.element.Tag) for navigating up multiple levels, use dot notation

13.4. Searching by Attribute

13.4.1. both find() and find_all() methods support attribute searches in same way

13.4.2. HTML standard attributes can be specified as additional arguments followed by equals = and the value enclosed in quotes " " e.g. (noting that "soup" is instance of BeautifulSoup) soup.find("div", id = "siteSub") note that user-defined attributes cannot be searched in this manner because the find() and find_all() methods will not recognise them as a keyword argument this limitation can be overcome by using the attrs argument for find() and find_all() because class is a Python reserved keyword, it will raise exception if you try to pass it as argument to find() or find_all() fix is to append an underscore to class

13.4.3. we can also search based on multiple attribute values - just pass them as 3rd, 4th, etc. arguments in find() or find_all() e.g. (noting that "soup" is instance of BeautifulSoup) soup.find("a",class_ = "mw-jump-link",href = "#p-search")

13.5. Extracting attribute data

13.5.1. we can extract attribute data from a soup tag object (bs4.element.Tag) using two approaches 1st approach is to reference the attribute name as a dictionary key on the tag object e.g. (noting that "a" is an instance of bs4.element.Tag) if attribute does not exist, this approach causes exception 2nd approach is to invoke the get() method on the tag object e.g. (noting that "a" is an instance of bs4.element.Tag) behaves same as approach 1 but for non existent attributes, it returns None and does not raise exception

13.5.2. to get a dictionary containing all attributes and assigned values for a soup tag object, just use the attrs property e.g. (noting that "soup" is instance of BeautifulSoup) a.attrs

13.6. Extracting tag string content data

13.6.1. The text and string properties on the soup tag object both have the same effect on tags with a single string for content, but behave differently when a tag includes nested elements text property strips away all nested tags to provide all text content as a single string however, text property for parent soup object will convert all Javascript to text because it only handles HTML string property returns None if content of tag does not consist of a single string (unbroken by other tags)

13.6.2. strings is a generator available for the tag object and it enables us to iterate over every string fragment in a for loop, processing string by string e.g. (noting that "p" is an instance of bs4.element.Tag) for s in p.strings: print(repr(s))

13.6.3. stripped_strings is another generator available for the tag object, behaving like strings but eliminating all leading/trailing whitespace, including newline characters e.g. (noting that "p" is an instance of bs4.element.Tag) for s in p.stripped_strings: print(repr(s))

13.7. Scraping links

13.7.1. you can capture a list of all links from the Beautiful Soup object using the find_all() method and then you can pull out the URL via the href attribute

13.7.2. it is common to encounter relative URLs, which are just folder/file references relative to page base URL we can use the urljoin() function from the parse module of the urllib package to combine the page base URL with the relative URL in order to form the absolute URL e.g. (noting that "l" is an instance of bs4.element.Tag association with an "a" tag, and url is a string that holds the base URL for the page being scraped) Python urllib.parse.urljoin to process multiple links, we can use list comprehension

13.8. Scraping nested elements

13.8.1. sometimes you need to perform nested searches for example, you might identify sections of a page that are commonly identifiable via a div tag with role attribute set to "note", and you want to scrape every link from within these particular div tags method 1: nested for loop with list append() method method 2: for loop with list extend() method

13.9. Scraping multiple pages automatically

13.9.1. this builds from scraping links from a single page - we can press on from a scraped list of links (in this example, captured in a variable named url_list) we start by iterating our list of scraped URLs and using our core techniques to scrape all <p> tag text from each page: 1. Get request 2. Capture content of response (html) 3. Create BeautifulSoup object 4. Capture list of <p> content strings 5. Append <p> string list to master list para_list = [] i = 0 for l in url_list: para_resp = requests.get(l) i += 1 if para_resp.status_code == 200: print(i,": good response :",l) else: print(i,":",para_resp.status_code," response (skipped):",l) continue para_html = para_resp.content para_soup = BeautifulSoup(para_html,"lxml") paras = [p.text for p in para_soup.find_all("p")] para_list.append(paras)

13.10. Using pandas to structure results and capture them to file

13.10.1. Having scraped related data into various list objects in Python, it's really easy to add these lists as columns to a Pandas dataframe object e.g. (noting that titles, years_cleaned, scores_cleaned, critics_consensus, synopsis, directors and cast are all Python variables referencing list objects related to scraped data from the Rotten Tomatoes site) import pandas as pd movie_list = pd.DataFrame() #Create empty dataframe movie_list["Title"] = titles movie_list["Year"] = years_cleaned movie_list["Score"] = scores_cleaned movie_list["Critic's Consensus"] = critics_consensus movie_list["Synopsis"] = synopsis movie_list["Director"] = directors movie_list["Cast"] = cast

13.11. Handling None ('NoneType' AttributeError) when scraping

13.11.1. When scraping a list of elements using a list comprehension, it is quite common for the code to fail on the following error: AttributeError: 'NoneType' object has no attribute <name_of_attribute>

13.11.2. Common example happens when scraping a list of elements (tags) for their string content by invoking the string property when iterator processes a tag that has no content, exception is raised duration_list = [t.find("span",{"class":"accessible-description"}).string for t in related_vids]

13.11.3. We can handle this by implementing conditional logic in the main expression of the list comprehension syntax is: [ <val1> if <boolean_expression> else <val2> for x in y ] e.g. duration_list = [None if t.find("span",{"class":"accessible-description"}) == None else t.find("span",{"class":"accessible-description"}).string for t in related_vids]

14. Browser Developer Tools

14.1. Useful for web scraping because you can inspect the underlying HTML of any element on the page

14.2. Chrome

14.2.1. right-click any part of web page and choose Inspect under Elements pane, right-click and choose Copy | Copy Element paste into Notepad++

14.3. Edge

14.3.1. to access Developer Tools, click ellipses ... in top right corner and choose More Tools | Developer Tools

15. Using Pandas to Scrape HTML Tables

15.1. We can scrape html tables using Beautiful Soup but it has to be done column by column and can be a bit of a tedious process

15.1.1. Pandas provides the read_html() function, which takes an html document as its argument and returns a list of all table elements converted into dataframes Note: in the background, the pandas.read_html() function leverages BeautifulSoup but it provides a much faster way to capture table content e.g. noting that html is a variable captured from the content property of a request response object import pandas as pd tables = pd.read_html(html) tables[1] #returns 2nd dataframe

16. Common Roadblocks for Web Scraping

16.1. Request headers

16.1.1. sent as part of request and contains meta data about request

16.1.2. content of header can vary and may include information such as application type, operating system, software vendor, software version, etc.

16.1.3. header content is combined into user agent string think of user agent string as an ID card for the application making the request all web browsers have their own unique user agent string well known bots like the Google web crawler also has its own unique user agent string

16.1.4. many servers send different responses based on user agent string when user agent string is missing or cannot be interpreted many sites will return a default response this can lead to differences between html we can inspect via browser developer tools and actual html captured via our web scraping response fix is to always write our html response to file and use this as our reference for scraping

16.1.5. some sites block all anonymous requests (i.e. requests that do not include a recognised user agent string) fix is to use user agent string of one of the main web browser applications, as these are publicly available Chrome user agent requests supports headers parameter with value passed as dictionary e.g.

16.2. Cookies

16.2.1. small piece of data that a server sends to the user's web browser

16.2.2. browser may store it and send it back with later requests to the same server

16.2.3. Typically, it's used to tell if two requests came from the same browser e.g. keeping a user logged-in

16.2.4. Cookies are mainly used for three purposes: Session management Logins, shopping carts, game scores, or anything else the server should remember Personalization User preferences, themes, and other settings Tracking Recording and analyzing user behaviour

16.2.5. Sites that require registration and login in order to access site pages will refuse get requests with a 403 Forbidden response fix is to create a stateful session that allows the session cookies to be received and used by our Python program, and to use an appropriate post request to get the session cookie requests module has Sessions class, which we can use to create session objects and then the subsequent post/get requests are invoked via the session object Sites that require login often redirect to a login page that includes a form tag action attribute of form tag holds the relative URL of login page form tag includes a number of input tags use Chrome Developer tools to trace login request via the Network tab

16.3. Denial of Service

16.3.1. When making multiple requests of a web server, we must be mindful of the risk of the server being overwhelmed by too many requests many websites have protection against denial of service attacks and will refuse multiple requests from a single client that are made too rapidly fix is to import the time module and use the sleep() function to create wait duration in between multiple requests

16.4. Captchas

16.4.1. These sites are deliberately hard to scrape by bots, so you should avoid attempting to do so

16.5. Dynamically generated content with Javascript

16.5.1. For this problem we will turn to the requests-html package, to be used in place of requests + BeautifulSoup

17. requests-html package

17.1. created by the creator of requests library to combine requests + BeautifulSoup functionality

17.2. Full JavaScript support

17.3. Get page and parse html

17.3.1. from requests_html import HTMLSession

17.3.2. session = HTMLSession()

17.3.3. r = session.get("url_goes_here")

17.3.4. r.html The HTMLSession.get() method automatically parses the html response and encapsulates it in the html property of the response object the html property of the response becomes the basis for the scraping operations

17.4. Scrape links

17.4.1. relative links urls = r.html.links

17.4.2. absolute links full_path_urls = r.html.absolute_links

17.4.3. both links and absolute_links return a set rather than a list

17.5. Element search

17.5.1. html.find() method returns a list by default, so it behaves like the find_all() method in BeautifulSoup r.html.find("a") if we use the first parameter, we can make it return a single element, not a list r.html.find("a", first=True)

17.5.2. the individual elements of the list returned by html.find() are typed as requests_html.Element

17.5.3. We can get a dictionary of an element's attributes by referencing the html.attrs property element = r.html.find("a")[0] element.attrs

17.5.4. We can get html string representation of element using the response_html.Element.html property element = r.html.find("a")[0] element.html

17.5.5. We can get element string content using the response_html.Element.text property element = r.html.find("a")[0] element.text

17.5.6. We can filter element search by using containing parameter of html.find() method r.html.find("a", containing="wikipedia") note that search is made on text of element and search is not case sensitive

17.6. Text pattern search

17.6.1. method returns raw html and first result that matches search() argument We can find all text that falls in between two strings by passing argument as "string1{}string2" result will be all the html found in between an occurrence of string1 and string2, where the curly braces {} represents the result to be found and returned e.g. noting that r represents a response object"known{}soccer")"known {} soccer")[0]

17.6.2. html.search_all() works the same as search() but returns all results that match e.g. noting that r represents a response object r.html.search_all("known{}soccer")

17.7. CSS Selectors

17.7.1. used to "find" (or select) the HTML elements you want to style

17.7.2. CSS Selector Reference

17.7.3. we need to understand CSS Selectors and the notation used because this is the notation used when we pass arguments to the html.find() method element selector html.find("element") #id selector html.find("#id") .class_name selector html.find(".class_name") general attribute selectors (there are multiple forms - to the left are 3 of most common) [attribute] selector [attribute=value] selector [attribute*=value] selector combining selectors is done by concatenation with spaces remember that when tags included in combined selection, they must come first e.g. r.html.find("a[href*=wikipedia]") returns list of <a> tag elements with href attributes that include "wikipedia" substring e.g. r.html.find("a.internal") returns list of <a> tag elements with class="internal" specifying context in tag hierarchy for element search parent_element child_element selector parent_element > child_element selector

17.8. Scraping pages with Javascript content

17.8.1. from requests_html import AsyncHTMLSession session = AsyncHTMLSession() One of the differences when scraping pages with JavaScript for the purpose of rendering the dynamic content into a regular html document is that we need to use an asynchronous session. This requires us to use the await keyword before requests

17.8.2. site_url = "" r = await session.get(site_url) r.status_code with an asynchronous session, we prefix the get request with the await keyword

17.8.3. await r.html.arender() we use the asynchronous version of the render() method, arender(), combined with the await keyword this uses the Chromium browser to render the page content and converts this into regular html that we can scrape the first time this is run on a host, it will attempt to automatically download and install Chromium

17.8.4. session.close() once the session is closed, we can proceed with scraping the html object in the normal way

17.8.5. tips for timeout errors try render() method wait parameter try render() method retries parameter print(r.html.render.__doc__)