Python Web Scraping

Começar. É Gratuito
ou inscrever-se com seu endereço de e-mail
Python Web Scraping por Mind Map: Python Web Scraping

1. Jupyter Notebook

1.1. Server-client application that allows editing and execution of notebook documents via web browser

1.2. notebook documents are rich text, human readable documents that combine executable code, text, pictures, graphs, etc.

1.3. application can be installed on a local machine and run without Internet connection or can be installed on remote server run via Internet connection

1.4. Notebook documents attach to a code kernel, which enables the embedded code to be executed when notebook is run

1.4.1. Jupyter notebooks attached to Python code kernel are known as IPython notebooks

1.4.1.1. Notebook files in IPython format take the *.ipynb file extension

1.5. Notebook editing tips

1.5.1. Ctrl + Enter executes cell and keeps focus on that cell

1.5.2. Shift + Enter executes cell and creates new cell below, shifting focus to the new cell

1.5.3. The print function can be used, but it is also invoked implicitly

1.5.3.1. example cell input:

1.5.3.1.1. x = [1,2,3,4,5] x

1.5.4. Press A to insert new cell above active one, press B to insert new cell below

1.5.5. Press D twice (i.e. D + D) to delete active cell

1.5.6. Press M to convert cell to Markdown

1.5.6.1. Markdown cell formatting

1.5.7. Press Y to convert cell to Code

2. Anaconda

2.1. A distribution that bundles Python language, Jupyter Notebook application and numerous packages for data science

2.2. Installing packages

2.2.1. Use Anaconda Prompt or Anaconda Powershell Prompt

2.2.1.1. Windows provides older cmd command prompt and newer Powershell prompt - you can run the same commands for tasks like package installation via either of these

2.2.2. pip install <package_name>

2.2.2.1. example

2.2.2.1.1. pip install requests-html

3. APIs

3.1. Application Programming Interface

3.2. Not web scraping, but should always be used in preference to web scraping where available

3.3. APIs can be free or paid

3.4. For getting data from websites, we use web APIs, based on HTTP

3.5. API documentation should specify how to use the API and the format of the response

3.5.1. common response format is JSON

4. JSON

4.1. Fundamentally you can think of JSON as Python dictionaries, where the keys are always strings and the values can any of the following:

4.1.1. string

4.1.2. number

4.1.3. object

4.1.3.1. i.e. a nested dictionary { }

4.1.4. array

4.1.4.1. i.e. a list [ ]

4.1.5. null

4.1.6. Boolean

4.2. Many web APIs give their response payloads for both GET and POST in JSON format

5. Structure data with pandas

5.1. pandas library has data frame object

5.1.1. data frame is a structured data type, a lot like a table

5.2. import pandas as pd

5.2.1. pd alias is a commonly used convention but not required

5.3. passing a dictionary of shallow dictionaries into a DataFrame constructor is very quick and easy method for creating a new data frame

5.3.1. a shallow dictionary is one that does not consist of any complex values such as nested lists or nested dictionaries

5.3.2. example

5.3.2.1. searchResDF = pd.DataFrame(r.json()["results"])

5.3.2.2. note that r represents a response object that was returned from a Web API GET request and the json() method is returning a dictionary object from the json formatted response payload. "results" is the key that references the dictionary of dictionaries

5.4. data frame objects can be easily exported to csv format

5.4.1. example

5.4.1.1. searchResDF.to_csv("itunes_search_results.csv")

5.5. data frame objects can be easily exported to Excel format

5.5.1. example

5.5.1.1. searchResDF.to_excel("itunes_search_results.xlsx")

6. HTML

6.1. HyperText Markup Language is code that describes structure and content of web pages, and is used by web browsers to render web pages

6.1.1. In addition to HTML, web pages are often supplemented by CSS and JavaScript

6.2. HTML document is composed of nested elements

6.2.1. every document has a Head and a Body element

6.2.1.1. Head contains metadata describing such things as Title, Language and Style (all of which are nested elements in the Head element)

6.2.1.1.1. No data in Head is used to display content in the browser - it is primarily used by search engines to understand what kind of page they are looking at

6.2.1.2. Body contains all the content that will become visible in the browser

6.2.1.2.1. Includes elements such as Link, Image, Paragraph, Table

6.2.2. To be valid HTML every element must be wholly nested inside another element

6.2.3. The syntax for every element is:

6.2.3.1. <tag_name>content</tag_name>

6.2.3.1.1. tag names identify the element and they can be any of the predefined names for HTML, or in some cases they can be custom

6.2.3.1.2. content can be text, other elements, or a combination of both

6.2.3.1.3. note: some elements are content-less and consist solely of attributes - these have a syntax of:

6.2.4. First tag in an HTML document should be: <!DOCTYPE html>

6.2.4.1. Below the opening <!DOCTYPE html> tag comes the root <html>...</html> element and all content is placed between the opening and closing html tags

6.2.5. Tag attributes are specified as name="value" pairs and they are always placed inside of opening tags

6.2.5.1. different elements support different tag attributes

6.2.5.2. tag attributes are separated from the tag name by a space, and multiple attributes (name="value") can be specified by using a space separator

6.2.5.3. The element for a link is <a>..</a>, which stems from an old term "anchor"

6.2.5.3.1. The content of the <a> link element is text that serves as a hot-link label

6.2.5.3.2. to re-direct the browser (on click) to an external URL, we add the href attribute

6.2.5.3.3. to make the browser open the external link in a new browser tab, we use the target attribute with the "_blank" value

6.2.5.3.4. example:

6.3. Most important tag attributes we need to understand for web scraping are Class and ID

6.3.1. HTML class attribute

6.3.1.1. used to group elements that are of the same category so that all of them can be manipulated at once

6.3.1.2. example

6.3.1.2.1. class="menu"

6.3.1.3. elements can have more than one class as they can belong to more than one category

6.3.1.3.1. multiple class assignments are specified inside a single class="value" attribute, where multiple values are separated by a space

6.3.1.4. Web developers often use very descriptive names for class attribute values because this helps improve their search engine rankings

6.3.2. HTML id attribute

6.3.2.1. value must be unique across the HTML document (web page)

6.3.2.2. every element can only have one value for its id attribute

6.4. Popular tags

6.4.1. HTML head tag

6.4.1.1. HTML title tag

6.4.1.1.1. mandatory for all HTML docs

6.4.1.1.2. can only be one such element

6.4.1.1.3. used by search engines like Google to categorise

6.4.1.2. HTML meta tag

6.4.1.2.1. content-less tags featuring attributes only

6.4.1.2.2. charset tag attribute used to specify the character encoding - e.g. UTF-8

6.4.1.2.3. name and content attributes pair together for various metadata, including author, description, keywords, etc.

6.4.1.3. HTML style tag

6.4.1.3.1. used with CSS content to define style of HTML document

6.4.1.3.2. works in combination with class and id attributes in the body elements

6.4.1.4. HTML script tag

6.4.1.4.1. used with Javascript content

6.4.2. HTML body tag

6.4.2.1. HTML div tag

6.4.2.1.1. defines division or section in HTML doc

6.4.2.1.2. just a container for other elements, a way to group elements

6.4.2.1.3. used almost exclusively with class and id attributes

6.4.2.2. HTML span tag

6.4.2.2.1. embedded within content, typically to apply some styling to part of the content whilst keeping the content together

6.4.2.2.2. unlike div tags, span tag content never starts on a new line

6.4.2.3. HTML iframe tag

6.4.2.3.1. used to embed another HTML document

6.4.2.3.2. src attribute specifies link to embedded document, often a URL external to site

6.4.2.4. HTML img tag

6.4.2.4.1. specifies image to display

6.4.2.4.2. src and alt attributes both required

6.5. HTML Lists

6.5.1. ordered lists

6.5.1.1. HTML ol tag

6.5.1.2. numbered by default, but alternatives can be specified

6.5.2. unordered lists

6.5.2.1. HTML ul tag

6.5.2.2. bullet point list

6.5.3. HTML li tag

6.5.3.1. list item, carries content for both ordered and unordered lists

6.6. HTML table tag

6.6.1. defines HTML table

6.6.2. consists of nested table row elements

6.6.2.1. HTML tr tag

6.6.2.1.1. consists of nested table data or table header elements

6.6.3. example

6.6.3.1. <table> <tr> <th>Month</th> <th>Savings</th> </tr> <tr> <td>January</td> <td>$100</td> </tr> </table>

6.7. Handling reserved characters or symbols not on our keyboard

6.7.1. Reserved symbols include < and >

6.7.2. 3 methods

6.7.2.1. specify name of symbol with & prefix and ; suffix

6.7.2.1.1. e.g.

6.7.2.1.2. note: not every symbol can be represented by a name and not every browser will recognise it

6.7.2.2. specify decimal code of Unicode codepoint for the symbol, and topped and tailed with & and ;

6.7.2.2.1. e.g.

6.7.2.3. specify hex code of Unicode codepoint for the symbol, and topped and tailed with & and ;, but with x in between the & and hex number

6.7.2.3.1. e.g.

6.8. Watch out for the non breaking space!

6.8.1. the non breaking space looks like a regular space on the screen but it has a different Unicode codepoint value of 160 (vs 32 for a regular space)

6.8.1.1. in hex, the nbsp is A0

6.8.2. referred to as nbsp character

6.8.3. an nbsp is used in HTML to ensure that two words are kept together and not allowed to break apart over two lines

6.9. XHTML

6.9.1. HTML rules are specified as guidelines, which means that poorly written HTML code that ignores certain rules is allowed

6.9.1.1. web browsers automatically handle things like opening tags with no closing tags, attribute values specified without double quotes, etc.

6.9.1.1.1. XHTML is a strict standard that insists on valid HTML

6.9.1.1.2. there are websites out there written in XHTML but it never took off in a big way, which means most websites are based on HTML

7. Beautiful Soup

7.1. Python package for parsing HTML and XML documents - ideal for web scraping

7.2. Web scaping workflow

7.2.1. 1. Inspect the page

7.2.1.1. use browser developer tool to inspect, and get a feel for the page structure

7.2.1.1.1. be aware that the developer's inspect tool often invokes Javascript itself, which modifies the HTML you inspect

7.2.2. 2. Obtain HTML

7.2.2.1. requests.get()

7.2.3. 3. Choose Parser

7.2.3.1. Parsing is process of decomposing HTML page and reconstructing into a parse tree (think element hierarchy)

7.2.3.2. Beautiful Soup does not have its own parser, and currently supports 3 external parsers

7.2.3.2.1. html.parser

7.2.3.2.2. lxml

7.2.3.2.3. html5lib

7.2.4. 4. Create a Beautiful Soup object

7.2.4.1. Input parameter for the Beautiful Soup constructor is a parse tree (produced by the chosen parser)

7.2.5. 5. Export the HTML to a file (optional)

7.2.5.1. Recommended because different parsers can produce different parse trees for same source HTML document, and it's useful to store the parsed HTML for reference

7.3. Basics of the web scraping workflow in Python

7.3.1. import requests from bs4 import BeautifulSoup

7.3.2. get the HTML using requests.get()

7.3.2.1. e.g.

7.3.2.1.1. url = "https://en.wikipedia.org/wiki/Music" r = requests.get(url)

7.3.3. Peek at the content to verify that response looks like an HTML document

7.3.3.1. html = r.content html[:100]

7.3.4. Make the "soup" by invoking the BeautifulSoup constructor, passing in the html response as 1st arg and the HTML parser name as 2nd arg

7.3.4.1. e.g.

7.3.4.1.1. soup = BeautifulSoup(html, "html.parser")

7.3.5. Write the parsed HTML to file, which involves opening up a binary file stream in write mode and for the file write() method, passing in the soup object with its prettify() method invoked, which produces a nicely formatting representation of the the HTML for writing to the file

7.3.5.1. e.g.(noting that "soup" is instance of BeautifulSoup)

7.3.5.1.1. with open("Wiki_response.html","wb") as file: file.write(soup.prettify("utf-8"))

7.3.6. Use the BeautifulSoup find() method to find first instance of a given element, where the tag name is passed as string argument

7.3.6.1. e.g. (noting that "soup" is instance of BeautifulSoup)

7.3.6.1.1. soup.find('head')

7.3.6.2. the result of find() is bs4.element.Tag, which is an object you can also invoke the find_all() method on

7.3.6.2.1. but if no such element is found then result is None

7.3.6.2.2. example of finding a tbody (table body) tag and then invoking find_all() to get all td (table data) tags contained within it

7.3.7. Use the BeautifulSoup find_all() method to find all instances of a given element, where the tag name is passed as string argument

7.3.7.1. e.g. (noting that "soup" is instance of BeautifulSoup)

7.3.7.1.1. links = soup.find_all('a')

7.3.7.2. the result of find_all() is bs4.element.ResultSet, which is a subclass of list

7.3.7.2.1. if no elements are found, the result is still bs4.element.ResultSet but akin to an empty list

7.3.8. Every element in the parse tree can have multiple children but only one parent

7.3.8.1. navigate to children by invoking the content property of soup element object

7.3.8.1.1. e.g. (noting "table" is instance of bs4.element.Tag)

7.3.8.2. navigate to parent by invoking the parent property of soup element object

7.3.8.2.1. e.g. (noting "table" is instance of bs4.element.Tag)

7.3.8.2.2. for navigating up multiple levels, use dot notation

7.4. Searching by Attribute

7.4.1. both find() and find_all() methods support attribute searches in same way

7.4.2. HTML standard attributes can be specified as additional arguments followed by equals = and the value enclosed in quotes " "

7.4.2.1. e.g. (noting that "soup" is instance of BeautifulSoup)

7.4.2.1.1. soup.find("div", id = "siteSub")

7.4.2.2. note that user-defined attributes cannot be searched in this manner because the find() and find_all() methods will not recognise them as a keyword argument

7.4.2.2.1. this limitation can be overcome by using the attrs argument for find() and find_all()

7.4.2.3. because class is a Python reserved keyword, it will raise exception if you try to pass it as argument to find() or find_all()

7.4.2.3.1. fix is to append an underscore to class

7.4.3. we can also search based on multiple attribute values - just pass them as 3rd, 4th, etc. arguments in find() or find_all()

7.4.3.1. e.g. (noting that "soup" is instance of BeautifulSoup)

7.4.3.1.1. soup.find("a",class_ = "mw-jump-link",href = "#p-search")

7.5. Extracting attribute data

7.5.1. we can extract attribute data from a soup tag object (bs4.element.Tag) using two approaches

7.5.1.1. 1st approach is to reference the attribute name as a dictionary key on the tag object

7.5.1.1.1. e.g. (noting that "a" is an instance of bs4.element.Tag)

7.5.1.1.2. if attribute does not exist, this approach causes exception

7.5.1.2. 2nd approach is to invoke the get() method on the tag object

7.5.1.2.1. e.g. (noting that "a" is an instance of bs4.element.Tag)

7.5.1.2.2. behaves same as approach 1 but for non existent attributes, it returns None and does not raise exception

7.5.2. to get a dictionary containing all attributes and assigned values for a soup tag object, just use the attrs property

7.5.2.1. e.g. (noting that "soup" is instance of BeautifulSoup)

7.5.2.1.1. a.attrs

7.6. Extracting tag string content data

7.6.1. The text and string properties on the soup tag object both have the same effect on tags with a single string for content, but behave differently when a tag includes nested elements

7.6.1.1. text property strips away all nested tags to provide all text content as a single string

7.6.1.1.1. however, text property for parent soup object will convert all Javascript to text because it only handles HTML

7.6.1.2. string property returns None if content of tag does not consist of a single string (unbroken by other tags)

7.6.2. strings is a generator available for the tag object and it enables us to iterate over every string fragment in a for loop, processing string by string

7.6.2.1. e.g. (noting that "p" is an instance of bs4.element.Tag)

7.6.2.1.1. for s in p.strings: print(repr(s))

7.6.3. stripped_strings is another generator available for the tag object, behaving like strings but eliminating all leading/trailing whitespace, including newline characters

7.6.3.1. e.g. (noting that "p" is an instance of bs4.element.Tag)

7.6.3.1.1. for s in p.stripped_strings: print(repr(s))

7.7. Scraping links

7.7.1. you can capture a list of all links from the Beautiful Soup object using the find_all() method and then you can pull out the URL via the href attribute

7.7.2. it is common to encounter relative URLs, which are just folder/file references relative to page base URL

7.7.2.1. we can use the urljoin() function from the parse module of the urllib package to combine the page base URL with the relative URL in order to form the absolute URL

7.7.2.1.1. e.g. (noting that "l" is an instance of bs4.element.Tag association with an "a" tag, and url is a string that holds the base URL for the page being scraped)

7.7.2.1.2. Python urllib.parse.urljoin

7.7.2.1.3. to process multiple links, we can use list comprehension

7.8. Scraping nested elements

7.8.1. sometimes you need to perform nested searches

7.8.1.1. for example, you might identify sections of a page that are commonly identifiable via a div tag with role attribute set to "note", and you want to scrape every link from within these particular div tags

7.8.1.1.1. method 1: nested for loop with list append() method

7.8.1.1.2. method 2: for loop with list extend() method

7.9. Scraping multiple pages automatically

7.9.1. this builds from scraping links from a single page - we can press on from a scraped list of links (in this example, captured in a variable named url_list)

7.9.1.1. we start by iterating our list of scraped URLs and using our core techniques to scrape all <p> tag text from each page: 1. Get request 2. Capture content of response (html) 3. Create BeautifulSoup object 4. Capture list of <p> content strings 5. Append <p> string list to master list

7.9.1.1.1. para_list = [] i = 0 for l in url_list: para_resp = requests.get(l) i += 1 if para_resp.status_code == 200: print(i,": good response :",l) else: print(i,":",para_resp.status_code," response (skipped):",l) continue para_html = para_resp.content para_soup = BeautifulSoup(para_html,"lxml") paras = [p.text for p in para_soup.find_all("p")] para_list.append(paras)

7.10. Using pandas to structure results and capture them to file

7.10.1. Having scraped related data into various list objects in Python, it's really easy to add these lists as columns to a Pandas dataframe object

7.10.1.1. e.g. (noting that titles, years_cleaned, scores_cleaned, critics_consensus, synopsis, directors and cast are all Python variables referencing list objects related to scraped data from the Rotten Tomatoes site)

7.10.1.1.1. import pandas as pd movie_list = pd.DataFrame() #Create empty dataframe movie_list["Title"] = titles movie_list["Year"] = years_cleaned movie_list["Score"] = scores_cleaned movie_list["Critic's Consensus"] = critics_consensus movie_list["Synopsis"] = synopsis movie_list["Director"] = directors movie_list["Cast"] = cast

7.11. Handling None ('NoneType' AttributeError) when scraping

7.11.1. When scraping a list of elements using a list comprehension, it is quite common for the code to fail on the following error:

7.11.1.1. AttributeError: 'NoneType' object has no attribute <name_of_attribute>

7.11.2. Common example happens when scraping a list of elements (tags) for their string content by invoking the string property

7.11.2.1. when iterator processes a tag that has no content, exception is raised

7.11.2.2. duration_list = [t.find("span",{"class":"accessible-description"}).string for t in related_vids]

7.11.3. We can handle this by implementing conditional logic in the main expression of the list comprehension

7.11.3.1. syntax is:

7.11.3.1.1. [ <val1> if <boolean_expression> else <val2> for x in y ]

7.11.3.2. e.g.

7.11.3.2.1. duration_list = [None if t.find("span",{"class":"accessible-description"}) == None else t.find("span",{"class":"accessible-description"}).string for t in related_vids]

8. Using Pandas to Scrape HTML Tables

8.1. We can scrape html tables using Beautiful Soup but it has to be done column by column and can be a bit of a tedious process

8.1.1. Pandas provides the read_html() function, which takes an html document as its argument and returns a list of all table elements converted into dataframes

8.1.1.1. Note: in the background, the pandas.read_html() function leverages BeautifulSoup but it provides a much faster way to capture table content

8.1.1.2. e.g. noting that html is a variable captured from the content property of a request response object

8.1.1.2.1. import pandas as pd tables = pd.read_html(html) tables[1] #returns 2nd dataframe

9. requests-html package

9.1. created by the creator of requests library to combine requests + BeautifulSoup functionality

9.2. Full JavaScript support

9.3. Get page and parse html

9.3.1. from requests_html import HTMLSession

9.3.2. session = HTMLSession()

9.3.3. r = session.get("url_goes_here")

9.3.4. r.html

9.3.4.1. The HTMLSession.get() method automatically parses the html response and encapsulates it in the html property of the response object

9.3.4.2. the html property of the response becomes the basis for the scraping operations

9.4. Scrape links

9.4.1. relative links

9.4.1.1. urls = r.html.links

9.4.2. absolute links

9.4.2.1. full_path_urls = r.html.absolute_links

9.4.3. both links and absolute_links return a set rather than a list

9.5. Element search

9.5.1. html.find() method returns a list by default, so it behaves like the find_all() method in BeautifulSoup

9.5.1.1. r.html.find("a")

9.5.1.2. if we use the first parameter, we can make it return a single element, not a list

9.5.1.2.1. r.html.find("a", first=True)

9.5.2. the individual elements of the list returned by html.find() are typed as requests_html.Element

9.5.3. We can get a dictionary of an element's attributes by referencing the html.attrs property

9.5.3.1. element = r.html.find("a")[0] element.attrs

9.5.4. We can get html string representation of element using the response_html.Element.html property

9.5.4.1. element = r.html.find("a")[0] element.html

9.5.5. We can get element string content using the response_html.Element.text property

9.5.5.1. element = r.html.find("a")[0] element.text

9.5.6. We can filter element search by using containing parameter of html.find() method

9.5.6.1. r.html.find("a", containing="wikipedia")

9.5.6.2. note that search is made on text of element and search is not case sensitive

9.6. Text pattern search

9.6.1. html.search() method returns raw html and first result that matches search() argument

9.6.1.1. We can find all text that falls in between two strings by passing argument as "string1{}string2"

9.6.1.1.1. result will be all the html found in between an occurrence of string1 and string2, where the curly braces {} represents the result to be found and returned

9.6.1.2. e.g. noting that r represents a response object

9.6.1.2.1. r.html.search("known{}soccer")

9.6.1.2.2. r.html.search("known {} soccer")[0]

9.6.2. html.search_all() works the same as search() but returns all results that match

9.6.2.1. e.g. noting that r represents a response object

9.6.2.1.1. r.html.search_all("known{}soccer")

9.7. CSS Selectors

9.7.1. used to "find" (or select) the HTML elements you want to style

9.7.2. CSS Selector Reference

9.7.3. we need to understand CSS Selectors and the notation used because this is the notation used when we pass arguments to the html.find() method

9.7.3.1. element selector

9.7.3.1.1. html.find("element")

9.7.3.2. #id selector

9.7.3.2.1. html.find("#id")

9.7.3.3. .class_name selector

9.7.3.3.1. html.find(".class_name")

9.7.3.4. general attribute selectors (there are multiple forms - to the left are 3 of most common)

9.7.3.4.1. [attribute] selector

9.7.3.4.2. [attribute=value] selector

9.7.3.4.3. [attribute*=value] selector

9.7.3.5. combining selectors is done by concatenation with spaces

9.7.3.5.1. remember that when tags included in combined selection, they must come first

9.7.3.5.2. e.g. r.html.find("a[href*=wikipedia]") returns list of <a> tag elements with href attributes that include "wikipedia" substring

9.7.3.5.3. e.g. r.html.find("a.internal") returns list of <a> tag elements with class="internal"

9.7.3.6. specifying context in tag hierarchy for element search

9.7.3.6.1. parent_element child_element selector

9.7.3.6.2. parent_element > child_element selector

9.8. Scraping pages with Javascript content

9.8.1. from requests_html import AsyncHTMLSession session = AsyncHTMLSession()

9.8.1.1. One of the differences when scraping pages with JavaScript for the purpose of rendering the dynamic content into a regular html document is that we need to use an asynchronous session. This requires us to use the await keyword before requests

9.8.2. site_url = "https://angular.io/" r = await session.get(site_url) r.status_code

9.8.2.1. with an asynchronous session, we prefix the get request with the await keyword

9.8.3. await r.html.arender()

9.8.3.1. we use the asynchronous version of the render() method, arender(), combined with the await keyword

9.8.3.2. this uses the Chromium browser to render the page content and converts this into regular html that we can scrape

9.8.3.3. the first time this is run on a host, it will attempt to automatically download and install Chromium

9.8.4. session.close()

9.8.4.1. once the session is closed, we can proceed with scraping the html object in the normal way

9.8.5. tips for timeout errors

9.8.5.1. try render() method wait parameter

9.8.5.2. try render() method retries parameter

9.8.5.3. print(r.html.render.__doc__)

10. What is it?

10.1. Web Crawling involves programs that automatically scan web page content, normally for purpose of classification

10.1.1. such programs are often referred to as spiders

10.1.2. spiders are used by search engines like Google

10.1.3. web searches are based on classification data produced by spiders

10.1.4. spiders are automated programs, often categorised as bots, which automatically run and follow links from page to page across millions of webs sites

10.2. Web Scraping can be used to describe part of the activity of Web Crawling, but often Web Scraping can describe a program that targets a specific website

10.2.1. Web Scraping involves an automated program known as a scraper

10.2.2. Purpose of scraper is to extract relevant data from targeted web page(s) and aggregated the data into a usable structured format

11. Web Scraping Ethics

11.1. Legally ambiguous

11.1.1. Intellectual Property (IP) Rights

11.1.1.1. includes Copyright, which is an automatically granted right assigned to creator of data

11.1.1.2. applies not to acquiring data but usage of it, so the scraping itself is legal

11.2. Must consult the robots.txt document that should be stored in the root directory of the website

11.2.1. robots.txt specifies what is and what is not allowed by the site in terms of bots

11.2.2. you should respect the information in the robots.txt file

11.3. You should generally respect the website you are scraping and avoid putting excessive loads on the server via high volumes of automated requests that might overload the server and degrade its performance

12. HTTP

12.1. HyperText Transfer Protocol

12.2. Websites consist of a collection of HTML code files, image files, video files and various other files, such as style sheets, etc.

12.3. Client makes request to download file and server responds with requested file

12.3.1. Web browser (client) interprets downloaded files, displaying website content within the browser window

12.4. HTTP Request Methods

12.4.1. GET

12.4.1.1. Most popular request method

12.4.1.2. fetches data from server

12.4.1.3. can be bookmarked

12.4.1.4. parameters are added to URL, but in plain text

12.4.1.4.1. not to be used for exchanging sensitive information

12.4.2. POST

12.4.2.1. 2nd most popular HTTP method invoked

12.4.2.2. Alters state of some object held server side

12.4.2.2.1. example is a shopping basket

12.4.2.3. Used to send confidential information

12.4.2.3.1. Login credentials always passed via POST request

12.4.2.4. Parameters added in separate body

12.5. HTTP Response Codes

12.5.1. 200

12.5.1.1. Request was processed successfully

12.5.2. 404

12.5.2.1. Error: page not found

12.6. HTTP Response format

12.6.1. HTML for web pages

12.6.2. For web APIs, most common response format is JSON

13. Web API with JSON Response

13.1. import requests

13.1.1. requests module give us ability to send HTTP GET requests to web server and capture the response

13.2. base_url = "<url>"

13.2.1. capture the URL for the GET request in a variable

13.3. response = requests.get(base_url)

13.3.1. invoke get function from imported requests module, passing in the URL via the variable

13.3.2. requests.get()

13.3.2.1. returns

13.3.2.1.1. requests.Response object

13.4. print(response.ok) print(response.status_code)

13.4.1. check the response status (looking for True and 200)

13.5. response.json()

13.5.1. returns response payload as Python dictionary

13.5.1.1. note: print response.text first to verify that format is JSON

13.5.2. note: when status_code is 400 (Bad Request) you can normally expect to see details of the error returned in the JSON response

13.5.3. note: response is a variable that references an object instance of class Response, which is returned by requests.get()

13.6. import json results = json.dumps(response.json(),indent=4) print(results)

13.6.1. use dumps() function from json module to print a pretty string representation of dictionary

13.6.2. json.dumps()

13.7. Adding parameters

13.7.1. after base URL, add "?" followed by <param>=<val>

13.7.2. multiple <val> may be comma separated

13.7.3. to specify more than one parameter, separate <param>=<val> sequences by &

13.7.4. example

13.7.4.1. https://api.exchangeratesapi.io/latest?base=GBP&symbols=USD,EUR,CAD

13.7.5. requests.get() with params

13.7.5.1. better way to pass parameters in GET request because it auto handles symbols/white space that is not legal in a URL

13.7.5.2. params takes a dictionary for its value, where every key is valid API parameter name and value is the paired parameter value

13.7.5.3. example:

13.7.5.3.1. import requests base_url = "https://itunes.apple.com/search" r = requests.get(base_url, params={"term":"kraftwerk", "country":"GB"})

13.8. Pagination

13.8.1. Search APIs sometimes deliver results in pages - e.g. Google search sends results one page at a time to the client browser

13.8.2. API documentation should describe if pagination is used and how to retrieve particular pages

13.8.3. example

13.8.3.1. import requests get_url = "https://jobs.github.com/positions.json" r = requests.get(get_url,params = {"page":1,"description":"sql"})

13.8.3.1.1. this API has a page parameter to specify which page you want

13.8.3.1.2. to loop through multiple pages, we can use a for loop and append results to a list

13.9. APIs with authentication

13.9.1. Requires registration and sometimes a paid subscription

13.9.2. Once registered, you get an application ID and Key, which is equivalent of username + password but dedicated to API use

13.9.3. HTTP POST method used instead of GET

13.9.3.1. With POST requests, the data is sent in the message body and not as part of the request URL itself (as with GET)

13.9.3.1.1. more secure

13.9.3.1.2. amount of data that can be sent as part of request is unlimited, unlike GET which is limited by URL restrictions

13.9.4. requests.post()

13.9.4.1. url is first and required parameter

13.9.4.1.1. like requests.get()

13.9.4.2. params is optional parameter

13.9.4.2.1. this is where you will typically put the application ID and key that you get as a registered user of the API

13.9.4.2.2. the value will be a dictionary and the keys/values driven by the API documentation

13.9.4.3. headers is optional parameter

13.9.4.3.1. dictionary argument

13.9.4.4. json is optional parameter

13.9.4.4.1. json argument, but for Python this is synonymous with dictionary

13.9.5. example (see my "Python Web API with POST (Authentication).ipynb" notebook for more info

13.9.5.1. import requests import json app_id = "8548bf9b" app_key = "df38bd7b9b3a6283ea6b1f5dca7ed85f" api_endpoint = "https://api.edamam.com/api/nutrition-details" header = {"Content-Type": "application/json"} recipe = {"title":"Cappuccino","ingr":["18g ground espresso","150ml milk"]} r = requests.post(api_endpoint, params = {"app_id":app_id,"app_key":app_key}, headers = header, json = recipe)

14. HTTP File Downloads

14.1. Any response can be captured and then written to a file, but it can be inefficient to download the content of some request wholly into RAM and then open up a file stream to write that content to file

14.2. Leveraging the stream parameter of the requests.get() function is the key to implementing a smarter, more efficient process for HTTP downloads, and use the Python with statement for more elegant coding

14.2.1. Python with statement

14.2.1.1. Used commonly with file streams because it guarantees to close the file in the event of some exception arising, and removes the need to explicitly call the .close() method on the file stream

14.2.1.2. Also used for other processes with dependencies on things external to the Python environment, including locks, sockets, telnets, etc. (with the recurring theme being automated assurance of closing down the connection when the work is done, including if the work is interrupted via exception)

14.2.1.3. syntax:

14.2.1.3.1. with <open connection to create object> as <connection object>: <statement(s) involving connection object>

14.2.1.4. example:

14.2.1.4.1. with open('file_path', 'w') as file: file.write('hello world !')

14.2.2. when we create a GET request using requests.get() and specify stream = True, this allows us to iterate the response (for the purpose of writing) in chunks, the size of which we can specify in bytes

14.2.2.1. in order to iterate the response content we need to invoke the .iter_content() method

14.2.2.1.1. we pass the chunk_size keyword argument to iter_content()

14.2.2.1.2. example

15. CSS

15.1. Cascading Style Sheets

15.2. language used to describe presentation and style of HTML documents

15.3. 3 ways that style can be applied to HTML element

15.3.1. inline

15.3.1.1. style attribute

15.3.1.1.1. example

15.3.1.1.2. note syntax for the style attribute values

15.3.2. internal

15.3.2.1. style element embedded inside the head element

15.3.2.1.1. style element content is composed of CSS selectors followed by CSS properties and values wrapped in curly braces { }

15.3.2.1.2. example

15.3.2.1.3. note that you can specify multiple CSS selectors (e.g. table, th, td) with a common set of CSS properties, separating each by comma

15.3.3. external

15.3.3.1. separate file that uses same syntax as the internal style element

15.3.3.2. browser downloads the CSS file and applies styles to every page in the site based on that file

15.3.3.2.1. this approach allows the entire look and feel of a website to be changed by altering this single CSS file

15.3.3.2.2. it's also faster for browsers to apply styling this way

15.4. CSS Ref for Properties

16. Browser Developer Tools

16.1. Useful for web scraping because you can inspect the underlying HTML of any element on the page

16.2. Chrome

16.2.1. right-click any part of web page and choose Inspect

16.2.1.1. under Elements pane, right-click and choose Copy | Copy Element

16.2.1.1.1. paste into Notepad++

16.3. Edge

16.3.1. to access Developer Tools, click ellipses ... in top right corner and choose More Tools | Developer Tools

17. Common Roadblocks for Web Scraping

17.1. Request headers

17.1.1. sent as part of request and contains meta data about request

17.1.2. content of header can vary and may include information such as application type, operating system, software vendor, software version, etc.

17.1.3. header content is combined into user agent string

17.1.3.1. think of user agent string as an ID card for the application making the request

17.1.3.2. all web browsers have their own unique user agent string

17.1.3.3. well known bots like the Google web crawler also has its own unique user agent string

17.1.4. many servers send different responses based on user agent string

17.1.4.1. when user agent string is missing or cannot be interpreted many sites will return a default response

17.1.4.2. this can lead to differences between html we can inspect via browser developer tools and actual html captured via our web scraping response

17.1.4.2.1. fix is to always write our html response to file and use this as our reference for scraping

17.1.5. some sites block all anonymous requests (i.e. requests that do not include a recognised user agent string)

17.1.5.1. fix is to use user agent string of one of the main web browser applications, as these are publicly available

17.1.5.1.1. Chrome user agent

17.1.5.1.2. requests supports headers parameter with value passed as dictionary

17.1.5.1.3. e.g.

17.2. Cookies

17.2.1. small piece of data that a server sends to the user's web browser

17.2.2. browser may store it and send it back with later requests to the same server

17.2.3. Typically, it's used to tell if two requests came from the same browser

17.2.3.1. e.g. keeping a user logged-in

17.2.4. Cookies are mainly used for three purposes:

17.2.4.1. Session management

17.2.4.1.1. Logins, shopping carts, game scores, or anything else the server should remember

17.2.4.2. Personalization

17.2.4.2.1. User preferences, themes, and other settings

17.2.4.3. Tracking

17.2.4.3.1. Recording and analyzing user behaviour

17.2.5. Sites that require registration and login in order to access site pages will refuse get requests with a 403 Forbidden response

17.2.5.1. fix is to create a stateful session that allows the session cookies to be received and used by our Python program, and to use an appropriate post request to get the session cookie

17.2.5.1.1. requests module has Sessions class, which we can use to create session objects and then the subsequent post/get requests are invoked via the session object

17.2.5.2. Sites that require login often redirect to a login page that includes a form tag

17.2.5.2.1. action attribute of form tag holds the relative URL of login page

17.2.5.2.2. form tag includes a number of input tags

17.2.5.2.3. use Chrome Developer tools to trace login request via the Network tab

17.3. Denial of Service

17.3.1. When making multiple requests of a web server, we must be mindful of the risk of the server being overwhelmed by too many requests

17.3.1.1. many websites have protection against denial of service attacks and will refuse multiple requests from a single client that are made too rapidly

17.3.1.1.1. fix is to import the time module and use the sleep() function to create wait duration in between multiple requests

17.4. Captchas

17.4.1. These sites are deliberately hard to scrape by bots, so you should avoid attempting to do so

17.5. Dynamically generated content with Javascript

17.5.1. For this problem we will turn to the requests-html package, to be used in place of requests + BeautifulSoup