Data Ingress Workshop

Get Started. It's Free
or sign up with your email address
Data Ingress Workshop by Mind Map: Data Ingress Workshop

1. How to improve Data In

1.1. Definition of quality (KPI's)

1.1.1. Quantitative

1.1.1.1. Using statistics: All numbers are compared to expected numbers based on average over x months

1.1.1.1.1. e.g. If avg. 3 months reviews = 10/week, and it drops to 2/week it's 20% of expectation.

1.1.1.2. Number of products

1.1.1.3. Number of reviews

1.1.1.4. Number of data fields

1.1.1.4.1. Date

1.1.1.4.2. Rating

1.1.1.4.3. Pros / Cons

1.1.1.4.4. Price

1.1.1.4.5. etc.

1.1.1.5. Speed of scraping (if needed)

1.1.2. Qualitative

1.1.2.1. Quality of Data

1.1.2.1.1. Product name accuracy (matching %)

1.1.2.1.2. Date format

1.1.2.1.3. Character encoding errors

1.1.2.1.4. etc.

1.1.2.2. Quality of relevance

1.1.2.2.1. Matching % against clients

2. Data Ingestion Goals

2.1. Get relevant data

2.1.1. Define field important weights

2.1.1.1. Product Name (100)

2.1.1.2. Brand (60)

2.1.1.3. Rating (50)

2.1.1.4. Date (75)

2.1.1.5. etc.

2.2. Don’t get blocked

2.2.1. Efficient scraping

2.2.1.1. Only visit a site once, next time via cache

2.2.1.2. Use Sitemaps / RSS (e.g. https://www.cnet.com/sitemaps/reviews/2018/)

2.2.1.3. Not too fast, mimic normal user

2.2.2. Feed Partnerships with key data sources

2.3. Classify the data for processing?

2.4. Optimize the processes/resources

2.4.1. Efficient scraping

2.4.2. Source prioritization

2.5. Minimize maintenance

2.5.1. Create Most Common Issues solutions

2.5.2. Smart patters for flexible scraping

2.6. Smarter error detection

2.6.1. Via the KPI measurement

2.6.2. Via change detection templates

2.6.3. Alert upon errors

3. Technology Changes

3.1. Current

3.1.1. Screenscraper (Old)

3.1.2. Scrapy (New)