1. about
1.1. founder @ Plnnr.com
2. agenda
2.1. What is it
2.2. Possible solutiona
2.3. ...
3. The problem
3.1. They collect lots of data
3.2. Need resolve many representations of the same entity
3.3. Plnnr need aggregate info on tourist attractions
4. Complications
4.1. Languages
4.2. Duplications
4.3. Missing information in some sources
4.4. Need allow manual coorrection
4.5. Process must be repeatable & deterministic
4.5.1. They do harvesting all the time
5. Solution
5.1. DB holds table of POI's
5.1.1. Point of Interest
5.1.1.1. their entity class
5.2. Each source POI point to its combined version
5.2.1. Representations
5.3. Algorithm
5.3.1. Create graph of entity & its representations
6. Entity resolution in general
6.1. Many use cases
6.1.1. When working with TV series
6.1.1.1. many formats & representations, & no identity standard
6.1.2. eLibrary
6.1.2.1. OS project
6.1.3. Delver
6.1.3.1. needed to resolve people
7. Properties of the problem
7.1. Single or multiple entity types
7.2. Are their standard/strong identifiers
7.3. Do entities have relations between them
7.4. Entity resolution across
7.4.1. data sources
7.4.2. time
7.5. Conflicting versions (across sources or time)
7.5.1. Show all or just most common version
8. These properties dimensions cause that there isn't a silverbullete single solution
9. Design goals
9.1. Quality results
9.2. Repeatable & determinstic
9.3. Reasonably fast
9.4. DRY code
10. Possible design decisions
10.1. Relational or Schema-less DB
10.2. Fragments (source entities) in the same table as combined entity
10.3. Combine only source entities, or use past results
11. Relevant design patterns
11.1. Actual data needs to be explicit
11.1.1. priorities between data-sources
11.1.1.1. to resolve conflicts
11.2. Always use slugs, or UUIDs at the least
11.2.1. Autoincremented id's as foreign keys are meaningless & problematic when regenerating DB
11.3. Data-sources for everything
11.4. Avoid n^2 algorithms by taking advantage of data properties
11.4.1. hurts when having 10K such operations
11.4.2. e.g., use coordinates to distinguish entities
11.5. Don't allow new data to ruin old matches easily
11.5.1. Content editor already corrected data
12. Tools
12.1. Google Translate
12.1.1. to deal with lanaguages
12.2. dbPedia
12.3. Semantinet
12.4. gype
12.4.1. example of good restful API