Entity Resolution in Plnnr, Imri Goldberg

Get Started. It's Free
or sign up with your email address
Rocket clouds
Entity Resolution in Plnnr, Imri Goldberg by Mind Map: Entity Resolution in Plnnr, Imri Goldberg

1. about

1.1. founder @ Plnnr.com

2. agenda

2.1. What is it

2.2. Possible solutiona

2.3. ...

3. The problem

3.1. They collect lots of data

3.2. Need resolve many representations of the same entity

3.3. Plnnr need aggregate info on tourist attractions

4. Complications

4.1. Languages

4.2. Duplications

4.3. Missing information in some sources

4.4. Need allow manual coorrection

4.5. Process must be repeatable & deterministic

4.5.1. They do harvesting all the time

5. Solution

5.1. DB holds table of POI's

5.1.1. Point of Interest

5.1.1.1. their entity class

5.2. Each source POI point to its combined version

5.2.1. Representations

5.3. Algorithm

5.3.1. Create graph of entity & its representations

6. Entity resolution in general

6.1. Many use cases

6.1.1. When working with TV series

6.1.1.1. many formats & representations, & no identity standard

6.1.2. eLibrary

6.1.2.1. OS project

6.1.3. Delver

6.1.3.1. needed to resolve people

7. Properties of the problem

7.1. Single or multiple entity types

7.2. Are their standard/strong identifiers

7.3. Do entities have relations between them

7.4. Entity resolution across

7.4.1. data sources

7.4.2. time

7.5. Conflicting versions (across sources or time)

7.5.1. Show all or just most common version

8. These properties dimensions cause that there isn't a silverbullete single solution

9. Design goals

9.1. Quality results

9.2. Repeatable & determinstic

9.3. Reasonably fast

9.4. DRY code

10. Possible design decisions

10.1. Relational or Schema-less DB

10.2. Fragments (source entities) in the same table as combined entity

10.3. Combine only source entities, or use past results

11. Relevant design patterns

11.1. Actual data needs to be explicit

11.1.1. priorities between data-sources

11.1.1.1. to resolve conflicts

11.2. Always use slugs, or UUIDs at the least

11.2.1. Autoincremented id's as foreign keys are meaningless & problematic when regenerating DB

11.3. Data-sources for everything

11.4. Avoid n^2 algorithms by taking advantage of data properties

11.4.1. hurts when having 10K such operations

11.4.2. e.g., use coordinates to distinguish entities

11.5. Don't allow new data to ruin old matches easily

11.5.1. Content editor already corrected data

12. Tools

12.1. Google Translate

12.1.1. to deal with lanaguages

12.2. dbPedia

12.3. Semantinet

12.4. gype

12.4.1. example of good restful API

12.5. Google Refine