Entity Resolution at Scale • Huon Wilson • YOW! 2019

This presentation was recorded at YOW! 2019. #GOTOcon #YOW Huon Wilson - Software Engineer at CSIRO’s Data61 RESOURCES ABSTRACT Real world #data is rarely clean: there are often corrupted and duplicate records, and even corrupted records that are duplicates! One step in #DataCleaning is #EntityResolution: connecting all of the duplicate records into the single underlying entity that they represent. This talk will describe how we approach entity resolution, and look at some of the challenges, solutions and lessons learnt when doing entity resolution on top of #ApacheSpark, and scaling it to process billions of records. [...] RECOMMENDED BOOKS Adi Polak • Machine Learning with Apache Spark • Holden Karau & Rachel Warren • High Performance Spark • Holden Karau, Konwinski, Wendell & Zaharia • Learning Spark • #DataEngineering #HuonWilson #SoftwareEngineering #Programming #YOWcon Looking for a unique learning experience? Attend the next GOTO conference near you! Get your ticket at Sign up for updates and specials at SUBSCRIBE TO OUR CHANNEL - new videos posted almost daily.
Back to Top