Suche

Wo soll gesucht werden?
Erweiterte Literatursuche

Ariadne Pfad:

Inhalt

Literaturnachweis - Detailanzeige

 
Autor/inHuang, Jian
TitelA Tale of Two Paradigms: Disambiguating Extracted Entities with Applications to a Digital Library and the Web
Quelle(2010), (134 Seiten)
PDF als Volltext Verfügbarkeit 
Ph.D. Dissertation, The Pennsylvania State University
Spracheenglisch
Dokumenttypgedruckt; online; Monographie
ISBN978-1-1241-6687-2
SchlagwörterHochschulschrift; Dissertation; Electronic Libraries; Language Processing; Profiles; Social Networks; Internet; Mathematics; Information Retrieval; Natural Language Processing; Library Science; Information Science; Computer Science; Models; Metadata; Evaluation Methods; Information Technology; Computer Uses in Education
AbstractWith the increasing wealth of information on the Web, information integration is ubiquitous as the same real-world entity may appear in a variety of forms extracted from different sources. This dissertation proposes supervised and unsupervised algorithms that are naturally integrated in a scalable framework to solve the entity resolution problem, which lies at the heart of the information integration process. This dissertation focuses on two incarnations of the entity resolution problem that arise in the data mining and natural language processing areas. First, "name disambiguation" occurs when one is seeking a list of publications of an author in a digital library, who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework that disambiguates the extracted author metadata from paper headers in a divide-and-conquer fashion: based on the metadata records extracted from paper headers, a blocking method retrieves candidate classes of authors with similar names and a density-based clustering method, DBSCAN, clusters the records by author. The distance metric between papers used for clustering is calculated by an online active selection Support Vector Machines algorithm LASVM. We prove that by recasting transitivity as density connectivity in DBSCAN, transitivity is guaranteed for core points. The method achieves high accuracy on a manually labeled dataset and readily disambiguates about a million author metadata records in CiteSeer, which paves the way for the fielded search by author name feature in CiteSeer[superscript X]. Second, as a key step towards document understanding in natural language processing, we investigate the problem of "cross document coreference" (CDC), which aims to decipher the true reference of a named entity across the boundary of documents. This dissertation presents a novel cross document coreference approach that leverages the profiles of entities which are constructed by information extraction tools and reconciled using a within-document coreference module. We propose to match the profiles by using a learned ensemble distance function comprised of a suite of similarity specialists. We develop a kernelized soft relational clustering algorithm that makes use of the learned distance function to partition the entities into fuzzy sets of identities. Evaluation on a large benchmark collection shows that the proposed methods achieve competitive coreference results. We further discuss the details of the implementation of the CDC and web person search system. This dissertation surveys the literature on author name disambiguation in citations and paper headers, citation matching and cross document coreference. Additionally, we explore the social networks of the disambiguated authors, performing a comprehensive study of the network and community level characteristics and proposing a stochastic model to predict collaborations of individuals. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.] (As Provided).
AnmerkungenProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml
Erfasst vonERIC (Education Resources Information Center), Washington, DC
Update2017/4/10
Literaturbeschaffung und Bestandsnachweise in Bibliotheken prüfen
 

Standortunabhängige Dienste
Die Wikipedia-ISBN-Suche verweist direkt auf eine Bezugsquelle Ihrer Wahl.
Tipps zum Auffinden elektronischer Volltexte im Video-Tutorial

Trefferlisten Einstellungen

Permalink als QR-Code

Permalink als QR-Code

Inhalt auf sozialen Plattformen teilen (nur vorhanden, wenn Javascript eingeschaltet ist)

Teile diese Seite: