Friday, May 18, 2012

[G] From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas

| More

Google Research Blog: From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas

Posted by Valentin Spitkovsky and Peter Norvig, Research Team

Yet in each word some concept there must be...
— from Goethe's Faust (Part I, Scene III)

Human language is both rich and ambiguous.
When we hear or read words, we resolve meanings to mental representations,
for example recognizing and linking names to the intended persons, locations or organizations.
Bridging words and meaning —
from turning search queries into relevant results to suggesting targeted keywords for advertisers —
is also Google's core competency, and
important for many other tasks in information retrieval and natural language processing.
We are happy to release a resource,
spanning 7,560,141 concepts and 175,100,788 unique text strings,
that we hope will help everyone working in these areas.

How do we represent concepts? Our approach piggybacks on
the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases.
We consider each individual Wikipedia article
as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to
concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link)
that point to each Wikipedia page, thus drawing on the vast link structure of the web.
For every English article we harvested the strings associated
with its incoming hyperlinks from the rest of Wikipedia, the greater web,
and also anchors of parallel, non-English Wikipedia pages.
Our dictionaries are cross-lingual, and
any concept deemed too fine can be broadened to a desired level of generality using
groupings of articles into hierarchical categories

The data set contains triples, each consisting of
(i) text, a short, raw natural language string;
(ii) url, a related concept, represented by an
English Wikipedia article's canonical location;
and (iii) count, an integer indicating the number of times
text has been observed connected with the concept's url.
Our database thus includes weights that measure degrees of association.
For example, the top two entries for football indicate
that it is an ambiguous term, which is almost twice as likely
to refer to what we in the US call soccer:

1. Association football 44,984
2. American football 23,373

An inverted index can be
used to perform reverse look-ups, identifying salient terms for each concept.
Some of the highest-scoring strings — including synonyms and translations —
for both sports, are listed below:


football and Football
Soccer and soccer
Association football
fútbol and Fútbol
Futbol and futbol

sepak bola

bóng đá
لعبة كرة القدم


American football
football and Football
fútbol americano
football américain
American football rules
futebol americano
فوتبال آمریکایی

football americano
Amerikan futbolu
Le Football Américain
football field
كرة القدم الأمريكية
Futbol amerykański

futbolu amerykańskiego
football team
американского футбола
Amerikai futball
sepak bola Amerika
football player
američki fudbal
كرة القدم الأميركية

Associated counts can easily be turned into percentages.
The following table illustrates
the concept-to-words dictionary direction —
which may be useful for paraphrasing,
and topic modeling
— for the idea of soft drink,
restricted to English (and normalized for punctuation, pluralization and capitalization differences):

1. soft drink(and soft-drinks)    28.6 
2. soda(and sodas)    5.5 
3. soda pop0.9 
4. fizzy drinks0.6 
5. carbonated beverages(and beverage)    0.3 
6. non-alcoholic0.2 
7. soft0.1 
8. pop0.1 
9. carbonated soft drink(and drinks)    0.1 
10. aerated water0.1 
11. non-alcoholic drinks(and drink)    0.1 
12. soft drink controversy0.0 
13. citrus-flavored soda0.0 
14. carbonated0.0 
15. soft drink topics0.0 

The words-to-concepts dictionary direction can
disambiguate senses
and link entities, which are often highly ambiguous,
since people, places and organizations can (nearly) all be named after each other.
The next table shows the top concepts meant by the
string Stanford, which refers to all three (and other) types:

1. Stanford University50.3 ORGANIZATION
2. Stanford (disambiguation)7.7 a disambiguation page
3. Stanford, California7.5 LOCATION
4. Stanford Cardinal football5.7 ORGANIZATION
5. Stanford Cardinal4.1 multiple athletic programs
6. Stanford Cardinal men's basketball2.0 ORGANIZATION
7. Stanford prison experiment2.0 a famous psychology experiment
8. Stanford, Kentucky1.7 LOCATION
9. Stanford, Norfolk1.0 LOCATION
10. Bank of the West Classic1.0 a recurring sporting event
11. Stanford, Illinois0.9 LOCATION
12. Leland Stanford0.9 PERSON
13. Charles Villiers Stanford0.8 PERSON
14. Stanford, New York0.8 LOCATION
15. Stanford, Bedfordshire0.8 LOCATION

The database that we are providing was designed for recall.
It is large and noisy, incorporating 297,073,139 distinct
string-concept pairs, aggregated over 3,152,091,432 individual
links, many of them referencing non-existent articles.
For technical details, see our paper
(to be presented at LREC 2012)
and the README file accompanying the data.

We hope that this release will fuel numerous creative applications that haven't been previously thought of!


No comments: