2008
DOI: 10.1613/jair.2409
|View full text |Cite
|
Sign up to set email alerts
|

Creating Relational Data from Unstructured and Ungrammatical Data Sources

Abstract: In order for agents to act on behalf of users, they will have to retrieve and integrate vast amounts of textual data on the World Wide Web. However, much of the useful data on the Web is neither grammatical nor formally structured, making querying difficult. Examples of these types of data sources are online classifieds like Craigslist and auction item listings like eBay. We call this unstructured, ungrammatical data "posts." The unstructured nature of posts makes query and integration difficult because the at… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
26
0

Year Published

2011
2011
2016
2016

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 27 publications
(26 citation statements)
references
References 34 publications
0
26
0
Order By: Relevance
“…The idea of wisdom of the crowd can be adopted for assessing precision while having perfect groundtruth to measure recall could still be difficult. , and their F1-score (F cs ) [10,26] are three commonly used metrics. N and M are the size of two instance sets that are matched to one another.…”
Section: Evaluation and Preliminary Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…The idea of wisdom of the crowd can be adopted for assessing precision while having perfect groundtruth to measure recall could still be difficult. , and their F1-score (F cs ) [10,26] are three commonly used metrics. N and M are the size of two instance sets that are matched to one another.…”
Section: Evaluation and Preliminary Resultsmentioning
confidence: 99%
“…ASN [26] relies on human input for identifying a candidate selection key; but sufficient domain expertise may not be available for various domains. Supervised [10] or partially-supervised [4] approaches have been explored to learn the candidate selection key; however, obtaining a sufficiently-sized groundtruth data is impractical for large datasets. Compared to these systems, our proposed candidate selection algorithm is unsupervised and is able to automatically learn the candidate selection key.…”
Section: Related Workmentioning
confidence: 99%
“…Unlike Marlin, our system can both effectively reduce candidate set size and achieve good coverage on true matches. Although BSL achieved good results on various domains, its drawbacks are that it requires sufficient training data and is not able to scale to large datasets [13]. Cao et.…”
Section: Related Workmentioning
confidence: 99%
“…It has 4 attributes: name, address, type and city. Another dataset is the Hotel dataset [13] that has 5 attributes: name, rating, area, price and date, matching 1,125 online hotel bidding posts from the Bidding For Travel website 5 to another 132 hotel information records from the Bidding For Travel hotel guides with 1,028 coreferent pairs. The last one is dataset4 [9], a synthetic census dataset, with 10K records and 5K duplicates within themselves.…”
Section: Datasetsmentioning
confidence: 99%
“…Phoebus [18][19] ontoX [20][21] uses OWL to represent extraction ontology. It defines data according to the data types available in OWL 1.0 such as int, float, string.…”
Section: Related Workmentioning
confidence: 99%