Zipf's law states that the frequency of word tokens in a large corpus of natural language is inversely proportional to the rank. The law is investigated for two languages English and Mandarin and for ngram word phrases as well as for single words. The law for single words is shown to be valid only for high frequency words. However, when single word and n-gram phrases are combined together in one list and put in order of frequency the combined list follows Zipf's law accurately for all words and phrases, down to the lowest frequencies in both languages. The Zipf curves for the two languages are then almost identical.
Experiments show that for a large corpus, Zipf's law does not hold for all ranks of words: the frequencies fall below those predicted by Zipf's law for ranks greater than about 5,000 word types in the English language and about 30,000 word types in the inflected languages Irish and Latin. It also does not hold for syllables or words in the syllable-based languages, Chinese or Vietnamese. However, when single words are combined together with word n-grams in one list and put in rank order, the frequency of tokens in the combined list extends Zipf's law with a slope close to −1 on a log-log plot in all five languages. Further experiments have demonstrated the validity of this extension of Zipf's law to n-grams of letters, phonemes or binary bits in English. It is shown theoretically that probability theory alone can predict this behavior in randomly created n-grams of binary bits.
Web search engines often federate many user queries to relevant structured databases. For example, a recruitment-related query might be federated to a jobseekers-and-employers database containing their resumes and skills. The relevant structured data items are then returned to the user along with web search results. Though each structured database is searched in isolation, the search often produces empty / incomplete results as the database may not contain the required information to answer the query. Starting from our Applicant Tracking System (ATS), we have 16 development databases of over 650,000 profile documents of resumes / cover letters / skills. There are on average 238 keywords per document. In fact, per minute there can be up to 200,000 transactions within all these databases. Our existing traditional database search technique (by full-text keyword PostgreSQL search) can be frozen or taking very long to respond unless if we cut off / only search from top profiles, we will not have the search results ready by thirty seconds, but this cut-off limitation returned incorrect results; for example for a query "Jet fuel Thermal Oxidation" to request information about job seekers whose resumes contain skills in Oil and Gas industry, in the top ten results there was a conflict in relevance ranking. In order to research a more suitable full-text keyword search technique better than the existing database search, we considered employment of semantic search models. Our semantic search technique has 88% -91.22% accuracy with very much quicker queries that can help users to make a search of 4 keywords of skills completed from 1 second to 28 seconds. Furthermore, the semantic search engine becomes very strong that users can search by entering a whole text paragraph. We designed a combination of semantic search that look for web pages per search and database search.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.