The Australian National Database of Spoken Language (ANDOSL) was collected to provide spoken speech data for the research community in Australia. It was intended that the data be representative of the major speech varieties in Australia, and that the collection would have sufficient coverage to be adequate to the needs of several disciplines, such as speech scientists, linguists, TESOL and TEFL teachers, engineers, computer scientists and speech pathologists. The data on which we report here is the foundation upon which it is hoped that a very large database will be established. As this data discussed in this paper was to be the foundation material and would probably be the largest single input to the collection, it was important to collect an adequate and representative basic core of material which could underlie a variety of the speech research in Australia. So, the current material is a collection of the appropriate segmental distribution for the three major dialects of Australian English currently described as Broad, General and Educated/Cultivated Australian. In addition, it was possible to commence collection of a limited sample of the English of Australians who were born overseas and speak English with an accent.
Abstract. Large databases are useful tools for speech technology research. Their usefulness is greatly enhanced if the data is annotated with time aligned labels. This is expensive and time consuming and has lead to the investigation and development of automatic aligners. This paper reports on an automatic aligner developed initially to solve the problem of annotating a large database within a set period of time. While developing the aligner, we investigated the importance of the models, the use of manual labels to bootstrap the system, and the role of the dictionary in the effectiveness of the aligner, and found that each had a contribution to make. The aligner produced was tested on unseen data to gauge its accuracy before being applied as a tool to annotation of a large amount of data. The aligner was developed in a way that facilitates its use in other applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.