The never ending learning of sound (nels) is an effort with the aim to make machines develop the capability to hear. most of the machines in today's time take decisions based on visual images, but they seldom use sounds as sensory inputs for taking actions. imagine a scenario where your machine could understand different sounds like you do. imagine a machine that could sense a knock on your door, someone breaking into your house and take decisions by detecting different sounds in the environment. this could help in efficient decision making and make the machine more intelligent. but, how does one cause the machine understand millions of sounds existing in the universe? the primary motivation behind this long-term project is to make the computer continuously learn all the sounds that exist in the world. however, nature has its grammar and sounds follow certain rules that humans can understand intuitively, but this information is unusable by machines. the nels team is developing an artificial intelligence system that continually crawls the web for sound samples and automatically learns their meanings, associations and semantics with minimal human intervention. audio event detection is a challenge in the era of big data due to the constraint of lack of available annotated data to train robust models that match the scale of class diversity. manually annotating sound events in isolation or events in segments of audio recordings is a time consuming and expensive process. the framework for nels consists of 3 main components - audio detectors, web-crawler and the web-interface. we use sound corpora such as urbansounds8k and esc-50 to train the audio detectors. but, availability of such manually annotated sound corpora is not feasible and would not render the system to self-sufficient. hence, we use a novel self-learning pipeline to train our detectors on web-data. a distributed web crawler continuously crawls the web to collect sound samples and metadata information for audio clips existing on the internet. our crawler is currently focussed on youtube clips. these clips along with their metadata indexed by the crawler help in self-learning and hence ensure that nels is self-sufficient and can scale with the increase in audio classes in the future. an effective way to represent audio files as feature vectors for categorizing audio events is the bag of audio words feature representation. low-level features such as mel frequency cepstrum coefficients (mfcc) are used to construct bag of audio words. we use gaussian mixture models for robust representation as gmm components effectively capture the distribution of mfcc vectors for a recording. once we obtain the representation of features, support vector machines are used to train sound detectors. these preliminary detectors are then used to obtain prediction of audio segments in unlabelled web data. web data is divided into multiple batches for iterative self-learning of audio detectors. to determine the audio files for next retraining, we employ various selection techniques to select candidate segments for retraining. as illustrated in our paper “an approach for self-training audio event detectors using web data” we obtain 1 percent improvement in precision of audio detectors per batch consisting of 10 youtube clips for all selection techniques. but the whole idea behind nels is to generate a system that is robust enough to develop independent classifiers for each audio category which predict the category of a given segment with high precision. for the classifiers to improve performance we need user feedback to validate and improve the accuracy of audio detectors. we have a web interface [available at http://nels.cs.cmu.edu/] that allows users to visualize relationships and attributes formulated by nels. this interface serves to collect binary user feedback for predicted segments of web data. user feedback helps in obtaining ground truth for audio segments which improves detection in next retraining phase. classification of audio events has applications in multiple fields. robots can perceive environment information using sound as sensory inputs; smart cities will be able to obtain real-time dynamics and audio events will help improve the experience of hearing aid for deaf and blind. nels will transform the society by helping machines make sense of audio data, hence driving the above-mentioned technological advancements. humans can detect only audio events between 20hz to 20khz. in the future, we believe that nels will be able to draw inferences for audio events outside this range as well for applications never thought before by humanity.
No Updates