Used with permission from the Microsoft Next Blog
by Allison Linn
Microsoft has released a set of 100,000 questions and answers that artificial intelligence researchers can use in their quest to create systems that can read and answer questions as well as a human.
The dataset is called MS MARCO, which stands for Microsoft MAchine Reading COmprehension, and the team behind it says it’s the most useful dataset of its kind because it is based on anonymized real-world data. By making it broadly available for free to researchers, the team is hoping to spur the kind of breakthroughs in machine reading that are already happening in image and speech recognition.
They also hope to facilitate the kind of advances that could lead to the long-term goal of ‘artificial general intelligence,’ or machines that can think like humans.
“In order to move towards artificial general intelligence, we need to take a step towards being able to read a document and understand it as well as a person,” said Rangan Majumder, a partner group program manager with Microsoft’s Bing search engine division who is leading the effort. “This is a step in that direction.”
Right now, Majumder said, systems to answer sophisticated questions are still in their infancy. Search engines like Bing and virtual assistants like Cortana can answer basic questions, like “What day does Hanukkah start?” or “What’s 2,000 times 43?”
But in many cases, Majumder said search engines and virtual assistants will instead point the user to a set of search engine results. Users can still get the information they need, but it requires culling through the results and finding the answer on the web page.
In order to make automated question-and-answer systems better, researchers need a strong source of what is called training data. These datasets can be used to teach artificial intelligence systems to recognize questions and formulate answers and, eventually, to create systems that can come up with their own answers based on unique questions they haven’t seen before.
Majumder and his team – which includes Microsoft researchers and people working on Microsoft products – say the MS MARCO dataset is particularly useful because the questions are based on real, anonymized queries from Microsoft’s Bing search engine and Cortana virtual assistant. The team chose the anonymized questions based on the queries they thought would be more interesting to researchers. In addition, the answers were written by humans, based on real web pages, and verified for accuracy.
By providing realistic questions and answers, the researchers say they can train systems to better deal with the nuances and complexities of questions regular people actually ask – including those queries that have no clear answer or multiple possible answers.
For example, the dataset contains the question, “What foods did ancient Greeks eat?” To answer the question correctly they culled through snippets of information from multiple documents or pieces of text to come up with foods such as grains, cake, milk, olives, fish, garlic and cabbage.
Li Deng, partner research manager of Microsoft’s Deep Learning Technology Center, said previous datasets were designed with certain limitations, or constraints. That made it easier for researchers to create solutions that could be formulated as what machine learning researchers call “classification problems,” rather than by seeking to understand that actual text of the question.
He said MS MARCO is designed so that researchers can experiment with more advanced deep learning models designed to push artificial intelligence research further forward.
“Our dataset is designed not only using real-world data but also removing such constraints so that the new-generation deep learning models can understand the data first before they answer questions,” he said.
Majumder said the ability for systems to answer complex questions could augment human abilities by helping people get information more efficiently.
Let’s say a Canadian student wants to know if she qualifies for a certain loan program. A search engine might direct that user to a set of websites, where she would have to read through the data and come up with an answer on her own. With better tools, a virtual assistant could scan that information for her and quickly provide a more nuanced and perhaps even personalized answer.
“Given much of the world’s knowledge is found in a written format, if we can get machines to be able to read and understand documents as well as humans, we can unlock all of these kinds of scenarios,” Majumder said.
Long-term goal: ‘Artificial general intelligence’For now at least, researchers are still far from creating systems that can truly understand or comprehend what humans are saying, seeing or writing – what many refer to as “artificial general intelligence.”
But in the last few years, machine learning and artificial intelligence researchers at Microsoft and elsewhere have made great strides in creating systems that can recognize the words in a conversation and correctly identify the elements of an image.
“Microsoft has led the way in speech recognition and image recognition, and now we want to lead the way in reading comprehension,” Majumder said.
But, he noted, this isn’t a problem that any one company can solve alone. Majumder said one reason his team released the dataset is because they want to work with others in the field.
MS MARCO is modeled on similar training sets that were created to help spur breakthroughs in other areas of machine learning and artificial intelligence. That includes the ImageNet database, which is considered to be the premier dataset for testing advances in image recognition. A team at Microsoft used ImageNet to test its first deep residual networks, sparking major leaps in the accuracy of image recognition.
The MS MARCO team also plans to follow ImageNet’s example by creating a leaderboard that shows which teams of researchers are getting the best results. Eventually, they may create a more formal competition along the lines of ImageNet’s annual challenges.
The MS MARCO dataset is available for free to any researcher who wants to download it and use it for non-commercial applications.