Exploring Paralex paraphrase-driven open question answering engine
Out of curiosity, I did a quick research to find out whether there are any existing solutions that provide answers for open (but simple) English questions, using data from the from popular sites such as Wikipedia, WikiAnswers or Yahoo Answers. Amazingly, very few such tools seem to be publicly available, except for Paralex, a proof-of-concept paraphrase-driven open question answering engine developed by a research team at the computer science department at the University of Washington. I decided to give it a try, mostly to find out what types of questions could be answered and the quality of the answers provided.
Examining the download
An evaluation version of the software is available for download from Paralex’s homepage, claiming to be able to answer simple questions using data from WikiAnswers. To try it out, download the paralex-evaluation.tar.bz2 file and extract it using tar. You will see the following folder structure:
# ls -lah total 76K drwxr-xr-x 6 16835 11679 4.0K Feb 12 12:29 . drwxr-xr-x 3 root root 4.0K Feb 1 14:57 .. drwxrwxr-x 8 16835 11679 4.0K Jun 11 2013 data drwxrwxr-x 3 16835 11679 4.0K Jun 11 2013 external drwxrwxr-x 4 16835 11679 4.0K Jun 11 2013 python -rw-rw-r-- 1 16835 11679 12K Jun 12 2013 README.md drwxrwxr-x 3 16835 11679 4.0K Jun 11 2013 scripts drwxrwxr-x 4 16835 11679 4.0K Jun 11 2013 web
The README.md file contains everything you need to know to get the demo up and running. Using MarkdownPad, I created a PDF version of the readme file which can be downloaded here. Among all folders, the scripts folder contains the bash scripts necessary to configure the engine and start the demo:
-rwxrwxr-x 1 16835 11679 378 Jun 12 2013 run_aveweights.sh -rwxrwxr-x 1 16835 11679 583 Jun 11 2013 run_eval.sh -rwxrwxr-x 1 16835 11679 1.6K Jun 12 2013 run_lexlearn.sh -rwxrwxr-x 1 16835 11679 347 Jun 12 2013 run_mergelex.sh -rwxrwxr-x 1 16835 11679 929 Jun 11 2013 run_qa.sh -rwxrwxr-x 1 16835 11679 639 Jun 12 2013 run_weightlearn.sh -rwxrwxr-x 1 16835 11679 67 Jun 11 2013 start_demo.sh -rwxrwxr-x 1 16835 11679 99 Jun 11 2013 start_nlp.sh
Setting up the demo
First install Python 2.7 (or later) and common Python packages such as bottler. To get the basic version running, you will only need the following files:
- start_nlp.sh: start an instance of the Stanford NLP tools that can be accessed via HTTP over port 8082. The code for this can be downloaded here.
- start_demo.sh: start a simple web server which uses the above NLP tools and data from WikiAnswers to provide answers to simple English questions. It runs on port 8083 and provide a simple JSON-based question-to-answer API.
One thing to note is that, although the bash files are in the scripts folder, their code assumes that they are in the root folder instead as relative paths are used:
#!/bin/bash set -u set -e cd external/stanford-corenlp-python python corenlp.py -H 0.0.0.0 -p 8082
Therefore you will need to move the scripts outsite the scripts folder for things to work. After that, run start_nlp.sh and wait a while for the NLP tools to complete initializing. When initialization is completed, run start_demo.sh (without terminating start_nlp.sh, for example, by opening a different terminal) and you will see some numbers on the screen:
0 100000 200000 300000 ............... 600000 ............... 0 100000 ...............
The numbers seem to be counting from 0 up, back to 0 and then up again, and the loop will continue for a while. What is the script doing and are we getting into an infinite loop? The answer can be found by examining the content of web/web.py, which is called by start_demo.sh. That Python script contains the following code:
import lex.demo import bottle from bottle import static_file, route, request, run, post, get, response WEB_PATH = 'web/' LEXICON_PATH = 'data/lexicons/paralex' WEIGHTS_PATH = 'data/weights/paralex.txt' DATABASE_PATH = 'data/database' NLP_PORT = 8082 NLP_HOST = 'localhost' qa = lex.demo.QA(lexicon_path = LEXICON_PATH, weights_path = WEIGHTS_PATH, database_path = DATABASE_PATH, nlp_port = NLP_PORT, nlp_host = NLP_HOST)
Now examine the python/lex/demo.py file and you see the following code:
class QA: def __init__(self, lexicon_path='data/lexicons/paralex_iter1', weights_path='data/weights/paralex_iter1/mixed.iter21', database_path='data/database', nlp_port=8082, nlp_host='rv-n15'): nl_vocab = lex.vocab.read_vocab(open('%s/vocab.txt' % lexicon_path)) nl_vocab_inv = lex.vocab.read_vocab_inv(open('%s/vocab.txt' % lexicon_path)) db_vocab = lex.vocab.read_vocab(open('%s/vocab.txt' % database_path)) db_vocab_inv = lex.vocab.read_vocab_inv(open('%s/vocab.txt' % database_path)) lexicon = lex.lexicon.read_lexicon(open('%s/lexicon.txt' % lexicon_path)) weights = lex.learn.load_weights(open(weights_path)) db = lex.db.open_db('%s/tuples.db' % database_path) config = dict(nlp_port=nlp_port, nlp_host=nlp_host)
Finally, have a look at python/lex/vocab.py which contains the following
import sys from collections import defaultdict from lex.semantics import str2obj as str2semantics def swap(a, b): return b, a def ident(a, b): return a, b def read_vocab(input, fn=ident): result = dict() for i, line in enumerate(input): if i % 100000 == 0: print >>sys.stderr, i id, word = line.strip().replace('\t', ' ').split(' ', 1) id = int(id) word = word.strip() key, value = fn(word, id) result[key] = value return result def read_vocab_inv(input): return read_vocab(input, fn=swap)
I guess you are now clear why various numbers are shown when start_demo.sh is executed. The code reads the vocabulary files, presumably generated from the WikiAnswers data dump, in an attempt to build its knowledge base. There are various files to be read and the count is shown for every 100000 lines that have been read. Certainly not a good way to show the loading progress! If I were to do this, I would have printed at least the name of the file which is being read and how many files are still left in the queue, so as not to confuse the user, instead of just showing one number that jumps up and down as more files are processed. I also wonder why the same file (vocab.txt) is read again and again, with only slightly different methods (read_vocab and read_vocab_inv). Some optimizations are obviously needed here, but that is not something to worry about for now.
With the demo server running, open a browser and type the following URL:
http://localhost:8083/parse?sent=Who+invented+Wikipedia
If everything is working correctly, you will get a response that looks like this:
{ "result" : [{ "parse" : [["who invent $y ?", "found.r(?, y)"], ["wikipedia", "wikipedia.e"]], "query" : "SELECT arg1 FROM tuples WHERE rel=\"found.r\" AND arg2=\"wikipedia.e\"", "score" : 0.333333333332, "answers" : ["jimmy-wales.e", "rick-astley.e"] }, { "parse" : [["who invent $y ?", "have-invent.r(?, y)"], ["wikipedia", "wikipedia.e"]], "query" : "SELECT arg1 FROM tuples WHERE rel=\"have-invent.r\" AND arg2=\"wikipedia.e\"", "score" : 0.222222222222, "answers" : [] } .............. ] }
The JSON output provides answers to the question provided (“Who invented Wikipedia”), in simple words, and the SQL queries shown in the response also provide some insights into how Paralex actually works. The engine parses the given question into known formats, e.g. who/what/when/where, and looks up its tuple database, which has been carefully constructed from the WikiAnswers dataset, in the hope that there exists at least one answer that matches the given question. Although this might be a shot in the dark, in our case, the algorithm returns correctly that Jimmy Wales is the author of Wikipedia, although Rick Astley, the English singer, is also returned. I am not sure what makes Paralex thinks that Rick Astley has got anything to do with Wikipedia. Also shown in the response are other queries which were attempted, but did not return any answers at all.
Hosting the demo
Although the downloaded files contain the web folder with a simple index.html page to demonstrate the use of the API, the provided demo web page is unsuable. The index.html page only contains one single text box (!) with no labels or texts at all and uses complicated jQuery code to parse the JSON response. It also does not work in my experiment. I therefore wrote a simple HTML form with a simple PHP parser script, which accepts questions typed into a text box, calls Paralex API for the answers and provides a nicely-formatted output page:
For this to run unattended, you will need to run the Paralex engine upon system startup, for example by creating a bash file with the following code and put it in your rc.local or in your crontab with the @reboot parameter:
cd /root/paralex/paralex-evaluation nohup ./start_nlp.sh < /dev/null & nohup ./start_demo.sh < /dev/null &
Below is the full list of answers returned for the above question, as produced by my code:
- Jimmy wales
- Rick astley
- 2001
- January 15 , 2001
- Design
- Andy
- Billmon
- Bridge creek
- Ed miller
- More detailed explanation
- 2001
As can be seen, except for the first answer which is correct, the rest make little sense. Dates are also returned as answers to a ‘Who’ question, although it can be argued that 15 January 2001 is the date Wikipedia went public. Oh well.
Accuracy of answers
To see how accurate the answers can be, I tried a few other questions, listed below:
- Who invented Facebook?
- Who wrote Robinson Crusoe?
- When did Wikipedia start?
- What did Robert Hooke discover?
These questions produced the following correct answers, among other things:
- Mark zuckerberg
- Daniel defoe / Defoe
- 15 january 2001
- Plant cell
Unfortunately, after repeated tests, I concluded that the engine can only answer these types of simple questions. Questions slightly more complicated such as, ‘When did the First World War start?’, or ‘In which country is Tokyo?’ produce insensible answers or even no answers at all. This can be expected from the way the system is designed – there can be no way that simple SQL queries can be used to find the answers for a lot of different types of questions. My last attempt, ‘In which continent is Singapore’, produced the following answers:
- Asian destination
- Destination
- Deterrent
- Example
- International destination
- International package
- Manufacturing facility
- South pacific
- Subsequent production
- Hunting and fishing
- Buyer
- Territory
The first and eight answers are closest to being acceptable. I guess some further processing is needed to paraphrase the keywords returned from the SQL queries into an acceptable answer. For example, answer (1) should have been ‘Asia’ to be correct while answer (8) is incorrect since Singapore is not in the South Pacific. The rest of the answers are, well, probably taken blindly from some WikiAnswers texts that contains the word ‘Singapore’.
The list of the language structures recognized by Paralex can be found in several files named vocab.txt in the data folder. The knowledgebase data file is named tuples.db located in the data/database folder. Some examples of the types of questions that can be answered can be found in the data/questions folder. However, in my tests, some of these sample questions do not provide any answers. This is perhaps due to different datasets being used to train the evaluation version of Paralex version which was downloaded, or simply because the author does not expect Paralex to be able to answer all sample questions.
Design scalability
The design uses tuples.db, a 3.5GB SQLite 3.x database to store the list of tuples:
[root@panel database]# file tuples.db tuples.db: SQLite 3.x database [root@panel database]#
As the theoretical size limit for an SQLite database is huge, more tuples can be added to this database for a better knowledge base. However, what affects the scalability of this design is the lex/vocab.py script which reads the vocab.txt file into memory several times, and consumes a lot of system memory:
In my test, this causes Paralex to consume well above 4GB of memory just after being loaded. You will need at at least 6GB of memory for it to start successfully, and probably 8GB of memory if you want Paralex to co-exist with other processes on your server. This is just a waste of system resources, given that the author could have converted vocab.txt to a database and used queries to retrieve the entries, instead of loading it from memory. The in-memory loading was most likely chosen to save time and efforts, or simply as a quick and dirty tool to complete the project, since the accuracy of the answers provided by the engine probably means that Paralex can never be used for any serious products anyway.
Conclusion
In summary, I am disappointed with the engine. I was expecting it to provide full sentences as answers, and not a few English words. The accuracy of the answers is too low to be used for any serious purposes and the memory requirements (at least 6GB of RAM) mean that most people, including me, would hesistate to run it on any production server, even just for a demo. I can probably do better by using the Bing Search API or perhaps the DuckDuckGo Instant Answer API, followed by some simple string parsing to select the most probable answers from, say, the first 20 or 50 results. Hopefully a new version with better accuracy and performance will eventually be available.
The original research paper for the Paralex project, namely ‘Paraphrase-Driven Learning for Open Question Answering’, can be downloaded from here. The raw WikiAnswers data can be downloaded here. For those who are interested, download similar question-to-answer datasets for Yahoo Answers and Wikipedia. Also check out the source code for Open Question Answering Over Curated and Extracted Knowledge Bases, another natural language processing project from the same team that developed Paralex.
See also