Saturday, January 17, 2009

Guest Post: Indian Kanoon - The road so far and the road ahead

The following guest post is from Sushant Sinha - a student at the University of Michigan and the force behind the searchable legal databases website Indian Kanoon. Our blog has previously debated issues of free digital accessibility of legal information (here, here, here and here). Indian Kanoon is an important step in that direction. (We realise that some of the technical details in the post may be unfamiliar to our readers, but the broad themes have been discussed regularly on our blog.)

I was quite pleased to find law information publicly available on the judis and the indiacode. However, it was too difficult to look for anything on these websites and so I started building tool sets to play with law data. At a certain point I felt that integration of these small software pieces will be very interesting. I was still skeptic as to whether search on law documents meant anything to common people who do not know the law jargon. In any case I integrated the tool sets into a search engine and got pleasantly surprised when many of my common queries were well answered. So I deployed it as a publicly available service, called it Indian Kanoon and fortunately many people have found it useful over time. When actual people start using a service (whether free or fee-based), the demand for correctness and usability increases significantly. The need to understand the problems, think about the issues and fix them have kept me in tight grip. Indian Kanoon was announced last January in a very crude form and a number of changes have gone in the past year. So this post is mostly to highlight what all work has gone into indian kanoon in the last year, what the challenges were and what features are planned in future.

Integrating more legal documents
Indian Kanoon started only with supreme court judgments and central laws. Clearly this was not sufficient to many people who wanted to search in high court judgments, law commission reports and law journals. Over last year, a number of other legal documents have been added. Firstly, the law commission reports and a law journal was added. The law journal "Central India Law Quarterly" has been digitized and was put up on Internet by Devaranjan. The only problem in their integration was that the many of these documents were images scanned from the books. So I used tesseract, a free OCR software supported by google, for extracting text from these images. However, the text extraction quality was just 90% and I am skeptical if google uses tesseract for its own google books project. Tarunabh pointed out the availability of constituent assembly debates that can be integrated. He pointed out two main problems in integrating them. First, the article numbers in the debates were different than in the constitution. Secondly, debates are cited in the court judgments using page numbers in the official books. But both of these numbers were not available in the digital copy provided by the government. So the only way out was to go back to the actual books. We did not want to give away the digital route yet. So we went to that had a scanned copy of the debates. Tarunabh emailed Google to release those books in public domain as the copyright on them has expired the previous year. Google replied saying that they are not sure about the copyright expiration and will be conservative in making books publicly available. Finally, I loaned the books from a library, manually copied the page numbers and the association list between the article numbers in the debates and the article numbers in the Constitution and integrated the constituent assembly debates. Indian Kanoon was highly deficient in terms of high court judgments and even in Supreme court judgments as Dilip earlier pointed out on my blog. So I integrated the high court judgments and made Indian Kanoon more comprehensive.

Features Beside making Indian Kanoon comprehensive in terms of legal documents, a number of features to make searching easier have been added. The most common problem was the mis-spelling of Indian names and so I first added the most critical feature for spelling suggestions. Ability to search and order documents by date was added next. The search and forums were redesigned to look aesthetically appealing. In order to provide notifications for new judgments, RSS feed for court judgments was recently added. Finally, people may like to monitor documents related to certain words or phrases. So on Tarunabh's suggestion I added the RSS feed for any arbitrary query.

Contributing code back Developing indian kanoon software has been possible because of the availability of large amount of free software. As a result I was able to modify these software and customize it for law search. Indian Kanoon uses a feature rich open source database - Postgresql as the backend. When users submit a query, matching documents are found, ordered and the top few are shown. For each document, the search engine also displays a small text excerpt where the query terms appear. The text excerpt allows people to quickly evaluate whether the document is relevant to the query. The headline function developed for indian kanoon was contributed back to postgres and has been added to the postgres CVS head. Beside that a bug in postgres was fixed as well. I also sent the phrase search function to the postgres list. But, Teodor Sigaev, who merged OpenFTS in the Postgresql, wants a generic operator that can check for arbitrary distance between the lexemes. I have not yet got time to work on this operator. Beside development on the database, the Indian Kanoon forums has been released as djangobb - Django Bulletin board that uses the django web application framework. The judis recently moved to a really obfuscated website where the judgment did not have a stable URL. Prashant Iyengar pointed out that we are not getting the live feed from the judis. So I reverse engineered the website and released the judis reverse engineering code.

Future works
Even after so much of work a number of things need to be improved on indian kanoon. Here is a list of changes that I think are required to make indian kanoon more comprehensive, more rich and better in search. Please feel free to suggest more.

1. Reverse engineering different court and tribunal websites so that indian kanoon can provide a live feed of all Indian court and tribunal judgments.

2. Currently indian kanoon cannot answer questions like "list of judgments in which a particular law section was held" and "search only in family law judgments". The problem is that we do not have enough semantic information about judgments. So I want to enable common users to start tagging documents. There will be two kinds of tagging: categorizing court judgments and laws into broad categories like family law, constitutional law, right to equality etc and secondly, tag whether a judgment explains, bolsters, or overturns a given law or judgment. The tags generated by the users will be available to everyone with the Creative Commons-Attribution-Share Alike license 3.0.

3. A number of people type in natural language in the search box. For example, someone will type "recent judgments from delhi high court". Even though we can answer these questions, we directly search the query to the documents. For example, the above query could have been reduced to "doctypes: delhi sortby: mostrecent". So what we need is a small natural language processor that can automatically convert such natural language queries to a more precise query that the engine can evaluate.

4. I only support searching for a set of words in the documents. Roy wanted a more sophisticated query langauge that supports boolean queries. This will enable people to issue more complicated queries like (freedom OR speech) AND (NOT expression).

5. With the addition of more data over time, Indian Kanoon takes more than a second to evaluate some queries. A number of software changes (or possible hardware upgrade) are required to bring back the evaluation time to sub-second.
Post a Comment