Saturday, January 17, 2009

Guest Post: Indian Kanoon - The road so far and the road ahead

The following guest post is from Sushant Sinha - a student at the University of Michigan and the force behind the searchable legal databases website Indian Kanoon. Our blog has previously debated issues of free digital accessibility of legal information (here, here, here and here). Indian Kanoon is an important step in that direction. (We realise that some of the technical details in the post may be unfamiliar to our readers, but the broad themes have been discussed regularly on our blog.)

----------------------------------------------------------------------------
I was quite pleased to find law information publicly available on the judis and the indiacode. However, it was too difficult to look for anything on these websites and so I started building tool sets to play with law data. At a certain point I felt that integration of these small software pieces will be very interesting. I was still skeptic as to whether search on law documents meant anything to common people who do not know the law jargon. In any case I integrated the tool sets into a search engine and got pleasantly surprised when many of my common queries were well answered. So I deployed it as a publicly available service, called it Indian Kanoon and fortunately many people have found it useful over time. When actual people start using a service (whether free or fee-based), the demand for correctness and usability increases significantly. The need to understand the problems, think about the issues and fix them have kept me in tight grip. Indian Kanoon was announced last January in a very crude form and a number of changes have gone in the past year. So this post is mostly to highlight what all work has gone into indian kanoon in the last year, what the challenges were and what features are planned in future.

Integrating more legal documents
Indian Kanoon started only with supreme court judgments and central laws. Clearly this was not sufficient to many people who wanted to search in high court judgments, law commission reports and law journals. Over last year, a number of other legal documents have been added. Firstly, the law commission reports and a law journal was added. The law journal "Central India Law Quarterly" has been digitized and was put up on Internet by Devaranjan. The only problem in their integration was that the many of these documents were images scanned from the books. So I used tesseract, a free OCR software supported by google, for extracting text from these images. However, the text extraction quality was just 90% and I am skeptical if google uses tesseract for its own google books project. Tarunabh pointed out the availability of constituent assembly debates that can be integrated. He pointed out two main problems in integrating them. First, the article numbers in the debates were different than in the constitution. Secondly, debates are cited in the court judgments using page numbers in the official books. But both of these numbers were not available in the digital copy provided by the government. So the only way out was to go back to the actual books. We did not want to give away the digital route yet. So we went to books.google.com that had a scanned copy of the debates. Tarunabh emailed Google to release those books in public domain as the copyright on them has expired the previous year. Google replied saying that they are not sure about the copyright expiration and will be conservative in making books publicly available. Finally, I loaned the books from a library, manually copied the page numbers and the association list between the article numbers in the debates and the article numbers in the Constitution and integrated the constituent assembly debates. Indian Kanoon was highly deficient in terms of high court judgments and even in Supreme court judgments as Dilip earlier pointed out on my blog. So I integrated the high court judgments and made Indian Kanoon more comprehensive.

Features Beside making Indian Kanoon comprehensive in terms of legal documents, a number of features to make searching easier have been added. The most common problem was the mis-spelling of Indian names and so I first added the most critical feature for spelling suggestions. Ability to search and order documents by date was added next. The search and forums were redesigned to look aesthetically appealing. In order to provide notifications for new judgments, RSS feed for court judgments was recently added. Finally, people may like to monitor documents related to certain words or phrases. So on Tarunabh's suggestion I added the RSS feed for any arbitrary query.

Contributing code back Developing indian kanoon software has been possible because of the availability of large amount of free software. As a result I was able to modify these software and customize it for law search. Indian Kanoon uses a feature rich open source database - Postgresql as the backend. When users submit a query, matching documents are found, ordered and the top few are shown. For each document, the search engine also displays a small text excerpt where the query terms appear. The text excerpt allows people to quickly evaluate whether the document is relevant to the query. The headline function developed for indian kanoon was contributed back to postgres and has been added to the postgres CVS head. Beside that a bug in postgres was fixed as well. I also sent the phrase search function to the postgres list. But, Teodor Sigaev, who merged OpenFTS in the Postgresql, wants a generic operator that can check for arbitrary distance between the lexemes. I have not yet got time to work on this operator. Beside development on the database, the Indian Kanoon forums has been released as djangobb - Django Bulletin board that uses the django web application framework. The judis recently moved to a really obfuscated website where the judgment did not have a stable URL. Prashant Iyengar pointed out that we are not getting the live feed from the judis. So I reverse engineered the website and released the judis reverse engineering code.

Future works
Even after so much of work a number of things need to be improved on indian kanoon. Here is a list of changes that I think are required to make indian kanoon more comprehensive, more rich and better in search. Please feel free to suggest more.

1. Reverse engineering different court and tribunal websites so that indian kanoon can provide a live feed of all Indian court and tribunal judgments.

2. Currently indian kanoon cannot answer questions like "list of judgments in which a particular law section was held" and "search only in family law judgments". The problem is that we do not have enough semantic information about judgments. So I want to enable common users to start tagging documents. There will be two kinds of tagging: categorizing court judgments and laws into broad categories like family law, constitutional law, right to equality etc and secondly, tag whether a judgment explains, bolsters, or overturns a given law or judgment. The tags generated by the users will be available to everyone with the Creative Commons-Attribution-Share Alike license 3.0.

3. A number of people type in natural language in the search box. For example, someone will type "recent judgments from delhi high court". Even though we can answer these questions, we directly search the query to the documents. For example, the above query could have been reduced to "doctypes: delhi sortby: mostrecent". So what we need is a small natural language processor that can automatically convert such natural language queries to a more precise query that the engine can evaluate.

4. I only support searching for a set of words in the documents. Roy wanted a more sophisticated query langauge that supports boolean queries. This will enable people to issue more complicated queries like (freedom OR speech) AND (NOT expression).

5. With the addition of more data over time, Indian Kanoon takes more than a second to evaluate some queries. A number of software changes (or possible hardware upgrade) are required to bring back the evaluation time to sub-second.

6 comments:

Renu Gupta said...

Indian kanoon seems like a useful portal..just noticed an error while browsing..

http://www.indiankanoon.org/feeds/latest/supremecourt/

on this link, the date of the newsfeed just below each heading seems to be running into 2014, 2013..You might want to correct this.

Sushant said...

@Renu: You are right. This is mostly because of error in recovering the date from the judgment. It extracts most of the date correctly except a few and thats why you will see some judgment dates to be after the current date. I will try to fix them as well.

suchit said...

Sushant and your team,

I think you guys are doing a great job. Indian Kanoon is just a fabulous web-site and far better then paid websites.

Your efforts are helping activists and RTI guys in India to quote case laws without recourse to expensive database and the horrendous nic websites that are a nightmare for any visitor.

But the feeling is always there- what if the database goes off the net?

Humble suggestion: Why don't you keep a nominal charge per annum which students, teachers and small lawyers without enough means may afford. You probably can give them extra facilities like being able to track their past search and refining within search and so on. This will help you get funds to keep your project running. At the same time, the fear of the website going, howsover misplaced will vanish. You will also be doing a great service to nation and sometimes unimaginable help to the destitute.

Thanks and regards.
You are doing a fabulous job.

Regards.

Suchitt D Dave,
Advocate - Supreme Court of India.

suchit said...

Sushant and your team,

I think you guys are doing a great job. Indian Kanoon is just a fabulous web-site and far better then paid websites.

Your efforts are helping activists and RTI guys in India to quote case laws without recourse to expensive database and the horrendous nic websites that are a nightmare for any visitor.

But the feeling is always there- what if the database goes off the net?

Humble suggestion: Why don't you keep a nominal charge per annum which students, teachers and small lawyers without enough means may afford. You probably can give them extra facilities like being able to track their past search and refining within search and so on. This will help you get funds to keep your project running. At the same time, the fear of the website going, howsover misplaced will vanish. You will also be doing a great service to nation and sometimes unimaginable help to the destitute.

Thanks and regards.
You are doing a fabulous job.

Regards.

Suchitt D Dave,
Advocate - Supreme Court of India.

Sushant said...

Hi Suchith,

Thanks for your appreciation! Do not worry, Indian Kanoon is not going to fold now. Actually, I am planning a major infrastructural upgrade by putting new servers in the data center that would be speeding up query evaluation nd provide better fault tolerance (in case some servers die).

After that there is also plan for launching value added services on IK that will make it self-sufficient.

Unknown said...

INDIA KANOON - A GREAT WEBSITE . IT IS A GREAT HELP FOR THE PEOPLE OF INDIA AND ENCOURAGE LEGAL LITERACY AND ACCESS TO JUSTICE AMONG MASSES. A EMPOWERING TOOL. KEEP UP THE GREAT WORK

Ravi Kant
Advocate Supreme Court of India
President, Shakti Vahini