In order to understand the scope of a patent, a reader needs to understand the claims of the patent, because patent claims define the legal scope of the patent. However, most researchers tend to focus on merging standard IR methods on all or part of the patent with IR methods that involve the citations of the patents.
However, features used in standard IR methods do not perform as well as more complicated features that keep the semantic representation of the document intact. One such feature is a dependency triplicate consisting of two words and their grammatical relationship. Learning these structures of the document will give the computer more insight into the document and be able to perform better analysis using it. Using these structures in patent text classification has proven to provide better performance than simple n-grams.
Even though patent claims are written in English, they do not read like standard English. The main scope of the invention must be described in one sentence. Patent agents and attorneys must be able to cover all possible forms of the invention in one run-on sentence with very exacting language.
These constraints, along with other rules, make patent claims difficult to parse. Off-the-shelf NLP parsers such as the Stanford Parser do not provide correct parse trees for most patent claims. As seen in the figure, the Stanford Parser does not correctly label ``said'' as an adjective in any instance and the parsing degrades as a result. This is a result of the databases used to train the software's model of English - the Wall Street Journal corpus.
Researchers have written about how to avoid this problem. One common method to fix this problem that has been adopted by the patent parsing community is by chunking the long patent claims into smaller segments. This method also helps to avoid the time and memory requirements to parse long sentences like most patent claims. However, even with smaller segments, the Stanford Parser does not perform well.
The only method that guarantees a better parsing is to train a grammatical model on patents and use that model in the parser. However, in order to train this model, a hand-annotated corpus of patent claims needs to be used. Creating this corpus requires extensive development time and resources and is thus infeasible.
The Stanford Parser provides a tool, however, to force certain words to be tagged with certain parts of speech (POS) tags. As can be seen in the figure, by just correcting the incorrect verbs tags (over multiple iterations sometimes) and rerunning the Stanford Parser, a correct parsing can be obtained.
A system that will automatically correct these incorrect POS tags was developed. It learned the properties of the incorrectly tagged words and what tags they should be labeled via a simple SVM-based system. With this information, it automatically corrects incorrect POS tags in other patent claims.
A corpus of words that were incorrectly labeled as a verb as well as their correct tags was gathered for this system to learn. By reducing the complexity from obtaining the POS tags of all words in a patent claim to just the tags of words originally labeled as a verb, the task of assembling this corpus became a problem for which Amazon Mechanical Turk (AMT) could provide a feasible solution.
To show that this technique has merit, a simple patent subject classification problem was developed. The mature field of patent subject classification *cite several papers here* is perfect for demonstrating how such a simple fix in parsing patent claims is enough to provide a better performance. This field involves classifying new patents into one of several different categories or subcategories.
In this paper, an overview of our AMT campaign and its evolution over time will be provided with the statistics of the results of each stage of the campaign. An overview of the automatic POS tag corrector as well as its performance when given data from our AMT campaign will be given. Finally, a presentation of our system used in patent subject classification will be given.