IP Australia Metadata Extraction Project

IP Australia

The Client: IP Australia

IP Australia is the Australian Government agency responsible for administering intellectual property (IP) rights and legislation relating to patents, trademarks, designs and plant breeders’ rights.


The Patent Backcapture Challenge

IP Australia had 390,000 historic patent documents, dating back to 1904, with little or no metadata. It was impossible to search through them effectively. IPA asked Semantic Sciences to extract items of metadata using the Sintelix extraction capabilities so that these records would be accessible to clients.

Many of these documents were only available in hard copy and some of them over 100 years old, in black and white and of moderate quality.  Using OCR, these documents were converted into a PDF format, creating new opportunities for storage and analysis.


IP Australia Project Requirements

  1. Capture/extract bibliographic fields from OCRed patent records and specifications from 1904 to 1979.
  2. Provide IPA with captured/extracted data in a specified structured XML format


The Backcapture Solution

As shown in the workflow diagram below, Sintelix provided a solution to IP Australia’s challenges within 2 months by:

  1. Extracting and transforming existing patent specification documents into 390,000 PDF documents.
  2. Loading those documents into Sintelix
  3. Normalizing and extracting information from those documents, creating 390,000 xml files
  4. Placing the metadata back into IP Australia databases in a searchable and easy to analyze format, making records accessible to clients.
Sintelix workflow detailing the back-capture solution for IP Australia

With Sintelix, IP Australia were able to transform a significant amount of data, extracting a large amount of information, including:

  1. Filing date (lodging or lodged date) of patent specification
  2. Invention title
  3. Applicant(s) name
  4. Inventor(s) name
  5. Agent’s name
  6. OPI date
  7. Filing date of basic application/ priority application
  8. IP Office of priority country
  9. Priority application number/number assigned to priority application
  10. Divisional application numbers (parent/child applications)

See examples below, showing the metadata extracted from historic patent specifications:

Extracting text from patents
Extracting text from patents

Data Backcapture Project Outcome:

With Sintelix, IP Australia were able to successfully extract metadata from 390,000 patent specifications within 6 weeks, meeting the tight deadline and delivering the required level of accuracy.

The letter of recommendation below from IP Australia confims the following project highlights:

  • High consistency
  • Excellent accuracy
  • Rapid execution
  • Low cost

Here are some of the comments from the letter of recommendation:

“The project was organised in two (2) stages: a proof of concept and a main delivery, with a decision gate in between. The results IPA received from the proof of concept were good and achieved within a very short period, so IPA authorised the main project to proceed. Its timelines were tight (6 weeks) and required high accuracy.

Semantic Sciences Research provided IPA with visibility of its progress via online access to progress reports with drill-down to the source and processed data provided from its Sintelix software platform.

Delivered results were excellent. A field accuracy of 99.7% was achieved, which is significantly greater that IPA would expect from human transcription. The project was performed on time and on budget.

IP Australia enjoyed a positive experience of working with Semantic Sciences Research and using Sintelix. The company met our procurement and performance expectations for service providers. We valued Semantic Sciences Research’s timeliness, responsiveness and proactivity.”  Veena Bhat, Patent Search Capability Coordinator, IP Australia.

Referral from IP Australia

PDF Download

PDF icon

Click here to download this case study in PDF format.

Ready To Get Started?