The entity extraction framework – known as Baleen – can automatically extract information from unstructured and semi-structured text. It tries to identify and extract entities from the text, such as people, locations, organisations and dates.
Baleen has been under development for a number of years; Dstl is now seeking community contributions that can feed back into the software, improving the quality of the code and extending its capability.
There are similar projects already in the public domain, but this provides an end-to-end entity extraction capability based on Apache UIMA (Unstructured Information Management Architecture) which is becoming a widely used and accepted framework.
Dstl’s James Baker says he hopes the text analytics developer community will help develop the software further:
We are releasing the core framework and a number of components of Baleen onto Github.com for the community to use, adapt and improve. We hope suppliers, members of academia and individuals will help take this further and develop capabilities which we have not yet uncovered, as well as find a use for it in their own work.
Dstl’s code can be found on the Github site. For further information on the technical side, email: firstname.lastname@example.org.