Aleph main features are available through the web interface. Data loading and other maintenance operations are provided via command line tools. In order to understand the concepts and how it works, please refer to the chapters bellow.
Below are some of common use cases Aleph was designed to cover. To learn more about the needs of such a tool, please refer to the notes about the page on user needs in investigative journalism.
There’s also a glossary describing the keywords used in Aleph.
Consider some common use-cases like:
- As a journalist, I want to combine different types of facets which represent document and entity metadata.
- As a journalist, I want to see a list of documents that mention a person/org/topic so that I can sift through the documents.
- As a journalist, I want to intersect sets of documents that mention multiple people/orgs/topics so that I can drill down on the relationships between them.
- As a data importer, I want to routinely crawl and import documents from many data sources, including web scrapers, structured sources and filesystems.
- As a data importer, I want to associate metadata with documents and entities so that users can browse by various facets.
One way to get the data into Aleph is to provide files and folders it can crawl and load the content to the database.
From files and folders¶
Aleph provide a tool to process all the files and folders using an input path. Some files such as archives (ZIP packages or Tarfiles) will be treated as as virtual folders and all their content will be imported under its name.
It is important to note that this method of loadig data provides very limited ways of including metadata (ex.: document titles, source URLs or document languages).
To use this tool, run
docker-compose run app python aleph/manage.py crawldir <DIRECTORY|FILE PATH>
-f, --foreign_id COLLECTION_IMPORT_IDlook up the import id in the settings tab of a collection
-c, --country COUNTRY
-l, --language LANGUAGE
It is important to mention that importing the same directory multiple times will not duplicate the source files as long as the base path of the crawl is identical. The base path of the file is used to identify the document source.