Data files are received from data sources that include textual content. The data files are categorized using a taxonomy of categories, where each category has sample textual content that defines a concept for the category. The categorizing includes comparing the textual content of the data file with the sample textual content for the category. A file score is calculated for each data file to compare the degree of similarity between the defined concept of the category and a determined concept for the data file. Each data file is associated with the category if the file score is equal to or greater than a pre-determined minimum score for the category. A portion of the data file and/or file score is be provided.
展开▼