In my previous post, Classifier development workflow, I briefly mentioned the supervised machine learning algorithms I’m using (CRFs) and the steps that usually comprise the classifier development process.
Today I’m going to have a closer look at my experimental setup. I’d like to come up to the first part of the classifier research framework that I’ll call the Batch Mode.
So, I’ve already pointed out that I’m currently doing a NER research on CoNLL’03 tagged corpus.
The English corpus, according to its description, is divided into three parts: training, testa and testb. It’s also convenient to call them train, dev and test datasets. The train and dev parts are drawn from the same collection of texts - Reuters news articles dated August 1996, but the test part is somewhat different - it’s taken from December 1996.
The train set is 8-9 times larger than any of the dev/test sets. It’s used, obviously, for the training of a supervised classifier. The evaluation of the classifier on the dev set (unseen during training) helps to find good parameters for the system. The evaluation on the test set becomes the final evaluation and helps to compare different classifiers.
My system is currently a C# program that does the following steps:
- extracts features from raw texts and parser results for both the train and eval (that is dev or test selected for each single run) sets;
- trains a CRF classifier from the MALLET toolkit on the train set;
- runs the classifier for markup of the eval set and evaluates the results according to the precision, recall and FB1 scores for each individual named entity category and microaveraged total scores using my own C# port of this program.
As I experimented with the system, I’ve seen that the resulting scores indicate various dependencies on the following settings:
- The training set size, obviously.
The higher is the training set size, the higher is the number of distinct feature combinations the classifier sees during training. Larger training sets, however, may require more training iterations.
- The learning algorithm iterations count.
This was a bit of suprise, though very easily explainable. With the number of iterations being very small, the classifier never has a chance to notice all the significant relations between features in the data, so the undertraining occurs. With that number being too high, the classifier gets overtrained and doesn’t extrapolate to the eval set well.
- The exact set of features being used.
The size of the total set of features in my case is a few tens, thus they can provide a huge amount of different combinations. The features are also very interrelated - some of them dominate on the others (making the learning process assign them major weights, while not extrapolating well to the eval set), some give good results only when combined with other specific features and so on.
The existence and impact of these settings requires a lot of test runs to be conducted so that the optimal choice of settings could be made. Let’s see, how the Batch Mode can help me to do it.
Running a lot of tests can be done in two ways, or a mixture of them:
- Using more machines to compute;
- Making the machines work day and night without the need to control them manually.
A single test run may take hours or even days. I’d like to be able to schedule a few tens of runs for all available machines and wait for them to complete while doing my everyday work, sleeping, eating and so on.
To understand, what is in a run, let me enumerate all the artifacts used or produced during it:
- Source dataset files, a combination of train + dev or train + test.
- Training set size, an integer number of documents (in simple case) or a list of document names (in a more complex case).
- Classifier iterations count.
- The set of chosen features.
- The system produces some intermediate files (tables of feature values) that are fed into the learning process. The generation of these files is a time-consuming process, but they take much disk space and they are entirely dependent on the beforementioned parameters, so currently I think that I can create and let them go on each run.
- The trained classifier model file. This is a very valuable piece of information that sums up all the source data and invested compute time. It’s rather small and allows to evaluate the classifier at any moment. Maybe it’s worth keeping it for the future.
- The evaluation results - the values of precision, recall, etc. scores that are used to build charts and make further development decision. The key goal of each run.
So, the Batch Mode looks like a separate application that receives the settings’ values, runs the classifier system and neatly stores the results and some of the intermediate files in a safe place.
Let me take a pause for some coding, and later I’ll write about the progress of the Batch Mode implementation.