F1 Score Calculator for Python Model Review
An F1 score helps you inspect a classifier with one balanced number. It joins precision and recall into a harmonic mean. This is useful when accuracy hides important mistakes. A model can look accurate when the dataset has many easy negative cases. F1 focuses on the positive class, or on every class when averages are used.
Why F1 matters
Precision answers a direct question. Of the items predicted positive, how many were correct? Recall asks another question. Of the real positives, how many did the model find? F1 rewards models that keep both values strong. A high precision with weak recall can still produce a modest F1 score. A high recall with weak precision can do the same.
Working with counts
The binary count method uses true positives, false positives, false negatives, and true negatives. These values form a confusion matrix. The calculator checks each value, prevents invalid totals, and shows related measures. These include accuracy, specificity, error rate, and balanced accuracy. You can also enter beta to build a general F score. Beta above one gives more weight to recall. Beta below one gives more weight to precision.
Working with labels
You can paste true labels and predicted labels from Python output. Use comma separated values. The tool creates class level counts for each label. It then reports precision, recall, F1, and support. Micro average counts all decisions together. Macro average treats every class equally. Weighted average respects class support. These options match common model review needs.
Using the result
Do not judge a model by F1 alone. Compare it with business risk, class balance, and data quality. For fraud, recall may matter more. For spam filtering, precision may matter more. Check the class table before choosing a model. Look for weak classes and uneven support.
Downloads
The CSV file is useful for spreadsheets. The PDF file is useful for records. Both exports include the inputs and calculated values. Keep them with training notes. They make model comparisons easier.
Best practice
Use the same validation split for each comparison. Record the threshold used for predicted classes. Small threshold changes can shift precision and recall. Save each run, then compare scores with care.