Hierarchical Datasets for Feature Selection in Genomics


mkdir -p static/datasets
cd static/datasets
for f in CellCycle Church Derisi Eisen Expr Gasch1 Gasch2 Sequence SPO; do
  wget https://oliveira-sh.github.io/datasets/${f}.arff
done
cd -

These .arff files are sourced from functional genomics research (Amanda Clare), annotated using FunCat and Gene Ontology (GO), ideal for hierarchical multi-label classification with tools like Clus or GMNB (ebi.ac.uk, dtai.cs.kuleuven.be). In the table below i summarized the main characteristics of the ten freely-available datasets used.

Datasets

DatasetNum of attributesNum of classesNumber of classes per level
Cellcycle7749918 - 80 - 178 - 142 - 77 - 4
Church2749918 - 80 - 178 - 142 - 77 - 4
Derisi6349918 - 80 - 178 - 142 - 77 - 4
Eisen7946118 - 76 - 165 - 131 - 67 - 4
Expr55149918 - 80 - 178 - 142 - 77 - 4
Gasch117349918 - 80 - 178 - 142 - 77 - 4
Gasch25249918 - 80 - 178 - 142 - 77 - 4
Pheno6945518 - 74 - 165 - 129 - 65 - 4
Seq47849918 - 80 - 178 - 142 - 77 - 4
Spo8049918 - 80 - 178 - 142 - 77 - 4

Further reading & resources