-
- Description:
A FEAture Selection Toolbox for C/C++, MATLAB/OCTAVE and Java, v2.0.0.
FEAST provides implementations of common mutual information based filter feature selection algorithms, and an implementation of RELIEF. All functions expect discrete inputs (except RELIEF, which does not depend on the MIToolbox), and they return the selected feature indices. These implementations were developed to help our research into the similarities between these algorithms, and our results are presented in the following paper:
Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection G. Brown, A. Pocock, M.-J. Zhao, M. Lujan Journal of Machine Learning Research, 13:27-66 (2012)
The weighted feature selection algorithms are described in:
Information Theoretic Feature Selection for Cost-Sensitive Problems A. Pocock, N. Edakunni, M.-J. Zhao, M. Lujan, G. Brown. ArXiv
If you use these implementations for academic research please cite the relevant paper above. All FEAST code is licensed under the BSD 3-Clause License.
Contains implementations of: mim, mrmr, mifs, cmim, jmi, disr, cife, icap, condred, cmi, relief, fcbf, betagamma
And weighted implementations of: mim, cmim, jmi, disr, cmi
References for these algorithms are provided in the accompanying feast.bib file (in BibTeX format).
FEAST works on discrete inputs, and all continuous values must be discretised before use with FEAST. In our experiments we've found that using 10 equal width bins is suitable for many problems, though this is data set size dependent. FEAST produces unreliable results when used with continuous inputs, runs slowly and uses much more memory than usual. The discrete inputs should have small cardinality, FEAST will treat values {1,10,100} the same way it treats {1,2,3} and the latter will be both faster and use less memory.
MATLAB Example (using "data" as our feature matrix, and "labels" as the class label vector):
size(data) ans = (569,30) %% denoting 569 examples, and 30 features
selectedIndices = feast('jmi',5,data,labels) %% selecting the top 5 features using the jmi algorithm selectedIndices =
28 21 8 27 23
selectedIndices = feast('mrmr',10,data,labels) %% selecting the top 10 features using the mrmr algorithm selectedIndices =
28 24 22 8 27 21 29 4 7 25
selectedIndices = feast('mifs',5,data,labels,0.7) %% selecting the top 5 features using the mifs algorithm with beta = 0.7 selectedIndices =
28 24 22 20 29
The library is written in ANSI C for compatibility with the MATLAB mex compiler, except for MIM, FCBF and RELIEF, which are written in MATLAB/OCTAVE script. There is a different implementation of MIM available for use in the C library.
MIToolbox v3.0.1 is required to compile these algorithms, and these implementations supercede the example implementations given in that package (they have more robust behaviour when used with unexpected inputs).
MIToolbox can be found at: http://www.github.com/Craigacp/MIToolbox/
The C library expects all matrices in column-major format (i.e. Fortran style). This is for two reasons, a) MATLAB generates Fortran-style arrays, and b) feature selection iterates over columns rather than rows, unlike most other ML processes.
Compilation instructions: * MATLAB/OCTAVE - run CompileFEAST.m Linux C shared library - use the included makefile Java - see java/README.md
- Changes to previous version:
Major refactoring of FEAST to improve speed and portability.
- FEAST now clones the input data if it's floating point and discretises it to unsigned ints once in a single pass. This improves the speed by about 30%.
- FEAST now has unsigned int entry points which avoid this discretisation and are much faster if the data is already categorical.
- Added weighted feature selection algorithms to FEAST which can be used for cost-sensitive feature selection.
- Added a Java API using JNI.
- FEAST now returns the internal score for each feature according to the criterion. Available in all three APIs.
-
Rearranged the repository to make it easier to work with. Header files are now in `
include`
, source in `src`
, the MATLAB API is in `matlab/`
and the Java API is in `java/`
. -
FEAST now compiles cleanly using `
-std=c89 -Wall -Werror`
.
- BibTeX Entry: Download
- Corresponding Paper BibTeX Entry: Download
- Supported Operating Systems: Linux, Macosx, Windows
- Data Formats: Matlab
- Tags: Matlab, Feature Selection, Feature Ranking, Mutual Information, Jmlr
- Archive: download here
Other available revisons
-
Version Changelog Date 2.0.0 Major refactoring of FEAST to improve speed and portability.
- FEAST now clones the input data if it's floating point and discretises it to unsigned ints once in a single pass. This improves the speed by about 30%.
- FEAST now has unsigned int entry points which avoid this discretisation and are much faster if the data is already categorical.
- Added weighted feature selection algorithms to FEAST which can be used for cost-sensitive feature selection.
- Added a Java API using JNI.
- FEAST now returns the internal score for each feature according to the criterion. Available in all three APIs.
-
Rearranged the repository to make it easier to work with. Header files are now in `
include`
, source in `src`
, the MATLAB API is in `matlab/`
and the Java API is in `java/`
. -
FEAST now compiles cleanly using `
-std=c89 -Wall -Werror`
.
January 8, 2017, 00:49:19 1.1.4 - Fixed an issue where zero MI values would cause it to segfault.
- Fixes to documentation and comments.
- Updated internal version of MIToolbox.
March 12, 2016, 18:35:08 1.1.1 - Bug fixes to memory management.
- Compatibility changes for PyFeast python wrapper (note the C library now returns feature indices starting from 0, the Matlab wrapper still returns indices starting from 1).
- Added C version of MIM.
- Updated internal version of MIToolbox.
June 30, 2014, 01:30:23 1.00 Initial Announcement on mloss.org.
February 13, 2012, 19:00:29
Comments
-
- Ryan Brauchler (on February 5, 2014, 22:57:35)
- First off, I'm a huge fan of this toolbox, and generally, it works great. I am implementing it to select features from an extremely high dimensional data set (~ 1,000 x 500,000). To do so, I placed this function into a loop where I break the initial matrix (which is held in a memmapfile to save memory) into N smaller bits and then feed that into FEAST and return the columns of the chosen features. The output of this loop is simply a 1xN cell array where each cell in the array is a list of the columns chosen by the algorithm. I am just using the mrmr algorithm in the toolbox. After a few iterations of this loop, however, I always get an out of memory error from Matlab. The toolbox works great in isolation, but when you start using it multiple times in a loop, it seems there is a memory leak in the MEX file for mrmr, and I would imagine its also in the others. Just wanted to bring the issue to your attention. I am going to focus on fixing it myself, but figured it might be faster to put it in the hands of the experts who wrote it. Thanks!
-
- Adam Pocock (on February 6, 2014, 02:40:32)
- Hi Ryan, Do you know if the error occurs when using FEAST from C/C++? I've had issues with Matlab before where it doesn't properly free memory allocated in MEX files (even though mxFree was called) which eventually leads to memory fragmentation and then subsequent MEX calls fail. What version of Matlab are you using, and how much RAM is there in the machine? Adam
-
- Ryan Brauchler (on February 6, 2014, 04:16:54)
- Hi Adam. Thank you so much for the prompt response. I have yet to try running the code outside Matlab, though I could attempt to do so tomorrow. It very well could be the case that it's a Matlab fragmentation error. If that's in fact the case is there a workaround that overcomes this issue? I'm using Matlab R2013a with 32GB of RAM in the machine. When the array is broken up into smaller pieces, each piece is less than 3MB in size, so that shouldn't end up being an issue of physical memory limitations. Thanks, Ryan
-
- Adam Pocock (on February 6, 2014, 05:22:11)
- Hi Ryan, One test you could do is see if Feast has the same issue (in Matlab) if you just rerun it multiple times on the same partition of the dataset. That way Matlab shouldn't do any more allocation beyond what is necessary to store the features (though it will still fragment the RAM a bit). We tested it fairly heavily by running it repeatedly across multiple datasets in the same Matlab instance, but we didn't have anything quite as large as the one you are using. Drop me an email at apocock at cs.man.ac.uk and we can look at this in a little more detail. Thanks, Adam
-
- YS L (on July 11, 2014, 03:27:02)
- Hi Admin, I just find this toolbox and am going to use it for my project. However, I still met the problem of memory. I use the functions like :"[selectedFeatures] = feast('mrmr',200,X,Y) ", where X is a (# sample) × (# feature) matrix and Y is a (# sample) × 1 binary vector (-1,1). When I call the feast function, MATLAB report error "Out of memory". I downloaded the newest version of the toolbox. My matlab version is R2012a and memory size is 32 Gb. The data is quite small around 200 × 300. Is there any suggestion to fix the problem? Thanks a lot !
-
- Adam Pocock (on July 11, 2014, 04:46:35)
- Hi, There are two preprocessing steps you should do before using the toolbox. - Ensure the data has been discretised into bins. - Transform the data by relabelling it (if a feature has value {1,100,1000} convert it to {1,2,3}). This should fix any out of memory errors for small datasets. Adam
-
- YS L (on July 11, 2014, 05:17:59)
- Hi Adam, That works! Thanks a lot
-
- A Smith (on August 5, 2014, 03:08:58)
- Adam, I'm having some trouble using FCBF. Inside the "SU" subfunction, there seems to be two functions h() and mi() that are undefined. Did I not install the Toolbox correctly?
-
- Adam Pocock (on August 5, 2014, 04:03:14)
- Looks like I forgot to add a bit to my setup instructions. FCBF uses the Matlab interface to MIToolbox (unlike the rest of FEAST which compiles in the C code), so you need to add the MIToolbox folder to your Matlab path. MIToolbox contains the h.m and mi.m files which FCBF uses. Adam
-
- ahmed hamdy (on September 22, 2014, 16:05:11)
- First of all, let me thank you for your perfect work. I'm intending to use your toolbox (FEAST) I'm using matlab 12 and i added the feast and the MIToolbox to the matlab path. then i tried index = feast('jmi',10,horse,Diagnosis) where horse is the feature matrix(299x26) and Diagnosis is the target concept (299x1) but i got the following error ========================= undefined function 'MIToolbxMex' for input argument of type 'double' Error in feast(line 70) selectedFeatures = FSToolboxMex(4, numToSelect, data, labels); ========================= please help me Ahmed Hamdy.
-
- Adam Pocock (on September 22, 2014, 16:19:20)
- Hi Ahmed, You need to compile MIToolbox as well as FEAST. Run CompileMIToolbox.m in the MIToolbox directory. Adam
-
- ahmed hamdy (on September 22, 2014, 16:35:50)
- Thanks a lot sir Adam. when running compileMIToolbox.m i got the following error ========================== Error using mex (line 206) Unable to complete successfully. Error in CompileMIToolbox (line 3) mex MIToolboxMex.c MutualInformation.c Entropy.c CalculateProbability.c ArrayOperations.c
-
- Adam Pocock (on September 22, 2014, 16:42:30)
- Hi Ahmed, Drop me an email (apocock at cs dot man dot ac dot uk). It looks like mex isn't setup correctly, so could you send me your OS (plus if its 32 or 64 bit), matlab version, and C compiler version? Adam
-
- Anupam D (on October 7, 2014, 22:55:39)
- Hi Adam, I'm trying to use FEAST toolbox for feature selection. I have a relatively small data set right now, but I'm getting the following error in Matlab "Out of Memory". I tried using both Matlab R014a and R2013. My data size is 130x50 (samples x features). The feature values are different floating point numbers ranging from [1000000 to -1] Is there any restrictions on the range of values that can be used to represent a feature?
-
- Adam Pocock (on October 8, 2014, 00:20:20)
- Hi Anupam, The short answer is yes, you should use a small range of values. FEAST uses the discrete mutual information and will round your floating point values to the lowest integer. It's better to do this yourself before using the tool as then you get more control over it. After you've done the discretisation you should map the discrete values into a smaller range (e.g. if a feature has value {1,100,1000} convert it to {1,2,3}). FEAST uses a relatively naive mapping function as it is written for maximum compatibility, so smaller values allocate less memory. The discrete mutual information doesn't take into account the value of a variable so this transformation doesn't change the result. You should discretise it into a small number of bins given you only have 130 samples, the mutual information behaves poorly when the number of states/bins is approximately the same as the number of samples. Adam
-
- Anupam D (on October 8, 2014, 00:49:41)
- Thanks Adam, for the explanation. Two more questions for u :) 1. Could you clarify what you are referring by discretization? The 130 samples are from 13 classes. 2. So would normalizing the features values to [0,1] work? I see it works (doesn't give the 'Out of memory error') but what kind of impact does that have on the selection algorithm? -Anupam
-
- Adam Pocock (on October 8, 2014, 01:00:33)
- Hi Anupam, By discretisation I mean mapping the floating point values onto a small set of integer values. For example if you have a feature in the range [1, 100], then you could map all values in the range [1,10] to 1, [11,20] -> 2 ... [91,100] -> 10. I would not recommend normalizing the features to [0,1] as then there are only two states in the discrete mi calculation (0 and 1) which loses a lot of information. The choice of the number of bins is a trade off between the amount of information available for learning, and the estimation error in the mutual information. Small numbers of states have low amounts of information and low estimation error. Large numbers of states have high information and high estimation error. In my publications we usually used 10 bins, and that seemed to work well across a variety of datasets. If you had many more samples then you could try using more bins, but wtih 130 samples I would recommendn't more than 10. If you want to read about our experimental setup its in the "Conditional Likelihood Maximisation: A Unifying..." paper in the bibtex on this page. Adam
-
- Anupam D (on October 8, 2014, 01:06:35)
- Great! Thanks for the quick response. -Anupam
-
- Guofa Li (on October 10, 2014, 05:23:47)
- I'd like to say that this is a great toolbox! Using the toolbox, I selected the best features to recognize the different categories. Here is a question I want to know. Is there a way to know the correct categorization rate using the selected features?
-
- Guofa Li (on October 10, 2014, 05:37:48)
- I read the questions from Anupam and the answers from Adam. Here is another question. If I do not map the values in ranges (e.g.[1,10] to 1, [11,20] -> 2 ... [91,100] -> 10 ), what will happen? I run the feast function and got the selected indices. But the values I used were all double numbers (e.g. 75.7702, 98.4044) and I got the selected indices successfully. Is there anything wrong with the results?
-
- Adam Pocock (on October 11, 2014, 00:24:00)
- Hi Guofa, I'm not quite sure what your first question is asking. Do you want to know the classification error rate using the selected features, or some other property? If you want to find the classification error rate then you will need to train a classifier on the training dataset using the selected features. FEAST uses filter algorithms which measure the quality of a feature set in a different way than the classification error. The feast function will run in many cases without discretisation, though the selected features might not have any meaning. There isn't any validation in the code to check to see if the mutual information values used are calculated accurately (and this is a difficult problem to detect in general). If you only have a small number of floating point values, (i.e. many data points all had the value 75.7702) then the result will be sound, however in general it is better to discretise the values. If you don't remap the values into ranges then you are likely to run into the out of memory errors reported in the other comments, though the remapping doesn't affect the quality of the results. Basically, discretise your dataset into a smallish number of bins before running FEAST. Otherwise the answers that are generated are not reliable. Thanks, Adam
-
- Joshua Juen (on October 12, 2014, 06:33:33)
- Hi, I am having a problem getting the same results with the C (and Python bindings) as I am with the Matlab version of the library. I am using as input the test data set provided by PyFEAST (digit.txt). I ran jmi with the top 15 input vectors through the python test and got: [6.0, 11.0, 14.0, 19.0, 22.0, 27.0, 30.0, 35.0, 38.0, 43.0, 46.0, 51.0, 54.0, 59.0, 62.0] I also ran a test on the same data in the Matlab version running on Windows and got: 14, 21, 22, 27, 28, 30, 35, 37, 38, 43, 44, 45, 51, 59, 62 I was confused by the difference in output with the same input data, same method so I wrote a program in straight C to use the C library directly (No Python Bindings). When I call the C function jmi with the same input data I get: 27,19,35,11,51,59,14,22,38,54,62,46,43,30,6 This clearly matches the output from the python binding (when sorted) but not Matlab. To make sure I was not having a problem with Matlab on Windows, I also installed and compiled on Matlab running on Ubuntu. I ran with the same input data again and got: 14, 21, 22, 27, 28, 30, 35, 37, 38, 43, 44, 45, 51, 59, 62 Am I doing something wrong here? Why is the output different on my Matlab compilations than my C (and PyFeast) implementations? Thanks Josh
-
- Joshua Juen (on October 12, 2014, 16:28:53)
- Upon further investigation, it looks like the Matlab code feeds the data into jmi in column-major order and the C code examples (and python bindings) feed them in using row-major order. That seems to be the difference. Should jmi take the data in row-major or column-major order? Based on the code, it looks like it should be column-major order, but I wanted to confirm before moving ahead. In that case, the sample input files in pyFEAST are incorrect and I will correspond with the author. I will also fix my C implementation as well. Thanks Josh
-
- Adam Pocock (on October 12, 2014, 17:32:18)
- Hi Joshua, FEAST expects things in column major order. I haven't looked too closely at the python bindings, I had assumed they also worked in column major. One thing to be careful of is that the ordering of the feature indices has meaning. You should get exactly the same result in the same order out of FEAST no matter what interface you use. Thanks, Adam
-
- Adam Pocock (on October 12, 2014, 23:56:24)
- Hi Joshua, I had a look at PyFeast, and it's not the sample data file that is incorrect, it's just missing the order="F" flag when the numpy array is created (line 28). From there you get the same results out of MATLAB and Python (once you've subtracted 1 from the MATLAB results as it indexes from 1). I made an issue on their Github, and updated the documentation for FEAST on my Github to reflect that it requires column-major matrices. I think I must have removed the comments that noted this in an earlier pre-release version. Thanks, Adam
-
- Peter Maier (on July 1, 2015, 18:44:55)
- Hello everybody, I read all of the entries above and have to ask one simple question about the "out of memory" problem: I have a feature in the range [0,1000] and map it to [1,10] with 10 equally spaced bins. (1. 0-100; 2.101-200 ....) AND I have a feature in the range [-5, 5] -> ?: Do I also map it in the range [1,10] (1. -5 to -4 and so on...) ? Thank you very much! Peter
-
- Adam Pocock (on July 1, 2015, 19:07:38)
- Hi Peter, If you have a discrete feature in the range -5,5 then MIToolbox will automatically map it into the range 1,10 as part of the feature selection process (as it doesn't change the output). If it's a continuous feature you might want to discretise it yourself. For discrete mutual informations it's only the number of bins that can change the value, the id given to each bin is irrelevant (i.e. if you have a two valued feature {0,1} it will give an identical entropy & mutual information to the same feature using the values {10,100}). Thanks, Adam
-
- Peter Maier (on July 1, 2015, 19:14:08)
- Wow! Thanks for the fast answer. I implemented it already and it gives me a (looks like ;) ) reasonable result! Another question: If I choose for example "selectedIndices = feast('xxx',5,data,labels) %% selecting the top 5 features" -> Is the first feature the most "independent or relevant" or is there any relevance in the output order? Furthermore: If I want to select "the" most relevant features, how can I get information about which algorithm to use? - Is it clever to use just the overlap of all Feature Selection Algorithms? (As we come again to the question of the relevance of the output order) - Is there some recommendation which algorithm works best with classifier X or with noisy data or is it just "try and error" ? Thanks in advance! Peter
-
- Sarah Smith (on August 24, 2015, 14:13:39)
- Hi I'm new to feast toolbox and want to use it in my class project. I have a matrix with dimensions 13677*20000. 13677 is the number of data and 20000 is the number of features. when I use feast function in matlab, I get the error 'out of memory'. I do not have any idea how to dicretize it to bins and how to use feast for bins. I really appreciate your help. Thanks so much. Sarah
-
- Adam Pocock (on August 24, 2015, 16:12:53)
- Hi Sarah, There are a couple of things, one that dataset is actually quite large, so don't try to get a full ranking of all the features. Feast uses a cache equal to the number of features times the number of selected features and that will be about 3GB on its own. There are multiple ways you could discretise the data, in our paper we used a very simple procedure which divided each feature into 10 bins of equal width. You can replicate this by finding the range of each feature (i.e. range = max(feature) - min(feature)) and then assigning everything below (min + range/10) to bin 0, everything in the range (min + range/10) to (min + 2*range/10) to bin 1 etc. Thanks, Adam PS @Peter, I missed your last comment, if you are still interested in the relative merits of different feature selection techniques then I recommend you read our paper accompanying feast (http://www.jmlr.org/papers/volume13/brown12a/brown12a.pdf). The short version is I'd recommend you use JMI, it has the best empirical performance in our study and has an acceptable set of theoretical tradeoffs.
-
- Sarah Smith (on August 24, 2015, 17:19:29)
- Thank you very much Adam by using bins, should I use feast function for each bin separately? thanks for your help Sarah
-
- Adam Pocock (on August 24, 2015, 17:29:59)
- Hi Sarah, You should create one new data matrix, where each of the features has been replaced with a binned version, and then use that matrix with a single feast call. Thanks, Adam
-
- Mayank Mahajan (on October 16, 2015, 06:33:33)
- Hi Adam, I'm trying to use FEAST for a machine learning research project, but I'm getting errors using both MATLAB and PyFeast saying that there is a segmentation fault. Do you know if there's a problem with the input data that is causing this, and what can be done to fix it? Thanks, Mayank
-
- Mayank Mahajan (on October 16, 2015, 06:35:13)
- Hey Adam, I should add that for PyFeast even the tests are all failing. So it's possible it may not be a data issue. Thanks, Mayank
-
- Adam Pocock (on October 16, 2015, 06:49:15)
- Hi Mayank, I wouldn't expect Matlab to throw a segmentation fault unless you have particularly strange data. Even then it should give an error message rather than taking down the program. What kind of data are you using? Is it categorical values encoded as integers or something else? Does the crash kill Matlab itself or does it just return with an error? Also, what platform are you running on (i.e. Matlab version, OS, 32 or 64 bit, compiler version). If the pyfeast tests are failing you should post an issue on their GitHub page. Thanks, Adam
-
- Mayank Mahajan (on October 16, 2015, 06:55:19)
- I'm using Matlab r2105b on Mac OS X Yosemite 10.10.5 with XCode 6.4, 64 bit. When I run the function, Matlab says it encountered an internal error and says it needs to be closed. It outputs a crash dump file, whose header I've added below: ------------------------------------------------------------------------ Segmentation violation detected at Thu Oct 15 19:45:52 2015 ------------------------------------------------------------------------ Configuration: Crash Decoding : Disabled Crash Mode : continue (default) Current Graphics Driver: Unknown hardware Current Visual : Quartz Default Encoding : ISO-8859-1 Host Name : dyn-160-39-233-199.dyn.columbia.edu MATLAB Architecture : maci64 MATLAB Root : /Applications/MATLAB_R2015b.app MATLAB Version : 8.6.0.267246 (R2015b) OpenGL : hardware Operating System : Darwin 14.5.0 Darwin Kernel Version 14.5.0: Wed Jul 29 02:26:53 PDT 2015; root:xnu-2782.40.9~1/RELEASE_X86_64 x86_64 Processor ID : x86 Family 6 Model 69 Stepping 1, GenuineIntel Virtual Machine : Java 1.7.0_75-b13 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode Window System : Quartz Fault Count: 1 Abnormal termination: Segmentation violation ------------------------ The data I'm using is a bunch of floats, and the labels are 0s and 1s. It's roughly 180000 rows and 175 columns.
-
- Asma Ahmad (on October 21, 2015, 04:36:37)
- Hi, I am trying to start working on your toolbox. My dataset has 438 rows and 12 columns and I want to select features (may be 6) from the data. But when I try self = feast('condred',6,trainData1,trainLabel1), the result is: self = 5 1 2 3 4 6 And this came up instantly. I tried various combinations but same results. Both MIToolbox and FEAST are under same directory and I did ran "compileFEAST.m" before running example. Is there any thing I am missing? There are no errors, but results seems to be very inaccurate. Please help me on this. Thanks, Asma
-
- Adam Pocock (on October 21, 2015, 23:43:12)
- Hi Mayank, Could you send me an email and we'll fix this offline. My address is further up this thread. Is the label stored as a double or some other Matlab type? Thanks, Adam
-
- Adam Pocock (on October 21, 2015, 23:47:58)
- Hi Asma, I wouldn't recommend you use condred, it's a poorly performing measure we implemented to show a theoretical point. You should try jmi or mrmr instead. If these still give similar results then you should check that the features are properly discretised as this behaviour implies all features have the same score, which could occur when each feature has maximum entropy (i.e. each value is unique). In this case FEAST defaults to returning whatever ranking it had, with an ordered list of the remaining features appended to it. I would expect FEAST to return results very quickly for such a small dataset, so the runtime isn't an issue. Thanks, Adam
-
- Rodrigo Ceballos (on March 12, 2016, 12:34:43)
- Hi Adam and Mayank, I'm having the same segmentation violation that Mayank reported which causes MATLAB to quit. I've tried it on both a Windows 7, 64 bit with 64 bit 2015a MATLAB and a MacOSX 10.11.1 and Matlab 2014b with identical results. Could you mention what the solution was to this issue? I've just installed the toolbox and wanted to test if it runs first. This is my test: %Random 100 trials with 10 features data = rand(100,10) %Random 100 labels labels = round(rand(100,1)) %Test MRMR selectedIndices = feast('mrmr',5,data,labels) Maybe this is not a valid test to run? Either way, it shouldn't crash MATLAB the way it currently does. Thanks for any help, Rodrigo
-
- Rodrigo Ceballos (on March 12, 2016, 13:24:43)
- Sorry, the code and comments got messed up above, here it is: features data = rand(100,10); labels = round(rand(100,1)); selectedIndices = feast('mrmr',5,data,labels);
-
- Adam Pocock (on March 12, 2016, 17:01:00)
- Hi Rodrigo, I never got a follow up email from Mayank so I couldn't figure out what was causing that segfault. In your case it's because you are feeding feast continuous data when you should be feeding it discretised data. If you use randi instead of rand then it will work. I thought I'd stamped out all the places where this issue makes Matlab segfault, so I'll take a look and fix it. I think the issue in this case is that it's coming back with a zero vector after the rounding, as if you do "data=rand(100,10)*2;" that doesn't crash it. I'll post an update once I've fixed the bug (though it's probably in MIToolbox rather than FEAST itself). Thanks, Adam
-
- Adam Pocock (on March 12, 2016, 18:38:49)
- Hi Rodrigo, I've fixed the issue in FEAST v1.1.4. If all the features had zero MI with the target then the code that found the maximum feature would store it's default value of -1 as the first selected feature. So depending on the memory layout it would try and read out of allocated memory, causing the segmentation fault. However in this situation algorithms like mRMR will return the list of feature indices in order, as there is no way to select features, and propagating the error out of Mex is difficult. Thanks, Adam
-
- Rodrigo Ceballos (on March 31, 2016, 13:59:43)
- Hi Adam, Thank you for clarifying the need to use discrete data and fixing the MI = 0 bug so fast! We have also found another potential bug that you might want to take a look at, or advertise the solution we found. While working with large loops that used the mrmr algorithm repeatedly we found that the memory leak issues only happen when using labels of type logical. For us, merely casting these into doubles fixed the memory leak issues some people above have mentioned. Thanks for all the support and a great library! Rodrigo
-
- Adam Pocock (on April 1, 2016, 01:17:15)
- Hi Rodrigo, You should only use double inputs for FEAST, it's behaviour is undefined if you use any other kind (as the C code underneath expects everything to be doubles). I'm surprised it ran at all when you were using logical labels. I've added a quick check to feast.m on Github which throws an error if you supply it the wrong kind of input (I thought I'd done this a while ago, but it appears not). I don't have time to update the version on MLOSS at the moment, but you should just be able to pull the git version, or copy the 3 lines into your copy of feast.m. Thanks, Adam
-
- Alejandro Blanco (on April 7, 2016, 13:52:15)
- Hello Adam, First of all, thanks so much for this fantastic library , some people of my doctoral school asked me if I can mod your code. They wants to modify your code (mainly mim) in order to start with pre-selected set. As I will provide a starting set of variables selected by a exhaustive, close-loop (select->classify->evaluate the results), that I have already implemented. Could you give me some hints about where I have to mod your code in order to supply a preselected set? I'm a senior programmer in many languages. So you can reply if you want as technical as possible. I will be so grateful. Greetings, Alex.
-
- Alejandro Blanco (on April 7, 2016, 13:57:49)
- Hello Again, The algorithm actually is CMI not MIM, sorry for my mistake. Greetings, Alex.
-
- Adam Pocock (on April 7, 2016, 15:33:26)
- Hi Alejandro, I wouldn't recommend using CMI (implemented in the CondMI.c file) with a large number of selected features as it has estimation issues after a certain point (unless you have exponential amounts of data). There are two ways of modifying the code to supply a preselected feature set, one easy to implement but slow, and one harder to implement but faster. The first option is to start from the example Matlab code in MIToolbox and use that as a basis for modification. Looking at the CMIM.m or IAMB.m (the first stage of IAMB is equivalent to CMI) files, they call the C mutual information functions from Matlab and require about 3 lines of modification to add this functionality. Basically preloading things into answerFeatures for CMIM or cmb for IAMB will have the right effect (CMIM also requires correct calculation of the partialScores vector). The second option requires modifying the feast.m file that does arguments processing and passes things through to FSToolboxMex.c. FSToolboxMex.c needs modification to pass through the extra argument to whichever feature selection technique you choose to modify. Then the FS method itself needs modification to add an extra argument to the function. This needs to be reflected in the FSAlgorithms.h header file, and preferably you should add a method with the original signature that calls the updated method with an empty selected features list. Finally you'll need to actually add the code that incorporates the pre selected features into the method. We should move this discussion to the [FEAST Github page](https://github.com/Craigacp/FEAST) so we can reference lines of code properly. Thanks, Adam
-
- Alejandro Blanco (on April 7, 2016, 16:14:58)
- Hello Adam, Thanks for your prompt reply, I was looking carefully your code and I think that I have a first approach. Ok, I will continue in git. Thank you!
-
- Ziba Gandomkar (on April 8, 2016, 09:03:42)
- Hi Adam, First of all, many thanks for this great library. It helped me a lot! I'm facing a problem similar to one asked by Ryan Brauchler (his first comment was posted on February 5, 2014, 22:57:35). I have a high dimensional feature vector( 500 samples *100,000 features) . So I placed the feat function inside a loop and each time a subset of features were fed into the function, but the Matlab crashes after a few iteration. I'll appreciate a lot if you could kindly please whether there is a way to deal with this problem or not. Best regards, Ziba
-
- Adam Pocock (on April 9, 2016, 21:30:47)
- Hi Ziba, First thing to check is that you are using the latest version of FEAST from GitHub. Second, does it crash if you just feed in that subset of features outside of a loop? FEAST is sensitive to the values of the inputs, and only works on categorical (integer) values which should be supplied as a double matrix. The main thing to try is to map each feature to a small number of contiguous integers starting from 0. If one subset of features has a large range of values (i.e. {1,100000}, instead of {1,2}) then it will allocate a lot of memory which can cause issues. Third, FEAST caches the mutual information values, so it's memory usage is a linear function of the number of features you want to select and the number of features in the dataset (within a single call to FEAST). It frees all the memory after the call returns so running it in a loop shouldn't cause any trouble. Thanks, Adam
-
- Ziba Gandomkar (on April 12, 2016, 02:22:43)
- Dear Adam, Thank yo very much for your response. The second point you've mentioned helped me to solve the problem. Thanks a lot again, Ziba
-
- Ganesh Singadkar (on May 22, 2016, 07:15:55)
- Hi, I am using Matlab 2014a, On windos 2007 64 bit. My feature vector is of size (1484x13) of type double and label of size (1484x1) of type double. But when I am trying to use the feature selection algorithm selectedIndices = feast('jmi',5,data,labels) I am getting the error Undefined function 'FSToolboxMex' for input arguments of type 'double'. Error in feast (line 64) selectedFeatures = FSToolboxMex(1,numToSelect,data,labels,beta); I have followed the correct procedure. Initially I have run the CompileMITToolbox.m Please help me to solve this problem.
-
- Ganesh Singadkar (on May 22, 2016, 07:51:11)
- Hi Adam, First of all, thank you so much for this toolbox. I have fixed the issue which I mentioned in my last comment. Thanks a lot.
-
- Ganesh Singadkar (on May 30, 2016, 17:57:57)
- Hi Adam, I am using your feature selection toolbox in MATLAB. My feature vector is of size (2000x26). I have normalized the feature vector by applying vector normalization. But after applying the feature selection method to the normalized feature vector to select 10 best features out of 26 features, it gives the indices of first 10 features only ( like 1,2,3,4..10). I have no idea why it's giving output like this. Is it correct procedure or am I doing any mistake..? please help me to solve this problem. Thanks in advance
-
- Adam Pocock (on May 30, 2016, 18:47:02)
- Hi Ganesh, If by vector normalization you mean you've made each feature vector unit length then you're not using integer valued features which is necessary for FEAST. All inputs to FEAST should be integers, preferably from a small set of contiguous values. If your data is continuous you should convert it by binning it into 5 or 10 bins. Thanks, Adam
-
- Ganesh Singadkar (on May 30, 2016, 19:37:47)
- Hi Adam, Thanks for the prompt response. By applying the normalization, I want to adjust the (GLCM Method) feature vector range between 0 to 1. But the biggest problem is that my original feature vector range is ranging from -0.5 to 2000, and I don't know how to adjust this range between 0 and 1 (is it necessary to adjust it in between 0 and 1..?). I have tried to apply different normalization methods, but it's affecting the result of the feature selection. Am I doing something wrong here? Please give me some hint to solve this problem. My feature vector range is vast in that case how to do the normalization and how to apply the feature selection stage to it. Thanks in advance,
-
- Adam Pocock (on May 30, 2016, 19:45:28)
- Hi Ganesh, FEAST expects the inputs to be integers, if you only give it values between 0 and 1 then they will be converted into just "0" and "1" by rounding. I recommend you convert each value into the range [0,10) and then discretise the values. You should apply this transform to each feature independently. If you subtract the minimum value from each feature then divide by the input range and multiply by the output range then this will convert each value into the correct range, at which point you can call floor() on each value to discretise it. Different normalisations will change the output of the feature selection, this is expected. Thanks, Adam
-
- harsh joshi (on May 31, 2016, 07:36:54)
- Hi Adam, First of all Thanks for the excellent toolbox. I have feature vector of size (3000x40) in which first 1500 are from class 1 and next 1500 from class 2. I want to train this feature vector by applying SVM classifier ( I am planning to use LIBSVM tool box from https://www.csie.ntu.edu.tw/~cjlin/libsvm/). I would like to can I use feature selection stage before training. Can I make the feature selection directly or I have to follow some procedure on the feature vector before the feature selection? Please tell me what steps I should follow before feature selection..? Thanks Harsh
-
- Adam Pocock (on June 5, 2016, 19:44:15)
- Hi Harsh, If your data is discrete you can directly apply FEAST, otherwise you will need to discretise it first. You don't have to use the discretised data in LibSVM, just for the feature selection. Thanks, Adam
-
- elina shakier (on September 22, 2016, 23:02:02)
- Hi Adam, Which function takes into account the correlation between features when is identifying the most significant one? Can you please advise about discretise method for the below range of the features? 1 0 -0.1048 3.8500e 21 0 0 2 2 0 0 0 2 0 -0.0298 0.0552 6 0 0 0 0 1 1 0 1 0 0.0570 0.0173 6 1 1 0 0 0 0 0 2 0 0.0478 0.0872 18 0 0 1 1 0 0 0 1 0 0.0347 0.0925 21 0 0 1 1 0 0 1 2 0 0.0326 -0.0257 27 1 1 0 0 0 0 0 2 0 0.1134 0.0428 8 1 1 1 0.99 0 0 0 Thanks Elina
-
- Adam Pocock (on September 23, 2016, 02:08:16)
- Hi Elina, Methods such as CMIM, JMI and mRMR explicitly take into account the correlations between features. CMIM and mRMR will reduce a feature's score if it is correlated with other already selected features, and JMI will both reduce it's score if it is correlated, and increase it if it combines well with already selected features. You should apply an independent discretisation to each feature, it doesn't need to be done with fixed parameters for a whole dataset. As I've mentioned before we've found 5 or 10 equal width bins to work pretty well for hundreds of datapoints, and you could use a few more bins if you have more data. Thanks, Adam
-
- elina shakier (on September 23, 2016, 07:45:56)
- Hi Adam, Thank you for the fast reply! Do these methods take into account the linearity and non linearity correlation between the features? What should I do with confounding features in applying these methods? Do you recommnad feast or MIToolbox for features that have non linear and linear relationship? If I would like to know more details about these methods, do you recommend your journal paper ? Please let me know, if you know any other references too. Thanks
-
- Adam Pocock (on September 23, 2016, 15:14:33)
- Hi Elina, All the mutual information based methods (everything except relief) are non-linear correlation methods. They all detect non-linear correlations between the features and the class, but only methods which use the conditional mutual information take into account non-linear correlations between groups of features and the class. The journal paper has a much fuller description of all the techniques and when they are appropriate. Confounding features are hard to detect with these methods, but JMI or CMIM is the best bet. CMIM is pretty similar to IAMB which is a proper structure learning algorithm which copes with confounding variables better (though it requires much more data). FEAST is a feature selection package which uses MIToolbox to provide mutual information functions, MIToolbox itself isn't a feature selection package. Thanks, Adam
-
- elina shakier (on October 6, 2016, 19:45:10)
- Hi Adam, Thank you for the previous reply. I am wondering, why I get two high correlated features as the top one using mRMR and JMI, can you please advice? How can I check the stability of these algorithems? Also, can feast manage the database with 8967 features and 13674 examples on the server?
-
- Adam Pocock (on October 6, 2016, 20:20:34)
- Hi Elina, I'm not sure what you mean by correlated features. Are you measuring the correlation using mutual information, pearson's linear correlation coefficient or something else? JMI may return correlated features, but only if they are predictive and have good class conditional correlations (which are beneficial to non-linear classification algorithms). For stability then there are the references in the journal paper, or some newer work by one of my advisor's other students [Sarah Nogueira](http://www.cs.man.ac.uk/~nogueirs/index.html). I believe Sarah released Matlab code with her paper. FEAST should cope with a dataset that large if you don't ask for a full ranking of all the features. If you do then it will require a few more gigs of RAM but should execute fine (though it will take a lot longer). The first thing to try is to get the top 5 features using a given method, that should execute fairly quickly and will tell you if FEAST managed to process the data properly. Thanks, Adam
-
- elina shakier (on October 7, 2016, 12:24:53)
- Hi Adam, Thank you. There is pairwise non linear and linear correlation between the features ( we have both). I am applying these algorithms for a practical application. The aim is to reduce the dependency between the features as much we can for identifying difference between two problems. I have tried JMI, CMIM and Mrmr. I am not sure, maybe I should only use CMIM? Should I? I have problem in the results interpretation ! For JMI, some of these top features are highly correlated, how should I deal with this as I should report the best top features that don't give the same information. Also, if we want to use wrapper method, when there is correlation between the top features that cause problem too?! Thanks
-
- Adam Pocock (on October 12, 2016, 03:16:57)
- Hi Elina, If you really require no correlations between the top features then I'd use CMIM. JMI may return correlated feature sets, but in some cases these correlations are beneficial to the classification process (as class conditional correlations add information rather than reduce it). Wrappers are prone to the same problems as filters with respect to correlations, but in this case you don't get any values that describe the correlations. It's also much more important to get a good search algorithm than it is with a filter as the wrapper procedure is more expensive. Thanks, Adam
-
- elina shakier (on October 12, 2016, 16:37:30)
- Thanks Adam, If I use CMIM, how should I compare the difference between the top features? For example, how much important is feature 1 than feature 2? Regards
-
- Adam Pocock (on October 12, 2016, 17:14:54)
- Hi Elina, Unfortunately FEAST doesn't return it's internal scores for features, so you'd have to calculate this after the fact. You can directly calculate the CMIM score for each feature (for the first selected feature it's I(F_1;Y), for all others it's min_{i < t} (I(F_t;Y|F_i), I(F_t;Y)). There is some example code in MIToolbox that runs CMIM in Matlab which you could modify to produce this output, but it's not as well tested as FEAST is, so it might do odd things. Thanks, Adam
-
- elina shakier (on October 20, 2016, 18:01:21)
- Hi Adam, Thank you. I realy need to get the score for each fearure as I need to to know the difference between them. When you do feature ranking you possibely get the rank from the lowest to hight value. I tried to modify the cmi.m in MIToolbox but I was not successful . Can you please help more. Thanks Elina
-
- Adam Pocock (on October 20, 2016, 19:03:57)
- Hi Elina, Make an issue on the [FEAST Github](https://github.com/Craigacp/FEAST) page, and we'll take the discussion there. It's easier to refer to the source code that way. Thanks, Adam
-
- elina shakier (on October 20, 2016, 21:14:52)
- Hi Adam, Thanks, I created an issues. https://github.com/Craigacp/FEAST/issues/6 Elina
-
- elina shakier (on November 7, 2016, 13:28:50)
- Hi Adam, To use feast for feature selection, I split data in training and testing parts. I have a large dataset (3000 examples), do you suggest cross validation or bootstrap? If bootstrap, how many of observations should be resampled? Thanks
-
- Adam Pocock (on November 7, 2016, 15:36:40)
- Hi Elina, I'd use 10 fold cross validation rather than a bootstrap, but that's my personal preference, either method should work fine. Thanks, Adam
-
- elina shakier (on February 8, 2017, 10:34:21)
- Hi Adam, Is there any relationship between 'feast' algorithems and Random Forest ((TreeBagger) in Matlab? Can we get the same top features? Thanks Elina
-
- Adam Pocock (on February 9, 2017, 02:16:14)
- Hi Elina, The features generated by an RF and by FEAST will usually be different. It's possible they could be similar if there is a small set of relevant features and you ignore the rank order, but usually you'll get quite different answers. Thanks, Adam
-
- elina shakier (on February 13, 2017, 16:50:03)
- Thanks Adam. Does make sense to do feature selection ( feast) before random forest? We can use random forest either for the feature selection and prefiction. Thanks Elina
-
- Adam Pocock (on February 14, 2017, 02:46:49)
- Hi Elina, I probably wouldn't use feature selection before a random forest unless my data was extremely sparse. It shouldn't hurt performance if you do, but I tend to prefer gradient boosted trees as my classifier (and feature selection seems more likely to be useful there). Thanks, Adam
-
- elina shakier (on February 15, 2017, 16:22:35)
- Thanks Adam, 1. How can I check if the data is sparse? Can you please clarify more the second sentences and also the performance part? As I am aware Random forest is a bagging model, do you mean if I use boosted trees(Adaboost), features selection (feast) is useful to be done before that ? 2. If we do n fold cross validation, as the top features vary in each fold, does it make sense to compute the mutual information score for each feature in each fold and take the average of scores overall n folds? So, the feature that has the highest average score is the best one? 3. Is there any advantages to use wrapper method after mutual information techniques? Do you know any techniques to do feature selection for batches of features ? Thanks Elina
-
- Adam Pocock (on February 15, 2017, 17:50:46)
- Hi Elina, If you don't know your data is sparse, it probably isn't. Feature selection before a random forest reduces the amount of variability in the data, and high variability is part of the way that a random forest reduces variance (and thus improves classification performance). Boosted trees reduce bias so are less reliant upon having variability. The two statements before are pretty crude exaggerations of why ensemble methods work, I recommend you read the ensemble methods chapters of [The Elements of Statistical Learning](https://statweb.stanford.edu/~tibs/ElemStatLearn/) for a proper explanation. I wouldn't recommend averaging the mutual information scores across folds when using more complicated techniques than MIM. Doing that will tie together the folds and leak information between them, which might taint any stats tests you do on the cross validated output. There are definite advantages for using wrappers after mutual information techniques. MI based filters are good for describing datasets and finding generally useful sets of features. To optimise performance with a specific classifier then a wrapper based on that classifier will always find a better feature set as it's optimising that classifier's loss function. For batches of features you could look at things like the group lasso, but I don't know too much about those techniques. Thanks, Adam
-
- elina shakier (on February 15, 2017, 23:42:16)
- Thanks Adam for the good explanation, I am wondering how can I deal with top features variability in cross validation when I am applying (JMI, Mrmr, and CMIM). The top features vary in each fold! Do you have any suggestions if averaging is not a good idea? In addition, is applying wrapper to ensemble method right thing to do? Thanks Elina
-
- nick joe (on March 12, 2017, 18:25:19)
- Hi Adam, How Feast deal with imbalanced data? Thanks
-
- Adam Pocock (on March 13, 2017, 01:15:57)
- Hi, FEAST doesn't do anything special for imbalanced data. The mutual information that underlies FEAST should take into account all the labels. If your dataset is really unbalanced try using the weighted feature selection algorithms and put a high weight on the minority class. Thanks, Adam
-
- Ilkka Rautiainen (on July 19, 2017, 15:24:51)
- Hi Adam, I'm trying to get some basic understanding of the feature scores provided by FEAST. These are the issues I'm currently having trouble with: 1) The top feature seems to be the same (at least with the two datasets I've experimented with and their subsets) when using different methods (CMIM, JMI, DISR). This alone isn't that suspicious, and I'm more concerned about the following points. 2) The top feature score is exactly the same with the three methods mentioned above. This seems to be true even when different subsets are selected from the datasets. Is it a coincidence that the top score always seems to be exactly same for different algorithms? 3) With DISR the score of first feature (0.5849) doesn't seem to fit the pattern of increasing or decreasing scores [0.5849;0.1548;0.2972;0.4586;0.6215]. How should this top score be interpreted? I'm seeing this behavior with the real dataset I'm working on and with a basic MATLAB 'ovariancancer' sample dataset. I'm using FEAST 2.0.0 and MIToolbox 3.0.1 with MATLAB R2016b on a Windows 10 machine. An overview of the FEAST feature data for 'ovariancancer' dataset is as follows: FEAST method, Selected indices, Feature scores 'cmim', [2814;3080;950;2337;2731], [0.5849;0.1765;0.1701;0.1624;0.1570] 'jmi', [2814;2735;2644;2336;2731], [0.5849;0.7782;1.4733;2.0527;2.7469] 'disr', [2814;679;3041;3032;3040], [0.5849;0.1548;0.2972;0.4586;0.6215] This data can be reproduced with the following MATLAB code: %Example code using MATLAB sample data 'ovariancancer': load ovariancancer.mat; labels = grp2idx(grp); feast_methods = {'cmim', 'jmi', 'disr'}; obs_discretized = zeros(size(obs,1), size(obs,2)); for ii=1:size(obs,2) [Y,~] = discretize(obs(:,ii), 10); %Discretize data to 10 uniform bins obs_discretized(:,ii) = Y; end selections = struct('feast_method', 'NaN', 'indices', {'NaN'}, 'scores', {'NaN'}); for ii=1:size(feast_methods, 2) [selected_indices, feature_scores] = feast(feast_methods(ii), 5, obs_discretized, labels); %Get 5 features selections(ii).feast_method = feast_methods(ii); selections(ii).indices = selected_indices; selections(ii).scores = feature_scores; end Thanks!
-
- Adam Pocock (on July 23, 2017, 00:00:29)
- Hi Ilkka, This is expected behaviour. All the feature selection algorithms in FEAST use the mutual information to score the first feature, and they select the one that scores the highest. After the first feature they diverge by using the specific scoring criterion given in that algorithm. With DISR the score is the unnormalised mutual information, and with each feature after that it's the score normalised by the joint entropy (which is probably going to shrink the score). Have a look through the associated journal paper, it goes through the differences in some detail, and should give you some intuition for what each criterion will do. Thanks, Adam
-
- Sarah Khan (on September 6, 2017, 10:45:16)
- Hi, I compiled MIToolbox and it worked just fine; however, when I try to compile FEAST, I get the following errors Error using mex C:\Users\src\BetaGamma.c not found; check that you are in the correct current folder, and check the spelling of 'C:\Users\src\BetaGamma.c'. Error in CompileFEAST (line 4) mex -I../../MIToolbox/include -I../include FSToolboxMex.c ../src/BetaGamma.c ../src/CMIM.c ../src/CondMI.c ../src/DISR.c ../src/ICAP.c ../src/JMI.c ../src/mRMR_D.c ../../MIToolbox/src/MutualInformation.c ../../MIToolbox/src/Entropy.c ../../MIToolbox/src/CalculateProbability.c ../../MIToolbox/src/ArrayOperations.c Any help would be appreciated.
-
- Adam Pocock (on September 6, 2017, 21:17:02)
- Hi Sarah, It looks like you're running CompileFEAST from your username directory on Windows, rather than from the FEAST/matlab directory. Could you check where you are running the script from? Thanks, Adam
Leave a comment
You must be logged in to post comments.