OpenFst Forum

You need to be a registered user to participate in the discussions.
Log In or Register

You can start a new discussion here:

help You can use the formatting commands describes in TextFormattingRules in your comment.
tip If you want to post some code, surround it with <verbatim> and </verbatim> tags.
warning Auto-linking of WikiWords is now disabled in comments, so you can type VectorFst and it won't result in a broken link.
warning You now need to use <br> to force new lines in your comment (unless inside verbatim tags). However, a blank line will automatically create a new paragraph.

Subject
Comment
Log In

Adding other properties to arcs

AlexeiIvanov - 22 Nov 2009 - 17:00

I wonder what is the best way to augment arcs with properties other then input/output symbols and weights? E.g. in ASR lattice re-scoring sometimes it is beneficial to keep the time stamps of elementary acoustic events and propagate them to the final FST.

Thank you.

 
Log In

k-closed semirings

RolandSchwarz - 20 Nov 2009 - 08:28

I probably did not totally understand the Mohri papers, but what I got was that the single source shortest distance algorithm only works for locally closed semirings, where in the general case the Floyd-Warshall or Gauss-Jordan all-pairs shortest distance has to be applied. Now the code documentation of the openFST library tells me that the implementation there is the single source one according to Mohri 2002 and that it only works for k-closed semirings. Now here's what I think that means: (i) the code documentation is not totally precise, in that the algorithm also works for locally closed, not just k-closed semirings (ii) it therefore works on the tropical (which is 1-closed) and the log semiring with positive weights (which corresponds to the real semiring with 0 <= weights <=1) up to a certain accuracy (iii) if that is true, it does not work on the log semiring with positive and negative weights. It does not give an error though, and it seems to compute sensible distnances. What happens in that case? Is there an internal check for closedness? Does it fall back to an all-pairs implementation of the Floyd-Warshall-whatever variant?

Or alternatively, am I totally mistaken? Can someone shed some light on my confusion?

Cheers,

Roland

 
Log In

ReWeight? : Working Example

NewestUser - 08 Nov 2009 - 04:11

Would it be possible to get a practical working example of the ReWeight operation, along with an example of the 'potentials.txt' input? At present it is not exactly clear how this operation might be used, or what it might be useful for.

CyrilAllauzen - 09 Nov 2009 - 11:45

An example of potentials.txt input is the ouptut of fstshortestdistance.

fstshortestdistance a.fst > potentials.txt
fstreweight --to_initial a.fst potentials.txt > b.fst
This is equivalent to:
fstpush --push_weights --to_initial a.fst > b.fst
This is actually the way pushing is implemented in the library, shortest-distance followed by reweighting.

 
Log In

read function causes problems

MichaelBerger - 30 Oct 2009 - 16:38

Hi,

I am trying to read a fst into a simple C++ application, however, for some reason I am being told that certain symbols are undefined:

Ld build/Debug/fstprintallpaths normal x86_64
cd /.../fstprintallpaths
setenv MACOSX_DEPLOYMENT_TARGET 10.6
/Developer/usr/bin/g++-4.2 -arch x86_64 -isysroot /Developer/SDKs/MacOSX10.6.sdk -L/Users/michael/code/fstprintallpaths/build/Debug -L/usr/local/lib -L/usr/local/lib/Cabal-1.6.0.2 -L/usr/local/lib/ImageMagick-6.4.1 -L/usr/local/lib/pkgconfig -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4 -L/usr/local/lib/ImageMagick-6.4.1/config -L/usr/local/lib/ImageMagick-6.4.1/modules-Q16 -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Language -L/usr/local/lib/ImageMagick-6.4.1/modules-Q16/coders -L/usr/local/lib/ImageMagick-6.4.1/modules-Q16/filters -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/Compat -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/PackageDescription -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/Simple -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Language/Haskell -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/Simple/Build -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/Simple/GHC -L. -F/Users/michael/research/projects/lattice_parsing/code/fstprintallpaths/build/Debug -filelist /Users/michael/research/projects/lattice_parsing/code/fstprintallpaths/build/fstprintallpaths.build/Debug/fstprintallpaths.build/Objects-normal/x86_64/fstprintallpaths.LinkFileList -mmacosx-version-min=10.6 -o /Users/michael/research/projects/lattice_parsing/code/fstprintallpaths/build/Debug/fstprintallpaths

Undefined symbols:
  "fst::Int64ToStr(long long, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*)", referenced from:
      fst::FloatWeightTpl<float>::GetPrecisionString()  in fstprintallpaths.o
  "fst::FstHeader::Read(std::basic_istream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)", referenced from:
      fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >::Read(std::basic_istream<char, std::char_traits<char> >&, fst::FstReadOptions const&)in fstprintallpaths.o
ld: symbol(s) not found
collect2: ld returned 1 exit status



-----------------------------------------------------------------
fstprintallpaths.cpp:

#include "fstprintallpaths.h"

#include <iostream>
#include <vector>
#include <fst/fstlib.h>
#include <fst/fst-decl.h>
#include <fst/vector-fst.h>
#include <fst/weight.h>

using namespace fst;

int main (int argc, char * const argv[]) {
    // insert code here...
   std::cout << "Hello, World!\n";
   StdFst *input = StdFst::Read("blah.fst");
   //StdFst *model = StdFst::Read("blah.fst");
   //printAllStrings(&fst);
    return 0;
}

string itos(int i) { // convert int to string; from Bjarne Stroustrup's FAQ
   stringstream s;
   s << i;
   return s.str();
}

string vectorToString(vector<int> v) {
   if(v.size() == 0)
      return "<>";
   string result = "<" + itos(v[0]);
   for(int i = 1; i < v.size(); i++) {
      result + "," + itos(v[i]);
   }
   return result + ">";
}

void printAllStringsHelper(VectorFst<StdArc> &fst, int state, vector<int> str, TropicalWeight cost) {
   if(fst.Final(state) == TropicalWeight::Zero()) 
      cout << vectorToString(str) << " with cost " << (Times(cost,fst.Final(state))) << endl;
   for(ArcIterator <VectorFst <StdArc> > aiter(fst,state); ! aiter.Done(); aiter.Next()) {
      StdArc arc = aiter.Value();
      str.push_back(arc.ilabel);
      printAllStringsHelper(fst, arc.nextstate, str, Times(cost, arc.weight.Value()));
      str.pop_back();
   }
}

// a bad idea if there are loops :)
void printAllStrings(VectorFst<StdArc> &fst) {
   vector<int> str(0);
   TropicalWeight tw(TropicalWeight::One());
   printAllStringsHelper(fst, 0, str, tw);
}

-----------------------------------------------------------------

fstprintallpaths.h

/*
 *  fstprintallpaths.h
 *  fstprintallpaths
 *
 */

#include <iostream>
#include <vector>
#include <fst/fstlib.h>
#include <fst/fst-decl.h>
#include <fst/vector-fst.h>
#include <fst/weight.h>

using namespace fst;

int main(int argc, char * const argv[]);

string itos(int i);

string vectorToString(vector<int> v);

void printAllStringsHelper(VectorFst<StdArc> &fst, int state, vector<int> str, TropicalWeight cost);

void printAllStrings(VectorFst<StdArc> &fst);

MichaelBerger - 30 Oct 2009 - 16:57

Trying it on the commandline gives the same problems:
$ g++ fstprintallpaths.cpp fstprintallpaths.h -I  /usr/local/include 
Undefined symbols:
  "fst::FstHeader::Read(std::basic_istream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)", referenced from:
      fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >::Read(std::basic_istream<char, std::char_traits<char> >&, fst::FstReadOptions const&)in ccspqZtg.o
  "fst::Int64ToStr(long long, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*)", referenced from:
      fst::FloatWeightTpl<float>::GetPrecisionString()  in ccspqZtg.o
ld: symbol(s) not found
collect2: ld returned 1 exit status

PaulDixon - 31 Oct 2009 - 08:19

Add a link switch -lfst for openfst to the g++ command
 
Log In

Determinization increases size of FST

MichaelBerger - 25 Oct 2009 - 14:36

Hi, I am trying to determinize a FST using fstdeterminize but for some reason the determinized fst is much larger than the original one:

$ fstinfo a_binary.fst
fst type                                          vector
arc type                                          standard
input symbol table                                ia.txt
output symbol table                               oa.txt
# of states                                       24
# of arcs                                         9095
initial state                                     0
# of final states                                 4
# of input/output epsilons                        0
# of input epsilons                               0
# of output epsilons                              9095
# of accessible states                            24
# of coaccessible states                          24
# of connected states                             24
# of strongly conn components                     24
expanded                                          y
mutable                                           y
acceptor                                          n
input deterministic                               n
output deterministic                              n
input/output epsilons                             n
input epsilons                                    n
output epsilons                                   y
input label sorted                                n
output label sorted                               y
weighted                                          y
cyclic                                            n
cyclic at initial state                           n
top sorted                                        y
accessible                                        y
coaccessible                                      y
string                                            n


$ fstinfo a_rmeps.fst
fst type                                          vector
arc type                                          standard
input symbol table                                ia.txt
output symbol table                               oa.txt
# of states                                       24
# of arcs                                         645
initial state                                     0
# of final states                                 4
# of input/output epsilons                        0
# of input epsilons                               0
# of output epsilons                              645
# of accessible states                            24
# of coaccessible states                          24
# of connected states                             24
# of strongly conn components                     24
expanded                                          y
mutable                                           y
acceptor                                          n
input deterministic                               n
output deterministic                              n
input/output epsilons                             n
input epsilons                                    n
output epsilons                                   y
input label sorted                                n
output label sorted                               y
weighted                                          y
cyclic                                            n
cyclic at initial state                           n
top sorted                                        y
accessible                                        y
coaccessible                                      y
string                                            n


$ fstinfo a_determinized.fst 
fst type                                          vector
arc type                                          standard
input symbol table                                ia.txt
output symbol table                               oa.txt
# of states                                       1026
# of arcs                                         22054
initial state                                     0
# of final states                                 906
# of input/output epsilons                        0
# of input epsilons                               0
# of output epsilons                              22054
# of accessible states                            1026
# of coaccessible states                          1026
# of connected states                             1026
# of strongly conn components                     1026
expanded                                          y
mutable                                           y
acceptor                                          n
input deterministic                               y
output deterministic                              n
input/output epsilons                             n
input epsilons                                    n
output epsilons                                   y
input label sorted                                y
output label sorted                               y
weighted                                          y
cyclic                                            n
cyclic at initial state                           n
top sorted                                        y
accessible                                        y
coaccessible                                      y
string                                            n

MichaelBerger - 25 Oct 2009 - 14:38

sorry for the cluttered command line output. The point I would like to get across is that the size of the original FST increases by 1000 states and the number of arcs also more than doubles when determinizing it.

Is this normal?

CyrilAllauzen - 02 Nov 2009 - 14:31

I fixed the formating of your original post. I don't see anything very unusual here. It is possible for determinization to significantly increase the size of the Fst.
 
Log In

ArcSort? error with Mutable FSTs

JonathanB - 16 Oct 2009 - 16:45

I am receiving the compile error "no matching function for call to 'ArcSort(fst::StdVectorFst&, fst::StdOLabelCompare)' " when I attempt to compose two mutable FSTs. Although an answer to an older post with the same error stated that making FSTs mutable would resolve this error, I receive the same error with mutable FSTs. I would really appreciate some help!

My code is of the form:

#include "fst/fstlib.h"

using namespace std; using namespace fst;

int main() { StdVectorFst? fst1; StdVectorFst? fst2; StdVectorFst? fst3;

ArcSort? (fst1, StdOLabelCompare? ()); fst3 = ComposeFst? (fst1, fst2);

return 0; }

I posted this question as part of an old topic, but I am reposting it as a new topic because I am not sure whether old topics are still reviewed.

Thank you very much, Jonathan

JonathanB - 16 Oct 2009 - 16:48

I apologize for the poor formatting. The code should read:

#include "fst/fstlib.h"

using namespace std;
using namespace fst;

int main()
{
StdVectorFst fst1;
StdVectorFst fst2;
StdVectorFst fst3;

ArcSort(fst1, StdOLabelCompare());
fst3 = ComposeFst(fst1, fst2);
return 0;
}

PaulDixon - 16 Oct 2009 - 20:27

Changing to the following compiles for me

 
ArcSort(&fst1, StdOLabelCompare());
fst3 = StdComposeFst(fst1, fst2);

 
Log In

do we have a transition that encodes a fail transition?

CamilleChen - 15 Oct 2009 - 15:37

Hi,

I know that an epsilon transition encodes an empty string. However, I'd like to compose two FSA. In case a label exists in one FSA, but not in another, do we have a fail transition, instead of the epsilon transition, can encode this?

Thanks very much!!

CyrilAllauzen - 15 Oct 2009 - 16:00

Hi Camille,

You want to use PhiMatcher (or maybe RhoMatcher, I'm not completely sure from your message). See the matcher documentation here.

Best, Cyril

CamilleChen - 15 Oct 2009 - 17:35

Hi Cyril, Yeah, these matchers are quite useful! Thanks!

Best, Camille

 
Log In

regular expression compiler

FrancoisBarthelemy - 15 Oct 2009 - 05:10

Is there an available regular expression compiler for OpenFst, namely a compiler which compiles a regular expression into an fst?

Thanks

 
Log In

Traverse the FST to check if it accepts a string or not and give appropriate output

VipulMittal - 15 Oct 2009 - 02:09

I have built an FST for a dictionary and I want to traverse it to check if the given word is a valid dictionary entry. If its a valid word, I'll output its feature values. Can someone please help me with how can I traverse a FST ? My Fst size is around 100MB.

Example FST for word : cat

0 1 c c 1 2 a a 2 3 t n,sg 3

VipulMittal - 15 Oct 2009 - 02:15

dont knw why the FST appeared like this. It is: 0 1 c c 1 2 a a 2 3 t n,sg 3

CyrilAllauzen - 15 Oct 2009 - 14:41

What you type is interpreted as twiki markup language. As mentioned above, you would have need to use the verbatim tags.

As for your original question, you want to use Composition.

 
Log In

compute weight automatically

CamilleChen - 14 Oct 2009 - 23:34

Hi,

I have another question:

if we know the total cost of path1 and we know the total cost of path 2 and path 1 is part of path 2 (say path 1 has 1 arc, and path 2 has 2 arcs, and path2's 1st arc label is the same as the path1's arc's label.)

then, is there a way to automatically combine these two paths in a single fst, by keeping path1's cost as the 1st arc's cost, and subtracting the path1's total cost from path2's total cost, and assign it as the cost of the second arc?

thanks!

CamilleChen - 14 Oct 2009 - 23:36

sorry: let me reorganize the condition a little bit:

if we know the total cost of path1

and we know the total cost of path 2

and path 1 is part of path 2 (say path 1 has 1 arc, and path 2 has 2 arcs, and path2's 1st arc label is the same as the path1's arc's label.)

CyrilAllauzen - 15 Oct 2009 - 16:02

Hi Camille,

This is still not very clear to me. Could you clarify further?

Best, Cyril

CamilleChen - 16 Oct 2009 - 19:18

Hi Cyril,

Now I see how to solve my question. Actually I did not know hot to indicate a path's weight when writing this path into a fst format. After I can do that, I can use the commands to do the calculation, and now, I know how to indicate a path's weight, so, the question is solved! smile

Thanks!

Camille

 
Log In

how to remove a weighted FSA's weights?

CamilleChen - 13 Oct 2009 - 21:50

Hi all, Suppose now I have a weighted FSA ready, and I'd like to rewrite it into another FSA by removing all the weights. Can anybody give me some suggestions? Thanks a lot!

RolandSchwarz - 14 Oct 2009 - 03:49

I don't know if there is a built-in function, but you could always try it with fstprint | cut -f 1-4 | fstcompile ... i.e. you print it, extract the fields which you want to keep and then recompile it or redirect it to a new text file.

RolandSchwarz - 14 Oct 2009 - 03:50

btw, I'm still looking for a solution to my problem below...

PaulDixon - 14 Oct 2009 - 19:00

fstmap with map_type set to rmweight can remove the arc and final weights

fstmap --map_type=rmweight

CamilleChen - 14 Oct 2009 - 23:27

both way works!! Thanks so much!!!

 
Log In

Performance issues?

RolandSchwarz - 09 Oct 2009 - 10:07

Hiya,

I'v been experimenting around with the openfst library lately and have the following problem: I'm trying to align sequences using a transducer (i.e. find the edit distance with specifics weights to indels and mismatches). The transducer has 4 states and some epsilon transitions to model indels (no double epsilon transitions tho). The sequences to be aligned are nucleotide sequences (alphabet of size 4) and about 1000 nucleotides long. Composition of two of these sequences with the alignment transducer (as in I o T o O, where I is the input and O the output sequence/acceptor) results in a 400mb fst file and takes approximately 3-4 minutes. Computation of the shortest path again another 2 minutes or so.

Now here's my question. It is somehow obvious, that the number of states explodes in composition (1000 * 4 * 1000 are at least 4 mill states), so maybe direct composition is not the way to go. But what can I honestly do to improve performance? I switched from the shell versions to the C library and tried lazy composition. Now composition is done in an instant, but the shortest path algorithm still seems to expand the whole transducer, as memory usage goes up to 500mb and it still takes about 4 minutes.

Is it simply such that the transducer approach to alignment is slow? Computing 200 of these pairwise alignments using traditional dynamic programming takes about 20 seconds. Or did I make a mistake somewhere? Does the shortest path algorithm neccessarily have to expand the full transducer, i.e. does lazy composition help at all in this case? I didn't even expect it to be competitive in terms of speed to dedicated tools, but I hoped it to be at least computationally feasible. I would really appreciate any kind of help.

Cheers,

Roland

CyrilAllauzen - 15 Oct 2009 - 15:51

Hi Roland,

The issue is indeed that computing the direct composition is not the way to go since it is rather expensive. Your idea of using lazy composition was good but as you noticed, ShortestPath expands the full machine by default. However, there is a way to avoid this behaviour: by using the shortest-first queue discipline and setting the option first_path to true.

ComposeFstOptions opts();
opts.gc_limit=0;  
// This disable the caching of the result of composition
// and should lower the memory usage at no computational cost since
// shortest-path should visit each state in the composed machine at most once.
ComposeFst<Arc> cfst(fst1, fst2, opt);

NaturalShortestFirstQueue<StateId, Weight> state_queue(distance);
ShortestPathOptions<Arc, NaturalShortestFirstQueue<StateId, Weight>,
AnyArcFilter<Arc> >
opts(&state_queue, AnyArcFilter<Arc>());
opts.first_path = true;
ShortestPath(cfst, ofst, &distance, opts);

This example assume composing two machines. In you case fst1 could be I o T (itself delayed or not, since it is a smaller machine both might be worth tryingboth) and fst2 could be O.

This should improve things a lot although it should still be slower than dedicated tools. I am very interested in this, so let me know how things turn out.

Best, Cyril

RolandSchwarz - 16 Oct 2009 - 05:56

Thanks alot Cyril, this is really working now. For some sequences it still takes 10 seconds or so, but memory consumption is low and most sequences are aligned within 1-2 seconds. Very nice. Does the NaturalShortestFirstQueue impose any restrictions on the type of semiring I chose? Or could I use the same strategy if I'm working on the log semiring and would be interested in the shortest distance?

All the best,

Roland

 
Log In

Label pushing - unexpected behavior?

JohnS - 02 Oct 2009 - 08:52

Hello, I have a couple of questions regarding label pushing in OpenFst.
It seems like a relatively common behavior of the operation Push(), in label pushing, is to generate resulting FSTs that are a lot larger than the original FST.
For fairly large FSTs, this (for me unexpected) growth in number of states and arcs, can be so significant that it makes any further processing impossible due to practical memory limitations.

I have attached the following example that produces the same behavior, using a very small FST.

Assume we have an FST A as depicted below:

By label pushing FST A (to final state), e.g. by the command
fstpush --push_labels --to_final A.fst PA.fst,
one might expect the following FST:

But instead the result FST PA.fst is generated, depicted below:


Question 1. In many cases, as is the example, there is an obvious way of performing pushing of the output labels to a final state, without adding any new states to the FST. Can you explain this growth behavior – under what circumstances it is triggered and, possibly, how it can be avoided?

When an FST that has been label pushed, is label pushed again, one might expect it to stabilize. However, Applying Push() to the already Pushed FST A generates the following output.

Question 2. Why is not the OpenFst implementation of label pushing idempotent, (e.g. like weight pushing is) so that the result of Push(A) is identical to result of Push(Push(A)) ?
Thanks!
-John

CyrilAllauzen - 15 Oct 2009 - 16:29

Hi John,

The reason for this is that label-pushing is implemented by considering the input transducer as a weighted automaton over the string semiring and applying weight-pushing (over the string semiring). The result is then the following weighted automaton over the string semiring:

label-pushing.jpg

The issue is how to convert that weighted automaton over the string into a transducer. Currently, we use the second step of the epsilon-normalization algorithm to do this. This is also what the cause the operation to be non idempotent.

One way to partially avoid the problem, would be to first reverse the transducer, apply label-pushing towards the initial state and reverse again:

fstreverse A.fst | fstpush --push_labels | fstreverse > PA.fst

This would result in:

reverse-label-pushing.jpg

The extra epsilon transitions are due to the use of reverse.

Best, Cyril

JohnS - 23 Oct 2009 - 05:48

Hi Cyril,

Thank you for taking your time to answer this question so thoroughly. I assume there exists, at least theoretically, a generic pushing algorithm to accomplish idempotent and “non-growing” label pushing. Do you agree on this? And if so, would you endorse such an algorithm to be included in any future version of OpenFST?

Best, John

 
Log In

Problem with Determinize and Minimize

VipulMittal - 01 Oct 2009 - 02:17

Hi, I want to build an FST for the words of a language where if the word is accepted, the output will be the features of the word. For this I'm using two FSTs - one for the prefix (root) part and other for the suffix - and I concatenate them to get the fina FST. Both the input files are in AT&T format.

But the problem is when I determinize it, it gives an error that the FST is non-functional i.e. for a particulat input there are more than one possible outputs, which will be there in this case.

So can someone please help me with this ??

Thank you, Vipul

DoganCan - 01 Oct 2009 - 19:55

Current version of OpenFst library does not support determinization of non-functional transducers. You can take the "encode -> determinize -> decode" path to determinize your final fst over (input label,output label) pairs. The result will be non-deterministic over input labels but may satisfy your current needs. Check the answer to the "det(L o G) becomes more complex than L o G" question below for a pointer to the encode/decode mechanism.

Best

Dogan Can

 
Log In

Constrains for the maximum size of the FSTs?

OanaNicolov - 30 Sep 2009 - 19:53

Hi,

Could you please tell me an estimate of the maximum size (in terms of #states and/or #transitions) of the FSTs, the openfst toolkit can work with (let's say on a machine with 8G of memory)? Alternatively, what was the size of your largest FST you worked with?

Thank you. Oana

DoganCan - 01 Oct 2009 - 20:14

Hi Oana,

OpenFst is a very efficient library which scales well to large problems. The answer to your question depends on the properties of the fsts and what you would like to do with them. For instance, you may run the shortest path and pruning algorithms over very large fsts (several millions of states and transitions) while you may not be able to determinize them or compose them with other large machines. What you can do also depends on whether your input fsts are acyclic, epsilon free etc. Using lazy implementations may help if your particular task does nor require complete fsts to be loaded onto the memory upfront. In short, (in my opinion) your best bet is to try and see.

Best,

Dogan

CyrilAllauzen - 15 Oct 2009 - 14:58

Hi Oana,

There are two main concrete class in the library ConstFst and VectorFst.

For ConstFst, the memory requirement is 20 bytes per state. and 16 bytes per arc. For VectorFst, the per-state memory requirement is about twice as much due to the overhead introduced by STL data structures.

Doing the math, you will see that you can easily represent machines with hundred of millions of states and transitions in a few GB of RAM.

If memory usage is critical, the CompactFst class can also be used to provide more efficient memory representations.

This addresses the memory requirement for representing an FST in memory. But, as Dogan pointed out, your memory usage will also depends on which algorithms you want to use.

Best, Cyril

 
Log In

RmEpsilon? , AutoQueue? and Queue Disciplines

NewestUser - 11 Sep 2009 - 11:51

Hi, I would like to know how 'reliable' the AutoQueue associated with RmEpsilon is; that is, how effective is this at selecting the optimal queue discipline given a particular input fst.

Also, are there any examples of employing specific queue disciplines with RmEpsilon? I've not been able to get this to work in practice.

NewestUser - 11 Sep 2009 - 23:59

Nevermind the second part of the above question, I was looking in the wrong place.
 
Log In

a stand in symbol for any symbol; printing out the strings for FSA

AnasElghafari - 28 Aug 2009 - 12:32

Hi there,

Thank you for this great library; I am using it for my B.A. thesis project in C.L. I am using it through the shell from within my Java code.

I have a bit of difficulty with a few things. This is one of them:

- Is there a compact way to specify "any other symbol". As in: in state 0 transduce "a" to "b" and transduce any other symbol to "c"?

The problem specifically is that I am creating acceptors where the symbols are words rather than letters, in some cases I might have 1000 words in my isymbols list. I want to build a rational kernel (transducers to count sequences) to work with these acceptors. At some states I need to be able to map every non-explicitly-mentioned input symbol to epsilon with weight 1. How can I specify that? Do I really need to add a thousand arcs for each state?

AnasElghafari - 28 Aug 2009 - 12:34

The second question (I included it in the title but forgot to include it in the body): is there a way to obtain/print out all the string accepted by some FSA ? (in the case where FSA accepts a finite set, of course)

CyrilAllauzen - 01 Sep 2009 - 11:46

For your first question, you can use RhoMatcher. See the advanced usage section.

For your second question, you will need to write your own function for this but it should be rather easy. Remember that the number of strings accepted by an acyclic automaton is in the worst case exponential in the size of the automaton.

Best, Cyril

 
Log In

det(L o G) becomes more complex than L o G

ZhijianOu - 25 Aug 2009 - 03:45

The experiment setup is as follows.
A Chinese lexicon L:
# of states                                       215371
# of arcs                                         303286

A Chinese bigram G:
# of states                                        87918
# of arcs                                       16822088

Run fstcompose to get
L o G 
# of states                                       391202
# of arcs                                       17125372

After determinization by running fstdeterminize,
det(L o G)
# of states                                      3479987
# of arcs                                       20071667
more complex !

I don't know why. I follows exactly the paper "Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech Recognition with Weighted Finite-State Transducers. In Handbook on Speech Processing and Speech Communication, 2008". The output word label in L is already placed on the initial transitions of the word.

The code is as follows:

fstarcsort.exe --sort_type=olabel  L.fst L.fst.sort
fstarcsort.exe --sort_type=ilabel  G.fst G.fst.sort
fstcompose.exe L.fst.sort G.fst.sort LG.fst
fstdeterminize.exe LG.fstb LG.fst.det
fstminimize.exe LG.fst.det LG.fst.min
Memory allocation failed

I run 64-bit openFST on windows 2003 server 64bit edition. The fstminimize.exe returns "Memory allocation failed" is a surprise. 64bit memory space is not sufficient?

Any suggestion that you could offer would be much appreciated.

Best, Zhijian

CyrilAllauzen - 30 Sep 2009 - 12:43

I'm not sure what the issue is. You might want to try minimizing the transducer as an acceptor: use fstencode to encode the labels, apply minimization and use fstencode to decode (see FST.EncodeDecodeDoc for more details).

Best, Cyril

 
Log In

How to deal with < / s > label in constructing ngram model

ZhijianOu - 23 Aug 2009 - 23:08

Dear all,

As mentioned in the paper "Allauzen, Mohri and Roark, Generalized algorithms for constructing statistical language models, ACL 2003." , in constructing a ngram model, there are transitions labeled with the end symbol < / s >, that lead to the final state.

To compose L with G, I need to add a psuedo word as follows to the L.

r eh d #0 read
r eh d #1 read
...
#0 </s>
 

Am I correct?

(btw: if I replace the < / s > in G with < eps >, the G is not deterministic. This is not a good way. So I propose the above way.)

Thanks for your patience. Your help is greately appreciated.

Best, Zhijian

CyrilAllauzen - 01 Sep 2009 - 12:09

As far as I remember, we were using </s> in G and adding </s> as a word with empty pronunciation (or silence) in L. The GRM library also allowed replace </s> in G by a final weight.

Best, Cyril

 
Log In

Switching weight pushing in the log semiring and in the tropical semiring

ZhijianOu - 23 Aug 2009 - 22:51

dear all,

I built a transducer of a trigram, following the paper "Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech Recognition with Weighted Finite-State Transducers. In Handbook on Speech Processing and Speech Communication, 2008"

As we note, after we introduce the backoff state, the G is only an approximate representation, if there are such seen trigram that have lower probability than its backed-off bigram, i.e. p(z|x,y) < alph(x,y)*p(z|y). To my viewpoints, there are two solutions.

1) One is to splitting states before replacing failure-transition by eps-transition, as proposed in "Allauzen, Mohri and Roark, Generalized algorithms for constructing statistical language models, ACL 2003."

2) The other, as I think, is to create a backoff model from an interpolated algorithm (e.g. modified kn smoothing). Then we have p(z|x,y) > alph(x,y)*p(z|y) for all trigrams. And if we further work in the tropical semiring, the representation will be exact.

So I chose to construct G in the tropical semiring. Now the question comes. As we know, weight pushing in the tropical semiring is inferior to pusing in the log semiring. I'm not sure whether the following procedure is correct, and if it's correct, how to do it? The semiring info is embeded in the arc type. If I use shell command, how can I change the arc type?

1) Switch from representing G (or L o G) in the tropical semiring to representing in the log semiring

2) run fstminimize to get min(L o G)

3) Swtich back to representing min(L o G) in the tropical semiring.

Thanks for your patience. Your help is greately appreciated.

Best, Zhijian

DoganCan - 24 Aug 2009 - 21:08

Current shell-level map command (fstmap) does not support mapping between std and log arcs. A simple workaround is to print the fst and compile the text representation with the appropriate arc type. On the other hand, C++ library includes mappers for conversion between std and log arcs. Following code might help:
VectorFst<StdArc> ifst;
VectorFst<StdArc> ofst;
VectorFst<LogArc> lfst;

Map(ifst,&lfst,StdToLogMapper());
Minimize(&lfst);
Map(lfst,&ofst,LogToStdMapper());

Best, Dogan

ZhijianOu - 25 Aug 2009 - 03:35

Hi, thanks to Dogan.

Sorry for the above lengthy question. To summarize, the main question is: when both lexicon L and grammar G are constructed in tropical semiring. Compared with directly run fstminimize.exe (i.e. in tropical semiring), is it better to switch to log semiring, run fstminimize.exe, and then switch back ?

Best, Zhijian

 
Log In

Compiling openfst-1.1 on cygwin

HenrikKinnemann - 11 Aug 2009 - 03:28

Hi,

is there anybody who has compiled openfst-1.1 (latest version) on Cygwin successfully? Which "3rd-party" libs need to be available on Cygwin in advance?

Best regards, Henrik

PaulDixon - 11 Aug 2009 - 04:08

I have compiled it in cygwin with the gcc-4 package and without any 3rd party libraries ./configure CXX=g++-4.exe CC=gcc-4.exe make install make
 
Log In

build 64bit openFST

ZhijianOu - 06 Aug 2009 - 10:19

hi,

I tried to run fstcompile.exe on a transducer with 23M states and 41M arcs, from AT&T FSM format text file. It terminates, saying that "Memory allocation failed". The 2G address space is not sufficent for compiling such a large transducer. Does the current openFST code support 64bit compile? Could you give me some instructions to build a 64bit openFST?

It is very appreciated that you could favor me with a reply. Thank you.

Zhijian

PaulDixon - 06 Aug 2009 - 23:10

Are you running the Windows version? If so, I can try add a 64bit profile to the Visual Studio solution

ZhijianOu - 14 Aug 2009 - 00:03

Hi, PaulDixon, I have installed Windows server 2003 Enterprise x64 edition.

It's very kind of you if you can add a 64bit profile to the Visual Studio solution.

Best regards, Zhijian

PaulDixon - 17 Aug 2009 - 08:47

Hi I added an updated solution to the contrib section. To build 64bit select the Release64 x64 in your build toolbar. If you are running VS express edition you might also need the platform SDK to get the 64 bit compiler.

ZhijianOu - 23 Aug 2009 - 22:53

PaulDixon, It works! Thank you very much. Best, Zhijian
 
Log In

valgrind complaints with Replace on openfst-1.1

GonzaloIglesias - 02 Aug 2009 - 13:36

Hi again, I have discovered that valgrind starts complaining with the openfst library with many operations, but only after a replace operation is performed. This didn't happen if compiled using openfst beta. I'm using valgrind 3.2.3. Here goes a small example:



    //Building 3 trivial fsts for fst replacement                                                                                                            
    VectorFst<StdArc> fst1,fst2,fst3;
    fst1.AddState();
    fst1.AddState();
    fst1.AddArc(0, StdArc(0, 123456789, StdArc::Weight::One(), 1));
    fst1.SetFinal(1,StdArc::Weight::One());
    fst1.SetStart(0);
    fst2.AddState();
    fst2.AddState();
    fst2.AddArc(0, StdArc(0, 1234567890, StdArc::Weight::One(), 1));
    fst2.SetFinal(1,StdArc::Weight::One());
    fst2.SetStart(0);
    fst3.AddState();
    fst3.AddState();
    fst3.AddState();
    fst3.SetFinal(2,StdArc::Weight::One());
    fst3.SetStart(0);
    fst3.AddArc(0,StdArc(0, 1000, StdArc::Weight::One(), 1));
    fst3.AddArc(1,StdArc(0, 2000, StdArc::Weight::One(), 2));
    vector< pair< StdArc::Label, const Fst<StdArc> * > > pairlabelfsts;
    pairlabelfsts.push_back(pair< StdArc::Label, const Fst<StdArc> * >(1000,&fst1) );
    pairlabelfsts.push_back(pair< StdArc::Label, const Fst<StdArc> * >(2000,&fst2) );
    pairlabelfsts.push_back(pair< StdArc::Label, const Fst<StdArc> * >(3000,&fst3) );
    VectorFst<StdArc> rulefst;
    StdArc::Label slabel=3000;

#ifndef USEOPENFSTBETA
    rulefst=ReplaceFst<StdArc>(pairlabelfsts,ReplaceFstOptions<StdArc>(slabel,false));
#else
    rulefst=ReplaceFst<StdArc>(pairlabelfsts,ReplaceFstOptions(slabel,false));
#endif
    RmEpsilon(&rulefst); //valgrind complains here
    rulefst.Write(some.fst); // and valgrind complains here too
    return 0;
}

(I may submit a link to a full example code)

Cheers!

CyrilAllauzen - 11 Aug 2009 - 10:09

Hi Gonzalo,

Thanks for pointing this out! I'll look more deeply into this.

Best, Cyril

 
Log In

very slow delayed union on openfst 1.1

GonzaloIglesias - 02 Aug 2009 - 11:45

Hi,

Thanks for such a splendid tool as openfst! I'm a very happy user of the beta version for more than a year now.

I'm writing here because I have encountered a specific problem with the delayed union operation using openfst-1.1. library. Basically, I perform a delayed union of many fsts. Afterwards, I update by reinstancing to a VectorFst . I have isolated a particular case in which the version compiled with openfst-beta works things out very quickly (less than one minute), whilst it takes 40 minutes on openfst 1.1 to create the same fst. I have a small but full example including the necessary fsts to read from (~6MB) and perform union with. I don't understand this behaviour and it seems to me something could be wrong within version 1.1

Perhaps this is a known issue? May I post a link to a tar file containing the example? Please let me know how should I proceed.

Thanks in advance!

CyrilAllauzen - 11 Aug 2009 - 10:08

Hi Gonzalo,

I've started to look into this and I think I found the reason for this inefficiency. We'll fix this for the next release.

Thanks a lot for your feedback!

Best, Cyril

CyrilAllauzen - 13 Aug 2009 - 16:11

Update: there is actually no efficiency issue for delayed union. The issue is that Gonzalo was using a sub-optimal approach. The most efficient way of performing a large number of delayed union is to use Union(RationalFst<A> *fst1, const Fst<A> &fst2) as follows:
UnionFst<A> *union = new UnionFst<A>(fsts[0], fsts[1]);
for (int i = 2; i < fsts.size(); ++i)
  Union(union, fsts[i]);
This takes advantage of the fact that UnionFst derives from RationalFst and that delayed rational operations such as union can be applied destructively to a RationalFst.

Cyril

 
Log In

Transforming a collection of strings

MadhuTherani - 29 Jul 2009 - 14:35

I am trying to transduce a set of strings (defined by a Regex) into one final string..I have an FST defining the Regex and another FST defining the final string.. I need the second FST to match to any output label of the first FST in the composition fo the two FSTs..Can anyone suggest how to go about it..? Sigma matcher in the composition.? thanks

madhu

DoganCan - 30 Jul 2009 - 15:47

Depending on what you mean by "I need the second FST to match to any output label of the first FST in the composition of the two FSTs", there might be many different solutions to your problem.

Say you have the string "x y z" in the set defined by the regex and the final string is "a b".

If you want the transduction "x y z" -> "a b a b a b", one possible solution is to create a simple transducer T in the form:

0 1 X a
1 2 <eps> b
2 0 <eps> <eps>
2
where X is any symbol in your input alphabet (You need to add a transition in the form "0 1 X a" for each X in your alphabet) and compose the regex with T.


If you want the transduction "x y z" -> "a b" instead, then one possible solution is to create a transducer T in the form:

0 1 X <eps>
1 0 <eps> <eps>
1 2 <eps> a
2 3 <eps> b
3
where X is any symbol in your input alphabet and compose the regex with T.

You might achieve the same behaviour using sigma matcher in composition (At least I believe it is possible).

You might even try (for "x y z" -> "a b"):

  • replace output labels of the regex with <eps> labels
  • replace input labels of the final string with <eps> labels
  • concat two transducers
  • syncronize
  • rmepsilon

Best, Dogan

 
Log In

AT&T FsmLib? vs openFST for exe ussage

ZhijianOu - 29 Jul 2009 - 11:20

hi,

1) How to easily get the command-line help in openFST's shell-level exe? You know, in FsmLib, an argument "-?" will print help information.

2) If exe is sufficient for my application, which one do you recommend to use? In particular, I have a special emphasis on the performance of acceptor optimization operations, e.g., determinize, minimize, remove-epsilon. As we see, minimizing a transducer is not supported at the command level.

Thanks in advance.

Best regards, ZhijianOu

PaulDixon - 29 Jul 2009 - 23:41

1) fstcommand -help
 
Log In

What semiring is the default in fstminimize, fstdeterminize ?

ZhijianOu - 29 Jul 2009 - 11:07

hi,

I'd like you to point out whether my following understandings are correct.

1) If the input is an acceptor, the implementation of fstminimize is first do weight pushing, and then do (unweighted) acceptor minimization.

2) If 1) is correct, then using different semirings will have different weight pushing results, and different minimization results.

3) In the description of fstminimize in openFST (or fsmminimize in fsmLib), there is no mention of the default semiring used. What semiring is the default?

4) For a deterministic transducer, P(I, x, y, F) contains only one path, using the notation in the paper - Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech Recognition with Weighted Finite-State Transducers. In Handbook on Speech Processing and Speech Communication, Springer-Verlag, 2008.

5) fstdeterminize depends on the semiring ?

6) If we create binary fst from text file using fstcompile or fsmcompile, where to specify the semiring ?

It is very appreciated that you could favor me with a reply. Thank you.

Best regards, ZhijianOu

PaulDixon - 29 Jul 2009 - 20:42

6) For fstcompile use the arc_type switch fstcompile --arc_type=log test.txt > test.fst

For fsmcompile use the -s switch fsmcompile -t -s log test.txt > test.fst

DoganCan - 30 Jul 2009 - 16:22

  1. Correct
  2. Correct
  3. No default semiring. Operation is performed in the semiring given by the arc type of the input fst.
  4. Correct
  5. Yes. Say there are distinct paths with the same labels in the input fst.
    • Tropical semiring: minimum of the path weights is assigned to the output path.
    • Log semiring: sum of the path weights is assigned to the output path.

ZhijianOu - 30 Jul 2009 - 22:04

Thank PaulDixon and DoganCan.

The -s switch for fsmcompile is to give the states textual names.

The output from "fsmcompile -?" is
FSM Version 3.7
Copyright (C) 1998 AT&T Corp. All rights reserved.
Usage: fsmcompile [-opts] [fsm]
Options:
 -i symbols       input arc symbols
 -o symbols       output arc symbols
 -s symbols       state symbols
 -t               transducer
 -V file          input potentials file
 -F file          output FSM file
 -?               info/options

Continue with my question 3. 
In openFST, we can use --arc_type to specify the semiring. But we can run fstcompile without --arc_type, such as
fstcompile --isymbols=isyms.txt --osymbols=osyms.txt text.fst binary.fst
The default --arc_type is "standard", which means tropical semiring. Am I right?

But in fsmLib, if we run fsmcompile as follow:
fsmcompile -t -i i.syms -o o.syms < T.txt > T.fst
It is not clear what semiring is used.

Thanks for your attention.

Best, ZhijianOu

PaulDixon - 30 Jul 2009 - 23:06

It seems the semiring option was added to fsmtoolkit 4.0.

Yes, standard means tropical semiring (Explained in previous post "Why_is_TropicalWeight_considered_standard?")

 
Log In

problem with compiling openfst on Windows with Visual Studio 2008

ZhijianOu - 27 Jul 2009 - 10:49

hi,

As said in the openfstwin-1.1.zip's readme.txt, I open the solution file openfstwin.sln in Visual Studio 2008 and compile. But there are errors as follows:

1>Compiling... 1>fst.cc 1>d:\openfstwin-1.1\openfst-1.1\src\include\fst\unordered_set(19) : fatal error C1083: Cannot open include file: 'unordered_set': No such file or directory
1>properties.cc
1>d:\openfstwin-1.1\openfst-1.1\src\include\fst\unordered_map(19) : fatal error C1083: Cannot open include file: 'unordered_map': No such file or directory
1>Generating Code...
1>Build log was saved at "file://d:\openfstwin-1.1\openfstwin\openfstwinlib\Debug\BuildLog.htm"
1>openfstwinlib - 2 error(s), 0 warning(s)

It is very appreciated that you could favor me with a reply. Thank you.

Best regards, ZhijianOu

CyrilAllauzen - 27 Jul 2009 - 12:31

Hi Zhijian,

First, I'll make sure that you are using Visual Studio 2008 Service Pack 1 as mentioned here.

This looks like a missing TR1 issue. You might not have a working version of TR1 currently installed. As far as I know (I have never used VS2008), an implementation TR1 is included with SP1.

Best, Cyril

ZhijianOu - 29 Jul 2009 - 11:03

Hi, Cyril,

It works. Thank you. After installing VS2008 Service Pack1, I compiled openfstwin-1.1 successfully, though still with warnings, either warning C4396 or warning C4800. For example,

d:\openfstwin-1.1\openfst-1.1\src\include\fst/pair-weight.h(195) : warning C4396: 'fst::operator >>' : the inline specifier cannot be used when a friend declaration refers to a specialization of a function template
d:\openfstwin-1.1\openfst-1.1\src\include\fst/matcher.h(317) : warning C4800: 'uint64' : forcing value to bool 'true' or 'false' (performance warning)

Best regards, ZhijianOu

PaulDixon - 29 Jul 2009 - 20:40

changing fst.Properties(kAcceptor, true) to fst.Properties(kAcceptor, true)!=0 Should stop the C4800 warning.

If you think any of the warnings are bogus you can the warning number to the pragma statement at the top of config.h

HenrikKinnemann - 25 Aug 2009 - 05:02

Hello Paul,

my build of OpenFST-1.1 with Visual Studio 2008 SP1 was successful.

Now I would like to compile/link an example test application which is shipped together with OpenFST in file 'openfst-1.1\src\test\fst_test.cc'.

Visual Studio reports the following compile errors for fst_tes.cc:

1>------ Rebuild All started: Project: TestFst, Configuration: Debug Win32 ------ 1>Deleting intermediate and output files for project 'TestFst', configuration 'Debug|Win32' 1>Compiling... 1>stdafx.cpp 1>Compiling... 1>TestFst.cpp 1>d:\usr\local\include\fst\test-properties.h(24) : error C3083: 'tr1': the symbol to the left of a '::' must be a type 1>d:\usr\local\include\fst\test-properties.h(24) : error C2039: 'unordered_set' : is not a member of 'std' 1>d:\usr\local\include\fst\test-properties.h(24) : error C2873: 'unordered_set' : symbol cannot be used in a using-declaration 1>TestFst - 3 error(s), 1 warning(s)

What do I have to do in order to use 'unordered_set'?

Thanks in advance, Henrik

 
Log In

OpenFst? and AT&T FSM support UTF8 or Unicode Encoding?

AbbasMalik - 24 Jul 2009 - 15:23

Dear All I have question regarding OpenFst and AT&T FSM.

Do OpenFst and AT&T FSM support UTF8 or Unicode Encoding?

Thank you in advance,

Best

CyrilAllauzen - 24 Jul 2009 - 16:45

Hi Abbas,

This question was discussed previously in the forum: link.

Let me know if that answer your questions.

Best, Cyril

 
Log In

Pattern Matching using Mohri 1997 algorithm

PhilSours - 16 Jul 2009 - 16:58

Hi,

I'm interested in pattern matching, i.e. finding occurrences of a set of strings within a text.

M. Mohri [1997] has described an algorithm to generate an efficient pattern search automaton based on failure functions in this paper:

Mohri, Mehryar. String-Matching with Automata. Nordic Journal of Computing, 4(2):217-231, Summer 1997. http://www.cs.nyu.edu/~mohri/postscript/njc.ps

This method has also been referred to by W. Skut [FSMNLP 2008].

Is there anyone who has submitted (or could submit) code implementing this algorithm that would be free for commercial use?

Thank you, Phil Sours

NewestUser - 18 Jul 2009 - 11:11

it's not all that clear what you are aiming to do, but there are quite a few generic string matching, or regular expression matching solutions out there. the following link is to a great tutorial on the subject, which presents a well known solution which is quite easy to implement using the openfst library,

http://swtch.com/~rsc/regexp/regexp1.html

although it is very simple to implement, this algorithm generates transducers which are, in some cases, not terribly well-suited to further optimization or post-processing, especially epsilon removal. the following paper provides an overview of some more efficient ( but more complex ) alternatives which may be better suited to general or more complex inputs,

www.cs.nyu.edu/~mohri/postscript/glush.pdf

dunno if that is what you are looking for, and perhaps it's overkill, but depending on what you mean by 'match strings in a text' it seems like openfst itself might be overkill.

CyrilAllauzen - 20 Jul 2009 - 15:09

Hi Phil,

A version of this algorithm is implemented in the GRM library. The binaries are available for free for non-commercial use, no source code however.

Cyril

 
Log In

Pruning without changing state ID's.

DoganCan - 12 Jul 2009 - 13:32

Hi,

I need to prune an fst while keeping the state ID's unchanged. I need this behaviour since I keep auxiliary information (i.e potentials) about the states of the input fst. Is there a convenient way of doing this?

Thanks, Dogan Can

NewestUser - 12 Jul 2009 - 21:08

can you use the 'keep_state_numbering' flag associated with fstcompile? i have not tested it in this particular situation, but it seems like it might suffice.

DoganCan - 13 Jul 2009 - 16:55

'keep_state_numbering' flag only works when compiling a binary fst from its text representation. it does not help when an fst operation like pruning is applied on that same fst.

CyrilAllauzen - 20 Jul 2009 - 15:19

Unfortunately, there is no way to do that in the current version of the library. A workaround would be "encode" the origin state of each transition in its labels.

Cyril

 
Log In

User-defined Flags

DoganCan - 12 Jul 2009 - 07:01

Hi,

I noticed the following comment in the advanced usage section.

In a user-defined binary, the command line options processing will all also work if the user calls:

SetFlags(usage, &argc, &argv, true);
In that case, the user can set his own flags as well, following the conventions in <fst/flags.h>.

My problem is that when I add a new flag to a custom fst binary, I need to put its definition along with other flag definitions in <src/bin/main.cc> or in <src/lib/flags.cc> and rebuild the library, otherwise SetFlags method does not recognize it.

I have the generic OpenFst binary setup:

fstbinary.cc -- includes binary-main.h, calls SetFlags and Arc templated BinaryMain function

binary-main.h -- includes <fst/main.h>, declares a new flag using DECLARE_type(newflag) and defines BinaryMain function

Defining the new flag using DEFINE_type(newflag,default,help_message) in fstbinary.cc or binary-main.h allows me to use the newflag with its default value, however i cannot pass it from the command line.

Is it possible to to define and use new flags in a user-defined binary without modifying the library source code? If so, could you provide an example?

Many thanks, Dogan Can

PaulDixon - 13 Jul 2009 - 06:36

I've managed to add and use flags without recompiling everything. No 100% sure if this does what you want as it is very similar to what you already tried, but here are minimalistic header and cpp files which allow the value of myflag to be changed. Compiled and tested on a standard OpenFst 1.1 install under Linux using g++ fsttest.cpp -lfstmain -lfst -ldl -o fsttest

#include "test-main.h"

namespace fst 
{
        REGISTER_FST_MAIN(TestMain, StdArc);    
        REGISTER_FST_MAIN(TestMain, LogArc);
}  

DEFINE_double(myflag, 1234, "My Flag");
using fst::CallFstMain;
int main(int argc, char **argv) 
{
  string usage = "Test flags \n\n  Usage: ";
  usage += argv[0];
  usage += "  Flags: myflag\n";
  std::set_new_handler(FailedNewHandler);
  SetFlags(usage.c_str(), &argc, &argv, true);
  if (argc !=1)
  {
    ShowUsage();
    return 1;
  }
  return CallFstMain("TestMain", argc, argv, FLAGS_arc_type);
}

#ifndef TEST_MAIN_H__
#define TEST_MAIN_H__

#include <fst/main.h>
DECLARE_double(myflag);
DECLARE_string(arc_type);

namespace fst 
{

   template <class Arc>
   int TestMain(int argc, char **argv, istream&, const FstReadOptions&)
   {
     
      cerr << "myflag=" << FLAGS_myflag << endl;
        return 0;
   }
}  
#endif  

DoganCan - 13 Jul 2009 - 17:32

Hi Paul,

This is exactly what I tried before. Seems like there is a problem with my build settings. Thanks for your help.

DoganCan - 13 Jul 2009 - 19:21

Hi,

After going through my build settings, I discovered that GCC_SYMBOLS_PRIVATE_EXTERN compiler flag was set to YES in default Release settings of XCode, and changing it to NO solved the problem. This flag sets symbol visibility and I suppose it makes flag definitions invisible to the flagregister.

 
Log In

auxiliary symbols and a speech recognition cascade

NewestUser - 10 Jul 2009 - 04:35

Hi, I'm trying to understand the proper usage of auxiliary symbols in the construction of a recognition cascade using.

unfortunately, although I've now read through,

  • "Speech Recognition with Weighted Finite State Transducers",
  • "A Generalized Construction of Integrated Speech Recognition Transducers",
  • "Generalized Optimization Algorithm for Speech Recognition Transducers", and
  • "Integrated Context-Dependent Networks in Very Large Vocabulary Speech Recognition",

this one point still eludes me, and does not seem to be described or illustrated in any of these papers ( or any others that I was able to find ). Almost every other aspect of the transducer construction and cascade composition seems to be very clearly explained and usually illustrated with simple examples, but the trick with auxiliary phones and composition continues to elude me.

At present my approach consists of the following steps,

  • construct G, a grammar transducer, which in my case happens to be determinizable. thus i perform no augmentation.
  • construct L, the closure of a lexicon transducer derived from a pronunciation dictionary.
    • i automatically supplement this transducer with auxiliary symbols/phones as necessary, as described in the papers mentioned above
    • e.g., red r eh d #1, read r eh d #2, etc.
  • construct C, an inverted, context-dependent, deterministic triphone transducer.
    • at present i am generating this transducer from a phoneme list which also includes the auxiliary symbols/phones so as to guarantee proper composition. i generate all possible logical triphones and rely on the composition process to weed out the superfluous combinations.
  • perform the composition and optimization of the cascade ( i am not dealing with the H-level at present )
    •  min( det( *C* o det( *L* o *G* ) ) ) 
  • replace all auxiliary triphones with the empty string, thus
     eh+d-#1 
    will be replaced with '-'

there are clearly a couple of problems with the last stage of my process. first, each auxiliary symbol at the context-independent level is being mapped to multiple different different auxiliary symbols at the context-dependent level. nevertheless the literature implies that this should not in fact be the case,

For further determinizations at the context-dependent phone level and distribution level, each auxiliary phone must be mapped to a distinct context-dependent phone. Thus, self-loops are added at each state of C mapping each auxiliary phone to a new auxiliary context-dependent phone. The augmented context-dependency transducer is denoted by C.

so i suppose that the mapping should be one-to-one, rather that one-to-many, which is what my current approach yields.

another clearly negative consequence of my current botched approach is that the final auxiliary-symbol removal step results in lost context wherever such auxiliary symbols appear. for example, the following transducer

0 1 SIL+r-eh red
1 2 r+eh-d -
2 3 eh+d-#2 -
3 4 d+#2-b -
4 5 #2+b-ah ball
5 6 b+ah-l -
...
...

defaults to,

0 1 SIL+r-eh red
1 2 r+eh-d -
2 3 - -
3 4 - -
4 5 - ball
5 6 b+ah-l -
...
...

which, although i can actually use it for recognition, is obviously wrong. so finally my question: are there any other papers which provide either diagrams of an augmented component context-dependent triphone transducer, or a more detailed description of the I(M) operation described in "Generalized Optimization...", For any transducer M, we also denote by I(M) the transducer augmented with extra transitions such that each new symbol on its input side is mapped to some new and distinct output symbol.

try as I might still don't quite get this. sorry if this seems a foolish question but im teaching this to myself and dont really know where else i might ask.

NewestUser - 10 Jul 2009 - 04:42

another, perhaps more obvious way of asking this might be, what exactly is a self-loop in the context of this discussion, and what does it look like in practice?

NewestUser - 11 Jul 2009 - 10:34

I believe I figured out a more appropriate solution, which seems to work, but perhaps Im still missing something,

The most trivial, canonical example of a deterministic context-dependent triphone transducer.

The same example, inverted, and augmented with 2 auxiliary phones, #0, and #1

Assuming we have an example grammar transducer of the following form,

0 1 a testa
1 2 b -
2 3 #0 -
3 4 a testb
4 5 b -
5 6 #1 -
6

Then composition with the CD transducer yields,

where the auxiliary triphone symbols can now be safely replaced with epsilons, or short-pauses, or removed without any loss of information.

CyrilAllauzen - 14 Jul 2009 - 18:33

This is indeed the correct construction. In this context, a self-loop is a transition whose destination state is the origin state.

NewestUser - 14 Jul 2009 - 23:25

thanks for the confirmation! i think there may also be at least one more valid construction, this involves augmenting the monophone symbols as needed with auxiliary symbols, then treating these as normal monophones during construction of the context-dependency transducer.

so, in contrast to the above examples where we have a list of monophones which includes auxiliary phones, 'a, b, #0, #1', which are all treated the same. and further in contrast to the self-loop approach, where we have two separate classes of symbols, monophones: 'a, b', and aux. symbols: '#0, #1' we instead of a selective combination depending on the structure of the lexicon transducer, e.g., monophones: 'a#0, b, a#1' which are all treated as normal monophones. this approach has an advantage in that there is no need to treat anything specially, and at the C-level ( assuming we've no need to worry about anything further down in the cascade ), there is no loss of information. instead of 'eps-a+b, #0c:, a-b+a' we get 'epse-a#0-b, a#0-b+a#1'. and instead of epsilon replacement, we have a simple symbol replacement.

this actually seems like a reasonable solution so long as, a. the degree of homophony is small, and b. there is no need to interact with lower levels in the cascade. if either a. or b. do not hold this alternative rapidly becomes intractable due to the increase in monophones and the increase in the size of the resulting C-level transducer.

CyrilAllauzen - 20 Jul 2009 - 15:16

This alternative approach would also work indeed.

However, I want to point out that the first approach does not really require to physically add the self-loops and can be achieved without increasing the size of C by using custom matchers? in composition.

 
Log In

Converting a "linearized transducer" into a true transducer FST

KennethRBeesley - 09 Jul 2009 - 12:17

Assume that you have an OpenFst Acceptor in which each string of the encoded Language has an even number of symbols, e.g.

a b c d e f

in which each even-numbered symbol (counting from 0) represents an Input symbol and each odd-numbered symbol represents an Output symbol. (This is the "linearized transducer".)

Question: Is there an easy way to turn this Acceptor (a kind of linearized transducer) into a real Transducer where each path like the one above is converted into a path of half the length, with these labels:

a:b c:d e:f

CyrilAllauzen - 20 Jul 2009 - 15:04

I would do it in two steps:

1) Use composition to turn a b c d e f into a:eps eps:b c:eps eps:d. This can easily be done using two two-state transducers T_1 and T_2 and computing T_1 o A o T_2. T_2 is the transducer erasing any symbol in odd position: for all a in the alphabet,

0 1 a eps
1 0 a a
1

T_1 is the inverse of the reverse of T_2

2) Apply Synchronize() followed by RmEpsilon().

Cyril

 
Log In

A few small bugs

DoganCan - 08 Jul 2009 - 09:08

Hi,

I noticed two small bugs/problems in the library and here are my fixes for them.

1. fstcompile does not give any error message when no input file is specified.

Solution: change line 00218 in <compile-main.h>

< if (!istrm) { 
> if (!*istrm) { 

2. <lexicographic-weight.h> defines pair weight type using "_<_" symbol between component weight types. However "<" symbol creates problems say when you want to register LexicographicArc for fst binaries since you can neither supply type1_<_type2 as an arc type nor create a dynamic shared object named type1_<_type2-arc.so.

Solution: change line 00075 in <lexicographic-weight.h>

< static const string type = W1::Type() + "_<_" + W2::Type(); 
> static const string type = W1::Type() + "_L_" + W2::Type(); 

Best, Dogan Can

CyrilAllauzen - 09 Jul 2009 - 11:00

Thanks a lot. We will fix this in the next release.

Best, Cyril

 
Log In

OpenFST? vs. BOOST Graph Library for operations on language models

HenrikKinnemann - 18 Jun 2009 - 08:01

Hello,

as a software developer of an OCR team I'm looking for a library which can

(a) represent certain kind of language models (e.g., OCR error correction rules, n-grams, word dictionaries, morphology and grammar rules for German language) in a unique way, for example, as weighted finite automata

and

(b) has some high-level operations already implemented to process the language models, e.g. functions like 'compose', 'minimize' and 'shortest-path-search'.

The BOOST Graph Library (http://www.boost.org/doc/libs/1_39_0/libs/graph/doc/index.html) does also have weighted finite automata which could be used to represent language models and BOOST Graph offers lot's of functions on them.

My question: Where are the supposed differences between BOOST Graph and OpenFST in relation to my OCR application, and, what are the strenghts and weaknesses of both of them?

Best regards, Henrik

CyrilAllauzen - 19 Jun 2009 - 12:09

Hi Henrik,

The Boost Graph Library focuses on directed and undirected graphs. As far as we know, it does not provide specific operations for weighted automata and transducers such as determinization, minimization or composition. You can probably use the BGL to represent language models, but you'll have to re-implement these algorithms. If you know otherwise, please point us to the relevant documents as there is no mention of automata-specific algorithms at the URL you provided.

The OpenFst library was especially designed for handling weighted automata and transducers. It provides a large selection of algorithms (including minimization and composition), some of them being implemented as on-the-fly (lazy) operations. The library also includes a general semiring framework. OpenFst provides some of the basic graph algorithms offered by the BGL such as generic visitors (depth-first, breadth-first, shortest-first, ...), generic shortest-path algorithms (Dijkstra, Bellman-Ford, ...), topological sort, strongly-connected components, ... However, OpenFst does not offer some other algorithms provided by BGL (spamming tree algorithms for instance).

The OpenFst library has been successful used by several natural language processing applications: speech recognition, speech synthesis, machine translation and OCR among others. It is for instance one of the core components of the speech recognition and speech synthesis efforts at Google. In that particular example, it is used among other things to represent (and combine) n-gram language models, word dictionaries and regular grammars.

To conclude, the Boost Graph library could provide a) but not b). The OpenFst library provides a) and b).

Best regards, Cyril

 
Log In

N-way Composition Support

GrahamNeubig - 25 May 2009 - 08:06

Hello,

Are there any plans to support the three-way composition mentioned in the technical report "N-way composition of weighted finite-state transducers" (by Cyril Allauzen and Mehryar Mohri) in the near future? It would come in handy for one of my projects and I would be willing to help with the implementation if no-one is working on it, but if it's already being worked on I'll wait for the next release.

 
Log In

New behaviour of SetInputSymbols? /SetOutputSymbols ?

ChristopherKermorvant - 07 May 2009 - 08:04

Hi,

I am working with the 1.1 version and it seems that there is a change in behavior of SetInputSymbols/SetOutputSymbols. As from version 1.1, it seems that these two functions make a deep copy of the SymbolTable given in argument. This is due to the addition of a copy constructor in symbol-table.h :

explicit SymbolTableImpl(const SymbolTableImpl& impl)
      : name_(impl.name_),
        available_key_(0),
        dense_key_limit_(0),
        check_sum_finalized_(false) {
    for (size_t i = 0; i < impl.symbols_.size(); ++i) {
      AddSymbol(impl.symbols_[i], impl.Find(impl.symbols_[i]));
    }
  } 
* Am I right ?

* if yes, i don't see the use of having a reference counting on a deep copy

* if yes, the following comment in symbol-table.h is not true anymore :

// SymbolTables are reference counted and can therefore be shared across
// multiple machines. For example a language model grammar G, with a
// SymbolTable for the words in the language model can share this symbol
// table with the lexical representation L o G.
 
Thank you for your answer and for this great library !

CyrilAllauzen - 07 May 2009 - 11:14

Hi,

This is inexact. SetInputSymbols and SetOutputSymbols still do a shallow copy of the SymbolTable in 1.1. They both call SymbolTable::Copy that in turn calls the copy constructor of SymbolTable:

  SymbolTable(const SymbolTable& table) : impl_(table.impl_) {
    impl_->IncrRefCount();
  }
The copy constructor of SymbolTableImpl is not called since impl_ is a pointer to a SymbolTableImpl.

Best, Cyril

 
Log In

Sat vocabulary building software

AllenSam - 05 May 2009 - 06:34

hi all looking for an English student to work on our ...


Off-topic post deleted.
MichaelRiley
.

Mindtuning for the Encode/Determinize/Decode workaround

KennethRBeesley - 06 Apr 2009 - 17:22

Cyril,

Back on 14 August 2007, you wrote:

"The library currently only supports the determinization of functional transducers (if two successful paths have the same input label, they need to also have the same output label). The reason for that is that we use the weighted automata determinization algorithm viewing the output labels as weights in the string semiring ....

A workaround is to use Encode/Decode to view the transducer as an acceptor, considering the pair (ilabel, olabel) as one symbol.

1. Encode: EncodeMapper encoder(kEncodeLabels, ENCODE); Encode(fst, &encoder);

2. Determinize.

3. Decode: Decode(fst, encoder);

*** End of Quotation ***

I need a bit of mind-tuning.

Question: Is the Encode/Determinize/Decode workaround recommended ONLY when determinizing a NON-functional Fst? If so, what's the best way to determine if an Fst candidate for determization is functional or not?

Background: In my application, Fsts are created from arbitrarily complex regular expressions, and the resulting Fsts may be functional or non-functional. Without the workaround, statements like

$v = a | a:b ;

currently cause a crash when the result Fst is determinized (without the Encode/Decode workaround), because the Fst maps "a" to "a" and to "b"--i.e. in this case the Fst is not functional.

Thanks,

Ken

MichaelRiley - 08 Apr 2009 - 00:01

You can always determinize and minimize an 'encoded' machine. The result may not be fully deterministic or minimal when 'decoded' but at least redundant transitions and states in the encoded version have been dealt with. Probably the safe thing to do in your application when a transduction is specified.
 
Log In

config.h

PaulDixon - 17 Mar 2009 - 08:14

Hi,

After following the INSTALL instructions using the defaults settings and using the library in my own program as described in the README file. The compiler gives and error looking for config.h (included compat.h). Is config.h copied as part of the installation? Everything compiles fine if I add another include path to the location where the configure script was run from.

MichaelRiley - 21 Mar 2009 - 11:45

Thanks. It's been fixed in version 1.1, now uploaded. Autotools growing pains.

 
Log In

Minor bug in assignment operator

DanEgnor - 16 Mar 2009 - 04:29

MutableFst declares an operator=(const Fst &), but it doesn't declare an operator=(const MutableFst &). You'd think that would be OK because MutableFst is a subclass of Fst, but in fact what it means is that C++ defines an implicit operator=(const MutableFst &), which does a shallow copy, which does nothing (because MutableFst is an abstract base class with no data members).

That means that if you then try to assign one MutableFst to another, nothing happens.

Recommended fix: add an operator=(const MutableFst &fst) method that simply calls the existing (abstract virtual) assignment operator, via static_cast or something. I can submit a diff if this isn't clear.

CyrilAllauzen - 16 Mar 2009 - 11:17

Hi Dan,

Indeed, we found out about this bug and fixed it a while back. The latest release of the library (version 1.0) included that fix.

Best, Cyril

DanEgnor - 17 Mar 2009 - 22:00

Zounds, I was three days out of date! Thanks.
 
Log In

When is it safe to RmEpsilon(), Determinize(), Minimize() When is it safe to automatically RmEpsilon? (), Determinize() and/or Minimize()

KennethRBeesley - 03 Mar 2009 - 16:59

Sorry to belabor this point, but I'm still struggling to understand when an Fst can be safely and automatically processed by RmEpsilon(), Determinize() and/or Minimize(). The challenge arises in my Kleene programming language, which uses OpenFst as the underlying library. In Kleene you can enter statements like

$myfst  =   a* b+ ((c|d|e|f) - (m|n|d)){2} ((dog|cat|rat) & (elephant|dog|pig))? ;

After the regular expression is parsed, the interpreter walks the parse tree and makes multiple calls to OpenFst library functions--Closure, Compose, Concat, Difference, Intersect, etc--building many intermediate Fsts and combining them using the various operations until the final Fst is created and (in this case) bound to the variable $myfst. Obviously, all this processing gets done automatically, without human intervention. The user is interested only in the final result, $myfst, which then might get used in subsequent regular expressions.

Ideally, the final result would be automatically determinized and minimized.

I understand, with help from Cyril , that if an Fst is tested and found to be 1) acyclic, or 2) both (unweighted AND an Acceptor) then it can be safely Determinized and Minimized. (Corrections would be welcome.) So after each intermediate or final Fst is created by the Kleene interpreter, it might be sent to the following clean-up function

// first-draft pseudo-code
if ( the fst is acyclic || (the fst is unweighted &&  the fst is an acceptor) ) {
      RmEpsilon()
      Determinize()
      Minimize()
} else {
      // this is questionable, see below
      RmEpsilon()
}

However, I'm aware that this could still be problematic. In a response on this forum, Cyril stated

"Indeed, in some situations you cannot remove epsilons because it will lead to a space blow up. In that case, you might still want to optimize the machine with epsilons [i.e. without first invoking RmEpsilon] using determinization and minimization."

In another response, Cyril stated

"Indeed, union introduces epsilon-transitions and epsilon-transitions will slow down composition. Non-determinism will also slow down composition. So, I would recommend you first epsilon-remove and then determinize your fst. However, there is always a risk with determinization (it might blow up or even not terminate). Hence, if what I suggest does not work better, you should investigate using determinize only, or rmepsilon only or determinize followed by rmepsilon."

This puts in question the 'else' clause in the pseudo-code above, where RmEpsilon() is automatically called on Fsts that don't satisfy the requirements for safe determinization and minimization.

Question: Assuming my Kleene scenario, where regular expressions are being interpreted and many Fsts (intermediate and final) are being created automatically, when can RmEpsilon(), Determinize() and Minimize() be safely invoked automatically? I want the interpreter to do all the optimizations that are mathematically safe, but to attempt nothing more.

We can assume that the interpreter can interrogate various features of the candidate Fst (cyclic vs. acyclic, weighted vs. unweighted, acceptor vs. transducer, etc.). The interpreter also knows exactly which operation has just been performed to produce the Fst in question (Union, Concat, Closure, etc.).

Thanks,

Ken

CyrilAllauzen - 03 Mar 2009 - 17:27

Hi Ken,

To clarify, when talking about determinization:

  • safe means a condition under which the algorithm will always terminate assuming you have enough memory and wait long enough;
  • unsafe means the algorithm might not terminate ever.

Assuming this definition of safe/unsafe, determinization is safe for acyclic FSTs and for unweighted acceptors and rmepsilon and minimization are always safe.

However, safe does not guarantee that the algorithms will terminate on your machine because you have a finite amount of memory and don't want to wait more than x minutes.

The complexity of determinization in the safe case is in the worst case exponential in the size of the input machine, and quadratic in the size of the result (see DeterminizeDoc).

The complexity of epsilon removal is in the worst case quadratic in the size of the input (see RmEpsilonDoc).

There is no way you can ensure these algorithms will never blowup of memory. However, you can ensure that for most "reasonable" input they will behave "reasonnably".

Due to potentially large cost of these algorithms, it is always recommended you try to reduce the size of the input Fst before hand. This is why I would recommend first optimizing the machine with epsilon, applying epsilon-removal and then optimizing the machine without epsilons, Hence, the "safest" optimization function would be:

// first-draft pseudo-code
if ( the fst is acyclic || (the fst is unweighted &&  the fst is an acceptor) ) {
      Determinize()
      Minimize()
      RmEpsilon()
      Determinize()
      Minimize()
} else {
      // this is questionable, see below
      RmEpsilon()
}

 
Log In

Compiling openfst-1.0 on cygwin

PaulDixon - 26 Feb 2009 - 18:17

I have successfully compiled openfst-1.0 on cygwin. It was necessary to install a newer version of gcc to get TR1 support. I also had to add this line to main.cc
#include <ctime>

I should also have Visual Studio build instructions ready soon.

Paul

PaulDixon - 17 Mar 2009 - 19:37

To compile on the latest verion of cygwin with the cygwin gcc-4 package. Add the #include to the main.cc file and Run configure with the following arguments. ./configure CXX=g++-4.exe CC=gcc-4.exe Build and install as described in the readme

HenrikKinnemann - 11 Aug 2009 - 03:24

 
Log In

fstcompose memory consumption?

DavidHugginsDaines - 19 Jan 2009 - 11:45

Hi,

I'm running into some trouble trying to build static recognition transducers using OpenFst. Specifically, for even a fairly trivial language model G:

# of states                     42980
# of arcs                       183468
# of final states               923
# of input/output epsilons      43867

and (determinized and minimized) dictionary L~:

# of states                     1180
# of arcs                       2424
# of final states               1
# of input/output epsilons      1

when I try to compose these with fstcompose, it rapidly uses all available memory. I haven't let it run long enough to see if it actually terminates.

I'm not sure this is specific to OpenFst, because I find that the FSM library has the same problem. It's possible that I'm building the language model incorrectly, though I'm not sure why this would cause composition to blow up. I am planning to do a sanity check using the GRM library but haven't figured this out yet.

Is it necessary to use on-demand composition when building a transducer of this sort?

CyrilAllauzen - 21 Jan 2009 - 18:10

Hi,

You should try not to determinize/minimize L and make sure that the output labels (i.e words) are on the first non-input epsilon on a path. Also, make sure that L is output label-sorted. (I assumed you do fstcompose L G > LG)

Let me know if this works.

Cyril

DavidHugginsDaines - 26 Jan 2009 - 13:30

Thanks for the advice. It is working quite well now. I had been putting words at the end of words, on the word boundary symbols in L~, under the assumption that epsilon normalization would push them forward eventually (which it does), and also including entirely unnecessary epsilon transitions from the start state to the first state of each word.

I assume the problem with a minimized L is that its maximum out-degree is much higher than an un-minimized one?

CyrilAllauzen - 30 Jan 2009 - 13:37

Hi David,

I'm glad to read that this works quite well now. The issue is with determinization and not minimization. Determinization pushes back the output labels creating a lot of long epsilons paths. In composition, while trying to match a label x at a state in G, all this epsilon paths in L are going to be followed in order to find an x. A lot of this work is wasteful since there is only a few of these paths at the end of which an x can be found.

Cyril

 
Log In

possibility of facing infinite loop in fstdeterminize

IlbinLee - 18 Jan 2009 - 21:47

Is ever possible for "fstdeterminize" to fall into infinite loop?

I'm trying to determinize an FST and the amount of memory it uses gradually increases and it reaches to the limit, results "memory allocation failed" error. However, after removing 2-3 certain arcs in the FST, the process ends pretty quickly. Those arcs don't have any particular characteristics, so I guess it's due to a structure of the FST which consists of more than 2-3 arcs.

Could you give me any solution or hint?

MarkusD - 20 Jan 2009 - 14:25

Yes, the determinization algorithm does not halt for certain weighted finite-state machines. This is described in Mohri 2004 (http://www.cs.nyu.edu/~mohri/postscript/fla.pdf). If the machine contains 'sibling states' then they must be 'twins'. Otherwise, the machine is not determinizable (under the commonly used semirings).

Roughly, two states are sibling states if they are reachable from the start state on different paths that have the same label sequence, and there are cycles at these two states that have similar labels. If these two cycles have the same weight then they are twins.

On this machine, for example, fstdeterminize doesn't terminate: 0 1 1 1 0 2 1 2 1 1 2 3 1 3 3 5 2 2 2 4 2 3 4 6 3

MarkusD - 20 Jan 2009 - 14:27

Here is the finite-state machine again (this time in "pre" tags):

0 1  1 1 0 2  1 2 1 1  2 3 1 3  3 5 2 2  2 4 2 3  4 6 3 
 
Log In

Minor fixes needed to compile with GCC 4.3

DavidHugginsDaines - 06 Jan 2009 - 13:46

As usual, the C++ "standard" got reinterpreted in GCC 4.3, and a few things need to be changed for OpenFst-20080422 to compile. They are mostly trivial, just adding std:: to C string functions and including to get INT_MAX. A patch is at http://www.cs.cmu.edu/~dhuggins/OpenFst-gcc-4.3.diff

 
Log In

Ignore(A, B, &C) algorithm?

KennethRBeesley - 05 Jan 2009 - 17:30

If you're taking requests: It would be convenient to have an Ignore(A, B, &C) algorithm, creating a network (C) that encodes the language/relation A, ignoring any occurrences of B.

 
Log In

Documentation of OpenFst? binary file format?

KennethRBeesley - 05 Jan 2009 - 14:46

Is there any available documentation on OpenFst's binary file format?

Thanks,

Ken

DavidHugginsDaines - 06 Jan 2009 - 14:24

I've been wondering this too, as I'm interested in reading and writing the binary file format without having to link in the OpenFst library (for no good reason aside from not wanting to deal with C++ compiler and library issues). It appears that it's slightly machine specific (but only in terms of endianness), so perhaps it's mainly meant to be used as an internal, intermediate format?

The definition of the header can be found in fst/lib/fst.h, and the symbol table format can be gleaned from fst/lib/symbol-table.cc. How the subsequent state and arc definitions are stored depends entirely on what particular Fst and Arc classes are being used.

In the case of VectorFst, I believe the whole thing looks like this, in native byte order. Strings are stored as a 32-bit length field followed by the raw character data. I may be a bit wrong about this as I haven't actually implemented it yet. Code for this should show up in the CMU Sphinx repository at some point in the near future though.

int32 magic_number = 2125659606;
string fst_type = "vector";
string arc_type = "standard";
int32 file_version = 2; /* file format version */
int32 flags; /* bitfield: HAS_ISYMBOLS = 1, HAS_OSYMBOLS = 2 */
uint64 properties; /* bitfield, defined in fst/lib/properties.h */
int64 start; /* start state ID */
int64 num_states;
int64 num_arcs;

/* input symbol table, if flags & HAVE_ISYMBOLS */
int32 magic_number = 2125658996;
string name;
int64 next_available_key; /* Next available key (number) */
int64 size;
struct {
    string symbol;
    int64 key;
} symbols[size];

/* output symbol table, if flags & HAVE_OSYMBOLS - see above... */

/* states and arcs */
struct {
    float final_weight;
    int64 num_arcs;
    struct {
        int ilabel;
        int olabel;
        float weight;
        int next_state;
    } arcs[num_arcs];
} states[num_states];

DavidHugginsDaines - 06 Jan 2009 - 14:26

That should say VectorFst<StdArc> ... forgot to format it correctly.
 
Log In

Patch for GNU autotools including libtool for OpenFst? 20080422

DavidShao - 28 Dec 2008 - 00:41

I have been developing patches to enable an open source software stack of OpenFst 20080422, leptonlib 1.58, iulib subversion revision 123, Tesseract subversion revision 205, and OCRopus subversion revision 1307 to all use libtool to create dynamic linked libraries on all of Mac OS X 10.5.6 using Macports, Ubuntu 8.10, FreeBSD 7, and Windows XP using Cygwin. The patches can be found at OCRopus files with filenames beginning with toautotools. As of today, for OpenFst, the corresponding file would be
toautotools_openfst_20080422_20081227.diff
As I have stated in a previous post, to apply and use the patch, say to a user subdirectory busr so that one does not need administrator privileges to install,

patch -p1 -E < ../toautotools_openfst_20080422_20081227.diff
autoreconf --install --force
./configure --prefix=$HOME/busr CPPFLAGS="-I/opt/local/include" LDFLAGS="-L/opt/local/lib"
make
make check
make install

I have enabled make check using tests already provided in OpenFst so that one can test whether modules can be dlopened. All tests pass for all platforms except Cygwin on Windows XP. Even for Cygwin on Windows XP, this patch enables a dynamic linked library to be built such that OCRopus is able to use it to pass its own OpenFst tests. (There is an additional failure of one OCRopus test, test-fst-search2.lua, on Mac OS X and FreeBSD.)

My preliminary investigation for Cygwin on Windows XP appears to show that something is going wrong with reinterpret_cast-ing the read() function so that it can be recovered using a map in both fst/lib/register.h and fst/bin/main.h. The pointer to this function appears to be immediately recovered if the map is accessed right after it is stored, but a later GetEntry() returns a null pointer.

I believe that using GNU autotools to provide dynamic linked libraries is an essential step to helping more wide distribution of OpenFst through standard platform package or port managers. I think the current patch just about finishes this task other than the above problem for Cygwin. I also have not tested this patch on any 64-bit platforms other than an Intel Macbook running Mac OS X.

DavidShao - 28 Dec 2008 - 00:42

 
Log In

Determinize, Sequentialize, Subsequentialize

KennethRBeesley - 11 Nov 2008 - 14:17

My colleagues and I are trying to understand the relationship between the OpenFst Determinize() algorithm and the sequentialization and subsequentialization discussed in Mohri's 1997 paper "Finite-State Transducers in Language and Speech Processing".

Any mind-tuning that you could offer would be much appreciated.

Ken

CyrilAllauzen - 13 Nov 2008 - 14:15

The Determinize() algorithm is the algorithm described by Mohri in 1997 Computational Linguistic paper.

In the weighted automaton case, the implementation follows exactly Mohri's paper. The weighted transducer is dealt by considering weighted transducer over seminiring K as weighted automaton over the semiring obtained by taking the cross product of the string semiring and the semiring K.

Cyril

 
Log In

Determinize and functional transducers

KennethRBeesley - 11 Nov 2008 - 14:10

The documentation of the Determinize() function in OpenFst states "The transducer must be functional." and deeper down, "N.B. Current version only determinizes functional transducers". Can we expect a future version of Determinize() that is not limited to functional transducers?

Thanks,

Ken

CyrilAllauzen - 13 Nov 2008 - 16:35

Yes, we indeed plan to have determinization of non-functional transducers in a future version.

Best, Cyril

ThomasDeniau - 11 Nov 2009 - 00:06

Has anybody written a version of determinization which works for non-functional transducers ? Cyril, would you have a preliminary version of that code to give to people looking to implement it a headstart ? Thanks
 
Log In

Request: Determinize in place

KennethRBeesley - 10 Sep 2008 - 23:44

Minimize(&A) minimizes a network A "in place", but Determinize(A, &B) takes A and produces as output a new determinized Fst B. Would it be possible to provide a new version of Determinize that determinizes in place, i.e. Determinize(&A)?

Thanks,

Ken

CyrilAllauzen - 24 Sep 2008 - 17:27

We could add a Determinize(&A). The reason we chose not to so was that the underlying algorithm would still be constructive: create B as the determinization of A, then delete A and assign B to A. Hence, it won't be any more (memory) efficient than using Determinize(A, &B).

KennethRBeesley - 24 Sep 2008 - 21:46

Whatever goes on underneath, I venture to suggest that it would be (to some of us at least) intuitive and useful to add a new Determinize(&A) that is parallel to RmEpsilon(&A) and Minimize(&A). From a user's point of view, it often makes sense to perform such operations "in place" on an existing Fst object.

In my own work (a programming language built on top of OpenFst) an assignment statement like

$net = a*b+[w-z]? ;

causes the right-hand side to be evaluated (by calling various functions in the OpenFst library) and in a symbol table the name $net is bound to a Java Fst object that stores (as a Java long int) the pointer to the resulting C++ Fst object. That pointer is my handle from the Java world to the Fst object in the C++ universe. If I then type

$net2 = $net ;

then $net2 becomes an alias for the same Java Fst object that contains the pointer to the original C++ Fst object. (If I really want to make a copy, I've got a copy function that relies on the Map() function in the OpenFst library.)

With RmEpsilon(&A), I can remove the epsilons from the original C++ Fst object in place, and both $net and $net2 still reference the same Java Fst object that still contains the pointer to the original C++ Fst object. If I could (at least superficially) also Determinize() that original Fst object in place, life would be a bit easier for me. With the existing Determinize(A, &B), a new C++ Fst object B is created, which usually causes my language to create a new Java Fst object to wrap it; and I've got to worry about changing the pointer in the original Java Fst object (the value, in the symbol table, of $net and any aliases like $net2) and deleting the original C++ object A to prevent a kind of memory leak.

So I now understand why Determinize(A, &B) is as it is, but I do still suggest that I and perhaps others would find a new Determinize(&A) useful and intuitively parallel to RmEpsilon(&A) and Minimize(&A), so that all could be performed "in place".

Best,

Ken

CyrilAllauzen - 14 Oct 2008 - 11:59

Hi Ken,

I understand that what you describe is a rather unsastisfying way of wrapping. However, I not sure I understand why this is the only of wrapping. Anyhow, you can easily add a Determinize(&A) as follows:

template <class Arc>
void Determinize(MutableFst<Arc> *fst) {
  *fst = DeterminizeFst<Arc>(*fst, 
                             DeterminizeFstOptions(CacheOptions(true, 0)));
}

Best, Cyril

 
Log In

StateIterator? and the Start state

KennethRBeesley - 21 Aug 2008 - 14:28

When you iterate through the states in an Fst:

for (StateIterator<StdFst> siter(fst); !siter.Done(); siter.Next()) {
     StateId state_id = siter.Value() ;
     ...
}


is the first state_id returned from the Iterator always the Start state of the Fst?

Thanks,

Ken

CyrilAllauzen - 02 Sep 2008 - 11:39

The first state returned by the iterator is not guaranteed to be the start state. You should always use the fst.Start() method to determinne the start state,

Best, Cyril

 
Log In

Failure transitions

ChrisHarris - 21 Jul 2008 - 14:56

Hey guys -

Great effort to build this library - there are so many uses - it's very nice to have such knowledgeable people building this for everyone. Thank you!

I'm interested in doing some language modeling with this library which I anticipate will require backoff weights that are commonly going to build paths which will be "better" than the proper path for certain n-grams.

I know you guys implemented failure transitions in the GRM library for LM purposes, and later an algorithm to replace them with epsilon transitions along with some compression heuristics to reduce the size of the final machines to address this exact problem.

My question is: What's the best route to take to implement this for the OpenFST project? If I take this on myself... should I create a new FST class to implement failure transitions? Or can I get away with just building a new Arc class which can trick the search functions for composition?

Thanks again for your help!

Chris

CyrilAllauzen - 21 Jul 2008 - 16:56

Hi Chris,

There is an undocumented support for failure transitions in composition. This feature is not documented and the handling of failure transitions will change significantly in the next version of the library.

For now, you can label failure transitions with kPhiLabel (you will need to use the C++ library for this, Fsts containing transitions labeled by kPhiLabel cannot be written to a file).

To perform composition with failure transitions, you need to use

ComposeFst<Arc>(fst1, fst2, ComposeFstOptions<COMPOSE_FST1_PHI>())
when fst1 contains transitions labeled by kPhiLabel or
ComposeFst<Arc>(fst1, fst2, ComposeFstOptions<COMPOSE_FST2_PHI>())
when fst2 contains transitions labeled by kPhiLabel.

Best, Cyril

ChrisHarris - 21 Jul 2008 - 20:03

Nice... thanks for the head's up... I'm glad I asked!

Out of curiosity, what's the new implementaiton going to be? Something along the lines of a custom Arc/State/Fsm class?

Chris

ChrisHarris - 23 Jul 2008 - 19:39

Cyril -

I noticed that during composition with failure transitions, the composition does not chase down the transitive closure of both epsilon and failure transitions.

You can reproduce the behavior by having a state with a failure transition lead to a state with an epsilon transition - which in turn leads to a state with a real symbol arc on it. For example an FSA with the following makeup: 0 1 -2 -2 1 2 0 0 2 3 1 1 3

When you do composition with an FST containing only that real symbol (symbol 1 in the above case) - I would expect it to yield a connected machine (failure -> epsilon -> symbol) - which it does not. Instead it looks through the state with the failure transition and sees no matching symbol (only an epsilon transition) and aborts composition.

I tried to fix the problem, but couldn't quite figure out a good way to do so. The logic to chase down the transitive closure of failure transitions is making assumptions on the structure of the FST (deterministic w.r.t. failure transitions, or at least that they're all equally good, because it takes the first one it finds which yields a match) which probably means any fix I make will be short lived anyway.

If you'd like more information, please let me know, otherwise I'll just try another workaround - but I don't think fixing it in the current implementation is worthwhile... do you?

Chris

CyrilAllauzen - 31 Jul 2008 - 18:42

Hi Chris,

Indeed, I should have mentioned this: failure and epsilon transitions do not work together. You should also have at most 1 failure transition at a given state.

Cyril

 
Log In

Interactive manual testing of an FST

KennethRBeesley - 07 Jul 2008 - 11:53

In the context of an interactive GUI, it might sometimes be useful to build an FST and then test it interactively by typing in individual strings to be generated (or analyzed), getting back the output string(s). If the FST to be tested is MainFST, then one obvious way to do this would be the following:

1. Get the input string typed by the user.
2. Build the input string into a one-path acceptor (call it InputACC)
3. StdVectorFst ResultFST ;
Compose(InputACC, MainFST, &ResultFST) ; // compose for generation
Project(&ResultFST, PROJECT_OUTPUT) ; // take the output projection
if (number of paths in ResultFST not infinite or huge) {
// print strings encoded by ResultFST
}

This testing would be put in a loop, allowing the user to manually enter as many test input strings as desired before terminating the loop. To implement analysis, one could 1) invert the MainFST before performing the steps above, or 2) do the following:

Compose(MainFST, InputACC, &ResultFST) ; // compose for analysis
Project(&ResultFST, PROJECT_INPUT) ; // take the input projection

QUESTION: Is there a better, more efficient, way to implement such manual testing? Perhaps using the lazy operations ComposeFst() and ProjectFst() ?

Thanks,

Ken

CyrilAllauzen - 09 Jul 2008 - 18:11

This approach looks good to me. Using lazy operations shouldn't improve the efficiency as is.

You could benefit from lazy operations when the number of strings in ResultFst is large and you don't want to display all these strings (or even don't want to compute all of ResultFst). Instead, you could rewrite your code to generate all paths to work for a non-expanded FST and to stop once a fixed number of strings have been found (you could even ask to the user if he wants to see more strings then).

Best, Cyril

 
Log In

Easy way to query the number of paths in an FST?

If you have an FST, is there an easy way to query the number of paths?

Thanks,
Ken
KennethRBeesley 13 Jun 2008 - 14:10
The natural way is to use the shortest-distance algorithm:
1. Use Map to turn your FST into an FST over the log semiring setting the weight of every transition and the final weight of final states to LogWeight? ::One().
2. Call ShortestDistance? with reverse == true. Let d be the value return in the distance vector for the initial state. d is then the -log of the number of successful paths in the FST.

Best,
Cyril
CyrilAllauzen 16 Jun 2008 - 12:56
Thanks, Cyril. I suspect that my terminology was wrong and/or my C++ skills are still shaky. What I really need is the number of complete paths in an FST, from the start state to a final state. Following these instructions (as best I can), I seem to get the number of paths from the start state to any state (final or not). I hesitate to post my code here, but I'd be more than happy to send it to you or anyone else.

Ken
KennethRBeesley 17 Jun 2008 - 16:48
The key is to use the reverse option: ShortestDistance(fst, &distance, true). For any state q, we then have that distance[q] is the sum of the weights of all the paths from q to a final states. Hence, distance[fst.Start()] is what you want (but you should always check before that distance.size() > fst.Start()).

Cyril
CyrilAllauzen 18 Jun 2008 - 11:37

Cyril,

I am using the reverse option = true. Perhaps I'm not properly setting the exit weight of the final states to LogWeight? ::One(). Will the following Map do it?

// Mapper from StdArc to LogArc.
// orig. from lib/map.h
// struct StdToLogMapper {
struct ModifiedStdToLogMapper {
typedef StdArc FromArc;
typedef LogArc ToArc;

// orig
//LogArc operator()(const StdArc &arc) const {
// return LogArc(arc.ilabel, arc.olabel, arc.weight.Value(), arc.nextstate);
//}

// KRB, change weight of all arcs to LogWeight::One()
LogArc operator()(const StdArc &arc) const {
return LogArc(arc.ilabel, arc.olabel, LogWeight::One(), arc.nextstate);
}

// KRB: also need to change the exit weight of all final states to
// LogWeight::One()

// KRB, will this do it?
MapFinalAction FinalAction() const { return MAP_REQUIRE_SUPERFINAL; }

// MAP_NO_SUPERFINAL, the final weight is mapped to a final weight
//
// MAP_ALLOW_SUPERFINAL, final weight mapped to an arc to the
// superfinal state, if the result cannot be represented as a final
// weight (see below?); superfinal state created only if needed)
//
// MAP_REQUIRE_SUPERFINAL, final weight mapped to an arc to the
// superfinal state, unless the weight can be represented as a final
// weight of weight Zero(). superfinal state always created. (N.B. the
// weight of the superfinal state is set to Weight::One())

uint64 Properties(uint64 props) const { return props; }
};



Thanks,

Ken
KennethRBeesley 25 Jun 2008 - 13:17
Following instructions from Cyril, here's a pathCount() function, and a trivial main() for testing, that work for me. If it gets garbled in the posting, I'd be glad to forward it by email to anyone interested.

Ken


#include "fst/lib/fstlib.h"
#include <iostream>
// iostream gives access to cout

using namespace fst ;

// From map.h
// Mapper to map all non-Zero() weights to One().
//template <class A, class B = A>
//struct RmWeightMapper {
//  typedef A FromArc;
//  typedef B ToArc;
//  typedef typename FromArc::Weight FromWeight;
//  typedef typename ToArc::Weight ToWeight;
//  B operator()(const A &arc) const {
//    ToWeight w = arc.weight != FromWeight::Zero() ?
//                   ToWeight::One() : ToWeight::Zero();
//    return B(arc.ilabel, arc.olabel, w, arc.nextstate);
//  }

//  MapFinalAction FinalAction() const { return MAP_NO_SUPERFINAL; }

//  uint64 Properties(uint64 props) const {
//    return props & kWeightInvariantProperties | kUnweighted;
//  }
//}

typedef VectorFst<LogArc> LogVectorFst ;
typedef Fst<LogArc>       LogFst ;
typedef LogArc::StateId StateId ;

// pathCount  return the number of paths in an FST
// (this version assumes that the input is a StdVectorFst)

int pathCount(const StdVectorFst &sfst) {
        // if the fst has loops, then the language or relation 
   // encoded by the FST is infinite, just return -1
   if (sfst.Properties(kCyclic, true)) {
      return -1 ;
   }
   // copy the FST input to a new unweighted network, 
   // over the Log Semiring
   LogVectorFst lfst ;

   // for each original StdArc, if the weight is _not_
   // TropicalWeight::Zero() then change it to LogWeight::One(),
   // else (when the weight is TropicalWeight::Zero()) change it to
   // LogWeight::Zero().  This is handily done with
   // RmWeightMapper(), which is in the library.

   Map(sfst, &lfst, RmWeightMapper<StdArc, LogArc>()) ;

   vector<LogArc::Weight> distance ;
   ShortestDistance(lfst, &distance, true) ;

   // from the distance vector, get the weight for the start state
   LogArc::Weight w = distance[lfst.Start()] ;

   // w.Value() is the -log of the number of paths
   // cout << "Weight from pathCount(): " << w << endl ;
   // so make w positive and get the exp()
   int paths = exp((double)(-1.0 * w.Value())) ;
   // cout << "Paths from pathCount(): " << paths << endl ; 
   return paths ;
}

// a trivial main() that tests pathCount()
//
int main() {
  // Create a little FST over the Tropical Semiring

  StdVectorFst fst ;
  fst.AddState() ;  // will be state 0
  fst.SetStart(0) ;

  fst.AddState() ;  // will be state 1
  fst.AddState() ;  // will be state 2

  // add some arcs
  fst.AddArc(0, StdArc(1, 1, 0.693, 1)) ; // arc from 0 to 1
  fst.AddArc(0, StdArc(2, 2, 0.693, 1)) ; // arc from 0 to 1
  fst.AddArc(0, StdArc(7, 7, 0.0, 1)) ;

  fst.AddArc(1, StdArc(3, 3, 0.693, 2)) ; // arc from 1 to 2
  fst.AddArc(1, StdArc(4, 4, 0.693, 2)) ; // arc from 1 to 2
  fst.AddArc(1, StdArc(5, 5, 0.693, 2)) ;
  fst.AddArc(0, StdArc(8, 8, 0.0, 2)) ;
  // Uncomment the following line to create a cycle/loop,
  // which makes the language or relation infinite.
  //fst.AddArc(1, StdArc(5, 5, 0.0, 1)) ;

  fst.SetFinal(2, 0.693) ;

  int count = pathCount(fst) ;
  cout << "Path Count from main(): " << count << endl << endl ;
  return 0 ;
}
KennethRBeesley 03 Jul 2008 - 16:46
Hi Ken,

I fixed your post.

Cyril
CyrilAllauzen 09 Jul 2008 - 11:41
Cyril,
Thanks again for your help in implementing the path-counting algorithm above.
Could you or anyone help me understand how the code above returns the number of paths? Why does the weight of the start state, in the Log semiring, correspond to the number of paths.? In larger FSTs, some accuracy seems to be lost. In one of our examples, an FST reported (by the algorithm above) to have 1,048,580 paths, has, in fact 4 fewer paths. Not a big error, but it surprised one of our testers who expected to get an exact path count.
Thanks,

Ken
KennethRBeesley 24 Jun 2009 - 19:55
It uses the shortest-distance algorithm describe in FST.ShortestDistanceDoc. In the real semiring, when the weights of all the arcs are 1, the shortest-distance from the initial state to a state q is the number of paths from the initial state to q. However, since the libary does not implement the real semiring, we have to do the computation in the log semiring. In this very special case where the weights are integer and we always multiply by 1, going to the log actually leads to more numerical errors.

There are two things you can try if you want to get better precision:

1. Use a smaller delta. Shortest-distance (like many algorithms in the library), use a delta parameter to determine whether two weights are "equal". By default, this delta is kDelta. This is to make the algorithm converge faster when they are cycles. Here, since the machine is acyclic, you could get away with a much much smaller delta.

2. Use doubles instead of floats. If reducing significantly delta does not help, the imprecisions might be due to floating point errors. You could try using doubles instead of floats by using ArcTpl<LogWeightTpl > instead of LogArc.

Finally, if this is very important to you, the best thing would probably be to implement the (+,x)-semiring over unsigned 64-bit integers.

Cyril
CyrilAllauzen 25 Jun 2009 - 17:02
 

About Factoring Algorithm of WFST

Hi,

In this paper

Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley.
Speech Recognition with Weighted Finite-State Transducers.
In Larry Rabiner and Fred Juang, editors, Handbook on Speech Processing and Speech
Communication, Part E: Speech recognition. Springer-Verlag, Heidelberg, Germany, 2008.

The factoring algorithm is mentioned but no enough details.

Can you tell me where to find the code or detailed descrption of the factoring algorithm?

Thanks in advance!


Zhenyu Pan
ZyPan 30 May 2008 - 13:08
 

Here is a patch to use CMake for builds (tested on Linux and Mac OS X)

Performs build, tests, and installation. Could generate for Cygwin and even Visual Studio; might work with the former, but code is of course UNIX-specific, so won't build with the latter.

Can't upload here, so download from: http://lemurz.org/io/wp-content/uploads/2008/05/OpenFst-beta-20080422-cmakep1.patch.gz
TomMurray 12 May 2008 - 17:42
Small update to install header files with 'make install': http://lemurz.org/io/wp-content/uploads/2008/05/OpenFst-beta-20080422-cmakep2.patch.gz TomMurray 12 May 2008 - 19:03
 

SymbolTable? ::size()

It's minor, but it would be nice to have a SymbolTable? ::size() function that returns the number of keys. MarkusD 23 Apr 2008 - 11:06
 

Efficient Way of Taking the Union of Several FSTs

Hi,

I'm trying to take the union of several thousands of FSTs. After a while the union becomes quite large and the union operation appears to take more and more time to complete. I tried removing epsilon transitions and determinizing after each union. This helps to reduce the time consumed by the union operation, however this time the overall operation (union+rmepsilon+encode+determinize+decode) takes much more time compared to the former case. I wonder if there is an efficient way of taking the union of several FSTs.

Thanks in advance,
Dogan
DoganCan 22 Apr 2008 - 12:57
Hi,

When dealing with the union of a very large number N of FSTs, a good approach is two proceed in log N steps. The first step consists of computing the union of the (2i)-th and (2i+1)-th FSTs for each 0 < i < N/2, leading to a set of N/2 transducer for the next step. The final transducer is obtained as the union of 2 intermediary transducers at the log N step.

Best,
Cyril
CyrilAllauzen 22 Apr 2008 - 17:58
 

possible bug in fstdeterminize in case of log semiring

Hi,

I noticed that the outputs of fstdeterminize and fsmdeterminize from AT&T FSM toolkit do not match in the case of log semiring. Here is a simple fst in log semiring with the mentioned problem and the results of two executables


cat test.txt
4 0 74 0 0.00274658203
4 1 74 0 5.89324951
4 3 20364 0 0.0282592773
4 2 20364 0 3.57971692
4 5 0 1 -1.0986321
0 5 0 1 -4.69826591e-06
0 2 20364 0 3.68112183
0 3 20364 0 0.0255126953
1 5 0 1
1 2 20364 0
2 5 0 1
3 5 0 1
5

fstcompile --arc_type=log test.txt | fstdeterminize | fstprint
0 1 0 1 -1.0986321
0 2 74 0 -1.51862259e-05
0 3 20364 0 -1.98714133e-05
1
2 1 0 1 0.000163570745
2 3 20364 0 0.000163570745
3 1 0 1 4.95321437e-05

fsmcompile -t -s log test.txt | fsmdeterminize | fsmprint
0 1 0 1 -1.0986321
0 2 74 0 -1.51862259e-05
0 3 20364 0 -1.98714133e-05
1
2 1 0 1 -4.6852183e-06
2 3 20364 0 -4.68526787e-06
3 1 0 1 -7.15111692e-10


I believe a previous discussion titled "fstdeterminize" is about the same issue where fstdeterminize changes the path weights in the log semiring. I tried the suggestion

"The issue might be due to numerical instabilities. One solution would then be to use a smaller delta (in DeterminizeFstOptions? )."

with much smaller/larger deltas, but the result of fstdeterminize did not change.

Could you please explain what I'm missing or how to correct if this is actually a bug.

Best,
Dogan
DoganCan 15 Apr 2008 - 10:44
Hi,

The output of fsmdeterminize and fstdeterminize do not match because there are significant differences in the implementation. In that example, both outputs seem valid to me.

However, changing the delta should have had an effect on the result. It appears there was a bug and the delta parameter was ignored when determinizing transducers (delta works as it is supposed to for acceptors).

A new version of the library fixing this issue is available on the download page.

Best,
Cyril
CyrilAllauzen 17 Apr 2008 - 17:28
 

LM using FST

Hi,
I used to use GRM to build LMs with FSM, is FST compatible with GRM ?
If not, is there an alternative in FST ?

Also, What is the equivalent of farcompilestrings & farprintstrings under FST?

Thanks,
Mohamed Noamany
MohamedNoamany 03 Apr 2008 - 20:50
Hi,
Only the text representation of OpenFst and FSM are compatible. So, if you want to use a LM build by GRM in OpenFst, you need to convert it to the OpenFst format using fsmprint / fstcompile:
 fsmprint lm.fsm | fstcompile > lm.fst 
There are no equivalent of farcompilestrings/farprintstrings in OpenFst.

Best,
Cyril
CyrilAllauzen 04 Apr 2008 - 10:50
  JonathanB 16 Oct 2009 - 11:16
 

compilation gotcha with spaces in directory names

Low priority item for the todo list (which will be likely fixed with autoconf but good to put in the test suite): compiling in a directory with spaces in the name (allowed and quite possible under macos) causes $(PWD) in bin/Makefile to break.

Symptom:

localhost:~/cse733/sp08/794L tutorial materials/fst/OpenFst/fst fosler$ make
( cd lib ; make all )
make[1]: Nothing to be done for `all'.
( cd bin ; make all )
make[1]: * No rule to make target `"/Users/fosler/cse733/sp08/794L', needed by `fstarcsort.o'. Stop.
make: * [all] Error 2

Clear and easy workaround is to replace $(PWD) with a hard coded directory with escaped spaces.
EricFoslerLussier 26 Mar 2008 - 21:35
 

Input and output alphabets

I've been using non-disjoint but different input and output
alphabets, along with --keep_osymbols and --keep_isymbols.

Two commands in particular are confusing here. fstinvert
does not understand that the input and output
symbol tables need to be swapped if the result is going
to be correct, and fstproject does not understand that
the alphabets in the result are going to be either both
derived from the osymbols or both derived from the
isymbols. Presumably the trouble is that these two
commands are outliers compared to the rest of the
library, so the infrastructure which they get from the
library doesn't quite fit.
ChrisBrew 14 Mar 2008 - 16:40
This is actually a bug. It should do the right thing with the symbol tables for project and invert. The bug has been fixed and you can download the fixed version of the library on the download page. CyrilAllauzen 14 Mar 2008 - 19:24
Thankyou. There's an easy workaround: don't use different alphabets, which actually
works for me. But better not to have to.
ChrisBrew 15 Mar 2008 - 10:49
Thankyou. There's an easy workaround: don't use different alphabets, which actually
works for me. But better not to have to.
ChrisBrew 15 Mar 2008 - 10:49
Actually, a new bug was introduced in fstinvert in the version released on friday. A new version is now available on the download page fixing this bug. CyrilAllauzen 17 Mar 2008 - 11:51
 

Interpreting complement character ranges

I'm implementing a programming language based on regular expressions, which are interpreted to produce finite-state networks. The current interpreter, which is just starting to work, calls functions in the OpenFst? library, though I've kept the design modular enough to use other interpreters based on other libraries.

Interpreting character ranges: If I parse and interpret the following statement:

$foo = b [aeiou] t ;

a character range like [aeiou] or [a-z] can be handled straightforwardly as a union, equivalent to (a
e i o u) and (a b c d ... z), respectively. No problem. But how, in OpenFst? , can I interpret complemented character ranges such as

$bar = a [^aeiou] e ;

???

If . is used in the syntax to represent "any possible character" (denoting the set of all single-character strings), then [^aeiou] is equivalent to (. - (a
e i o u)), but that just raises the question, How would one interpret the following:

$fum = a . e ;

denoting the set of all three-character strings where the first character is 'a', the second character is "any possible character" and the third is 'e' ?

Thanks,

Ken
KennethRBeesley 13 Mar 2008 - 13:12
Wow. That last message got garbled wherever I used a vertical-bar to indicate union. The first garbled example, expanding [aeiou], was (a verticalbar e verticalbar i verticalbar o verticalbar u ).
The second, equivalent to [a-z] was
(a verticalbar b verticalbar c verticalbar d verticalbar ... z)

The third, equivalent to [^aeiou] was

(. - ( a verticalbar e verticalbar i verticalbar o verticalbar u))

Ken
KennethRBeesley 13 Mar 2008 - 14:02
 

Mindtuning: value of separate input and output symbol tables?

OpenFst allows the use of separate input and output symbol tables for mapping symbol names (strings) to the integers used as labels on an Fst. So, for examples, the label 2 on the input side could be mapped to 'a', and the label 2 on the output side could be mapped to 'b'.

1. What's the perceived value of allowing separate input and output mappings?
2. In practice, is it most common to use the same symbol table for both sides?
Thanks,

Ken
KennethRBeesley 04 Mar 2008 - 11:53
Actually, it is very common to have different symbol tables for each side. For instance, if you a lexicon that maps prononciation of words, i.e. sequence of phonemes, to words. The input symbol table would contains "phonemes" and the output table words. This is a real example from speech recognition (cf. this paper) CyrilAllauzen 04 Mar 2008 - 11:59
Yes, that's a good example. Thanks. In my own work, at least, I would tend to want individual alphabetic symbols like 'a', 'b' and 'c' to be stored (by default) on arcs with their standard Unicode code point values, including IPA symbols in a speech application; and any symbols with multicharacter print names (including whole words, as in the speech-recognition example) would be stored with code point values from a Unicode Private Use Area. Unicode currently offers 137,468 Private Use code points--is that enough for all the words and other multicharacter symbols (e.g. ngrams of phoneme characters) used in a typical speech application? KennethRBeesley 05 Mar 2008 - 15:43
Unfortunately, this solution would not be satisfying for speech recognition. Some very large vocabulary speech recognition systems use vocabularies consisting of several million words (i.e. multicharacter symbols).

Allowing users to define their own input and output symbol tables is the most general approach. For your work, the approach you describe may be more than adequate, and the library indeed allows you to have the same symbol table used as both input and output table for all your FSTs.

Cyril
CyrilAllauzen 06 Mar 2008 - 10:55
 

"Normalizing" arc/final weights on an FST?

I am using FSTs to represent probabilistic finite-state automata and
there are various situations in which I need to "normalize" them--that
is, scale the arc weights such that the probability of arcs leading
out of a state (plus the probability of state finality) sums to 1 for
each state. I am having trouble, however, figuring out how to modify
the weights on arcs of an FST. I have tried the following code, and
while this modifies the final weight appropriately, it doesn't affect the arc weights at all:

void makeProper(VectorFst<StdArc> &fst) {
    using fst::TropicalWeight;
    for(StateIterator<VectorFst<StdArc> > siter(fst); !siter.Done(); siter.Next()) {
      int s = siter.Value();
      float tot = pow(2, -1 * fst.Final(s).Value());
      for(ArcIterator<VectorFst<StdArc> > aiter(fst,s); ! aiter.Done(); aiter.Next()) {
        StdArc arc = aiter.Value();
        float contribution = pow(2, -1 * arc.weight.Value());
        tot += contribution;
      }
      float adjustment = -1 * log(tot) / log(2);
      float newFinal = fst.Final(s).Value() - adjustment;
      fst.SetFinal(s,TropicalWeight(newFinal));
      for(ArcIterator<VectorFst<StdArc> > aiter(fst,s); ! aiter.Done(); aiter.Next()) {
        StdArc arc = aiter.Value();
        float newWeight = arc.weight.Value() - adjustment;
        arc.weight = TropicalWeight(0.0);
      }
    }
  }


How can I modify the arc weights? Or is there some other
functionality I haven't noticed yet that performs this normalization
on its own?

Many thanks once again,

Roger
RogerLevy 02 Mar 2008 - 23:34
Ahh...my naive attempt to use
...
was not entirely successful. My full query with properly-indented code can be found here:

http://idiom.ucsd.edu/~rlevy/normalize_fst_weights

Roger
RogerLevy 02 Mar 2008 - 23:37
Hi Roger,

1. In order to modify an arc, you need to use a MutableArcIterator? instead of an ArcIterator? . Then do aiter.SetValue(arc) at the end of your last for loop. The constructor of aiter should then also take &fst as an argument instead of fst.

2. There is an operation to normalize the weights in the way you describe. It is call pushing: Push(&fst, REWEIGHT_TO_INITIAL)

However, pushing is defined using the semiring operation, hence in your case you'll need to push in the log semiring.
VectorFst<StdArc> fst;
VectorFst<LogArc> log_fst;
Map(fst, &log_fst, StdToLogMapper());
Push(&log_fst, REWEIGHT_TO_INITIAL);
Map(log_fst, &fst, LogToStdMapper());


The total weight will still be present at the initial state. You can remove it by doing after calling Push:
vector<LogWeight> distance;
ShortestDistance(log_fst, &distance, true);
LogWeight total = distance[log_fst->Start()];
MutableArcIterator< VectorFst<LogArc> > aiter(&log_fst, log_fst->Start());
for (; !aiter.Done(); aiter.Next()) {
 LogArc arc = aiter.Value();
 arc.weight = Divide(arc.weight, total);
 aiter.SetValue();
}


P.S. I didn't compile the code so there might be some typos.
P.P.S. I fixed the formating of your original post.
CyrilAllauzen 04 Mar 2008 - 11:25
Many thanks once more, Cyril. (The last call needed to be aiter.SetValue(arc) but that was easy enough to figure out.) The approach I was taking had an error in it, anyway, so Push is much better.

Best

Roger
RogerLevy 04 Mar 2008 - 22:50
One other question: is there an easy way of making LogArc? and LogWeight? operate in base 2 logs rather than in natural logs?

Roger
RogerLevy 05 Mar 2008 - 17:52
Unfortunately, they are no easy way to do this.

Cyril
CyrilAllauzen 06 Mar 2008 - 10:44
Got it. Also, I notice that RealWeight? and RealArc? are not yet implemented, is this correct? (they would be convenient for transparent post-hoc inspection of normalized log-semiring FSTs.)

Best

Roger
RogerLevy 08 Mar 2008 - 15:56
No RealWeight? yet. The reason was we never used the real semiring for anything else than toy example with the FSM library. But, we might decide to add it to the next release since people seem to request it.

Best,
Cyril
CyrilAllauzen 11 Mar 2008 - 11:59
 

Natural code for printing all strings accepted by an FST?

I have written some code for printing all strings accepted by an FST together with their weights. (Clearly a bad idea in general, but useful right now for some tests I'm running with very small FSTs.) Since I am new to both C++ and OpenFst? I wonder whether there is a more natural way to write this functionality. In particular I wonder:

1) is there a better way of checking whether a state is final?

2) whether there are better ways to implement the accumulation of weights along the path through the FST, perhaps by using a Weight class directly?

void printAllStrings(VectorFst<StdArc> &fst) {
  vector<int> str(0);
  printAllStringsHelper(fst, 0, str, 0.0);
}
void printAllStringsHelper(VectorFst<StdArc> &fst, int state, vector<int> str, float cost) {
  if(fst.Final(state).Value() < 100000)
    cout << vectorToString(str) << " with cost " << (cost + fst.Final(state).Value()) << endl;
  for(ArcIterator<VectorFst<StdArc> > aiter(fst,state); ! aiter.Done(); aiter.Next()) {
    StdArc arc = aiter.Value();
    str.push_back(arc.ilabel);
    printAllStringsHelper(fst, arc.nextstate, str, cost + arc.weight.Value());
    str.pop_back();
  }
}
string itos(int i) {   // convert int to string; from Bjarne Stroustrup's FAQ
  stringstream s;
  s << i;
  return s.str();
}
string vectorToString(vector<int> v) {
  if(v.size() == 0)
    return "<>";
  string result = "<" + itos(v[0]);
  for(int i = 1; i < v.size(); i++) {
    result +=  "," + itos(v[i]);
  }
  return result + ">";
}

Many thanks in advance for any pointers!

Roger
RogerLevy 01 Mar 2008 - 21:04
hmm...TWiki has done some damage here to my double-equals equality test, to my plus-equals assignments, and also has erased some templates...hopefully the code is still reasonably readable! I've posted it at

http://idiom.ucsd.edu/~rlevy/print_all_strings

Roger
RogerLevy 01 Mar 2008 - 21:09
1. Yes! There is a much better to test if a state is final.
if (fst.Final(state) == Weight::Zero()) // if state is not final 

Where you can replace Weight by the appropriate weight class (even better do a typedef of the current weight to Weight).

2. Indeed, you should try to use the Weight class operations. You should almost never need to use weight.Value(). Even cout << weight and cin >> weight should work since the operators << and >> are defined.

Remember, the operation to combine weight along a path is Times.
So in your code you should have
TropicalWeight cost
and
  Times(cost, arc.weight) 
P.S. I fixed the formatting of your original post
CyrilAllauzen 04 Mar 2008 - 12:12
Many thanks, Cyril, this is great!

Roger
RogerLevy 04 Mar 2008 - 22:47
Roger,
1. Do you have a final version of your code to print all paths with their weights?, and
2. Would you consider sharing it?
Thanks,

Ken
KennethRBeesley 11 Jun 2008 - 11:52
Hi Ken,

It looks like the current version of my code is pretty similar to the earlier version:

// a bad idea if there are loops :)
void printAllStrings(VectorFst? &fst) {
vector str(0);
TropicalWeight? tw(TropicalWeight? ::One());
printAllStringsHelper(fst, 0, str, tw);
}

void printAllStringsHelper(VectorFst? &fst, int state, vector str, TropicalWeight? cost) {
if(fst.Final(state) = TropicalWeight? ::Zero())
cout << vectorToString(str) << " with cost " << (Times(cost,fst.Final(state))) << endl;
for(ArcIterator? <VectorFst > aiter(fst,state); ! aiter.Done(); aiter.Next()) {
StdArc? arc = aiter.Value();
str.push_back(arc.ilabel);
printAllStringsHelper(fst, arc.nextstate, str, Times(cost, arc.weight.Value()));
str.pop_back();
}
}

string vectorToString(vector v) {
if(v.size()
0)
return "<>";
string result = "<" + itos(v[0]);
for(int i = 1; i < v.size(); i++) {
result +
"," + itos(v[i]);
}
return result + ">";
}

string itos(int i) { // convert int to string; from Bjarne Stroustrup's FAQ
stringstream s;
s << i;
return s.str();
}

I have a similar but separate function pair for LogArc? FSTs. It would be nice to have a single function pair that works for any arc type, and except for the types the code would be the same, but I don't see how to do it because there isn't a common superclass for LogArc? and StdArc? . If someone else sees how to do this, let me know!

And please feel free to use this code :)

Roger
RogerLevy 11 Jun 2008 - 20:17
Ah, I also forgot -- I believe that arc.weight.Value() can be replaced by arc.weight, as per Cyril's suggestion earlier. RogerLevy 11 Jun 2008 - 20:18
Roger,
Many thanks for posting this code. One comment at this point: the initial call to printAllStringsHelper(fst, 0, str, tw) assumes that the start state is state 0. After combining and optimizing fsts, this may not be the case. Better to use
printAllStringsHelper(fst, fst.Start(), str, tw).

Ken
KennethRBeesley 13 Jun 2008 - 11:52
Roger,
Another possibility: generalization to print Input vs. Output strings. Your code currently uses arc.ilabel. You could do

enum Projection { PROJECT_INPUT, PROJECT_OUTPUT } ;

void printAllStrings(VectorFst? &fst, Projection proj) {
...
printAllStringsHelper(fst, fst.Start(), str, tw, proj) ;
}

....

if (proj == PROJECT_INPUT) {
str.push_back(arc.ilabel) ;
} else {
str.push_back(arc.olabel) ;
}

Ken
KennethRBeesley 13 Jun 2008 - 12:15
Roger,

In testing for a final state, I think that your code is incorrect (or garbled in the posting). You have if(fst.Final(state) = TropicalWeight? ::Zero())
but (as I understand it) a state is final if the exit weight is not equal to Weight::Zero(), so if (fst.Final(state) = TropicalWeight? ::Zero()) with a not-equal operator, is what you want.
Straighten me out as necessary.

Ken
KennethRBeesley 13 Jun 2008 - 12:37
Roger,
I see that my "correction" got corrupted the same way when posted. The test should be
if (fst.Final(state) notEqualTo TropicalWeight? ::Zero()) where notEqualTo is the usual exclamation mark followed by an equal-sign.
Ken
KennethRBeesley 13 Jun 2008 - 14:04
 

Why is TropicalWeight considered standard?

In another word, why does StdArc use TropicalWeight instead of FloatWeight or LogWeight. Is there some material that explains the significance of TropicalWeight?

Also, what's the reason that there are Plus(), Times() and Divide() implemented for FloatWeight()? Because it is too trivial? smile
JiaPu 29 Feb 2008 - 18:57
sorry, type. It should read:

".... there are NOT Plus() ..."
JiaPu 29 Feb 2008 - 18:58
sorry, typo. It should read:

".... there are NOT Plus() ..."
JiaPu 29 Feb 2008 - 18:58
StdArc use TropicalWeight and LogArc use LogWeight. StdArc could have been call TropicalArc, but since it is the most commonly used [it correspond to the Viterbi approximation] a shorter name like StdArc was prefered. If you don't like it, you can still do:
 typedef StdArc TropicalArc;

FloatWeight is just an abstract class from which LogWeight and TropicalWeight are derived. It does not correspond to a semiring, that's why it does not have Plus(), Times(), ...
CyrilAllauzen 04 Mar 2008 - 11:43
For the significance of the tropical semiring, you can check out the papers in FstBackground? or this paper or that one. CyrilAllauzen 04 Mar 2008 - 11:50
Cyril,

Please give me an algebra 101. According to the definition of semiring on wiki, float (or rational number) is a semiring. And when authorizing finite state grammar for speech recognition, it is more natural to use float as weight. Although, internally log weight is preferred.
JiaPu 04 Mar 2008 - 15:23
Float in FloatWeight only means than the underlying representation is a C/C++ float type. TropicalWeight and LogWeight both use C/C++ float to represent real numbers, hence they are both derived from the partially abstract type FloatWeight. Of course, the real number with as operation the usual plus and times is also a semiring. It is not implemented in the library for now because it is not very commonly used. We might add it in a future version. If we do, it will also be as a type derived from FloatWeight.

Float just defines the set, it is not enough to define a semiring. Different times and plus operations would lead to different semiring sharing the same set of elements.

Finally, you didn't have to all the way to wikipedia, there is some documentation on here on this website.
CyrilAllauzen 04 Mar 2008 - 19:10
 

Compiler error on Determinize().

When compiling following code:

StdVectorFst in_fst;
... add some states and arcs to in_fst ...
StdVectorFst out_fst
Determinize(fst, out_fst);

I got compiler error:
/Volumes/Data/Projects/OpenFst/macosx/Corpus2SphinxFsg.cpp:82: error: no matching function for call to 'Determinize(fst::StdVectorFst&, fst::StdVectorFst&)'

Since StdVectorFst? is derived from Fst, I don't understand that why this doesn't compile.
JiaPu 28 Feb 2008 - 12:48
Aha, never mind. Just realize that Determinize() expects a pointer instead of a reference on the second argument. But why can't we use reference on both arguments? JiaPu 28 Feb 2008 - 13:18
This is one of our coding conventions: input arguments are declared as const references and output arguments are declared as pointers. CyrilAllauzen 29 Feb 2008 - 14:26
 

New Release of OpenFst?

Are there any plans for a new release of OpenFst? ?
If so, I have some requests ...

Ken
KennethRBeesley 06 Feb 2008 - 17:20
There will be a new release of OpenFst? . You're welcome to make some suggestions. CyrilAllauzen 08 Feb 2008 - 15:20
Cyril,

Very glad to hear about the new release.

I respectfully request the following:

void Copy(&In, *Out) // create a deep copy of the input automaton

Convenience boolean functions for testing an existing network:

isLabelSorted(&Fst, side)
isFunctional
isAcceptor
isWeighted
isEpsilonFree
isDeterministic

Finally, a new text input-output format that is Unicode-capable. I'll try to
write up some more specific suggestions for an XML-based notation.

Thanks for all your good work,
Ken
KennethRBeesley 13 Feb 2008 - 23:53
Hi Ken,

A deep copy can already be performed using Map() with the IdentityMapper, however this might not be obvious at first so a Copy() could indeed be something useful.
StdVectorFst in;
StdVectorFst out;
Map(in, &out, IdentityMapper<StdArc>());
For the boolean functions, I'm not so sure that that:
if(isAcceptor(fst)) ...
is much simpler than:
if(fst.Properties(kAcceptor, true)) ...

Best,
Cyril
CyrilAllauzen 29 Feb 2008 - 12:48
For functional, the issue is that the best algorithms to test for functionality are quadratic in the number of transitions. This prevented us from using it as a property bits, that's why there are no kFunctional. CyrilAllauzen 29 Feb 2008 - 12:53
Thanks Cyril,

Your fst.Properties(kAcceptor, true) example was what I needed to get oriented. No boolean function isAcceptor(fst) is needed. Other beginners like me might want to look in fst/lib/properties.h for more property names, including:

kAcceptor
kNotAcceptor

kWeighted
kNotWeighted

kEpsilons
kNoEpsilons

kIDeterministic (deterministic on the input side)
kNonIDeterministic

kODeterministic (deterministic on the output side)
kNonODeterministic

kILabelSorted (sorted with regard to input labels)
kNotILabelSorted

kOLabelSorted (sorted with regard to output labels)
kNotOLabelSorted
KennethRBeesley 29 Feb 2008 - 17:06
 

Another newbie question

I'm having trouble compiling another example from the Quick Tour page,
to get myself going with FST composition. here it is:

#include "fst/lib/fstlib.h"

int main()
{

using fst::StdVectorFst;
using fst::StdFst;
using fst::StdOLabelCompare;
using fst::StdILabelCompare;
using fst::ArcSort;

StdFst? *input = StdFst? ::Read("input.fst");

StdFst? *model = StdFst? ::Read("model.fst");

ArcSort? (input, StdOLabelCompare? ());
ArcSort? (model, StdILabelCompare? ());

StdVectorFst? result;

Compose(*input, *model, &result);

}

Trying to compile gives me the following error:


rlevy@truffle:~/src/C++$ g++ -I /local/contrib/cpl/OpenFst Example1.cpp /local/contrib/cpl/OpenFst/fst/lib/libfst.so -lpthread
Example1.cpp: In function ‘int main()’:
Example1.cpp:16: error: no matching function for call to ‘ArcSort(fst::StdFst*&, fst::StdOLabelCompare)’
Example1.cpp:17: error: no matching function for call to ‘ArcSort(fst::StdFst*&, fst::StdILabelCompare)’

It's not obvious to me why it's not matching the function properly -- any ideas?

Many thanks!

Roger
RogerLevy 01 Feb 2008 - 01:14
This does not work because the input of ArcSort needs to be a mutable fst (of type MutableFst). Here, input and model are of type StdFst i.e. Fst and not MutableFst.

There are several solutions for this:
1. Restrict the program to only accepts as input VectorFst which is mutable:
StdVectorFst *input = StdVectorFst::Read("input.fst");
StdVectorFst *model = StdVectorFst::Read("model.fst");
Simple but not general.

2. Copy model and input into a VectorFst before sorting:
StdVectorFst input2(*input);
ArcSort(&input2, StdOLabelCompare());

3. Use on-the-fly sorting:
ArcSortFst StdOLabelCompare? > input2(*input, StdOLabelCompare());

Best,
Cyril
CyrilAllauzen 29 Feb 2008 - 13:19
I am getting the same error, except I have already declared mutable FSTs. I would really appreciate some help! My code is of the form:

#include "fst/fstlib.h"

using namespace std;
using namespace fst;

int main()
{
StdVectorFst? fst1;
StdVectorFst? fst2;
StdVectorFst? fst3;

ArcSort? (fst1, StdOLabelCompare? ());
fst3 = ComposeFst? (fst1, fst2);

return 0;
}

Specifically, the compile error is:

no matching function for call to 'ArcSort(fst::StdVectorFst&, fst::StdOLabelCompare)'


Thank you very much,
Jonathan
JonathanB 16 Oct 2009 - 12:03
 

Newbie question: compiling a C++ program using the OpenFst library

This is a newbie question about how to successfully compile a C++
program that refers to the OpenFst? library. I'm a newcomer to C++,
though I'm comfortable with C and with object-oriented programming
(e.g., Java). As a first example I tried constructing a program out
of the example "Creating FSTs Using Constructors and Mutators From
C++":

***

#include "fst/lib/fstlib.h"

int main()
{

using fst::StdVectorFst;
using fst::StdArc;

// A vector FST is a general mutable FST
StdVectorFst? fst;

// Adds state 0 to the initially empty FST and make it the start state.
fst.AddState(); // 1st state will be state 0 (returned by AddState? )
fst.SetStart(0); // arg is state ID

// Adds two arcs exiting state 0.
// Arc constructor args: ilabel, olabel, weight, dest state ID.
fst.AddArc(0, StdArc? (1, 1, 0.5, 1)); // 1st arg is src state ID
fst.AddArc(0, StdArc? (2, 2, 1.5, 1));

// Adds state 1 and its arc.
fst.AddState();
fst.AddArc(1, StdArc? (3, 3, 2.5, 2));

// Adds state 2 and set its final weight.
fst.AddState();
fst.SetFinal(2, 3.5); // 1st arg is state ID, 2nd arg weight

fst.Write("binary.fst");

}

***


When I try to compile this program (with filename Example.cpp) using
g++ version 4.1.2 (on a 64-bit Linux machine), I get some errors which
I'm having a hard time deciphering:

rlevy@morel:~/src/C++$ g++ -I /local/contrib/cpl/OpenFst Example.cpp
/tmp/cc1hh0FE.o: In function `fst::SymbolTable::Write(std::basic_ostream >&) const':
Example.cpp:(.text._ZNK3fst11SymbolTable5WriteERSo[fst::SymbolTable::Write(std::basic_ostream >&) const]+0x1c): undefined reference to `fst::SymbolTableImpl::Write(std::basic_ostream >&) const'
/tmp/cc1hh0FE.o: In function `fst::FstImpl::WriteHeaderAndSymbols(std::basic_ostream >&, fst::FstWriteOptions const&, int, fst::FstHeader*) const':
Example.cpp:(.text._ZNK3fst7FstImplINS_6StdArcEE21WriteHeaderAndSymbolsERSoRKNS_15FstWriteOptionsEiPNS_9FstHeaderE[fst::FstImpl::WriteHeaderAndSymbols(std::basic_ostream >&, fst::FstWriteOptions const&, int, fst::FstHeader*) const]+0xc3): undefined reference to `fst::FstHeader::Write(std::basic_ostream >&, std::basic_string, std::allocator > const&) const'
/tmp/cc1hh0FE.o: In function `fst::CompatProperties(unsigned long long, unsigned long long)':
Example.cpp:(.text._ZN3fst16CompatPropertiesEyy[fst::CompatProperties(unsigned long long, unsigned long long)]+0xd7): undefined reference to `fst::PropertyNames'
/tmp/cc1hh0FE.o: In function `unsigned long long fst::TestProperties(fst::Fst const&, unsigned long long, unsigned long long*)':
Example.cpp:(.text._ZN3fst14TestPropertiesINS_6StdArcEEEyRKNS_3FstIT_EEyPy[unsigned long long fst::TestProperties(fst::Fst const&, unsigned long long, unsigned long long*)]+0x18): undefined reference to `FLAGS_fst_verify_properties'
collect2: ld returned 1 exit status


Could anyone lend me insight into the problem I'm encountering here?

Many thanks, and apologies for the newbie question...

Best

Roger
RogerLevy 27 Nov 2007 - 13:33
You simply need to link with the library (and with the pthread library)!
For instance:
$ g++ -I /local/contrib/cpl/OpenFst Example.cpp /local/contrib/cpl/OpenFst/fst/lib/libfst.so   -lpthread
CyrilAllauzen 27 Nov 2007 - 15:37
Thank you so much Cyril -- this solved the problem!

Best

Roger
RogerLevy 27 Nov 2007 - 19:59
 

Autoconfiscating patch

I have uploaded a very preliminary patch that autoconfiscates OpenFst? . There are four files, a configure.ac, a Makefile.am, and two template files to generate test shell scripts.

Suppose the patch and uncompressed/untarred OpenFst? are in directory ~/stage, then, assuming one has a recent autoconf, automake, and libtool installed and one wishes to install to a subdirectory ~/usr of one's home directory
$ cd ~/stage/OpenFst; patch < ../openfst_auto_20071112.diff
$ autoreconf --install
$ ./configure --prefix=$HOME/usr
$ make
$ make install
will install OpenFst? . I have the impression that the headers go to ~/usr/include/fst/lib, the executables, which all begin with "fst", go to ~/usr/bin, and the library libfst goes to ~/usr/lib.
To run tests for whether dlopen() works
$ make test

I have tested the patch only on Mac OS 10.4. I have tested building in a directory separate from the source directory. I have not tested the installed executables.

Is libfstmain also supposed to be accessible after installation? Right now I believe the patch statically links it into the executables.

I have not extended the patch to work on Cygwin. Right now the patch has no host dependent code, but for Cygwin I believe I would be forced to use AC_CANONICAL_HOST and then test for Cygwin so that the proper flags are passed to produce dynamic modules and not static libraries.

I am willing to keep working on the patch to fill your requirements. Right now the patch does not recursively run make on subdirectories as there are only three subdirectories to process.

I also disclaim any copyright on the patch--the patch is completely in the public domain.
DavidShao 13 Nov 2007 - 00:30
 

Lazy VectorFst?

What happens when you construct a VectorFst from some lazy machine, like in the following example:

Fst* composed = new ComposeFst(fst1, fst2);
VectorFst
fst3(*composed);

Is the result fst3 still a lazy machine?

The background: I need to pass a machine into a function that needs a MutableFst. I can't pass ComposeFst or Fst directly, so I need to transform it to VectorFst, but want to be sure that the states are still only expanded lazily.

Thanks!
Markus
MarkusD 08 Nov 2007 - 13:46
No, fst3 is not a lazy machine. The constructor of VectorFst will expand the lazy machine.

Yes, you need to copy a ComposeFst into a VectorFst before passing it as an argument to a function that needs a MutableFst.

If a function (provided by the OpenFst library) requires its argument to be a MutableFst (or ExpandedFst) that means that the underlying algorithm only works with on expanded machine.

Best,
Cyril
CyrilAllauzen 09 Nov 2007 - 13:58
 

porting openfst to Windows

Hi,

I am starting to work on porting openFst to Windows/VisualC++. Is anyone interested or already working on this subject ?

Thank you,

Chris
ChristopherKermorvant 30 Oct 2007 - 07:10
I am generally interested in seeing OpenFst? ported to Windows, especially if the result remains free, open-source and Apache-licensed. Please keep us informed.

Ken
KennethRBeesley 25 Sep 2008 - 16:47
 

Evaluating the Weights

Hi all,

I have a FST and a training data set (input and output pairs). How should I evaluate the weights for the arcs? I want to count the number of times each arc has been crossed. But each time I compose the FST with the input, I get a new FST and the stateID would change. And I will not know which arc it corresponds to in the original FST. How should I deal with this problem?

Many thanks in advance!

Wei
WeiChen 04 Oct 2007 - 09:10
 

fstminimize

Hi all,

While testing some very simple examples I came across something which confused me w.r.t. the behavior of fstminimize: Running fstrmepsilon on the output of fstminimize can actually produce a more "minimal" FST in some situations. Is this to be expected?

Small example:


#!/bin/bash

function draw_fst() {
fst_file="$1"
dot_file="${1%%.fst}.dot"
ps_file="${1%%.fst}.ps"
fstdraw --title="$fst_file" "$fst_file" "$dot_file"
dot -Tps "$dot_file" >"$ps_file"
gv "$ps_file" &
}

cat <a.txt
0 1 a x
1 2 b y
2
EOF
cat <a.isyms
0
a 1
b 2
EOF
cat <a.osyms
0
x 1
y 2
EOF

fstcompile \
--isymbols=a.isyms \
--osymbols=a.osyms \
--keep_isymbols \
--keep_osymbols \
a.txt a.fst
fstunion a.fst a.fst union.fst
draw_fst union.fst
fstminimize union.fst minimize.fst
draw_fst minimize.fst
fstrmepsilon minimize.fst rmepsilon.fst
draw_fst rmepsilon.fst


Cheers,
Andy
AndySchlaikjer 19 Sep 2007 - 20:05
Seems as though the code got a little mangled in the original post. Each of the lines beginning with "cat " is missing the "here-document" operator (two less-than signs + "EOF").. Hope it makes sense. AndySchlaikjer 19 Sep 2007 - 20:08
This is indeed completely expected.

1. Determinize and Minimize treats epsilon as a regular symbol. Hence, if the input has epsilon transitions, the result is not minimal if you interpret epsilon as the empty word (i.e. true epsilon).

2. Assuming that the input has no epsilon transitions, Minimize will produce the minimal deterministic automaton equivalent to the input. There might be non-deterministic automata with fewer states however.

What you observe is a combination of 1+2 since epsilon-removal indeed interprets epsilon as the empty word and produce a non-deterministic machine.

Best,
Cyril
CyrilAllauzen 20 Sep 2007 - 13:15
Ah, i see now. Union was producing a non-deterministic machine for this example, and i was handing that over to Minimize which actually requires as input a deterministic machine. This was made clearer when I tried to run the sequence "Union, Epsnormalize, Minimize" and Minimize produced the error "FATAL: Input Fst is not deterministic".

So it seems as though there are two ways to deal with this: (1) "Union, Epsnormalize, Determinize, Minimize", to ensure both Determinize and Minimize wouldn't encounter epsilon symbols, or (2) "Union, Minimize, Rmepsilon", though I'm not convinced this second one will always produce equivalent output..

So are there other situations in which it can be useful (perhaps for efficiency?) to run Determinize and Minimize algorithms on machines containing "partial" non-determinism due to epsilon edges?

Thanks,
Andy
AndySchlaikjer 20 Sep 2007 - 14:31
In (1), no need to use EpsNormalize, you can simply use RmEpsilon instead. If you follow (2) by Determinize and Minimize you should get the same result as (1).

Indeed, in some situation you cannot remove epsilons because it will lead to a space blow up. In that case, you might still want to optimize the machine with epsilons using determinization and minimization.

Best,
Cyril
CyrilAllauzen 21 Sep 2007 - 11:52
 

fstdeterminize

Hello,

I have an FST with log-arcs (an acceptor). I use ShortestDistance? to compute the sum of its paths from initial state to all final states, and it somewhere around 0. However, after I run fstdeterminize, the sum of paths becomes -0.26 instead of the value close to 0. Shouldn't fstdeterminize maintain the value of sum of paths?

Thanks

Jerr.
JerrRo 05 Sep 2007 - 15:43
Indeed, fstdeterminize should maintain the value of the sum of all paths. Are you doing anything else than determinization (no minimization for instance)? And is everything indeed done in the log semiring?

The issue might be due to numerical instabilities. One solution would then be to use a smaller delta (in DeterminizeFstOptions? ).

Is your machine cyclic? Is the size of the determinized machine much larger than the original machine? The issue might then be that the machine is not determinizable and that fstdetermize terminates because of the numerical approximations. It is unlikely but possible.

In the log semiring, a cause of non-determinizablility is having two states q and q' both accessible from the initial state with paths labeled with the same input string x and such that q and q' both admits cycles labeled with the same input string y and the weight of that cycles are not equal.
CyrilAllauzen 09 Sep 2007 - 14:10
 

make on 64-bit Linux

Hi,

running make on a 64-bit machine does not work. Adding -fPIC to the CFLAGS in the Makefiles of bin, lib and test solves that problem.

Cheers,

Christian
ChristianKofler 04 Sep 2007 - 11:23
 

deleting transducers?

Suppose I have a subroutine that creates an empty transducer from heap memory, populates it somehow according to passed-in arguments, and passes back a ptr to it

StdVectorFst * makeFst( ... params ...) {
StdVectorFst? *newfstp = new StdVectorFst? () ; // get ptr to an empty FST
// add arcs, specify initial and final states
return newfstp ;
}

And assume that in the calling program I eventually want to delete it. Is it as
simple as calling 'delete' ;

StdVectorFst *ptr = makeFst(...args...) ;
// do something with the Fst, perhaps mutate it
// until I'm done with it
delete ptr ; // will this work as expected and avoid memory leaks?

Thanks,

Ken
KennethRBeesley 23 Aug 2007 - 17:59
Yes, this will work. CyrilAllauzen 23 Aug 2007 - 18:24
 

Crossproduct and Complement

1. Is there a general crossproduct algorithm that takes two acceptor networks and
produces a tranducer that encodes the crossproduct?

2. How about complement/negation? I presume that it would be limited to
unweighted networks?

Thanks,

Ken
KennethRBeesley 23 Aug 2007 - 13:03
1. You simply needs to create a transducer T with one unique state and a transition with labels (a,b) for each input symbol a and each output symbol b. You can then use composition: A o T o B would represent the cross product.

2. Complement is currently a library internal-only operation. It is limited to unweighted deterministic automata.
CyrilAllauzen 23 Aug 2007 - 18:10
2. Forgot to mention that you should use Difference instead of Complement. CyrilAllauzen 25 Aug 2007 - 13:14
 

AT&T FSM format (text format) and Unicode

You confirmed earlier that in Symbol Table Files that map symbols to internal
integer labels, one could simply use real Unicode code point values for the
integers. That was good news, but it leads to a few questions:

1. In the old AT&T documentation I read

"Some FSM operations allocate internal arrays based on the maximum inte-
ger used as an input or output arc symbol. This design choice, chosen
for efficiency, requires the user to avoid huge integer labels (e.g.,
INT_MAX) since memory may otherwise be exhausted."

Is that also the case with OpenFst? ? If so, it would be a major disincentive to
using real code point values, especially if (like myself) you work with characters encoded
in the Unicode supplementary area. Comments?

2. If you are using Supplementary Chars, or even Greek, Cyrillic, Hebrew, etc.,
then symbol table files could intuitively
contains lines such as the following (for the Deseret Alphabet):

?? 66560
?? 66561
?? 66562
?? 66563
?? 66564
?? 66565
etc.

Using such a symbol table, encoded in Unicode UTF-8, I tried 'fstcompile' on the following
file, also encoded in UTF-8

0 1 a b 0.5
1 2 ?? ?? 0.5
2 3 x y 0.5
3 1.0

fstcompile choked on line 2, which contains the first characters in this file
outside the ASCII range.
fstcompile has many options, but I haven't seen anything (yet) that would cause
it to handle the text input and symbol-table files as UTF-8. Have I missed
something?

3. (Previously mentioned topic) If one does use real Unicode code-point values
for the label integers, it would be convenient to allow them to be represented,
in the symbol-table files, in hex, e.g.

?? 0x10400
?? 0x10401
?? 0x10402
?? 0x10403
?? 0x10404
?? 0x10405
etc.

Comments would be welcome.

Thanks,

Ken
KennethRBeesley 23 Aug 2007 - 12:56
I see that the Unicode chars did not survive the reposting process in my question above. When pasted in, each ?? was a Unicode char.
Ken
KennethRBeesley 23 Aug 2007 - 13:05
The library can only handle ascii for symbol files and fst text format.

Note that the use of a symbol file is not required at all. You can directly define your fst using the "code-point value". Or write a script that turn the UTF-8 characters into their "code-point value".
CyrilAllauzen 24 Aug 2007 - 09:30
Thanks for the response.
I see that I asked too many questions in one posting. Sorry.

Here's Question #1 again:

In the old AT&T documentation I read

"Some FSM operations allocate internal arrays based on the maximum inte-
ger used as an input or output arc symbol. This design choice, chosen
for efficiency, requires the user to avoid huge integer labels (e.g.,
INT_MAX) since memory may otherwise be exhausted."

Is that also the case with OpenFst? ??
If so, it could be a major disincentive to using real code point values as labels, especially if (like myself) you work with characters encoded in the Unicode supplementary area.

Thanks,

Ken
KennethRBeesley 24 Aug 2007 - 18:18
No, it is not the case for OpenFst. The library does not allocate such arrays. CyrilAllauzen 29 Aug 2007 - 11:21
 

ShortestDistance?

Hi,

If I am not mistaken, when calculating shortest distance with reverse=true (say, in the log semiring), meaning - the backward values, the weight for states which have no access to the final state should be LogWeight? ::Zero (or +inf). However, when I did that for a machine of the following 0-1> 0->2 1 is final, 2 is not final, I got from ShortestDistance? LogWeight? ::One() (very small value close to 0) when calculating shortestdistance with reverse=true for state 2. Am I misunderstanding something?

Thanks.


Jerr.
JerrRo 17 Aug 2007 - 13:28
You should indeed get LogWeight::Zero() for state 2. However, the size of the vector returned by ShortestDistance can be less than NumStates(),i.e, it is the maximum state visited as mentioned here.

If a state i is such that i < distance.size(), then its shortest distance is distance[i] otherwise it is Weight::Zero().

In the example you gave, I suspect that distance.size() is 1 since 2 would likely not be visited by ShortestDistance when reverse is true.

Best,
Cyril
CyrilAllauzen 17 Aug 2007 - 16:45
 

composition of FSTs

I am trying to compose (programmatically) a transducer with itself several times. The FST is a result of many unions. The composition of that FST with itself takes a long time. I am trying to figure out what could be a good way to do that efficiently. Is composition slower with many epsilon moves? Would it be recommended to first determinize the FST and remove epsilon moves before composing it with itself (the original FST has many epsilon moves, being a result of a union)? JerrRo 14 Aug 2007 - 14:55
Indeed, union introduces epsilon-transitions and epsilon-transitions will slow down composition. Non-determinism will also slow down composition. So, I would recommend you first epsilon-remove and then determinize your fst.

However, there is always a risk with determinization (it might blow up or even not terminate). Hence, if what I suggest does not work better, you should investigate using determinize only, or rmepsilon only or determinize followed by rmepsilon.

Cyril
CyrilAllauzen 14 Aug 2007 - 15:21
Thanks for your quick response. I am using a different semi-ring I defined for the composition. For some odd reason, when trying to determinize the FST I get the error:

FATAL: StringWeight? ::Plus: unequal arguments (non-functional FST?)

I am not sure what it means. I am guessing the mention of StringWeight?
is an error (in the error :)).

Thanks.


Jerr
JerrRo 14 Aug 2007 - 15:37
The library currently only supports the determinization of functional transducers (if two successfull paths have the same input label, they need to also have the same output label). The reason for that is that we use the weighted automata determinization algorithm viewing the output labels as weights in the string semiring (hence StringWeight? in the error message).

A workaround is to use Encode/Decode to view the transducer as an acceptor, considering the pair (ilabel, olabel) as one symbol.

1. Encode:
EncodeMapper encoder(kEncodeLabels, ENCODE);
Encode(fst, &encoder);

2. Determinize.

3. Decode:
Decode(fst, encoder);
CyrilAllauzen 14 Aug 2007 - 16:16
Is it possible that RmEpsilonFst? /ComposeFst have a bug? When I tried creating an RmEpsilon? FST, I got SIGABRT later when trying to compose it with another machine. This means the following doesn't work:

RmEpsilonFst* pRm = new RmEpsilonFst? ( origFst);
ComposeFst
pCompose = new ComposeFst? (someFst, pRm);
(get here SIGABRT - probably because of double free/delete)

while this works:

RmEpsilonFst
pRm = new RmEpsilonFst? ( origFst);
ComposeFst
pCompose = new ComposeFst? (someFst, *origFst);
(just changed compose line, even keep the rm epsilon without using the resulting FST).

MyArc/MyWeight are debugged quite well (I use them in so many other places, and they seem to be working well), so I suspect the problem is not there.

I will try to check the FST library code, to see if anything pops out immediately, but I suspect that won't happen.

Thanks.


Jerr.
JerrRo 14 Aug 2007 - 17:08
Thanks Jerr., I will look into that. In the meantime, I recommend you use the destructive version of epsilon-removal: RmEpsilon(&A).
It is also more efficient in general.

Cyril
CyrilAllauzen 14 Aug 2007 - 20:12
Just to make sure I got it right:

2. Determinize.
Here I should output the fst to some new MutableFst? (maybe it is even possible to call Determinize(myFst, &myFst) without fear?)

3. Decode:
Decode(fst, encoder);
Here I should only decode the older FST (not the new MutableFst? ) to restore it to its previous state. The new determinized MutableFst? will be already decoded, so there is no need to change it.

Thanks.


Jerr.
JerrRo 15 Aug 2007 - 08:03
2. Both should work.
3. You want to decode the result of determinization (using the same encoder as for encoding). You want to decode the original fst only if you still need to use it. Encode and Decode take as argument pointers to MutableFst.
CyrilAllauzen 15 Aug 2007 - 09:44
 

write FST in AT&T FSM format

If I've constructed an fst and can write it out in binary form using
fst.Write("binary.fst")

how can I write out the same FST in AT&T FSM format?

Thanks,

Ken
KennethRBeesley 13 Aug 2007 - 15:51
The library only supports writing fsts in binary format, since this is the "working" format that all the utilities use. The command-line utility fstprint reads a binary fst and writes out its textual (AT&T FSM format) representation.

Best,
Cyril
CyrilAllauzen 14 Aug 2007 - 11:02
In the context of a GUI/IDE built on top of OpenFst? , it might well be useful to have the library support
1) writing out an fst in its textual (AT&T FSM Format) representation, and
2) drawing the fst (via DOT)

Any chance of adding such support to future versions of the library?

Thanks,

Ken
KennethRBeesley 21 Aug 2007 - 14:13
Maybe.

As of now, nothing prevents someone from simply using the FstPrinter and FstDrawer classes (from fst/bin/print-main.h and fst/bin/draw-main.h) by copy-pasting them into one's code.
CyrilAllauzen 24 Aug 2007 - 09:34
 

VectorFst?

Hello,

I am unioning several thousand of FSTs together... When I use the final UnionFst? to create a VectorFst? (new VectorFst? <>(unioned_fst)), it takes a long time to actually create that FST and it also takes a lot of memory. Is there a specific reason for such process to take such a long time when moving from UnionFst? to VectorFst? ?

Thanks.


Jerr.
JerrRo 13 Aug 2007 - 15:47
The reason is that UnionFst is a delayed fst. That means it is only built when visited, i.e., it's only when converting to VectorFst that the actual union is computed.

It will probably be more efficient to convert each intermediary UnionFst to VectorFst.

In most cases, a very efficient way will be to apply epsilon-removal and determinization after each union.

Best,
Cyril
CyrilAllauzen 14 Aug 2007 - 11:11
Thanks! JerrRo 14 Aug 2007 - 14:55
 

Determininizg / Minimizing

Hello,

After minimizing and determinizing a machine programmatically (using DeterminizeFst? and Minimize) I get all kind of isolated states that can never be reached. However, when I use fstminimize and fstdeterminize, this does not happen. Is there a way to get rid of these states programmatically? Right now, I am pretty much doing the same MinimizeMain? and DeterminizeMain? do, and still I get these isolated states.

Thanks.


Jerr.
JerrRo 09 Aug 2007 - 12:53
You can get rid of theses states by using Connect(&mutable_fst).

However, it is not clear to me why these non-accessible states appear. I would need to know more precisely what you are doing.

Best,
Cyril
CyrilAllauzen 09 Aug 2007 - 14:38
 

Multicharacter Labels

The Quick Tour indicates, for symbol-table files: "You may use any string for a label"

Question: Are there any restrictions or conventions on these strings?
Question: If a string label contains more than one character, are there cases in the code where input strings must be 'tokenized' into individual labels based on the labels defined in the symbol-table files? If so, how is such tokenization performed.
Question: How have multicharacter labels been used traditionally in the AT&T/OpenFst tradition? Examples?

Thanks,

Ken
KennethRBeesley 02 Aug 2007 - 14:00
I guess these strings should not contain spaces, tabs and newlines.

The symbol table does not play any role internally. It is only used for displaying and printing. No tokenization is performed. A multicharacter string is always treated as an indivisible label in the library.

An example of multicharacter labels is the example given in the top banner of this page. More seriously, you can check out this paper for an example of multicharacter labels when building a language model.

Best,
Cyril
CyrilAllauzen 02 Aug 2007 - 14:26
How would a literal space be represented in an AT&T FSM format file, and in a symbol-table file?
Thanks,

Ken
KennethRBeesley 23 Aug 2007 - 13:08
It cannot. You would need to replace by an other symbol.

As I was saying before, a symbol list is not required. I created fsts using ascii codes for labels. And in that case, it is simply better not to use a symbol list to create the fsts.
CyrilAllauzen 24 Aug 2007 - 09:39
 

changing an arc weight

Hello,

I have an FST with LogWeight? arcs. I would like to access these arcs and change them programatically. I wanted to do it using ArcIterator? , but Value() returns the arc as const. Is there other way to change the weight on an arc?

thanks.

jerr
JerrRo 01 Aug 2007 - 16:02
You need to use MutableArcIterator instead of ArcIterator. You can then use the SetValue method to modify the arc pointed to by the iterator.
Cyril
CyrilAllauzen 01 Aug 2007 - 17:23
 

labels and integers

The OpenFst? Quick Tour indicates, for symbol table files, that
"You may use any string for a label; you may use any non-negative integer for a label ID. The zero label is reserved for the epsilon label."

OK, but just to make sure, I have a few questions:
Question: Both examples use a dense range 0-3 of integers. Is this required or recommended?
Question: Could one just use Unicode code point values as the integers?
Question: Can one indicate the integer values in hex notation in the symbol-table file(s)?
Question: What is the maximum integer value? Can it handle Unicode supplementary code point values? up to 0x10FFFF?

Thanks,

Ken
KennethRBeesley 26 Jul 2007 - 00:47
You don't have to use dense range of integers.

You can indeed use Unicode values as the integers.

No, you cannot use hex notation in the symbol table file.

The maximum integer value depends on the Arc::Label type. For StdArc? , Label is defined as int (signed 32 bits on most machines). Hence, 0x10FFFF should be fine. You can also define your own Arc type if you wish.

Best,
Cyril
CyrilAllauzen 29 Jul 2007 - 15:12
We use strtoll(...,10) for reading integers. I guess we didn't use strtoll(,...0) (which would allow 012 or 0x12) because people might inadventedly have leading zeroes but want base 10. I could be convinced to change this if people feel strongly about it.

-Michael
MichaelRiley 30 Jul 2007 - 19:51
Cyril/Michael,

Many thanks for your responses, and in general for making OpenFst? available.

I (and I assume many others) are moving consistently toward using Unicode wherever possible for internal character representations, and in Unicode, Hex is the De Facto King. Having to use decimal for Unicode code point values is a nuisance and leads to errors. While I recognize your concern about those who might "inadvertently have leading zeroes but want base 10", I suspect that most of us are pretty well acquainted with the 12 vs. 012 vs. 0x12 distinction.

Possibilities:
1. Change to using strtoll(...,0) to allow 12, 012 and 0x12. Someone might write a trivial symbolTableLint program to scan the files and print warnings if both decimal and (apparently) octal representations are found in the same file.

2. Or better (?), provide some kind of optional header in the symbol-table file, or some other convenient user-specified flag, to indicate that strtoll(...,0) should be used instead of (the default) strtoll(...,10). That would maintain backward/legacy compability; and if I use 0x12 without overtly indicating that strtoll(...,0) should be used, then presumably I'd get a helpful compile-time error.

What do you think?

Ken
KennethRBeesley 02 Aug 2007 - 13:39
 

compiling OpenFst? beta on macosx

I'm running Mac OS X 10.4.10. I downloaded OpenFst? beta and edited bin/Makefile, bin/lib and bin/text, making the changes indicated for macosx. 'make all' seemed to work. But 'make test' yields:

[titania:openfst/OpenFst/fst] beesley% make test
( cd lib ; make all )
make[1]: Nothing to be done for `all'.
( cd bin ; make all )
make[1]: Nothing to be done for `all'.
(cd test ; make test )
g++ -I../.. -O2 -o weight_test.o -c weight_test.cc
g++ -Wl,-L/Users/beesley/fsimpls/openfst/OpenFst/fst/test/../lib -o weight_test weight_test.o -lfst -lm -lpthread -ldl
./weight_test # Tests Weight classes
dyld: Library not loaded: libfst.dylib
Referenced from: /Users/beesley/fsimpls/openfst/OpenFst/fst/test/./weight_test
Reason: image not found
make[1]: * [test_weights] Trace/BPT trap
make: * [test] Error 2


Question: what does this error indicate? and how can I fix it?
Thanks,

Ken
KennethRBeesley 26 Jul 2007 - 00:10
You need to add the paths to fst/lib, fst/bin and fst/test to your DYLD_LIBRARY_PATH environment variable:

export DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:/Users/beesley/fsimpls/openfst/OpenFst/fst/test:/Users/beesley/fsimpls/openfst/OpenFst/fst/bin:/Users/beesley/fsimpls/openfst/OpenFst/fst/lib
CyrilAllauzen 29 Jul 2007 - 14:48
 

-- CyrilAllauzen - 15 Jun 2007

Topic attachments
I Attachment Action Size Date Who Comment
jpgjpg label-pushing.jpg manage 6.8 K 15 Oct 2009 - 20:30 CyrilAllauzen Label pushing example: result of label pushing as an automaton over the string semiring
elsediff openfst_auto_20071112.diff manage 8.7 K 13 Nov 2007 - 05:13 DavidShao Autoconfiscating patch
jpgjpg reverse-label-pushing.jpg manage 10.9 K 15 Oct 2009 - 20:31 CyrilAllauzen Label pushing example: reverse, label-pushing towards initial state, reverse
Topic revision: r242 - 22 Nov 2009 - 22:00:16 - AlexeiIvanov
 
This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback