//
// Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
//
// Distributed under the Boost Software License, Version 1.0.
// https://www.boost.org/LICENSE_1_0.txt

/*!
\page boundary_analysys Boundary analysis

- \ref boundary_analysys_basics
- \ref boundary_analysys_segments
    - \ref boundary_analysys_segments_basics
    - \ref boundary_analysys_segments_rules
    - \ref boundary_analysys_segments_search
- \ref boundary_analysys_break
    - \ref boundary_analysys_break_basics
    - \ref boundary_analysys_break_rules
    - \ref boundary_analysys_break_search


\section boundary_analysys_basics Basics

Boost.Locale provides a boundary analysis tool, allowing you to split text into characters,
words, or sentences. It is commonly used to find appropriate places for line breaks.

\note This is not a trivial task!
\par
A Unicode code point and a character are not equivalent.
For example, the Hebrew word Shalom - "שָלוֹם" - consists of 4 characters and
6 code points (4 base letters and 2 diacritical marks).
\par
Words may not be separated by space characters in some languages like in Japanese or Chinese.

Boost.Locale provides 2 major classes for boundary analysis:

-   \ref boost::locale::boundary::segment_index - an object that holds the index of segments in text (like words, characters,
    sentences). It provides access to \ref boost::locale::boundary::segment "segment" objects via iterators.
-   \ref boost::locale::boundary::boundary_point_index - an object that holds the index of boundary points in text.
    It can iterate over the \ref boost::locale::boundary::boundary_point "boundary_point" objects.

Each of the classes above use an iterator type as template parameter.
Both of these classes accept in their constructors:

- A flag that defines boundary analysis \ref boost::locale::boundary::boundary_type "boundary_type".
- The pair of iterators that define the text range to be analysed
- A locale parameter (if not given, the global one is used)

For example:
\code
namespace ba=boost::locale::boundary;
std::string text= ... ;
std::locale loc = ... ;
ba::segment_index<std::string::const_iterator> map(ba::word,text.begin(),text.end(),loc);
\endcode

Each class implements members \c begin(), \c end() and \c find() making it possible to iterate
over the selected segments or boundaries in the text or find a location of a segment or
boundary for a given iterator.


Convenience typedefs like \ref boost::locale::boundary::ssegment_index "ssegment_index"
or \ref boost::locale::boundary::wcboundary_point_index "wcboundary_point_index" are provided as well,
where "w", "u16" and "u32" prefixes define a character type \c wchar_t,
\c char16_t and \c char32_t and "c" and "s" prefixes define whether <tt>std::basic_string<CharType>::const_iterator</tt>
or <tt>CharType const *</tt> are used.

\section boundary_analysys_segments Iterating Over Segments
\section boundary_analysys_segments_basics Basic Iteration

The text segments analysis is done using \ref boost::locale::boundary::segment_index "segment_index" class.

It provides a bidirectional iterator that returns \ref boost::locale::boundary::segment "segment" object.
The segment object represents a pair of iterators that define this segment and a rule according to which it was selected.
It can be automatically converted to \c std::basic_string object.

To perform boundary analysis, we first create an index object and then iterate over it:

For example:

\code
using namespace boost::locale::boundary;
boost::locale::generator gen;
std::string text="To be or not to be, that is the question."
// Create mapping of text for token iterator using the global locale.
ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
// Print all "words" -- chunks of word boundary
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
    std::cout  << "\""<< * it << "\", ";
std::cout << std::endl;
\endcode

Would print:

\verbatim
"To", " ", "be", " ", "or", " ", "not", " ", "to", " ", "be", ",", " ", "that", " ", "is", " ", "the", " ", "question", ".",
\endverbatim

This sentence "生きるか死ぬか、それが問題だ。" (<a href="http://tatoeba.org/eng/sentences/show/868189">from Tatoeba database</a>)
would be split into following segments in the \c ja_JP.UTF-8 (Japanese) locale:

\verbatim
"生", "きるか", "死", "ぬか", "、", "それが", "問題", "だ", "。",
\endverbatim

The boundary analysis that is done by Boost.Locale
is much more complicated than just splitting the text according
to white space characters, although it is not always perfect.


\section boundary_analysys_segments_rules Using Rules

The segments selection can be customized using \ref boost::locale::boundary::segment_index::rule(rule_type) "rule()" and
\ref boost::locale::boundary::segment_index::full_select(bool) "full_select()" member functions.

By default, segment_index's iterator returns each text segment defined by two boundary points regardless
the way they were selected. Thus in the example above we could see text segments like "." or " "
that were selected as words.

Using a \c rule() member function we can specify a binary mask of rules we want to use for selection of
boundary points using \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
and \ref bl_boundary_sentence_rules "sentence" boundary rules.

For example, by calling

\code
map.rule(word_any);
\endcode

Before starting the iteration process, specify a selection mask that fetches: numbers, letter, Kana letters and
ideographic characters ignoring all non-word related characters like white space or punctuation marks.

So the code:

\code
using namespace boost::locale::boundary;
std::string text="To be or not to be, that is the question."
// Create mapping of text for token iterator using the global locale.
ssegment_index map(word,text.begin(),text.end());
// Define a rule
map.rule(word_any);
// Print all "words" -- chunks of word boundary
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
    std::cout <<"\""<< * it << "\", ";
std::cout << std::endl;
\endcode

Would print:

\verbatim
"To", "be", "or", "not", "to", "be", "that", "is", "the", "question",
\endverbatim

And the for given text="生きるか死ぬか、それが問題だ。" and rule(\ref boost::locale::boundary::word_ideo "word_ideo"), the example above would print:

\verbatim
"生", "死", "問題",
\endverbatim

You can determine why a segment was selected by using the \ref boost::locale::boundary::segment::rule() "segment::rule()" member
function.  The return value is a bit-mask of rules.

For example:

\code
boost::locale::generator gen;
using namespace boost::locale::boundary;
std::string text="生きるか死ぬか、それが問題だ。";
ssegment_index map(word,text.begin(),text.end(),gen("ja_JP.UTF-8"));
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) {
    std::cout << "Segment " << *it << " contains: ";
    if(it->rule() & word_none)
        std::cout << "white space or punctuation marks ";
    if(it->rule() & word_kana)
        std::cout << "kana characters ";
    if(it->rule() & word_ideo)
        std::cout << "ideographic characters";
    std::cout<< std::endl;
}
\endcode

Would print:

\verbatim
Segment 生 contains: ideographic characters
Segment きるか contains: kana characters
Segment 死 contains: ideographic characters
Segment ぬか contains: kana characters
Segment 、 contains: white space or punctuation marks
Segment それが contains: kana characters
Segment 問題 contains: ideographic characters
Segment だ contains: kana characters
Segment 。 contains: white space or punctuation marks
\endverbatim


Note that rules are applied to the end boundary of a segment when deciding
whether to include a segment. In some cases this can cause unexpected behavior.

For example, consider the text:

\verbatim
Hello! How
are you?
\endverbatim

Suppose we want to fetch all sentences from the text.

The \ref bl_boundary_sentence_rules "sentence rules" have two options:

- Split the text where sentence terminator like ".!?" are detected: \ref boost::locale::boundary::sentence_term "sentence_term"
- Split the text where sentence separators such as "line feed" are detected: \ref boost::locale::boundary::sentence_sep "sentence_sep"

Naturally to ignore sentence separators we would call \ref boost::locale::boundary::segment_index::rule(rule_type v) "segment_index::rule(rule_type v)"
with sentence_term parameter and then run the iterator.

\code
boost::locale::generator gen;
using namespace boost::locale::boundary;
std::string text=   "Hello! How\n"
                    "are you?\n";
ssegment_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
map.rule(sentence_term);
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
    std::cout << "Sentence [" << *it << "]" << std::endl;
\endcode

Would result in:
\verbatim
Sentence [Hello! ]
Sentence [are you?
]
\endverbatim

These (potentially unexpected) results occur because "How\n" is still considered
a sentence but is selected by a different rule.

This behavior can be changed by setting \ref boost::locale::boundary::segment_index::full_select(bool) "segment_index::full_select(bool)"
to \c true. It will force the iterator to join the current segment with all previous segments even if they do not fit the required rule.

So we add this line:

\code
map.full_select(true);
\endcode

Right after "map.rule(sentence_term);" and get expected output:

\verbatim
Sentence [Hello! ]
Sentence [How
are you?
]
\endverbatim

\subsection boundary_analysys_segments_search Locating Segments

Sometimes it is useful to find a segment that some specific iterator is pointing to.

For example, suppose we want to find the word a user clicked on.

\ref boost::locale::boundary::segment_index "segment_index" provides
\ref boost::locale::boundary::segment_index::find() "find(base_iterator p)"
member function for this purpose.

This function returns an iterator to the segment that includes \a p.

For example:

\code
text="to be or ";
ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
ssegment_index::iterator  p = map.find(text.begin() + 4);
if(p!=map.end())
    std::cout << *p << std::endl;
\endcode

Would print:

\verbatim
be
\endverbatim

\note

If the iterator is inside a segment, that segment is returned. If the segment does
not fit the selection rules, then the next segment following the requested position
that does fit the rules will be returned.

For example: For \ref boost::locale::boundary::word "word" boundary analysis with \ref boost::locale::boundary::word_any "word_any" rule:

- "t|o be or ", would point to "to" - the iterator in the middle of segment "to".
- "to |be or ", would point to "be" - the iterator at the beginning of the segment "be".
- "to| be or ", would point to "be" - the iterator is not pointing to a segment fitting the required rule, so next valid segment selected is "be".
- "to be or| ", would point to end as no valid segment can be found.

\section boundary_analysys_break Iterating Over Boundary Points
\section boundary_analysys_break_basics Basic Iteration

The \ref boost::locale::boundary::boundary_point_index "boundary_point_index" is similar  to
\ref boost::locale::boundary::segment_index "segment_index" in its interface but has a different role.
Instead of returning text chunks (\ref boost::locale::boundary::segment "segment"s, it returns a
\ref boost::locale::boundary::boundary_point "boundary_point" object that
represents a position in text - a base iterator that is used for
iteration of the source text C++ characters.
The \ref boost::locale::boundary::boundary_point "boundary_point" object
also provides a \ref boost::locale::boundary::boundary_point::rule() "rule()" member
function that returns why this boundary was selected, i.e. the matched rule.

\note The beginning and the ending of the text are considered boundary points, so even
an empty text consists of at least one boundary point.

Lets see an example of selecting the first two sentences from a text:

\code
using namespace boost::locale::boundary;
boost::locale::generator gen;

// Our text sample
std::string const text="First sentence. Second sentence! Third one?";
// Create an index
sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));

// Count two boundary points
sboundary_point_index::iterator p = map.begin(),e=map.end();
int count = 0;
while(p!=e && count < 2) {
    ++count;
    ++p;
}

if(p!=e) {
    std::cout   << "First two sentences are: "
                << std::string(text.begin(),p->iterator())
                << std::endl;
}
else {
    std::cout   <<"There are less than two sentences in this "
                <<"text: " << text << std::endl;
}\endcode

Would print:

\verbatim
First two sentences are: First sentence. Second sentence!
\endverbatim

\section boundary_analysys_break_rules Using Rules

Just like \ref boost::locale::boundary::segment_index "segment_index" the
\ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
a \ref boost::locale::boundary::boundary_point_index::rule(rule_type r) "rule(rule_type mask)"
member function to filter boundary points that interest us.

It allows to set \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
and \ref bl_boundary_sentence_rules "sentence" rules for filtering boundary points.

Lets change the example above a bit:

\code
// our text sample
std::string const text= "First sentence. Second\n"
                        "sentence! Third one?";
\endcode

If we run our program as is on the sample above we would get:
\verbatim
First two sentences are: First sentence. Second
\endverbatim

Which is not really what we expected, because the "Second\n"
is considered an independent sentence that was separated by
a line separator "Line Feed".

However, we can set set the rule \ref boost::locale::boundary::sentence_term "sentence_term"
and the iterator would use only boundary points that are created
by a sentence terminators like ".!?".

So by adding:
\code
map.rule(sentence_term);
\endcode

Right after the generation of the index we would get the desired output:

\verbatim
First two sentences are: First sentence. Second
sentence!
\endverbatim

You can also use the \ref boost::locale::boundary::boundary_point::rule() "boundary_point::rule()" member
function to learn about the reason why this boundary point was created by comparing it with an appropriate
mask.

For example:

\code
using namespace boost::locale::boundary;
boost::locale::generator gen;
// our text sample
std::string const text= "First sentence. Second\n"
                        "sentence! Third one?";
sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));

for(sboundary_point_index::iterator p = map.begin(),e=map.end();p!=e;++p) {
    if(p->rule() & sentence_term)
        std::cout << "There is a sentence terminator: ";
    else if(p->rule() & sentence_sep)
        std::cout << "There is a sentence separator: ";
    if(p->rule()!=0) // print if some rule exists
        std::cout   << "[" << std::string(text.begin(),p->iterator())
                    << "|" << std::string(p->iterator(),text.end())
                    << "]\n";
}
\endcode

Would give the following output:
\verbatim
There is a sentence terminator: [First sentence. |Second
sentence! Third one?]
There is a sentence separator: [First sentence. Second
|sentence! Third one?]
There is a sentence terminator: [First sentence. Second
sentence! |Third one?]
There is a sentence terminator: [First sentence. Second
sentence! Third one?|]
\endverbatim

\subsection boundary_analysys_break_search Locating Boundary Points

Sometimes it is useful to find a specific boundary point according to a given
iterator.

\ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
a \ref boost::locale::boundary::boundary_point_index::find() "iterator find(base_iterator p)" member
function.

It returns a boundary point on \a p or at the location following \a p if \a p does not point to an appropriate position.

For example, for word boundary analysis:

- If a base iterator points to "to |be", then the returned boundary point would be "to |be" (same position)
- If a base iterator points to "t|o be", then the returned boundary point would be "to| be" (next valid position)

For example, if we want to select 6 words around a specific boundary point we can use following code:

\code
using namespace boost::locale::boundary;
boost::locale::generator gen;
// Our text sample
std::string const text= "To be or not to be, that is the question.";

// Create a mapping
sboundary_point_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
// Ignore wite space
map.rule(word_any);

// Define our arbitrary point
std::string::const_iterator pos = text.begin() + 12; // "no|t";

// Get the search range
sboundary_point_index::iterator
    begin = map.begin(),
    end = map.end(),
    it = map.find(pos); // find a boundary

// Go 3 words backward
for(int count = 0;count <3 && it!=begin; count ++)
    --it;

// Save the start
std::string::const_iterator start = *it;

// Go 6 words forward
for(int count = 0;count < 6 && it!=end; count ++)
    ++it;

// Make sure we are at a valid position
if(it==end)
    --it;

// Print the text
std::cout << std::string(start,it->iterator()) << std::endl;
\endcode

This would print:

\verbatim
 be or not to be, that
\endverbatim


*/
