Re: pseudo-specs for a String class: tokenization

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Mon, 08 Sep 2008 23:52:57 +1200

Alex Rousskov wrote:
> On Fri, 2008-09-05 at 17:47 +0200, Kinkie wrote:
>> On Fri, Sep 5, 2008 at 5:02 PM, Alex Rousskov
>> <rousskov_at_measurement-factory.com> wrote:
>>> On Fri, 2008-09-05 at 10:19 +0200, Kinkie wrote:
>>>> On Fri, Sep 5, 2008 at 4:43 AM, Alex Rousskov
>>>> <rousskov_at_measurement-factory.com> wrote:
>>>>> Just like String, the iterator interface is pretty standard. For our
>>>>> Tokenizer, we can simplify it a little unless others think that
>>>>> compatibility with standard library algorithms is worth the trouble.
>>>>> Here is a sketch:
>>>>>
>>>>> class Tokenizer {
>>>>> public:
>>>>> Tokenizer(); // immediately atEnd
>>>> I'd avoid the default constructor entirely.
>>> Bad idea. The default constructor does not hurt in this case. It does
>>> help when you want another method to initialize the tokenizer or when
>>> you want to reset the already initialized tokenizer.
>> A tokenizer only has meaning when attached to a KBuf (String,
>> whatever), that's what I ment by not having a constructor without an
>> attached KBuf.
>
>>From practical point of view, you may not have the right string to
> "attach" to at the time of construction and attaching to the wrong
> string is worse than meaningless.
>
>>From design point of view, a basic tokenizer that is atEnd() with or
> without the attached buffer is perfectly fine and meaningful because you
> cannot do much with atEnd tokenizer.
>
> As we add more bells and whistles to the Tokenizer class, the meaning of
> some methods may indeed become vague for unattached tokenizer. For
> example, what should the originalString() or source() method return if
> we have one? For simplicity sake, we can solve that problem by declaring
> that the default constructor has the same visible effect as the
> Tokenizer(String(), String()) constructor.
>
>>>> I'd rather add a version whcih takes the String but not the delimiters.
>>> I would recommend avoiding implicit conversions from String to anything
>>> and I doubt there is a reasonable set of default delimiters.
>> Why there would be an implicit conversion?
>
> Ask Amos -- he has suffered enough from it to give an entertaining
> answer :-). Or see the attached source file.

All my failed attempts were broken the String MemBuf size separation. I
was attempting to expand String to implicit conversion as-is but got
hamstrung when memory buffers were cast to char* and the MemBuf data
pointer were silently converted to String's and copy-allocator size
asserts kicked in :-)

This new method of attack should not encounter that due to two
differences. Firstly the lack of a buffer size assert :-) and lack of
need for an implicit conversion between the types.

>
>> And you're right, just as a Tokenizer has no meaning without a KBuf,
>> then it also has none without delimiters.
>
> Tokenizer may have meaning when it has nothing. We could assign some
> meaning to a Tokenizer that has a string but not delimiters (e.g., treat
> that as an empty delimiter set), but I think such unusual usage should
> be explicit: Tokenizer(myString, String()).
>
>>>>> Tokenizer(const String &aString, const String &delimiters);
>>>> String arg must be passed by value (which translates to refcounted ref
>>>> to the data). Passing by reference will alias the String, falling back
>>>> into the original problem.
>>> I am sorry, but you are mistaken. String arguments like that should be
>>> passed by reference. I do not know what "alias the String" means, but it
>>> is perfectly safe and noticeably more efficient to pass Strings by
>>> reference in contexts like that. This has been discussed recently
>>> already.
>> The point is that passing an object by (c++) reference does not create
>> a new copy. No new copy means that the (Kbuf-level) refcounts do not
>> get increased, it is "just" an alias for the original KBuf, with all
>> that it means
>
> That is exactly what we want for a method parameter.
>
>> (no content freezing, can be appended to while the
>> Tokenizer is running, etc).
>
> Tokenizer itself cannot append to the string parameter because it is a
> const parameter (being also a reference is irrelevant here). Code that
> has access to a non-const copy of the same string is free to modify the
> string, of course. It all "just works" and is a standard practice.
>
> If you are thinking about threads, then references must not be passed
> across thread boundaries. However, Tokenizer will never be a thread so,
> again, there is no API problem here either.
>
>> As long as we're dealing with the KBuf class itself, that's no problem
>> and is a welcome opimization. But here we must make sure that
>> somewhere a copy of the KBuf object is created. Granted, it may be
>> done WITHIN the call; I'll just have to make sure that it gets done.
>
> If it is not done within the call, then there is still no danger (but no
> string to work with either!).
>
> The only danger here is that somebody will declare a Tokenizer class
> data member of the reference type and store a reference. That danger
> exists regardless of the constructor parameter type and no design can
> eliminate it. Hopefully, such bugs will be caught by review (or by
> compiler, if we have code that assigns Tokenizers).
>
>>> If we stick with named interfaces, then do this:
>>>
>>> // move to the next token, named and STL-like interfaces
>>> Tokenizer &operator ++() { return next(); }
>>> Tokenizer &next();
>> Agreed.
>>
>>> so that you can write:
>>>
>>> tokenizer.next().token()
>> No.
>> If knowing what the actual separator was is important I'd rather:
>> bool next(); //returns false @end-of-string
>> KBuf& token();
>> char separator();
>>
>> so it becomes:
>> while (tokenizer.next()) {
>> KBuf t=tokenizer.token();
>> }
>
> The above loop misses the first token and I think you are switching
> topics (the original question was how to design nextToken and not how to
> loop).
>
> For nextToken, next().token() can hardly be improved. It is not perfect
> because next() may end up at the end of the string, but that is the
> problem with nextToken idea itself, not the implementation.
>
> For looping, I have already posted the correct looping sketch. Here it
> is with named interfaces:
>
> for (Tokenizer tMaker(str, dels); !tMaker.atEnd(); tMaker.next()) {
> String token = tMaker.token();
> ...
> }
>
>>>> Would anyone think it'd be useful to have non-single-char delimiters?
>>>> It'd complicate the called code quite a bit, but if it's useful and it
>>>> simplifies the calling code...
>>> I think we will eventually have a DelimiterSet or StrFinder class so
>>> that we can support string delimiters, RE delimiters, and arbitrary code
>>> delimiters. I would not change the API though. Currently, our
>>> DelimiterSet or StrFinder is a String class, which is interpreted
>>> internally as a set of chars. Eventually, that interpretation would be
>>> up to the passed finder object...
>>>
>>> You can typedef Tokenizer::Finder to String right now. I did not propose
>>> that to keep things simple. It would be easy to change later without any
>>> affects on the caller code.
>> Nah. Might as well do that now.
>> And the class hierarchy makes sense. A pure-virtual class which
>> ignores the actual matching method.
>
> Are you sure you want to dive into that now? With so much time it takes
> us to agree on trivial/standard things, would it be better to start with
> something simple, well-designed, and immediately useful? And then add
> Finder if we need it? Again, given correct implementation, the callers
> will most likely see no difference when we add Finder support.
>
>> How would Kbuf::Tokenizer::Finder sound like? Would even avoid the
>> wart of friend classes.
>
> If you insist on starting to complicate things now, then:
>
> - Tokenizer and StringFinder should be stand-alone classes. There is no
> reason to place them inside a String or buffer class. Classes are not
> namespaces. Same for placing StringFinder inside Tokenizer.
>
> - StringFinder will have virtual find() function that determines where
> the matching substring is. It will be used by Tokenizer to find
> delimiters (not tokens!). There may be a couple of ways to design this
> right, but I would rather not spend time on it now, in hope that you
> will agree that we should focus on the basics first.
>
> Thank you,
>
> Alex.
>
>

-- 
Please use Squid 2.7.STABLE4 or 3.0.STABLE8
Received on Mon Sep 08 2008 - 11:53:37 MDT

This archive was generated by hypermail 2.2.0 : Mon Sep 08 2008 - 12:00:04 MDT