J5’s Blog

January 11, 2011

Time keeps on ticking

Filed under: Gnome, Python, community, conference, performance — J5 @ 6:54 pm

I got to a bit of a milestone today on the new PyGObject Introspection invoke code I have been working on. I can now run a handful of tests that marshal in’s out’s and returns. It is mostly just basic types right now but it works. Of significant importance is that I got the first torture test to run. We call a simple interface with a couple of in and out parameters 10000 times in a loop. Here is the output from the old implementation:


test_torture_profile (test_everything.TestTortureProfile) ...
        torture test 1 (10000 iterations): 0.240000 secs

and now from my new implementation:


test_torture_profile (test_cache.TestTortureProfile) ...
	torture test 1 (10000 iterations): 0.070000 secs

That is more than a 3x speedup for a simple case. Of course there is still a lot of work to do to handle more complex types and all of the edge cases but again, progress. I’m probably losing some speed gains due to moving to function calls instead of a big switch statement but one of the benefits of splitting everything up is when issues occur I know exactly where it is happening instead of having to scroll up the code to see if I am decoding or encoding and what type is causing the issues.

The hackfest is next week in Prague. Note to those going, because of the small amounts involved in your travel costs, I will be handling reimbursement in Euros or CZK. We will figure it all out when I get there.

I’ve been so busy I also forgot to thank Collabora who is sponsoring the hotel, a thank you dinner, beer and a coffee machine for the hackspace which is donating rooms for us to hack in. I’m looking forward to having a great time and getting some work done. Hopefully this blizzard that is hitting us tonight won’t effect my travel.

GNOME Foundation Sponsored

[read this post in: ar de es fr it ja ko pt ru zh-CN ]

January 6, 2011

Cleaning up invoke with a caching layer

Filed under: Gnome, performance — J5 @ 7:32 pm

I’ve gotten pretty far with the new caching layer for PyGObject. Nothing compiles yet but the skeleton code is sitting on the invoke-rewrite branch. Right now I am working with the “in” marshallers and validators. I came to the conclusion that validation and marshalling are really intertwined so it is much clearer to do the validation during the marshalling step. So now since I am breaking things up into cachable functions, this hot mess:

UINT8 validation

        case GI_TYPE_TAG_UINT8:
            /* UINT8 types can be characters */
            if (PYGLIB_PyBytes_Check(object)) {
                if (PYGLIB_PyBytes_Size(object) != 1) {
                    PyErr_Format (PyExc_TypeError, "Must be a single character");
                    retval = 0;
                    break;
                }

                break;
            }
        case GI_TYPE_TAG_INT8:
        case GI_TYPE_TAG_INT16:
        case GI_TYPE_TAG_UINT16:
        case GI_TYPE_TAG_INT32:
        case GI_TYPE_TAG_UINT32:
        case GI_TYPE_TAG_INT64:
        case GI_TYPE_TAG_UINT64:
        case GI_TYPE_TAG_FLOAT:
        case GI_TYPE_TAG_DOUBLE:
        {
            PyObject *number, *lower, *upper;

            if (!PyNumber_Check (object)) {
                PyErr_Format (PyExc_TypeError, "Must be number, not %s",
                              object->ob_type->tp_name);
                retval = 0;
                break;
            }

            if (type_tag == GI_TYPE_TAG_FLOAT || type_tag == GI_TYPE_TAG_DOUBLE) {
                number = PyNumber_Float (object);
            } else {
                number = PYGLIB_PyNumber_Long (object);
            }

            _pygi_g_type_tag_py_bounds (type_tag, &lower, &upper);

            if (lower == NULL || upper == NULL || number == NULL) {
                retval = -1;
                goto check_number_release;
            }

            /* Check bounds */
            if (PyObject_RichCompareBool (lower, number, Py_GT)
                    || PyObject_RichCompareBool (upper, number, Py_LT)) {
                PyObject *lower_str;
                PyObject *upper_str;

                if (PyErr_Occurred()) {
                    retval = -1;
                    goto check_number_release;
                }

                lower_str = PyObject_Str (lower);
                upper_str = PyObject_Str (upper);
                if (lower_str == NULL || upper_str == NULL) {
                    retval = -1;
                    goto check_number_error_release;
                }

#if PY_VERSION_HEX < 0x03000000
                PyErr_Format (PyExc_ValueError, "Must range from %s to %s",
                              PyString_AS_STRING (lower_str),
                              PyString_AS_STRING (upper_str));
#else
                {
                    PyObject *lower_pybytes_obj = PyUnicode_AsUTF8String (lower_str);
                    if (!lower_pybytes_obj)
                        goto utf8_fail;

                    PyObject *upper_pybytes_obj = PyUnicode_AsUTF8String (upper_str);
                    if (!upper_pybytes_obj) {
                        Py_DECREF(lower_pybytes_obj);
                        goto utf8_fail;
                    }

                    PyErr_Format (PyExc_ValueError, "Must range from %s to %s",
                                  PyBytes_AsString (lower_pybytes_obj),
                                  PyBytes_AsString (upper_pybytes_obj));
                    Py_DECREF (lower_pybytes_obj);
                    Py_DECREF (upper_pybytes_obj);
                }
utf8_fail:
#endif
                retval = 0;

check_number_error_release:
                Py_XDECREF (lower_str);
                Py_XDECREF (upper_str);
            }

check_number_release:
            Py_XDECREF (number);
            Py_XDECREF (lower);
            Py_XDECREF (upper);
            break;
        }

UINT8 marshalling

        case GI_TYPE_TAG_UINT8:
            if (PYGLIB_PyBytes_Check(object)) {
                arg.v_long = (long)(PYGLIB_PyBytes_AsString(object)[0]);
                break;
            }

        case GI_TYPE_TAG_INT8:
        case GI_TYPE_TAG_INT16:
        case GI_TYPE_TAG_UINT16:
        case GI_TYPE_TAG_INT32:
        {
            PyObject *int_;

            int_ = PYGLIB_PyNumber_Long (object);
            if (int_ == NULL) {
                break;
            }

            arg.v_long = PYGLIB_PyLong_AsLong (int_);

            Py_DECREF (int_);

            break;
        }

becomes this easier to read function:

UINT8 cached marshalling and validation function

gboolean
_pygi_marshal_in_uint8 (PyGIState         *state,
                        PyGIFunctionCache *function_cache,
                        PyGIArgCache      *arg_cache,
                        PyObject          *py_arg,
                        GIArgument        *arg)
{
    long long_;

    if (PYGLIB_PyBytes_Check(py_arg)) { 

        if (PYGLIB_PyBytes_Size(py_arg) != 1) {
            PyErr_Format (PyExc_TypeError, "Must be a single character");
            return FALSE;
        }

        long_ = (long)(PYGLIB_PyBytes_AsString(py_arg)[0]);

    } else if (PyNumber_Check(py_arg)) {
        PyObject *py_long;
        py_long = PYGLIB_PyNumber_Long(py_arg);
        if (!py_long)
            return FALSE;

        long_ = PYGLIB_PyLong_AsLong(py_long);
        Py_DECREF(py_long);

        if (PyErr_Occured())
            return FALSE;
    } else {
        PyErr_Format (PyExc_TypeError, "Must be number or single byte string, not %s",
                      py_arg->ob_type->tp_name);
        return FALSE;
    }

    if (long_ < 0 || long_ > 255) {
        PyErr_Format (PyExc_ValueError, "%li not in range %i to %i", long_, 0, 255);
        return FALSE;
    }

    arg.v_long = long_;

    return TRUE;
}

Notice the if (long_ < 0 || long_ > 255) check. With the previous version we were actually converting the ranges to python objects and comparing them via the python RichCompare interface. Now since we have to decode the value anyway, we can just do a simple C comparison which is an order of magnitude faster. This doesn't even show some of the areas where we are decoding twice because of the split validation and marshalling routines. In any case ... progress ... though we won't know how much until I have implemented enough to get it running the test suite.

Other advantage come from reorganizing invoke including preliminary support for default values. I added a way for user_data to default to NULL if not passed in by the user. It is designed in such a way that any default value can be added to an argument cache and invoke should be able to substitute it. This means that once GI adds a way to specify default values we simply need to marshal them into the the corresponding cache and it should just work. Another enhancement that should be easy to add later is keyword argument support. Since we are normalizing the data in the cache, theoretically one should be able to pick a random argument and marshal it without having to process the argument list linearly. That however will be a later feature.

[read this post in: ar de es fr it ja ko pt ru zh-CN ]

December 22, 2010

Reality Bytes

Filed under: Gnome, Python, performance — J5 @ 7:59 pm

I’m sitting here at my parents house (I wanted to beat the snow for the holidays) working on the document that will guide the development of the new invoke module for PyGI. To start I attacked it from a very high level, idealistic standpoint where we could both modularize and optimize the code all in one sweeping iteration over the GI function arguments.

Of course I knew there would be edge cases and a few places where the model needed to be tweaked. As I dived more and more into the details (I’m designing the caching layer right now) I realized there isn’t enough information in GI to completely get rid of processing my own metadata, nor are the args guaranteed to be able to be processed linearly. For instance most library init functions take argc (array length) before argv (the array). Of course this isn’t to say that GI and the typelibs are not complete. They are more than complete, just that we haven’t normalized the state machine.

This all necessitates a metadata sorting loop which begs the question, why don’t we just combine that with the cache and state initialization layers and then just loop over the in args during the in validation and marshalling layer? We could for a majority of in arguments, process them in the metadata loop but for the few edge cases that break the model, it would mean separate code paths and increased complexity.

The truth is even though paper is entitled Speeding Up Invoke, my real goals are to fully understand what was going on under the hood while explaining it to others and to modularize the code so it is easier to understand (and optimize) going forward. In any case almost anything we do will be an improvement over the multiple times we call into the GI library for the same data in one invoke call. By producing a roadmap and breaking the code into smaller chunks, more people will be able to look at the code and provide optimization, feature and stability fixes. Actually the way things are laid out in the paper we will already fix some of the limitations such as simplified argument counting which allow optional user_data values, multiple callback support, and multiple lists/arrays referencing one length in value (see clutter_actor_animatev for an example).

The good news is once I fully flesh out the paper implementing it should take a couple of days. Most of it is just rearranging code. I hope to get to it before the hackfest in the middle of January.

[read this post in: ar de es fr it ja ko pt ru zh-CN ]

December 13, 2010

Mapping out PyGI Invoke

Filed under: Python, performance — J5 @ 5:42 pm

Download: PyGI speeding up invoke 0_1 draft.pdf – this is and highly incomplete first draft of the paper.  It lays out my design goals as well as explaining the invoke flow and getting down to the details of in value validation.  Comments and constructive criticism welcome.

I’ve been taking  a break from coding to really map out the internals of invoking a function using GObject Introspection.  A few bugs popped up that I could either band-aid or look into really fixing some of the core issues we have in our invoke routine.  I wanted to get a deep understanding of how invoke worked and how all of the GI function fit together to validate and marshal between C and Python.  While I am at it I am also designing an idealistic view of how we could modularize the validation and marshalling routines so that it is easier to experiment with optimizations and add support for other types should they be needed in the future.  I say this is an idealistic view because there are still going to be limitations and edge cases we will have to work around but it is my hope that by writing up a design and cleaning up the code will make it easier for others to understand the GI internals and contribute to the code.  While my work is geared towards the Python GObject Introspection bindings, I hope it will help other language binding authors better understand the GI internals.

[read this post in: ar de es fr it ja ko pt ru zh-CN ]

January 27, 2010

Flattening the model

Filed under: AMQP, Open Formats, Standards, messaging, performance, tubes — J5 @ 5:14 pm

In my quest for cleaner code in Kamaloka-js I have been working on simplifying the dispatch model.  AMQP has some interesting features built into it to facilitate real-time functionality along with message prioritization.  To accomplish this messages can be sent on different queues and tracks, and also be broken up into segments which can be further broken up into frames.

Frames

Frames are the basic building blocks of the AMQP data stream.  They contain complete headers that describe queue, track and segment that is currently being constructed.  The payload of a frame (the segment being built) can be broken up into arbitrary sized byte arrays which are then reassembled based on the channel and track they are sent on.  In this way applications with memory constraints can request that frames be no bigger than what the application can fit in memory.  A typical frame header looks something like:

{
  channel: 0,
  track:1,
  is_first_frame: 1,
  is_last_frame: 1,
  is_first_segment: 1,
  is_last_segment: 1
}

Segments

Segments are like frames but instead of an arbitrary split on a data size, each segment is split on struct boundaries. That is to say, when you receive a complete segment you can be sure that it can be fully parsed. There are currently four segment types: A control segment, command segment, header segment and body segment. Command messages are currently the only type that can contain a header and body segment. For instance, the transfer command is used to send messages to and from a queue. It would contain header segments which could be used to route the entire message and it would contain a body segment which contains arbitrary data the user application cared about. Each segment is broken up into at least one frame. Multiple segments would never be sent in a single frame.

Channel and Tracks

A channel is just an integer that denotes related frames and segments. One can think of each channel being a list of incoming frames which are ordered correctly. Once the last frame in a channel is seen, the message is constructed and the channel is flushed and ready to receive a new message. In this way, multiple messages can be received at the same time by utilizing different channels but only one message can be sent on a single channel at a time.

Tracks are an exception to the one message per channel rule. There are two tracks in the current spec. The control track (track 0) and the command track (track 1). Controls preempt commands on a channel, so you can be in the middle of receiving frames on the command track and a control can come in on the control track and you must respond to that first.

This all sounds complicated but you can just think of the channel/track combination as being one entry in an hash. For instance frames coming into channel 0 and track 1 would be given the hash “0.1″:

message_channels["0.1"].add_frame(incoming)

First pass – Frame dispatching

At first it was easier to think of the frame and segment issues as different layers. At the lowest layer I would decode and dispatch each frame and then pass it off to the segment layer once a complete segment had been decoded. The segment layer would then collect the segments, relate them to each other, and then construct the full message. The frame layer looked something like this:

Flowchart showing the frame decoding layer of the kamaloka-js AMQP bindings

Flowchart showing the frame decoding layer of the kamaloka-js AMQP bindings

The dispatch would then pass it off to the segment decoder.

This became overly complicated because each segment had varying degrees of metadata and the body and header segments didn’t map to the message object very well. I could have gone ahead and created a segment object but I wanted to simplify the code.

Flattening the frame and segment layers

As it turned out flattening the model only added a couple of more steps to the current frame layer. Since frames and segments are just two different ways of breaking up a message for transfer over the wire, combining the two in the same layer made sense. What I ended up with was this:

Flowchart showing how the Kamaloka-js AMQP bindings decode frame and segments into a message

Flowchart showing how the Kamaloka-js AMQP bindings decode frame and segments into a message

If you notice I now only create a new message if it is the first frame and first segment on a channel. When I see it is the last frame I incrementally decode the segment but I only dispatch the message once the last segment is seen. In the end, these minor adjustments allowed me to strip out a whole layer of redundant code. It also simplified the low level event code as I used to have to manage callbacks for each segment in order to construct a message. Now events only trigger once a full message is received and not when each frame or segment is received.

[read this post in: ar de es fr it ja ko pt ru zh-CN ]

September 14, 2009

kamaloka-js AMQP bindings in four browser engines

Filed under: Open Formats, messaging, performance, webapps — J5 @ 3:44 pm

While I am developing on Firefox today I tested and fixed some issues under Chromium (webkit based), Opera (opera based) and Konquorer (khtml based). My pub/sub demo worked the same in all four browsers (and at the same time I might add). We tried with IE8 on Luke’s machine but the error messages were cryptic. I’ll have to get my hands on a windows box at some point but I’m guessing the biggest issues are going to be trailing commas or keywords which are similar in the amqp spec and javascript (e.g. I ran into void and class keyword issues with the webkit browsers).

One thing to note is Chromium’s integrated debugger is a lot faster than Firebug. I haven’t used it extensively but if it can break at breakpoints and introspect objects without getting screwy I might end up switching. I really like firebug but lately it has become dog slow. It usually takes a couple of minutes of waiting for each frame of data to be processed and displayed while under Chromium with their debugger on, the slowdown is noticeable but not significant compared to when the debugger is off.

UPDATE: Chromium’s debugger is probably faster because it doesn’t actually do anything that I can see. It displays scripts but doesn’t actually break into the code

[read this post in: ar de es fr it ja ko pt ru zh-CN ]

June 24, 2008

Firefox 3 Delivers on Promises

Filed under: Gnome, Linux, community, friends, performance, usability — J5 @ 6:48 pm

I feel I owe this blog post to Chris being that I’ve been cited as one of the catalysts for some in the GNOME community aligning themselves with WebKit.  Not that I think that is bad that there is competition in the browser market (competition is one thing but a line in the sand is just counterproductive here) but my original intent was merely to ask what are our priorities and what projects would align closer to those priorities.

In any case it was reported on Slashdot that according to an article at Dot Net Perls, Firefox is now one of the most efficient browsers when it comes to memory usage.  This meshes with the internal tests Mozilla was doing and Chris blogged about.  It was one of my main gripes with Firefox when using the XULRunner and Gecko engine as the basis for an embedded browser.  At the time I was a bit nonplussed as the work that was being done to make Firefox better revolved around blaming and removing important libraries instead of fixing the root causes.

If the data is to be believed (and be transferable to Linux as the tests were run on Windows) then it does point to significant improvements in Firefox and I thank the Mozilla community for listening and dealing with the issues head on.  Software is hard and we shouldn’t turn our backs on a friend of the Linux community even when they might not be walking lock step with us.  The flip side is Mozilla does need to be concious of the needs of downstream developers and not use its market position as bludgeon to get its way. To that end there are still the issues of a stable embedded API and better platform integration. I hear those are being worked on so hopefully it won’t be an issue going forward.

Again I would like to thank the Mozilla community for putting out a great browser that is a serious competitor with Internet Explorer. I would also like to thank the Mozilla Foundation for helping fund accessibility work in GNOME. By working with each other instead of butting heads, as happens every once in awhile, the ecosystem grows and benefits both communities.

[read this post in: ar de es fr it ja ko pt ru zh-CN ]

May 5, 2008

Move over traveling salesman…

Filed under: Redhat, performance, usability — J5 @ 12:43 pm

…and say hello to Cappuccino in a Cloud.

The Red Hat “Boston” office just moved into new diggs down the street from our old office space.  This is the second move we have made since I got here four years ago and a needed one as the company continues to grow at a steady pace. Inevitably the discussion of coffee makers comes up every time we make a move (and quite frequently in the interim too) with a new coffee gadget showing up shortly after. We opted for the Flavia drink station this time around. This brings up the issue that any new gadget presented to a large audience will inevitably see high traffic for the first few days before the novelty wears off and the traffic reduces to a steady level of consumers.

There are many questions that need to be considered here. Will the machine stand up to the first few weeks of abuse? If it was engineered for a high peak capacity is it still economical to run when that traffic has fallen off? Do we just accept that the first few weeks will see some breakdowns, pissed customers who will not come back because of the failed experience and keep on chugging with the knowledge that our initial costs were low? If coffee making could be parallelized could it scale up and down economically and efficiently?

This is the Cappuccino in a Cloud problem. How do you make processes efficient and scalable for both high load peak and the inevitable lower day to day traffic? The travelling salesman problem dealt with efficiencies of one single entity (the salesman) finding the most efficient (read cheapest) single threaded route through a number of destinations. In today’s word the consumer comes to the buisness or service, sometimes all at once, and it is important to figure out the most efficient way (measured in the consumer’s satisfaction and producer’s bottom line) to handle that load.

[read this post in: ar de es fr it ja ko pt ru zh-CN ]

March 24, 2008

BOSSA Fun and Productive

Filed under: D-Bus, performance, travel — J5 @ 2:26 pm

I got back from the BOSSA conference last week. It still is my favorite conference for just talking to people and getting things rolling. The organizers keep it small and limit the number of talks so that people can get to what they do best and just talk to each other. While I am no longer directly involved in embedded development I still feel this should be the target of most developers. A win on current generation embedded devices means improvements across the board from those devices to the desktop and even the server room. Most of my debugging and optimization techniques for D-Bus, the focus of my BOSSA talk, were gained from working with it on embedded devices. I continue to gear my work with the idea that in the near future a large portion of the people consuming those technologies will be doing so on devices that are considered to be “embedded”.

[read this post in: ar de es fr it ja ko pt ru zh-CN ]

Welcome Matthew

Filed under: Fedora, Linux, Redhat, friends, performance — J5 @ 11:58 am

For those of you in Fedora land who don’t know Matthew Garrett has just accepted an offer from Red Hat . If that name doesn’t ring a bell, it should. Matthew is one of the reasons Linux works on laptops. Being one of the few people who truly understands Linux from the hardware all the way up to the desktop, he will be spending his time working on power management in both Kernel land and Userspace.

It is great to see my company recognize the need for such improvements and hire top notch people to get it done. Welcome aboard Matthew.

[read this post in: ar de es fr it ja ko pt ru zh-CN ]

Powered by WordPress