Lab 6 – Inline Assembler

Part A

Q: What is an alternate approach?
A: An alternate approach is to let the compiler decide which registers to use by not deciding which registers to use.

Q: Should we use 32767 or 32768 in next line? why?
A: We should use 32767 because int16_t has a range between -32768 and 32767.

Q: what does it mean to “duplicate” values here?
A: Duplicating value here mean vol_int is duplicated to all 8 vector registers.

Q: what happens if we remove the following lines? Why?
A: If we remove these lines we will get a segmentation fault because the code specifies the input and output operands. Without it, the input and output value can’t be stored anywhere.

Q: Are the results usable? are they correct?
A: Yes the results are usable and correct.

Part B

The open source package I decided to analyze is OpenBLAS. BLAS stands for Basic Linear Algebra Subprograms. It is the computational kernal in linear algebra or scientific applications. It provides standard interfaces for vector-vector operations, matrix-vector operations and matrix-matrix operations. OpenBLAS is an open source project that uses the BLAS library.

The reason I chose OpenBLAS is because I’ve used the BLAS library in my GPU course so I am familiar with it. I haven’t heard of OpenBLAS before so I decided to learn more about it for this lab.

After spending some time going though their Github repo, I found a folder called “kernel” where they stored code for different architectures. There is quite a bit of assembly language found within this program. I am impressed by all of the different architectures they support. I was able to find a Readme file that lists all of the supported architectures.

2. Supported Architecture
X86 : Pentium3 Katmai
 Coppermine
 Athlon (not well optimized, though)
 PentiumM Banias, Yonah
 Pentium4 Northwood
 Nocona (Prescott)
 Core 2 Woodcrest
 Core 2 Penryn
 Nehalem-EP Corei{3,5,7}
 Atom
 AMD Opteron
 AMD Barlcelona, Shanghai, Istanbul
 VIA NANO

X86_64: Pentium4 Nocona
 Core 2 Woodcrest
 Core 2 Penryn
 Nehalem
 Atom
 AMD Opteron
 AMD Barlcelona, Shanghai, Istanbul
 VIA NANO

IA64 : Itanium2

Alpha : EV4, EV5, EV6

POWER : POWER4
 PPC970/PPC970FX
 PPC970MP
 CELL (PPU only)
 POWER5
 PPC440 (QCDOC)
 PPC440FP2(BG/L)
 POWERPC G4(PPC7450)
 POWER6

SPARC : SPARC IV
 SPARC VI, VII (Fujitsu chip)

MIPS64/32: Sicortex

Most of their assembly code is found in separate .s files. I found 1 instance where they had inline assembly code. It was located in a header file called common_x86.h.

 

Square root function found in common_x86.h

static __inline long double sqrt_long(long double val) {
#if defined(_MSC_VER) && !defined(__clang__)
return sqrt(val); // not sure if this will use fsqrt
#else
long double result;

__asm__ __volatile__ ("fldt %1\n"
"fsqrt\n"
"fstpt %0\n" : "=m" (result) : "m"(val));
return result;
#endif
}

With all of this assembly code on different architectures, I think the code runs the same or with very little difference, across all platforms. OpenBLAS runs CPU intensive algorithms to solve problems, so their code needs to be efficient and optimized well. For example, running a matrix-matrix multiplication of a matrix that is 1,000,000 by 1,000,000 will need a lot of processing power. By using assembly language to set up the registers and having control of where data is stored, I think this provides a huge increase in performance and decrease in errors and memory leaks. With OpenBLAS, the complexity of the code isn’t much of an issue because it mainly deals with calculating vector and matrix multiplication, which doesn’t really require difficult coding. The main purpose is to run “easy” calculations many times. In terms of it’s portability, OpenBLAS was able to write their code for many different platforms. However, I can see this being an issue if the code is complex and can be difficult to write in different architectures.

 

Lab 5 – Algorithm Selection

The purpose of this lab was to create random “sound samples”, scale the samples, store them in an array and calculate the sum. Once we have created the basic sound scale program, we will create alternate approaches to scale the sound samples.

The first approach is to multiply each sample by the floating point volume factor 0.75.

I increased the sample size to 500,000,000 from 500,000 to see if there was a difference between the 3 approaches. With the bigger sample size, it is more evident which approach runs faster.

vol1code.PNG

I used gcc -O0 vol1.c -o vol1 to compile the code. The output remained the same every time I ran the code. The time slightly changes every run.vol1a

I compiled the code again but with optimizations. The time it took to compile was slight faster by about 1 second. I noticed that the result changed with optimization. I’m not fully sure why this happens but I think it has something to do with how the sum is calculated in the loop.vol1b

The next approach was to pre-calculate a lookup table and look up each sample in that table. In this code, I added the “soundTable” variable that holds the array for each sample.

vol2code

The time it takes to run the program is slightly faster than the first approach. Running the code multiple times, yields the same result.

vol2avol2b

The last approach is to convert the volume factor to a fix-point integer then shifting the bits to the right by 8.

vol3code

The time it takes to run the program is a bit faster than the first 2 approaches without optimizations and with. Again, the results didn’t change after multiple runs.

vol3

Conclusions

Based on these results, we see that the last approach yields the fastest processing times. This approach converts the volume factor to a fix-point integer by multiplying by a binary number representing a fixed-point value “1”.

From the three approaches, we can see that optimization with -O3 decreases the time it takes to run each program by a significant amount. The distribution of data does not matter because the code executes all parts of the loops and arrays.

 

Lab 4 – Vectorization

ss1

For this code, I used gcc -O0 lab4.c -olab4 to compile the program. I wanted to see what the disassembly output looked like without optimization. The <main> had about 97 lines of code without vectorization. ss2ss3

Next, I compiled the code with gcc -O3 lab4.c -olab4.ss4

ss5-e1520125730968.png

Here, we see that the lines of code is slightly less with about 64 lines. The outlined lines show where the SIMD instructions occur.

I ran time ./lab4 to see if the vectorization made any difference in compile and run time. There was a slight difference in that sys time was higher when I compiled the code with -O0. I’m not sure if this was supposed to happen or not.

time2 No optimization

time1 With optimization

 

Lab 3 – VSCode Bug

Given the list of bugs I decided to work on “Fix #43302 Opening .bat file in VSCode via context menu in windows explorer tries to execute it”. The first thing I tried to do was to recreate the bug. In the steps to reproduce, the user mentions that you need to right-click the .bat file and click on “Open with Code”. For some reason, this didn’t show up for me. File explorer did not give me an option to “Open with”. I googled “Visual Studio Code Open with” and came across a website that shows how to get this to show up. Apparently, it is an option when you first setup VSCode.Capture.PNG

The only other option was to edit the registry on my laptop, which I did not feel comfortable doing, so I reinstalled VsCode with those options checked. Once it was installed, I right-clicked the file and “Open with code” shows up! I click on it and the .bat file opens in VSCode no problem. File explorer didn’t run the .bat file. I went back to the bug to see if I had missed a step. I noticed that the user listed the extensions he has installed on VSCode. I decided to install the same extensions to see if that might create the bug. After installing the 6 extensions, I tried again. Still nothing. The issue might be caused by the user using an older version of VSCode. He was on 1.19.3, while I’m using 1.20.1. Since I couldn’t recreate the bug, so I tried another bug.

The next bug I tried was “Fix #42720 Color picker: no longer appears in settings editor”. I went in VSCode’s setting to try to recreate the bug. I hover the cursor of the hex color value and the color pick shows up.  Again, I could not recreate the bug. I tried both Dev and Standard versions of VSCode. My partner wasn’t able to recreate the bugs either. We tried one more bug.

The last bug we tried was “Fix #16834 Search in folder with special characters `{}` yields no results.” The user is using Mac OS for this bug, but I decided to try to recreate it in Windows. I created a folder, {hello} with test.txt inside of it. I also create a folder without the {} just to see if the search function works to begin with. I followed the steps and couldn’t seem to find the file within {hello}. I finally found a bug! Yay? I then tried to search for the folder without the {}. It was able to find the directory. The next step is to locate where the code for the search feature is. After about an hour of looking, I couldn’t seem to find where the code was. (Edit 02/22: We went through this bug in class and found the code). I may have overlooked a function somewhere in the code.

In the end, finding and trying to recreate bugs is very tedious but rewarding. It really feels like you are doing detective work trying to solve a crime by working backwards. I definitely enjoy finding and recreating bugs. However, fixing them is another story.

Lab 2 – Visual Studio Code

This was my first time hearing about and using Visual Studio Code. The initial setup was a bit tricky because I haven’t had much practice installing applications through command line. I had to install all of the prerequisites such as Node.JS, Yarn and Python. I ran into issues where a command doesn’t exist or it is missing another application. Once I was able to run “yarn run watch” it started working!

My first impression of VSCode was that the layout and themes were nice. I generally prefer a darker theme to work in so I didn’t need to change that. I also didn’t change any of the settings because I haven’t used VSCode before, so I’m not even sure what I would want to change. I think once I use VSCode more, I’ll eventually change some of the keybindings.

The extensions I installed were the ones that dealt with other languages such as C++, C#, Java. Each of these extensions allows language support. I chose them because these are the languages that I know. It would be beneficial to be able to code those languages on VSCode.

I found the live debugging quite interesting. The option to toggle developer tools was easy to use. I have a little bit of experience playing around with the Inspect tool on FireFox and Chrome, so I was familiar with the one on VSCode. It was interesting to see how you can debugging right in VSCode. It makes it easy to follow the code and find where the source code is stored.

The web technologies that VSCode are all new to me. with the exception of Node.Js, this is my first time being exposed to them. After doing a bit of research on each of the different technologies, I have a better understanding of how each of them work.

TypeScript: a super set of JavaScript. Adds optional static typing.

Electron: Allows for development of desktop GUI applications to use components meant for web applications. Examples are Discord and Spotify.

Node.Js: Executes JavaScript code server-side. JavaScript was used mainly for client-side scripting. Node.js runs scripts to produce dynamic web pages before the page is sent to the browser.

ESlint: Open source JavaScript linting utility.

TSlint: Open soutce TypeScript linter.

gulp: JavaScript tool kit used as a build system for front end development. It is used for automation of repetitive and time-consuming tasks.

mocha: JavaScript test framework.

sinon: a standalone test spies and test framework for JavaScript. Allows spies, stubs and mocks.

yarn: a quick, secure and reliable packaging manager.

Lab 3 – Assembler

aarch64 code
x86_64 code

This lab was my first experience with assembly language. It took some time getting used to the syntax and commands for both x86_64 and aarch65. It was quite challenging because it is quite different from other languages. The hardest part to get used to was the fact that we have to “manually” tell the program where to store values. I found it hard to keep track where I stored my values and often had to draw out the registers with the values in them. The second challenge was working with two digit numbers. Having to store each digit in a separate byte was quite difficult and frustrating. After some time however, our group found a way to get the code to work.

First I will discuss the code on x86_64:

Our group came up with a solution to loop 30 times and print each value by first. The following is the logic for our code:

  1. After setting up the initial variables (registers), increment the index.
  2. Store the index into rax
  3. divide the value in rax (the index) by 10. This is to see if it is a multiple of 10.
  4. If it is not a multiple of 10 we add remainder to the value 0 and then store the value.
  5. If it is a multiple of 10, we increment the value in the 10s digit.
  6. Once the index reaches our ‘max’ value, we stop the loop and print out the values.

Once we decided on the logic it was still difficult trying to figure out how assembly language works. We needed to figure out how to print the data and we also need to tell the program where to display each value. At the bottom of the code, we can see the line that shows “num = msg + 5”. This tells the program to put the value at index 5 of the line. I played around with this value a bit just to try and get a better understanding of how it works. Changing it to 6 caused the newline to ‘break’ and shows the leading 0.ss

I believe this is caused by placing the number in the middle of the ‘\n’. i.e “Loop: 0 \[here]n”.

The code on aarch64 is slightly different, but the logic is similar. Some of these differences between aarch64 and x86_64 are:

  1. aarch doesn’t have an increment command, so we have to use ‘add’ instead of ‘inc’.
  2. aarch’s divide command can take values or registers, whereas x86 only divides a register by what is in rax.
  3. In aarch, you can use msub to find the remainder. In x86, the remainder is stored in rdx after using the ‘div’ command

Between the 2 architectures, I found the x86_64 was a bit easier to understand and follow. I also prefer the way x86 handles ‘div’. Storing the quotient in rax and the remainder in rdx makes it easier to remember where the values are. It also eliminates the ‘msub’ step that is found in aarch64.

Comparing assembly language and other languages, there seems to be a huge difference in syntax and commands. With other languages, we don’t have to worry about where each value is stored, the compiler handles that for us. We also don’t have to worry about the length of digits. It is also easier to convert strings into int and vice versa with other languages.

Learning assembly language gave me a better understanding of how code stores data and how it is accessed. It also made me appreciate how much the compiler does in other languages.

Release 0.1

I want to preface this blog post by saying that I have very little experience in creating RESTful API’s in Java. There was a lot of “firsts” for me on this project.

My Java API uses Google’s libphonenumber API to parse a phone number. My program uses Eclipse Neon 3, Maven build tool with Jersey, TomCat Java Servlet, Junit and Mockito testing framework. With the exception of Eclipse, it was my first time using the mention tools.

It took a lot of time researching which tools to use and learning how they work. I would say the majority of my time was spent looking for reading on the tools and watching YouTube tutorials on how to setup the environment. From what I’ve researched, this is my understanding of each tool:

Maven is a software project management tool. It can manage a prokect’s build, reporting and documentation from a central piece of information. Maven projects a file called “pom.xml”. This file contains the projects dependencies, group id, artifact id, version number and how to package the project.

Jersey is a RESTful Web Services framework. It is a JAX-RS(Java REST Web Service) reference implementation that provides features and utilities to simplify RESTful service and client development.

TomCat is an open-source Java Servlet Container.

Junit is a unit testing framework for Java.

Mockito is an open source testing framework. The framework allows the creation of mock objects.

I found this whole release to be challenging. I was quite discouraged to see that the majority of the class used JavaScript for this project and I was one of the few that used Java. The process looked simpler using Node.js and the environment looked easier to set up. The reason I used Java was because I’ve never learned Node.js. I felt that this was a disadvantage but pursued this challenge regardless.

Setting up the environment with the correct dependencies was the most challenging and time consuming. Once I set up my environment, I had to learn how to create a RESTful API in Java. This took quite some time for me to learn as well. The method of creating a REST API will slightly vary depending on which tool you used. It was difficult finding tutorials that used the same tools that I was using. The next challenge was fixing my errors. It became quite frustrating because one error would lead to another; every time I would fix an error, it would cause another error. Most of the errors I encountered was caused by the environment setup. At one point, I imported entire libraries and dependencies in order to get my code working. Once my code started to work, I had to create test cases.

I’ve never created test cases before and was unaware of the frameworks I needed. Again, this took quite some time to learn and setup. I don’t think my test cases are working properly and are still works in progress.

For Part B, I felt I was slightly limited in which bugs I wanted to fix. Only 3 other students wrote their code in Java. I also ran into trouble running their programs on my laptop. I think this was due to the way I set up Eclipse on my laptop. Also, some of the README files were not specific enough, so I couldn’t really figure out their instructions. The bug I decided to work on was an enhancement. I wanted to add code to parse PDFs. Again, this took some researching and finding sample codes. After completing the code, I felt accomplished and surprised that I actually solved the issue. I prefer to be a contributor over a maintainer because I like fixing bugs. I don’t consider myself to be the strong programmer, so being a contributor and working on existing code is something I am more comfortable with.

Overall, this was a very challenging project but I learned a lot from it. Although I was discouraged at the beginning, I felt good about completing the project on my own.

 

Lab 2 – Compiled C Lab

The following lab was done on the Archie server:

The original C code was compiled with gcc -g -O0 -fno-builtin -o hello hello.c
Size: 73K, dynamically executable(libc.so.6)

This will provide a base for the following variations.

  1. gcc -g -O0 -fno-builtin -o -static hello1 hello.c
    The size of this build is much larger than the original file. The size of this build is 680K which is almost 10 times the size of the original. Also, because the static option was added, the file is no longer dynamically executable. When running the ldd hello1 command, we get a message “not dynamically executable”
  2. gcc -g -O0 -o hello2 hello.c
    When we remove the -fno-builtin option, we can see a slight change when we run objdump. When the “printf” function is called in the original file we see “400480 <printf@plt>”. In the modified version, the <printf@plt> is replaced with puts@plt. In terms of file size, hello2 is slightly smaller.
  3. gcc -O0 -fno-builtin -o hello3 hello.c
    When we remove the -g option the size of the file is about 2K smaller than the original. This is because we are removing debugging information from the compiler. The -g option tells the compiler to generate debugging information.
  4. gcc -g -O0 -fno-builtin -o hello4 hello.c *added 10 arguments
    With the added arguments, we can see them being added to each register. After the 8th argument, we see “str”, which stores register w0 to an address that is pointed to.
    Untitled
  5. gcc -g -O0 -fno-builtin -o hello5 hello.c  *created output function
    Having added an output function, the results are pretty much the same. The main difference is in the objdump, we see that another function <output> that looks similar to the original <main> function.
  6. gcc -g -O3 -fno-builtin -o hello6 hello.c
    The -O3 options sets the compiler’s optimization level. With -O3, the compile time is a bit longer than the default -O0 option. We can see that it’s is about 0.004 seconds longer. If the code was larger, I would assume that the compile time difference would also be larger.
    Untitled2

Lab 1 – PeerTube

PeerTube is a Open Source video streaming platform using P2P (BitTorrent) directly in the web browser with WebTorrent. It is fairly new and has not been completed yet. It is sponsored by Framatube. Currently there are 57 issues and 21 contributors. The project is written mostly in TypeScript.

The interesting thing about PeerTube is that the front-end looks very similar to YouTube, but the back-end is very different. YouTube and other video streaming website are on a centralized network. This means that everything is stored in a “central” location (servers). Decentralization means that there is no center point of where data is coming and going. Peer-to-peer makes this possible by allowing users to send and receive data to and from one another. Decentralization is a hot topic right now when it comes to security and privacy.

With PeerTube, users can request to watch a video to the server, which will then pass a torrent URI to the user. Any server (peer) that is seeding the video will send the video back to the requester for them to watch. This allows videos to be watched smoothly and avoids down times caused by overloaded servers. Below is a diagram that illustrates this.

68747470733a2f2f6c7574696d2e6370792e72652f4e765241637636552e706e67

The goal for PeerTube is to “democratize video hosting by creating a network of hosts, whose video views are shared live between users”. I believe that if successful, PeerTube (or another decentralized streaming service) can eventually overtake YouTube as a video streaming site. I am curious to see how they overcome major obstacles such as copyright and advertisements. Their beta release date is March 2018.

Lab 1

The two open source software packages I am choosing to examine are Apache HTTP server and VLC Media Player.

Apache HTTP server
Apache is licensed under Apache License 2.0.

Apache has multiple ways for developers to get involved with the project. They have a mailing list for users to subscribe to announcements, user support, developing and debugging and GitHub repositories. Users can review and submit bugs through their site Bugzilla.

On Bugzilla, developers can submit new bugs or work on unresolved bugs. They can leave comments and propose patches or testcases as well.
Example of Resolved bug

This bug took about 7 days to fix. The user (Vincent Privat) that submitted the bug also solved the issue. Another user (Stefan Bodewig) comments with a thanks, for solving the issue. About a month later, another user (Alan Bateman) comments about a similar bug and asks a question that is related to the initial bug.

This community seems very quick with responding to bugs and fixes.

 

VLC Media Player
VLC is licensed under GNU General Public License Version 2.

VLC has various ways that users can contribute. They have areas for programmers, writers, translators, moderators and designers. Developers can subscribe to mailing lists, look at their bug-tracking system and join an IRC channel.

Example of Resolved Bug

This bug took about 4 weeks to fix. There are 4 users on working on this bug. I’m not sure if they solved the issue or they decided to close the issue because it was a server issue instead of an application issue. VLC’s bug tracking website does not give a specific time-stamp for each comment. It only shows “n weeks ago”, which makes it difficult to tell exactly how quick they were responding to each other. By my assumption, I believe that the average response time is less than 24 hours.


Apache’s and VLC’s bug tracking system are very similar. They give a detailed look what the bug is and the priority level of the bug. The user comment design is easy to follow and read. However, I prefer Apache’s comment section because they include a time-stamp beside each comment, which makes it easier to know exactly when the comment was submitted. I noticed that on Apache’s bug-tracker, users can add an attachment directly to the comment section. VLC’s patch submission is on a separate page located at patches.videolan.org.     I’m not sure which method I prefer because I haven’t had experience with either one yet. From my first impression, I would think Apache’s method would be easier to submit patches.